Building a Real-Time Streaming Analytics Pipeline in 2025

Real-time streaming analytics pipeline architecture

The gap between batch analytics and operational reality has been widening for years. Batch processing runs every night, or every hour, or every fifteen minutes — but business events happen continuously. A fraud event that is detected at the end of the day is not a detection; it is a report. A churn signal that appears in the weekly rollup is a retrospective, not an intervention. The organizations that have moved to streaming analytics report not just faster numbers but qualitatively different capabilities: the ability to act on data before consequences are irreversible.

Building that capability is harder than the vendor marketing suggests. A streaming analytics pipeline that operates reliably at sub-100ms latency from event to insight requires careful architecture across four distinct layers: ingestion, processing, storage, and serving. Each layer has its own failure modes, scaling characteristics, and operational considerations. This guide covers what works in production in 2025, based on patterns observed across enterprise deployments.

Layer 1: Ingestion

Apache Kafka remains the de facto standard for event ingestion at production scale, and the reasons are as valid in 2025 as they were five years ago. Its append-only log model provides replay capability — the ability to reprocess historical events when bugs are discovered or logic changes — that is invaluable in production environments. Its partition model provides horizontal scalability with predictable performance characteristics. Its consumer group semantics provide exactly-once delivery guarantees that downstream processing can rely on.

The configuration decisions that matter most for streaming analytics latency are not the ones most teams optimize. Topic partition count affects throughput but not latency at moderate scale. The setting that most directly affects end-to-end latency is linger.ms on producers: this controls how long the producer waits to batch messages before sending. The default of 0ms (send immediately) minimizes latency but maximizes per-message overhead. For throughput-sensitive applications, a linger of 5-10ms provides substantial throughput improvement at minimal latency cost. For truly latency-critical applications, 0ms linger combined with acks=1 (leader acknowledgment only) achieves the lowest possible producer latency at the cost of durability.

Schema management for event streams is an operational concern that bites almost every team that does not address it explicitly. When a source system adds a new field to an event, removes an existing field, or changes a field's type, downstream consumers that were not expecting the change can fail silently — continuing to run but processing invalid data. The standard solution is a schema registry (Confluent's is the most widely used) that enforces schema compatibility rules on every event produced. Backward-compatible schema evolution allows consumers to continue running when new fields are added. Forward-compatible evolution allows producers to add fields that older consumers ignore. Both directions together provide the schema safety net that production streaming systems need.

Layer 2: Stream Processing

Apache Flink has emerged as the dominant stream processing framework for enterprise use cases that require stateful computation, exactly-once semantics, and complex event timing logic. Its checkpoint mechanism provides consistent state snapshots that enable failure recovery without data loss or duplication. Its event-time processing model handles late-arriving events correctly, which matters whenever events traverse network paths with variable latency. For the class of analytics that requires joining multiple event streams, maintaining running aggregations over time windows, or detecting patterns across sequences of events, Flink is currently the most production-proven choice.

The most common implementation mistake in Flink deployments is treating the default parallelism and state backend settings as production configurations. The default RocksDB state backend stores state on local disk, which is appropriate for development but problematic in production environments where task managers are ephemeral. Production deployments should use a remote state backend (S3, GCS, or Azure Blob with checkpointing) so that state survives task manager restarts. The performance tradeoff — remote state access adds latency compared to local disk — is acceptable for most streaming analytics use cases but requires tuning the checkpoint interval and state access patterns to minimize impact.

Window semantics are a source of confusion in every streaming system. Tumbling windows aggregate events in fixed, non-overlapping time buckets (the five-minute window from 10:00 to 10:05 is separate from the window from 10:05 to 10:10). Sliding windows aggregate over overlapping periods (a five-minute window that advances every one minute). Session windows group events within a configurable inactivity gap. The choice between these is not academic: tumbling windows are computationally cheapest and easiest to reason about, but sliding windows are often what business logic actually requires. A sales dashboard showing "trailing 30-minute revenue" is using a sliding window. Getting the window semantics wrong produces numbers that are internally consistent but analytically wrong.

Layer 3: Streaming Storage

The output of stream processing feeds a storage layer that must simultaneously support two incompatible access patterns: real-time updates (write rate proportional to event rate) and interactive analytical queries (read patterns that require full-table scans with arbitrary filters). Row-oriented databases are optimized for the first; columnar stores are optimized for the second. The architecture that resolves this tension in 2025 is the table format layer: Apache Iceberg, Delta Lake, or Apache Hudi used as an abstraction over object storage that provides both transactional write performance and columnar read performance.

Each of these formats uses a different approach to the merge problem — how to efficiently incorporate streaming updates into a columnar store. Hudi uses a merge-on-read approach that writes updates to small delta files and merges them lazily on read, minimizing write amplification at the cost of read-time merge overhead. Iceberg uses copy-on-write semantics that produce clean columnar files immediately but at higher write amplification. Delta Lake uses a transaction log that enables ACID transactions over S3-compatible object storage. The choice between them depends on write frequency, query frequency, and whether ACID multi-table transactions are required.

Compaction is the operational task that every team underestimates. High-frequency streaming writes produce many small files, which degrade query performance because each file requires a separate metadata lookup and read operation. Compaction jobs merge these small files into larger, query-efficient files on a schedule. Without regular compaction, query performance on streaming tables degrades continuously as file count grows. Production deployments need automated compaction jobs with monitoring to ensure they complete before the file count reaches performance-degrading levels.

Layer 4: Serving and Dashboarding

All of the above infrastructure produces nothing unless business users can interact with the results. The serving layer for streaming analytics must handle two distinct query types: pre-aggregated metrics that update continuously (the dashboard showing current fraud rate or active session count) and ad-hoc analytical queries that need to scan historical streaming data.

Pre-aggregated metric serving is best handled with a purpose-built time-series store or a materialized view layer that is updated by the stream processing tier. Apache Druid, ClickHouse, and Pinot are the most common choices for this workload in 2025. They are designed specifically for high-cardinality, high-throughput metrics ingestion and support sub-second query latency on current metrics with days-to-months of history. They are not general-purpose analytical databases — complex multi-table joins and arbitrary aggregations are not their strength. For the metric monitoring and real-time dashboard use case, they are the right tool.

Ad-hoc historical queries over streaming data are served by the table format layer (Iceberg, Delta, or Hudi) through a distributed SQL engine. Trino, Spark SQL, and Dremio are common choices. These provide full SQL expressiveness at the cost of higher query latency — typically 2-30 seconds for complex queries over large historical windows. Users who expect sub-second response for arbitrary historical queries over petabytes of streaming data will be disappointed; the physics of scanning large volumes of data constrains what is achievable regardless of optimization. The architecture that works in practice separates the use cases: real-time metrics go through the OLAP store, historical analysis goes through the columnar analytical engine.

Operational Considerations That Are Underestimated

Pipeline monitoring is the operational investment that separates streaming systems that run reliably from those that fail unpredictably. The metrics that matter are: consumer group lag (growing lag indicates the pipeline is falling behind the event rate), checkpoint duration (increasing checkpoint time indicates state is growing faster than the checkpoint mechanism can handle), and processing-time-to-event-time skew (increasing skew indicates late-arriving events that will be missed by fixed-size windows).

Late data handling is a design decision, not an afterthought. Events that arrive later than the watermark cutoff are either dropped, reprocessed in a correction window, or flagged as corrections to previously computed values. The choice has significant implications for query result consistency. Systems that silently drop late data produce results that look correct but include systematic gaps. Systems that reprocess late data produce eventually consistent results but create downstream complexity for users who need to know whether a number is final or may change. Pick one strategy explicitly and communicate it clearly to analytics consumers.

Kafka schema registry prevents silent data corruption when event schemas evolve.
Flink's remote state backend is required for production; local disk state is a development configuration only.
Table format layers (Iceberg, Delta, Hudi) resolve the tension between streaming write performance and columnar query performance.
Real-time metrics and ad-hoc historical queries have different latency profiles and need different serving infrastructure.
Compaction is a mandatory operational task; without it, streaming table query performance degrades continuously.

Streaming analytics is not a replacement for batch analytics — it is an additional capability for the use cases where timeliness determines value. The organizations that have built it successfully treat it as a distinct engineering discipline with its own tools, operational practices, and performance models, rather than trying to make batch infrastructure run faster.

See how Dataova's streaming analytics integration connects to existing Kafka infrastructure and delivers sub-100ms latency to dashboards and alert systems.

Back to Blog