Achieving Sub-Second Queries at Petabyte Scale: Our Approach

Petabyte scale query optimization architecture diagram

When we started building Dataova's query engine, the benchmark we set for ourselves was deliberately uncomfortable: sub-second interactive query response on datasets measured in petabytes. Not sub-second on a warm cache. Not sub-second on pre-aggregated rollups. Sub-second on arbitrary analytical queries against raw, petabyte-scale data. That constraint forced every architectural decision we made, and this post explains how we got there.

The conventional wisdom when we started was that you could have two of three things: query speed, data freshness, and scale. Precomputing aggregations gives you speed but sacrifices freshness and limits the questions you can ask. Distributed SQL gives you scale and freshness but sacrifices speed on large scans. Our premise was that this trade-off was an artifact of the architectures that existed, not an inherent property of the problem. The last three years have validated that bet.

The Columnar Storage Foundation

Columnar storage is table stakes for any analytics system in 2025 — the decision to use it is not interesting. What is interesting is how you use it. Most columnar systems store data column-by-column in contiguous blocks, which gives good scan performance on single-column predicates. The problem is that analytical queries rarely filter on a single column. A typical dimensional query might filter on date range, geographic region, product category, and customer segment simultaneously. With naive columnar storage, each predicate requires a separate scan pass that is then intersected.

Dataova's storage layer uses a multi-dimensional sort key strategy that we developed specifically for analytical workloads. Data within each Parquet partition is sorted on a compound key derived from the query patterns observed on that dataset. A table used predominantly for time-series regional reporting might be sorted (time, region, product). This means the storage pages that satisfy the most common combined predicates are physically contiguous on disk — a single read can fetch all the data a complex query needs rather than requiring separate scan passes per dimension.

The sort key selection is not static. Our storage optimizer tracks query patterns over time and periodically rewrites partition sort keys to optimize for the actual query distribution observed in production. When a dataset's query patterns shift — a new analytical use case, a change in reporting focus — the storage layer adapts automatically without requiring manual reconfiguration. This is the adaptive indexing strategy that the headline of this post refers to: indexing that improves continuously as the system learns from real query workloads.

Vectorized Execution

Reading the right data efficiently is necessary but not sufficient for sub-second performance. The computation on that data also has to be fast. Vectorized execution is the technique that makes CPU-bound query processing fast enough for interactive analytics at scale.

Traditional row-based query execution processes one row at a time through a pipeline of operators. Each row passes through a filter, then a projection, then an aggregation. The overhead of function call dispatch per row, combined with poor CPU cache utilization (each row is scattered across memory), produces mediocre CPU efficiency. A well-optimized row-based system might achieve 20-30% of theoretical CPU throughput.

Vectorized execution processes data in batches of 1,024 values at a time (the default batch size we use, tunable based on workload characteristics). Each operator receives a column vector of 1,024 values and applies its computation across the entire batch before passing it to the next operator. This dramatically improves CPU cache utilization — the batch fits in L1 or L2 cache during processing — and enables the compiler to generate SIMD (Single Instruction, Multiple Data) instructions that process multiple values in a single CPU clock cycle. Our vectorized engine achieves 70-80% of theoretical CPU throughput on typical analytical workloads, a 3-4x improvement over row-based alternatives.

For filter operations specifically, we use a bitmask approach that allows multiple filters to be evaluated simultaneously as bitwise operations before any data movement happens. A query with five WHERE clause predicates evaluates all five simultaneously on the full batch, producing a single 1,024-bit mask that identifies the rows satisfying all predicates. Only those rows are passed forward to the projection and aggregation stages. This reduces the data volume moving through the execution pipeline by the selectivity of the combined filter, which on typical analytical queries is substantial.

Distributed Execution and Data Locality

Single-node vectorized execution, no matter how optimized, cannot process petabyte-scale queries in sub-second time. Parallelism across a cluster of nodes is required. The challenge is organizing that parallelism so that network data movement — the bottleneck in almost every distributed system — is minimized.

Dataova uses a partition-aware query planning strategy. When a query arrives, the query planner inspects the partition structure of every table involved. For each partition, it identifies which cluster nodes hold replicas of that partition's data. It then generates an execution plan that assigns each partition's processing to a node that already holds its data, eliminating the need to move data over the network before processing can begin. Network transfers occur only when intermediate results need to be combined — typically a small fraction of the input data volume.

For joins between large tables, we use a broadcast-hash join strategy when one table is small enough to fit in memory across the cluster. The smaller table is broadcast to all nodes; each node then performs the join locally against its partition of the larger table without any cross-node data movement. For large-to-large table joins, we use a repartition join with consistent hashing to ensure that matching rows are on the same node before the join executes. The cost of the repartition transfer is paid once during query planning, rather than occurring unpredictably during execution.

Intelligent Caching That Does Not Lie

Sub-second queries on truly cold data — first-ever queries against data that has never been accessed before — are impossible at petabyte scale without caching. Our caching approach is designed to make common query patterns fast while being honest about what is and is not cached.

The cache operates at the column chunk level, not the row or page level. When a query scans column C of table T between time T1 and T2, the specific column chunks that were accessed are cached in a distributed in-memory tier. The next query that overlaps this range and accesses column C finds its data in cache. Queries that access different columns or different time ranges bypass the cache and read from the storage tier.

Cache warming is a first-class feature. Scheduled queries run during off-peak hours to populate the cache with data that will be needed for the next day's peak usage. Workload managers can declare "warm this query pattern at 5 AM every day" as an explicit operational policy, rather than hoping that yesterday's production queries happened to warm the cache correctly.

We are deliberate about not hiding cache behavior from query planners or users. Every query response includes a cache hit ratio metric. Operations teams can see exactly what fraction of each query's data came from cache versus storage tier, which informs decisions about cache sizing, warming schedules, and query optimization. Systems that hide cache behavior make performance unpredictable; making it explicit makes it manageable.

Query Plan Optimization Under Uncertainty

All of the above techniques work well when the query planner has accurate statistics about the data. The challenge is that maintaining perfectly accurate statistics on petabyte-scale datasets with continuous ingestion is expensive. Traditional systems solve this by running statistics collection jobs on a schedule, which means statistics are always somewhat stale.

Dataova's query optimizer uses a sampling-based statistics approach that maintains approximate but continuously updated statistics on all tables. Every new partition that is written to the storage tier contributes a random sample to the statistics model. This means statistics are never more than one partition behind current data, at a maintenance cost proportional to the sample fraction rather than the full dataset size.

The optimizer also uses runtime feedback to improve plan quality. When a query executes and the actual row counts at each operator differ significantly from the estimated row counts that the plan was built on, this runtime information is fed back to update the statistics model. Over time, the optimizer's cardinality estimates improve for the specific query patterns that appear in production, concentrating statistical accuracy where it has the most impact on query performance.

Performance Results in Production

After three years of production deployment across enterprise customers, the performance characteristics we observe are consistent: 90th percentile query latency under 800ms for dataset sizes up to 500TB; under 1.8 seconds for dataset sizes from 500TB to 2PB; concurrent query throughput of 400+ queries per second per cluster node without performance degradation. These are not benchmark numbers on synthetic data — they are production metrics from real enterprise workloads.

The workloads where performance is most dramatically better than alternative approaches are: large-dataset aggregations with multiple filter dimensions (where sort key co-location provides the largest benefit), high-concurrency reporting workloads (where vectorized execution's CPU efficiency prevents core count from being the binding constraint), and ad-hoc exploration queries that cannot be served from precomputed rollups (where the storage optimizer and adaptive indexing matter most).

Multi-dimensional sort keys co-locate data for common query filter combinations, reducing scan volume per query.
Vectorized execution achieves 70-80% of theoretical CPU throughput versus 20-30% for row-based engines.
Partition-aware query planning minimizes network data movement in distributed execution.
Column-chunk caching with explicit cache warming makes performance predictable rather than dependent on workload history.
Runtime feedback improves optimizer cardinality estimates continuously for production query patterns.

Sub-second queries at petabyte scale are not a feature of the hardware you buy — they are an outcome of every design decision in the query engine. The architectural choices described here compound: each optimization interacts with the others to produce aggregate performance that is greater than any individual technique could deliver alone.

For a technical walkthrough of Dataova's query engine architecture, visit the Platform page or contact our engineering team directly.

Back to Blog