Zero overhead for simple queries
2-tier routing: ~10ns check skips planning for single-table queries.
Not every query needs a cost-based planner. Single-table scans with a WHERE clause are fast enough with rule-based execution. But once JOINs, CTEs, and subqueries enter the picture, choosing the right execution strategy matters. This post covers how ZeptoDB built a cost-based planner from scratch, in 7 phases.
The first design decision: don’t slow down simple queries. A needs_cost_planning() check (~10ns) routes queries:
Simple SELECT/WHERE/GROUP BY → existing fast path (zero overhead)JOIN / CTE / subquery / set-op → cost-based planning pipelineThis means 90%+ of time-series queries (single-table scans) pay nothing for the planner’s existence.
Accurate cost estimation requires knowing the data. TableStatistics collects per-column, per-partition stats:
| Statistic | Method | Cost |
|---|---|---|
| min / max | Track on append | O(1) per row |
| null_count | Increment on null | O(1) per row |
| distinct_count | HyperLogLog (64 registers) | O(1) per row, ~2-15% error |
| row_count | Counter per partition | O(1) |
| ts_min / ts_max | From timestamp column | O(1) |
The HyperLogLog uses 64 registers with FNV-1a hashing — 64 bytes of memory per column, ~30 lines of code. The 2-15% error is acceptable for cost estimation (we need order-of-magnitude accuracy, not exact counts).
Stats are updated incrementally on append and frozen when a partition is sealed (immutable). Cross-partition aggregation merges stats for query planning.
The cost model estimates execution cost for each operator type:
Cost Constants: SEQ_SCAN = 1.0 (sequential read, cache-friendly) RANDOM_IO = 4.0 (random access, cache-unfriendly) INDEX_PROBE = 0.5 (binary search, O(log n)) HASH_BUILD = 1.5 (hash map construction) HASH_PROBE = 0.5 (hash map lookup) SORT_COST = 2.0 (comparison + swap) AGG_COST = 1.0 (accumulate)The cost model estimates how many rows survive each predicate:
| Predicate | Selectivity Formula |
|---|---|
col = value | 1 / NDV (number of distinct values) |
col BETWEEN lo AND hi | (hi - lo) / (max - min) |
col IN (a, b, c) | list_size / NDV |
A AND B | sel(A) × sel(B) |
A OR B | sel(A) + sel(B) - sel(A) × sel(B) |
The crossover point between index scan and sequential scan is ~15-25% selectivity. Below 15%, the index scan’s O(log n) probe + random access is cheaper. Above 25%, sequential scan’s cache-friendly linear read wins.
The parser produces an AST (SelectStmt). The logical planner converts it to an operator tree:
Simple SELECT: SCAN → FILTER → PROJECT → SORT → LIMIT
JOIN: SCAN(left) ─┐ ├── JOIN → FILTER → PROJECT → SORT → LIMIT SCAN(right) ─┘
GROUP BY: SCAN → FILTER → AGGREGATE → SORT → LIMITBefore cost-based planning, two rule-based rewrites run:
The physical planner converts logical operators to physical operators using statistics:
| Logical | Physical | Selection Criteria |
|---|---|---|
| SCAN | SEQ_SCAN | Default |
| SCAN | INDEX_SCAN | Selectivity ≤ 0.15 |
| SCAN | SORTED_RANGE_SCAN | Timestamp range on sorted column |
| JOIN | HASH_JOIN | Default (smaller side as build) |
| JOIN | ASOF_JOIN | ASOF semantics |
| AGGREGATE | HASH_AGGREGATE | Default |
| SORT + LIMIT | TOPN_SORT | LIMIT present (partial sort) |
| SORT | FULL_SORT | No LIMIT |
For HASH_JOIN, the planner uses statistics to pick the smaller side as the build side. Building the hash map on fewer rows means less memory and faster construction:
Left: 100K rows (reference data) → build sideRight: 10M rows (trade data) → probe sideThe physical plan integrates into exec_select() for complex queries. Simple queries bypass planning entirely.
EXPLAIN output for complex queries shows the physical plan tree with cost estimates:
EXPLAIN SELECT t.*, r.nameFROM trades t JOIN ref r ON t.symbol_id = r.idWHERE t.price > 100;
Physical Plan:├── HASH_JOIN (cost: 15234.0, est_rows: 45000)│ ├── SEQ_SCAN trades (cost: 10000.0, est_rows: 100000)│ │ └── FILTER price > 100 (sel: 0.45)│ └── SEQ_SCAN ref (cost: 100.0, est_rows: 1000) [build]└── PROJECT t.*, r.nameThe final phase connects planner decisions to actual execution. Three operators were wired:
HASH_JOIN build side swap — explicitly wired. The executor reads the physical plan to determine which side to build on:
auto* hj_node = find_hash_join_node(last_physical_plan_);if (hj_node && !hj_node->build_right) { // Planner says left side is smaller — swap build/probe planner_swap = true;}INNER and FULL OUTER JOINs support the swap. LEFT/RIGHT JOINs are excluded — their semantics require specific build sides.
TOPN_SORT — already wired. The executor’s apply_order_by() already uses std::partial_sort when LIMIT is present.
INDEX_SCAN — already wired. The executor’s collect_and_intersect() automatically uses timestamp binary search and sorted column scans.
Planning overhead: ~1μs per query. Simple queries: zero overhead via the fast-path check.
Zero overhead for simple queries
2-tier routing: ~10ns check skips planning for single-table queries.
HyperLogLog statistics
64 bytes per column. Incremental updates. 2-15% error — good enough for cost estimation.
Predicate pushdown
Filters pushed below JOINs. Fewer rows enter the expensive operator.
Build side selection
Statistics-driven. Smaller side builds the hash map. Confirmed working via EXPLAIN.
Related: GROUP BY Optimization → · FlatHashMap JOINs → · Parallel Query Engine →