Skip to content

Building a Cost-Based Query Planner from Scratch

Not every query needs a cost-based planner. Single-table scans with a WHERE clause are fast enough with rule-based execution. But once JOINs, CTEs, and subqueries enter the picture, choosing the right execution strategy matters. This post covers how ZeptoDB built a cost-based planner from scratch, in 7 phases.


The first design decision: don’t slow down simple queries. A needs_cost_planning() check (~10ns) routes queries:

Simple SELECT/WHERE/GROUP BY → existing fast path (zero overhead)
JOIN / CTE / subquery / set-op → cost-based planning pipeline

This means 90%+ of time-series queries (single-table scans) pay nothing for the planner’s existence.


Accurate cost estimation requires knowing the data. TableStatistics collects per-column, per-partition stats:

StatisticMethodCost
min / maxTrack on appendO(1) per row
null_countIncrement on nullO(1) per row
distinct_countHyperLogLog (64 registers)O(1) per row, ~2-15% error
row_countCounter per partitionO(1)
ts_min / ts_maxFrom timestamp columnO(1)

The HyperLogLog uses 64 registers with FNV-1a hashing — 64 bytes of memory per column, ~30 lines of code. The 2-15% error is acceptable for cost estimation (we need order-of-magnitude accuracy, not exact counts).

Stats are updated incrementally on append and frozen when a partition is sealed (immutable). Cross-partition aggregation merges stats for query planning.


The cost model estimates execution cost for each operator type:

Cost Constants:
SEQ_SCAN = 1.0 (sequential read, cache-friendly)
RANDOM_IO = 4.0 (random access, cache-unfriendly)
INDEX_PROBE = 0.5 (binary search, O(log n))
HASH_BUILD = 1.5 (hash map construction)
HASH_PROBE = 0.5 (hash map lookup)
SORT_COST = 2.0 (comparison + swap)
AGG_COST = 1.0 (accumulate)

The cost model estimates how many rows survive each predicate:

PredicateSelectivity Formula
col = value1 / NDV (number of distinct values)
col BETWEEN lo AND hi(hi - lo) / (max - min)
col IN (a, b, c)list_size / NDV
A AND Bsel(A) × sel(B)
A OR Bsel(A) + sel(B) - sel(A) × sel(B)

The crossover point between index scan and sequential scan is ~15-25% selectivity. Below 15%, the index scan’s O(log n) probe + random access is cheaper. Above 25%, sequential scan’s cache-friendly linear read wins.


The parser produces an AST (SelectStmt). The logical planner converts it to an operator tree:

Simple SELECT:
SCAN → FILTER → PROJECT → SORT → LIMIT
JOIN:
SCAN(left) ─┐
├── JOIN → FILTER → PROJECT → SORT → LIMIT
SCAN(right) ─┘
GROUP BY:
SCAN → FILTER → AGGREGATE → SORT → LIMIT

Before cost-based planning, two rule-based rewrites run:

  1. Predicate pushdown: Move FILTER below JOIN when it references only one side. This reduces the number of rows entering the JOIN.
  2. Projection pushdown: Propagate needed columns to SCAN nodes. Don’t read columns that aren’t used.

The physical planner converts logical operators to physical operators using statistics:

LogicalPhysicalSelection Criteria
SCANSEQ_SCANDefault
SCANINDEX_SCANSelectivity ≤ 0.15
SCANSORTED_RANGE_SCANTimestamp range on sorted column
JOINHASH_JOINDefault (smaller side as build)
JOINASOF_JOINASOF semantics
AGGREGATEHASH_AGGREGATEDefault
SORT + LIMITTOPN_SORTLIMIT present (partial sort)
SORTFULL_SORTNo LIMIT

For HASH_JOIN, the planner uses statistics to pick the smaller side as the build side. Building the hash map on fewer rows means less memory and faster construction:

Left: 100K rows (reference data) → build side
Right: 10M rows (trade data) → probe side

The physical plan integrates into exec_select() for complex queries. Simple queries bypass planning entirely.

EXPLAIN output for complex queries shows the physical plan tree with cost estimates:

EXPLAIN SELECT t.*, r.name
FROM trades t JOIN ref r ON t.symbol_id = r.id
WHERE t.price > 100;
Physical Plan:
├── HASH_JOIN (cost: 15234.0, est_rows: 45000)
│ ├── SEQ_SCAN trades (cost: 10000.0, est_rows: 100000)
│ │ └── FILTER price > 100 (sel: 0.45)
│ └── SEQ_SCAN ref (cost: 100.0, est_rows: 1000) [build]
└── PROJECT t.*, r.name

The final phase connects planner decisions to actual execution. Three operators were wired:

HASH_JOIN build side swap — explicitly wired. The executor reads the physical plan to determine which side to build on:

auto* hj_node = find_hash_join_node(last_physical_plan_);
if (hj_node && !hj_node->build_right) {
// Planner says left side is smaller — swap build/probe
planner_swap = true;
}

INNER and FULL OUTER JOINs support the swap. LEFT/RIGHT JOINs are excluded — their semantics require specific build sides.

TOPN_SORT — already wired. The executor’s apply_order_by() already uses std::partial_sort when LIMIT is present.

INDEX_SCAN — already wired. The executor’s collect_and_intersect() automatically uses timestamp binary search and sorted column scans.

Planning overhead: ~1μs per query. Simple queries: zero overhead via the fast-path check.

Zero overhead for simple queries

2-tier routing: ~10ns check skips planning for single-table queries.

HyperLogLog statistics

64 bytes per column. Incremental updates. 2-15% error — good enough for cost estimation.

Predicate pushdown

Filters pushed below JOINs. Fewer rows enter the expensive operator.

Build side selection

Statistics-driven. Smaller side builds the hash map. Confirmed working via EXPLAIN.


Related: GROUP BY Optimization → · FlatHashMap JOINs → · Parallel Query Engine →