Zero overhead when unused
Lazy init + conditional compilation. No DuckDB cost if you don’t query Parquet.
ZeptoDB is optimized for hot, real-time time-series data. But what about cold data sitting in Parquet files on S3? Rather than building a full Parquet query engine, we embedded DuckDB as an in-process analytical offload — connected via columnar data conversion.
ZeptoDB’s execution engine is built for streaming columnar data in memory — append-only partitions, SIMD aggregation, microsecond latency. Parquet files are a different beast: compressed, row-group-based, potentially gigabytes on remote storage.
Building a Parquet query engine from scratch would mean reimplementing predicate pushdown on row groups, dictionary decoding, delta encoding, and dozens of compression codecs. DuckDB already does all of this extremely well.
The architecture:
Hot data (real-time) Cold data (historical)───────────────────── ──────────────────────ZeptoDB Column Store Parquet files (S3/local) │ │ ▼ ▼ZeptoDB Execution Engine DuckDB (embedded, in-process) │ │ └──────────┬───────────────────┘ ▼ QueryResultSetZeptoDB and DuckDB both work with columnar data, but in different formats. The Arrow bridge converts between them:
// Convert ZeptoDB ColumnStore columns to flat vectorsArrowData columns_to_arrow_data(const ColumnStore& store, const vector<string>& columns);Supported types: INT64, FLOAT64, TIMESTAMP_NS, INT32, FLOAT32, SYMBOL.
// Convert column data back to QueryResultSetQueryResultSet arrow_columns_to_result(const ArrowData& data);This is a columnar-to-columnar conversion — no row-by-row serialization. The data stays in contiguous arrays throughout.
The DuckDBEngine class wraps DuckDB with a pimpl pattern to isolate its headers from the rest of the codebase:
class DuckDBEngine { std::unique_ptr<Impl> impl_; // DuckDB headers only in .cpppublic: QueryResult execute(const std::string& sql); void register_parquet(const std::string& name, const std::string& path);};Key design decisions:
The executor routes queries to DuckDB in two ways:
SELECT * FROM duckdb('/data/historical/trades_2025.parquet')WHERE symbol = 'AAPL' AND price > 150ORDER BY timestamp DESCLIMIT 100;The duckdb('path') syntax explicitly tells the executor to route to DuckDB. The path can be local or S3.
// In executor configenable_duckdb_offload = true;duckdb_memory_limit_mb = 256;When enabled, queries targeting registered Parquet tables are automatically routed to DuckDB. The routing happens in exec_select() after CTE resolution but before partition scanning.
DuckDB is a large dependency (~50MB compiled). It’s optional:
option(ZEPTO_ENABLE_DUCKDB "Enable DuckDB embedding for Parquet offload" ON)All DuckDB code is behind #ifdef ZEPTO_ENABLE_DUCKDB. When disabled, the duckdb() table function returns an error message, and automatic offload is skipped. The rest of ZeptoDB is unaffected.
DuckDB v1.2.1 is pulled via CMake FetchContent — includes the Parquet extension by default.
| Scenario | Engine | Why |
|---|---|---|
| Real-time streaming queries | ZeptoDB | Microsecond latency, SIMD, in-memory |
| Last N minutes of data | ZeptoDB | Hot partitions, already in column store |
| Historical analysis (days/months) | DuckDB | Parquet predicate pushdown, compression |
| Ad-hoc exploration of cold data | DuckDB | Full SQL, no need to load into memory |
| JOIN hot + cold data | Both | ZeptoDB for hot side, DuckDB for cold side |
The typical pattern: real-time dashboards hit ZeptoDB directly. End-of-day analytics and backtesting queries use duckdb() to scan Parquet archives without loading them into memory.
Zero overhead when unused
Lazy init + conditional compilation. No DuckDB cost if you don’t query Parquet.
Columnar bridge
Arrow-style conversion between ZeptoDB and DuckDB. No row-by-row serialization.
Memory isolation
DuckDB gets its own budget. Won’t compete with ZeptoDB’s arena allocator.
Full SQL on Parquet
DuckDB’s Parquet engine handles predicate pushdown, dictionary decoding, and compression.
Related: Lock-Free Ingestion → · Cost-Based Query Planner → · Bare-Metal Tuning →