4.7 GB/sec flush
Uncompressed writes approach NVMe SSD theoretical bandwidth with direct write() and custom binary format.
Time-series databases face a fundamental tension: real-time queries need data in memory, but storing everything in RAM is prohibitively expensive. ZeptoDB solves this with a three-tier storage architecture that moves data through hot, warm, and cold stages automatically.
┌──────────────────────────────────────────────────────┐│ HOT │ RDB (In-Memory) │ Active + recent data ││ │ ArenaAllocator │ Sub-microsecond reads │├────────┼─────────────────────┼───────────────────────┤│ WARM │ HDB (NVMe) │ Sealed partitions ││ │ mmap + LZ4 │ ~678µs reads │├────────┼─────────────────────┼───────────────────────┤│ COLD │ S3 (Parquet) │ Historical archive ││ │ SNAPPY/ZSTD │ DuckDB/Polars query │└──────────────────────────────────────────────────────┘Data flows downward automatically. The FlushManager runs a background thread that checks for sealed partitions every second and flushes them to disk — no mutex on the hot ingestion path.
ZeptoDB supports three modes, configured per-deployment:
| Mode | Description | Query Target |
|---|---|---|
PURE_IN_MEMORY | Extreme HFT tick processing | RDB only |
TIERED | RDB (today) + HDB (history), async merge | RDB + HDB |
PURE_ON_DISK | Backtesting / deep learning feature generation | HDB only |
Inspired by kdb+‘s splayed table approach, each column is stored as an independent binary file:
hdb_data/ {symbol_id}/ {hour_epoch}/ timestamp.bin ← per-column binary price.bin volume.bin msg_type.binEach file starts with a 32-byte header — exactly half a cache line:
| Field | Size | Description |
|---|---|---|
| magic | 5B | APEXH |
| version | 1B | Format version (v1) |
| col_type | 1B | ColumnType enum |
| compression | 1B | 0=None, 1=LZ4 |
| row_count | 8B | Number of rows |
| data_size | 8B | Compressed data size |
| uncompressed_size | 8B | Original size |
Per-column separation means queries only mmap the columns they need. A SELECT avg(price) FROM trades never touches the volume.bin file.
LZ4 block compression is applied automatically, with a smart fallback: if the compressed result is larger than the original (e.g., random data), the raw bytes are stored instead.
Time-series data compresses exceptionally well — sequential timestamps and correlated prices yield a 0.31 compression ratio (69% savings):
| Metric | Value |
|---|---|
| Compression ratio | 0.31 (69% savings) |
| Compression throughput | ~1,128 MB/sec |
| Decompression | Near memory bandwidth |
HDB reads use mmap(MAP_PRIVATE) with madvise(MADV_SEQUENTIAL):
MappedColumn destructor calls munmap + close automatically| Metric | Value |
|---|---|
| Write throughput (1M rows, uncompressed) | 4,785 MB/sec — near NVMe theoretical bandwidth |
| Write throughput (1M rows, LZ4) | 1,128 MB/sec — includes compression CPU cost |
| In-memory COUNT (1M rows) | 1.11 µs |
| Tiered HDB COUNT (1M rows) | 677.60 µs (~600× slower, still sub-ms) |
| In-memory VWAP (1M rows) | 44.84 µs |
| Aspect | kdb+ | ZeptoDB |
|---|---|---|
| Flush throughput | ~1–2 GB/sec | ~4.7 GB/sec |
| Partition granularity | Day | Hour (better for HFT) |
| Compression | Separate gzip step | LZ4 built-in |
| Header size | 8 bytes | 32 bytes (magic + version + metadata) |
| Memory management | Reference counting | RAII mmap |
4.7 GB/sec flush
Uncompressed writes approach NVMe SSD theoretical bandwidth with direct write() and custom binary format.
69% disk savings
LZ4 block compression is highly effective on time-series data — sequential timestamps and correlated prices.
Zero-copy reads
mmap with MADV_SEQUENTIAL lets the OS page cache handle prefetching. Uncompressed columns return direct pointers.
Hour-level partitions
Finer than kdb+‘s day-level partitioning — better suited for HFT data with millions of ticks per hour.
The FlushManager drives the state machine:
ACTIVE → SEALED → FLUSHING → FLUSHED │ │ │ │ │ ingestion │ bg check │ write disk │ reclaim arena │ (no lock) │ (1s poll) │ (async) │ (reset())Single background thread, no locking on the hot path. After flush completes, ArenaAllocator::reset() reclaims memory immediately.
Related: Parquet on S3 → · Storage Tiering & Materialized Views → · Bare Metal Tuning →