Skip to content

HDB Tiered Storage: From Memory to Parquet to S3

Time-series databases face a fundamental tension: real-time queries need data in memory, but storing everything in RAM is prohibitively expensive. ZeptoDB solves this with a three-tier storage architecture that moves data through hot, warm, and cold stages automatically.


┌──────────────────────────────────────────────────────┐
│ HOT │ RDB (In-Memory) │ Active + recent data │
│ │ ArenaAllocator │ Sub-microsecond reads │
├────────┼─────────────────────┼───────────────────────┤
│ WARM │ HDB (NVMe) │ Sealed partitions │
│ │ mmap + LZ4 │ ~678µs reads │
├────────┼─────────────────────┼───────────────────────┤
│ COLD │ S3 (Parquet) │ Historical archive │
│ │ SNAPPY/ZSTD │ DuckDB/Polars query │
└──────────────────────────────────────────────────────┘

Data flows downward automatically. The FlushManager runs a background thread that checks for sealed partitions every second and flushes them to disk — no mutex on the hot ingestion path.


ZeptoDB supports three modes, configured per-deployment:

ModeDescriptionQuery Target
PURE_IN_MEMORYExtreme HFT tick processingRDB only
TIEREDRDB (today) + HDB (history), async mergeRDB + HDB
PURE_ON_DISKBacktesting / deep learning feature generationHDB only

Inspired by kdb+‘s splayed table approach, each column is stored as an independent binary file:

hdb_data/
{symbol_id}/
{hour_epoch}/
timestamp.bin ← per-column binary
price.bin
volume.bin
msg_type.bin

Each file starts with a 32-byte header — exactly half a cache line:

FieldSizeDescription
magic5BAPEXH
version1BFormat version (v1)
col_type1BColumnType enum
compression1B0=None, 1=LZ4
row_count8BNumber of rows
data_size8BCompressed data size
uncompressed_size8BOriginal size

Per-column separation means queries only mmap the columns they need. A SELECT avg(price) FROM trades never touches the volume.bin file.


LZ4 block compression is applied automatically, with a smart fallback: if the compressed result is larger than the original (e.g., random data), the raw bytes are stored instead.

Time-series data compresses exceptionally well — sequential timestamps and correlated prices yield a 0.31 compression ratio (69% savings):

MetricValue
Compression ratio0.31 (69% savings)
Compression throughput~1,128 MB/sec
DecompressionNear memory bandwidth

HDB reads use mmap(MAP_PRIVATE) with madvise(MADV_SEQUENTIAL):

  • Uncompressed data: zero-copy direct pointer return — the OS page cache does all the work
  • LZ4 compressed data: decompress to buffer, then return pointer
  • RAII cleanup: MappedColumn destructor calls munmap + close automatically

MetricValue
Write throughput (1M rows, uncompressed)4,785 MB/sec — near NVMe theoretical bandwidth
Write throughput (1M rows, LZ4)1,128 MB/sec — includes compression CPU cost
In-memory COUNT (1M rows)1.11 µs
Tiered HDB COUNT (1M rows)677.60 µs (~600× slower, still sub-ms)
In-memory VWAP (1M rows)44.84 µs

Aspectkdb+ZeptoDB
Flush throughput~1–2 GB/sec~4.7 GB/sec
Partition granularityDayHour (better for HFT)
CompressionSeparate gzip stepLZ4 built-in
Header size8 bytes32 bytes (magic + version + metadata)
Memory managementReference countingRAII mmap

4.7 GB/sec flush

Uncompressed writes approach NVMe SSD theoretical bandwidth with direct write() and custom binary format.

69% disk savings

LZ4 block compression is highly effective on time-series data — sequential timestamps and correlated prices.

Zero-copy reads

mmap with MADV_SEQUENTIAL lets the OS page cache handle prefetching. Uncompressed columns return direct pointers.

Hour-level partitions

Finer than kdb+‘s day-level partitioning — better suited for HFT data with millions of ticks per hour.


The FlushManager drives the state machine:

ACTIVE → SEALED → FLUSHING → FLUSHED
│ │ │ │
│ ingestion │ bg check │ write disk │ reclaim arena
│ (no lock) │ (1s poll) │ (async) │ (reset())

Single background thread, no locking on the hot path. After flush completes, ArenaAllocator::reset() reclaims memory immediately.


Related: Parquet on S3 → · Storage Tiering & Materialized Views → · Bare Metal Tuning →