Arrow ecosystem compatible
Standard Parquet files readable by DuckDB, Polars, Spark, pandas, and any Arrow-compatible tool.
ZeptoDB’s proprietary binary format (.bin) is optimized for maximum-speed local I/O, but it’s opaque to external tools. By adding Apache Parquet as a storage format with automatic S3 upload, historical data becomes queryable from DuckDB, Polars, Spark, and any Arrow-compatible tool — without touching ZeptoDB at all.
When a partition is sealed, the FlushManager routes it based on the configured output format:
Partition (SEALED) │ ├──[BINARY]───→ HDBWriter → {base}/{symbol}/{hour}/{col}.bin │ (LZ4 compressed, fastest local I/O) │ ├──[PARQUET]──→ ParquetWriter → {base}/{symbol}/{hour}/{symbol}_{hour}.parquet │ │ (Arrow compatible, SNAPPY/ZSTD) │ └──[S3]──→ S3Sink → s3://{bucket}/{prefix}/{symbol}/{hour}.parquet │ └──[BOTH]─────→ BINARY + PARQUET stored simultaneouslyBOTH mode gives you the best of both worlds: local binary for real-time queries, Parquet on S3 for external analytics and disaster recovery.
ZeptoDB column types map cleanly to Arrow/Parquet types:
| ZeptoDB ColumnType | Arrow DataType | Parquet Physical Type |
|---|---|---|
| INT32 | int32() | INT32 |
| INT64 | int64() | INT64 |
| FLOAT32 | float32() | FLOAT |
| FLOAT64 | float64() | DOUBLE |
| TIMESTAMP_NS | timestamp(ns, UTC) | INT64 (TIMESTAMP) |
| SYMBOL | uint32() | INT32 |
| BOOL | boolean() | BOOLEAN |
Timestamps preserve nanosecond precision and UTC timezone through the full pipeline.
| Codec | Speed | Ratio | Best For |
|---|---|---|---|
| SNAPPY (default) | ★★★★★ | ★★★ | Real-time flush |
| ZSTD | ★★★ | ★★★★★ | Cold storage / S3 long-term |
| LZ4_RAW | ★★★★★ | ★★★ | Maximum-speed compression |
| NONE | — | — | Testing / debugging |
SNAPPY is the default for real-time flush — it’s fast enough to keep up with ingestion. ZSTD is recommended for S3 uploads where compression ratio matters more than speed.
s3://{bucket}/{prefix}/{symbol_id}/{hour_epoch}.parquet
Examples: s3://zepto-hdb/prod/hdb/1/1742648000.parquet s3://zepto-hdb/prod/hdb/2/1742648000.parquetflush_to_buffer() serializes Parquet directly to memory — no intermediate local file needed for S3 upload:
auto buf = parquet_writer.flush_to_buffer(partition);s3_sink.upload_buffer( reinterpret_cast<const char*>(buf->data()), buf->size(), s3_key);AWS SDK standard credential chain: environment variables → ~/.aws/credentials → IAM Role (recommended for production). For self-hosted S3 (MinIO), set endpoint_url and use_path_style = true.
FlushConfig config;config.output_format = HDBOutputFormat::PARQUET;config.parquet_config.compression = ParquetCompression::ZSTD;config.enable_s3_upload = true;config.s3_config.bucket = "zepto-hdb-prod";config.s3_config.prefix = "hdb";config.s3_config.region = "ap-southeast-1";config.delete_local_after_s3 = true; // save local storage
FlushManager flush_mgr(pm, hdb_writer, config);flush_mgr.start();// Automatically saves Parquet and uploads to S3 when partition is sealedFor BOTH mode, set output_format = HDBOutputFormat::BOTH — binary stays on local NVMe for real-time queries, Parquet goes to S3 for external analytics.
-- 5-minute OHLCV bars from ZeptoDB HDB on S3SELECT epoch_ms(timestamp / 1000000) AS bar_time, first(price) AS open, max(price) AS high, min(price) AS low, last(price) AS close, sum(volume) AS volumeFROM read_parquet('s3://zepto-hdb-prod/hdb/1/*.parquet')GROUP BY time_bucket(INTERVAL '5 minutes', bar_time)ORDER BY bar_time;import polars as pl
df = pl.read_parquet("s3://zepto-hdb-prod/hdb/1/1742648000.parquet")df = df.with_columns([ pl.col("price").ewm_mean(span=20).alias("ema20")])Arrow ecosystem compatible
Standard Parquet files readable by DuckDB, Polars, Spark, pandas, and any Arrow-compatible tool.
Automatic S3 offload
Sealed partitions upload to S3 automatically. Optional local deletion saves NVMe space.
Dual format mode
BOTH mode: binary on local NVMe for sub-millisecond queries, Parquet on S3 for external analytics.
In-memory serialization
flush_to_buffer() serializes Parquet directly to memory for S3 upload — no intermediate local file.
Related: HDB Tiered Storage → · Python Ecosystem Integration → · DuckDB Embedding →