Skip to content

Parquet on S3: Historical Data Storage for Time-Series

ZeptoDB’s proprietary binary format (.bin) is optimized for maximum-speed local I/O, but it’s opaque to external tools. By adding Apache Parquet as a storage format with automatic S3 upload, historical data becomes queryable from DuckDB, Polars, Spark, and any Arrow-compatible tool — without touching ZeptoDB at all.


When a partition is sealed, the FlushManager routes it based on the configured output format:

Partition (SEALED)
├──[BINARY]───→ HDBWriter → {base}/{symbol}/{hour}/{col}.bin
│ (LZ4 compressed, fastest local I/O)
├──[PARQUET]──→ ParquetWriter → {base}/{symbol}/{hour}/{symbol}_{hour}.parquet
│ │ (Arrow compatible, SNAPPY/ZSTD)
│ └──[S3]──→ S3Sink → s3://{bucket}/{prefix}/{symbol}/{hour}.parquet
└──[BOTH]─────→ BINARY + PARQUET stored simultaneously

BOTH mode gives you the best of both worlds: local binary for real-time queries, Parquet on S3 for external analytics and disaster recovery.


ZeptoDB column types map cleanly to Arrow/Parquet types:

ZeptoDB ColumnTypeArrow DataTypeParquet Physical Type
INT32int32()INT32
INT64int64()INT64
FLOAT32float32()FLOAT
FLOAT64float64()DOUBLE
TIMESTAMP_NStimestamp(ns, UTC)INT64 (TIMESTAMP)
SYMBOLuint32()INT32
BOOLboolean()BOOLEAN

Timestamps preserve nanosecond precision and UTC timezone through the full pipeline.


CodecSpeedRatioBest For
SNAPPY (default)★★★★★★★★Real-time flush
ZSTD★★★★★★★★Cold storage / S3 long-term
LZ4_RAW★★★★★★★★Maximum-speed compression
NONETesting / debugging

SNAPPY is the default for real-time flush — it’s fast enough to keep up with ingestion. ZSTD is recommended for S3 uploads where compression ratio matters more than speed.


s3://{bucket}/{prefix}/{symbol_id}/{hour_epoch}.parquet
Examples:
s3://zepto-hdb/prod/hdb/1/1742648000.parquet
s3://zepto-hdb/prod/hdb/2/1742648000.parquet

flush_to_buffer() serializes Parquet directly to memory — no intermediate local file needed for S3 upload:

auto buf = parquet_writer.flush_to_buffer(partition);
s3_sink.upload_buffer(
reinterpret_cast<const char*>(buf->data()), buf->size(), s3_key);

AWS SDK standard credential chain: environment variables → ~/.aws/credentials → IAM Role (recommended for production). For self-hosted S3 (MinIO), set endpoint_url and use_path_style = true.


FlushConfig config;
config.output_format = HDBOutputFormat::PARQUET;
config.parquet_config.compression = ParquetCompression::ZSTD;
config.enable_s3_upload = true;
config.s3_config.bucket = "zepto-hdb-prod";
config.s3_config.prefix = "hdb";
config.s3_config.region = "ap-southeast-1";
config.delete_local_after_s3 = true; // save local storage
FlushManager flush_mgr(pm, hdb_writer, config);
flush_mgr.start();
// Automatically saves Parquet and uploads to S3 when partition is sealed

For BOTH mode, set output_format = HDBOutputFormat::BOTH — binary stays on local NVMe for real-time queries, Parquet goes to S3 for external analytics.


-- 5-minute OHLCV bars from ZeptoDB HDB on S3
SELECT
epoch_ms(timestamp / 1000000) AS bar_time,
first(price) AS open, max(price) AS high,
min(price) AS low, last(price) AS close, sum(volume) AS volume
FROM read_parquet('s3://zepto-hdb-prod/hdb/1/*.parquet')
GROUP BY time_bucket(INTERVAL '5 minutes', bar_time)
ORDER BY bar_time;
import polars as pl
df = pl.read_parquet("s3://zepto-hdb-prod/hdb/1/1742648000.parquet")
df = df.with_columns([
pl.col("price").ewm_mean(span=20).alias("ema20")
])

Arrow ecosystem compatible

Standard Parquet files readable by DuckDB, Polars, Spark, pandas, and any Arrow-compatible tool.

Automatic S3 offload

Sealed partitions upload to S3 automatically. Optional local deletion saves NVMe space.

Dual format mode

BOTH mode: binary on local NVMe for sub-millisecond queries, Parquet on S3 for external analytics.

In-memory serialization

flush_to_buffer() serializes Parquet directly to memory for S3 upload — no intermediate local file.


Related: HDB Tiered Storage → · Python Ecosystem Integration → · DuckDB Embedding →