Lock-Free Ingestion at 5.52M Events/sec
ZeptoDB ingests 5.52 million events per second on a single node. This post explains how — from the lock-free ring buffer to Highway SIMD batch copy to the LLVM JIT query engine that processes the data.
The Pipeline
Section titled “The Pipeline”Feed Handler → Ring Buffer (MPMC) → Drain Thread → Column Store → Query Engine ↓ WALEvery component on this path is designed for zero allocation and zero contention.
Ring Buffer: Lock-Free MPMC
Section titled “Ring Buffer: Lock-Free MPMC”The ingestion entry point is a multi-producer, multi-consumer (MPMC) ring buffer. Multiple feed handlers can write concurrently without locks:
- Atomic sequence numbers for coordination — no mutex, no spinlock
- Power-of-2 sizing for bitwise modulo (mask instead of division)
- Cache-line padding between producer and consumer counters to prevent false sharing
- Batch drain: the consumer thread pulls up to N items per iteration, amortizing the atomic overhead
Drain Thread: Dedicated Core
Section titled “Drain Thread: Dedicated Core”A single background thread drains the ring buffer into the column store. On bare-metal deployments, this thread is pinned to a dedicated CPU core with pthread_setaffinity_np — no context switches, no scheduler interference.
Column Store: Arena Allocator
Section titled “Column Store: Arena Allocator”Each partition uses an arena allocator — a pre-allocated memory block with bump-pointer allocation. Appending a tick is a pointer increment, not a malloc call. The columnar layout means each field (timestamp, price, volume, symbol) is stored in a contiguous array, maximizing SIMD and cache efficiency.
Highway SIMD
Section titled “Highway SIMD”ZeptoDB uses Google’s Highway library for portable SIMD. Highway compiles each function for multiple targets (SSE4, AVX2, AVX-512, NEON) and selects the optimal one at runtime.
Vectorized Operations
Section titled “Vectorized Operations”| Operation | Technique | Cache-Hot Speedup |
|---|---|---|
sum_i64 | 4 independent accumulators + ReduceSum | 4.2x |
filter_gt | Mask → StoreMaskBits → bit traversal | 2.6x |
vwap | ConvertTo(i64→f64) + MulAdd FMA pipeline | 2.5x |
The key insight: 4 independent accumulators for sum. A single accumulator creates a pipeline dependency (each add waits for the previous). Four accumulators let the CPU execute 4 additions in parallel via instruction-level parallelism.
For filter, SIMD replaces scalar if branches with mask-based predication — eliminating branch mispredictions entirely.
Cache Behavior
Section titled “Cache Behavior”On cache-resident data (100K rows, ~800KB — fits in L2):
- SIMD delivers 2.5-4.2x speedup — compute-bound, SIMD wins big
On memory-bound data (10M rows, ~80MB — exceeds L3):
- Pure reads (sum) are bandwidth-limited — only 1.2x
- Compute-intensive ops (filter, VWAP) still improve 1.8-2.3x
This matches the DataBlock pipeline design: queries process 8192-row blocks that fit in L1/L2 cache, where SIMD is maximally effective.
LLVM JIT
Section titled “LLVM JIT”Filter expressions are compiled to native code via LLVM OrcJIT v2:
SQL WHERE clause → Parser → AST → LLVM IR → Native function pointerThe compiled function has signature bool (*)(int64_t price, int64_t volume) — a direct function pointer call, no virtual dispatch. Compilation takes ~2.6ms; the result is cached for repeated queries.
Feed Handlers
Section titled “Feed Handlers”Built-in parsers for common market data protocols:
| Protocol | Throughput | Parse Latency |
|---|---|---|
| FIX 4.4 | 1.1M msg/sec | 420ns |
| ITCH | 2.8M msg/sec | 180ns |
| Binance WS | 850K msg/sec | 580ns |
| Kafka | 3.2M msg/sec | batch |
Each feed handler runs in its own thread and writes directly to the ring buffer — no intermediate queue, no serialization.
Bare-Metal Tuning
Section titled “Bare-Metal Tuning”For maximum throughput on dedicated hardware:
- CPU pinning: Feed handler threads on producer cores, drain thread on a dedicated core
- NUMA awareness: Allocate ring buffer and column store on the same NUMA node
- Huge pages: 2MB huge pages for arena allocator — fewer TLB misses
- io_uring: Async WAL writes via io_uring — no
fsyncblocking the hot path
See Bare Metal Tuning → for the full guide.
Get started: Quick Start → · Feed Handler Guide →