Skip to content

Lock-Free Ingestion at 5.52M Events/sec

ZeptoDB ingests 5.52 million events per second on a single node. This post explains how — from the lock-free ring buffer to Highway SIMD batch copy to the LLVM JIT query engine that processes the data.


Feed Handler → Ring Buffer (MPMC) → Drain Thread → Column Store → Query Engine
WAL

Every component on this path is designed for zero allocation and zero contention.

The ingestion entry point is a multi-producer, multi-consumer (MPMC) ring buffer. Multiple feed handlers can write concurrently without locks:

  • Atomic sequence numbers for coordination — no mutex, no spinlock
  • Power-of-2 sizing for bitwise modulo (mask instead of division)
  • Cache-line padding between producer and consumer counters to prevent false sharing
  • Batch drain: the consumer thread pulls up to N items per iteration, amortizing the atomic overhead

A single background thread drains the ring buffer into the column store. On bare-metal deployments, this thread is pinned to a dedicated CPU core with pthread_setaffinity_np — no context switches, no scheduler interference.

Each partition uses an arena allocator — a pre-allocated memory block with bump-pointer allocation. Appending a tick is a pointer increment, not a malloc call. The columnar layout means each field (timestamp, price, volume, symbol) is stored in a contiguous array, maximizing SIMD and cache efficiency.


ZeptoDB uses Google’s Highway library for portable SIMD. Highway compiles each function for multiple targets (SSE4, AVX2, AVX-512, NEON) and selects the optimal one at runtime.

OperationTechniqueCache-Hot Speedup
sum_i644 independent accumulators + ReduceSum4.2x
filter_gtMask → StoreMaskBits → bit traversal2.6x
vwapConvertTo(i64→f64) + MulAdd FMA pipeline2.5x

The key insight: 4 independent accumulators for sum. A single accumulator creates a pipeline dependency (each add waits for the previous). Four accumulators let the CPU execute 4 additions in parallel via instruction-level parallelism.

For filter, SIMD replaces scalar if branches with mask-based predication — eliminating branch mispredictions entirely.

On cache-resident data (100K rows, ~800KB — fits in L2):

  • SIMD delivers 2.5-4.2x speedup — compute-bound, SIMD wins big

On memory-bound data (10M rows, ~80MB — exceeds L3):

  • Pure reads (sum) are bandwidth-limited — only 1.2x
  • Compute-intensive ops (filter, VWAP) still improve 1.8-2.3x

This matches the DataBlock pipeline design: queries process 8192-row blocks that fit in L1/L2 cache, where SIMD is maximally effective.


Filter expressions are compiled to native code via LLVM OrcJIT v2:

SQL WHERE clause → Parser → AST → LLVM IR → Native function pointer

The compiled function has signature bool (*)(int64_t price, int64_t volume) — a direct function pointer call, no virtual dispatch. Compilation takes ~2.6ms; the result is cached for repeated queries.


Built-in parsers for common market data protocols:

ProtocolThroughputParse Latency
FIX 4.41.1M msg/sec420ns
ITCH2.8M msg/sec180ns
Binance WS850K msg/sec580ns
Kafka3.2M msg/secbatch

Each feed handler runs in its own thread and writes directly to the ring buffer — no intermediate queue, no serialization.


For maximum throughput on dedicated hardware:

  • CPU pinning: Feed handler threads on producer cores, drain thread on a dedicated core
  • NUMA awareness: Allocate ring buffer and column store on the same NUMA node
  • Huge pages: 2MB huge pages for arena allocator — fewer TLB misses
  • io_uring: Async WAL writes via io_uring — no fsync blocking the hot path

See Bare Metal Tuning → for the full guide.


Get started: Quick Start → · Feed Handler Guide →