Skip to content

Distributed Cluster Architecture: Sharding, Routing, and Replication

ZeptoDB was designed as a single-node, in-memory time-series database. Scaling it horizontally required solving three problems: transport abstraction without runtime overhead, partition routing with minimal data movement, and health monitoring that doesn’t lie. Here’s how we did it.


Zero Virtual Calls

CRTP template dispatch on the hot path — no vtable indirection.

Transport Swap = 1 Line

Switch from shared memory to RDMA by changing a single template parameter.

Consistent Hashing

128 virtual nodes per physical node — minimal partition movement on topology changes.

Full Local Testing

SharedMemBackend lets you test multi-node clusters without RDMA hardware.


The transport layer uses CRTP (Curiously Recurring Template Pattern) to eliminate virtual call overhead:

template <typename Impl>
class TransportBackend {
public:
void remote_write(const void* src, const RemoteRegion& remote,
size_t offset, size_t bytes) {
impl().do_remote_write(src, remote, offset, bytes); // inlined
}
Impl& impl() { return static_cast<Impl&>(*this); }
};

Two backends implement this interface:

BackendUse Caseremote_write impl
SharedMemBackendDev/testmemcpy via POSIX shm_open + mmap
UCXBackendProductionUCX one-sided RDMA (ucp_put_nbi)

Switching backends is a template parameter change: ClusterNode<SharedMemBackend>ClusterNode<UCXBackend>.


Each physical node maps to 128 virtual nodes on a uint64 hash ring:

Hash ring (uint64 → NodeId):
──────────────────────────────────────────
0 ... UINT64_MAX
├──●──●──●──●──●──●──●──●──●──●──●──●──┤
│ N1 N2 N1 N3 N2 N1 N3 N1 N2 N3 N2 N1 │
└─ (128 vnodes per physical node) ┘
Lookup: symbol_id → hash → upper_bound(ring) → NodeId
O(log n), cached to O(1)

When a node joins or leaves, only the affected range migrates — other nodes are untouched.

Benchmark results:

OperationThroughputLatency
route() (cached)500M ops/s2.0 ns
route() (uncached)29M ops/s34.6 ns

The HealthMonitor runs a UDP heartbeat protocol with a compact 24-byte packet:

struct HeartbeatPacket {
uint32_t magic = 0x41504558; // 'APEX'
NodeId node_id;
uint64_t seq_num;
uint64_t timestamp_ns;
};

State machine:

JOINING ──heartbeat──▶ ACTIVE
ACTIVE ──3s timeout──▶ SUSPECT
SUSPECT ──heartbeat──▶ ACTIVE (recovery)
SUSPECT ──7s more────▶ DEAD (failover triggered)

The two-phase timeout (SUSPECT → DEAD) prevents false positives from transient network blips.


ClusterNode<Transport> unifies all components:

template <typename Transport>
class ClusterNode {
Transport transport_; // SharedMem or UCX
PartitionRouter router_; // Consistent hash ring
HealthMonitor health_; // UDP heartbeat
ZeptoPipeline local_pipeline_; // Local storage + query engine
};

The ingestion hot path is simple:

ingest_tick(msg):
owner = router_.route(msg.symbol_id) // 2ns cached lookup
if owner == self:
local_pipeline_.ingest_tick(msg) // local fast path
else:
transport_.remote_write(...) // one-sided RDMA

OperationThroughputLatency
SharedMem write+fence (64B)73.9M ops/s13.5 ns
SharedMem bulk write (4KB)14.9M ops/s66.9 ns (61 GB/s)
Partition routing (cached)500.4M ops/s2.0 ns
Single-node ingest5.1M ops/s195.7 ns

SharedMem at 13.5 ns is a useful baseline — production RDMA adds network latency (1–15 μs typical), but the software overhead is already minimized.


Lesson: Always Heap-Allocate Large Objects

Section titled “Lesson: Always Heap-Allocate Large Objects”

ClusterNode is ~8 MB due to the embedded ZeptoPipeline (which contains a 65K-slot ring buffer). Two stack-allocated nodes in a test caused a 16 MB stack overflow — caught by UCX’s signal handler with confusing crash logs.

// Wrong: stack overflow
ClusterNode<SharedMemBackend> node1(cfg1), node2(cfg2); // 16MB on stack
// Right: heap allocation
auto node1 = std::make_unique<ShmNode>(cfg1);
auto node2 = std::make_unique<ShmNode>(cfg2);

In HFT systems, always heap-allocate objects with large embedded buffers.