Skip to content

RingConsensus: Lightweight Partition Ownership Protocol

Every node in a ZeptoDB cluster holds a copy of the PartitionRouter — the consistent hash ring that maps symbols to nodes. When a node joins or leaves, all copies must converge. We needed a synchronization protocol, and we chose not to use Raft. Here’s why.


Each ClusterNode, QueryCoordinator, and ComputeNode maintains an independent PartitionRouter. Without synchronization, a node addition on the coordinator doesn’t propagate — other nodes route ticks to stale destinations.


Infrequent Changes

Ring changes happen only on node failure or addition — not on every tick.

Small State

The full ring state is a node list — a few KB at most.

Existing Infrastructure

CoordinatorHA + FencingToken already provide leader election and epoch ordering.

Latency Budget

Raft consensus latency on the tick path is unacceptable for HFT workloads.

ZeptoDB’s target domains (HFT, IoT, observability) are fine with eventual consistency for routing metadata. The data path has its own correctness guarantees via fencing tokens.


The coordinator is the single source of truth for ring mutations. On a topology change, it:

  1. Mutates its local router
  2. Bumps the fencing token epoch
  3. Broadcasts a RING_UPDATE RPC to all peers
  4. Peers validate the epoch and rebuild their routers
Coordinator:
on_node_state_change(ACTIVE)
→ consensus_->propose_add(node_id)
→ router_.add_node(node_id)
→ fencing_token_.advance() // epoch N → N+1
→ broadcast RING_UPDATE to all peers
→ peers: validate(epoch) → router rebuild
→ peers: respond RING_ACK
Follower receives RING_UPDATE:
→ FencingToken::validate(epoch)
→ if epoch < last_seen: REJECT (stale)
→ else: deserialize RingSnapshot → rebuild router
→ respond RING_ACK

Two new RPC message types:

TypeIDPayload
RING_UPDATE13RingSnapshot (epoch + serialized node list)
RING_ACK14Empty (acknowledgment)

RingSnapshot serialization:

struct RingSnapshot {
uint64_t epoch;
std::vector<NodeId> nodes;
std::vector<uint8_t> serialize() const;
static RingSnapshot deserialize(const uint8_t* data, size_t len);
};

The consensus mechanism is abstracted behind RingConsensus:

class RingConsensus {
public:
virtual ~RingConsensus() = default;
virtual bool propose_add(NodeId id) = 0;
virtual bool propose_remove(NodeId id) = 0;
virtual bool apply_update(const RingSnapshot& snap) = 0;
};

EpochBroadcastConsensus is the default implementation. Swapping to Raft is a one-line change:

// Default: epoch broadcast (eventual consistency)
node.join_cluster(seeds);
// Future: Raft (strong consistency for banking/healthcare)
node.set_consensus(std::make_unique<RaftConsensus>(raft_config));
node.join_cluster(seeds);

ClusterNode::join_cluster() automatically initializes consensus and registers the RPC callback:

join_cluster(seeds):
1. Transport init (shm_open or ucp_init)
2. Register self + seeds in PartitionRouter
3. Connect to seed node
4. Start HealthMonitor (UDP thread)
5. Initialize EpochBroadcastConsensus ← new
6. Register RING_UPDATE RPC callback ← new
7. Start local pipeline (drain thread)

The is_coordinator flag in ClusterConfig determines behavior:

  • Coordinator: proposes changes through consensus (broadcast to peers)
  • Follower: applies changes received via RING_UPDATE RPC

PropertyGuarantee
ConvergenceAll nodes converge after broadcast completes
OrderingEpoch monotonicity prevents stale updates
Partition toleranceNodes that miss an update get corrected on next change
Split-brainFencingToken rejects updates from stale coordinators

The epoch acts as a logical clock. A follower that receives epoch 5 after seeing epoch 7 rejects it — the fencing token validation catches this automatically.


Epoch BroadcastRaft
✅ Zero latency on tick path❌ Consensus round-trip on every mutation
✅ Simple implementation (~200 LOC)❌ Complex (leader election, log replication)
✅ Reuses existing FencingToken infra❌ Separate leader election mechanism
⚠️ Eventual consistency✅ Strong consistency
⚠️ Single coordinator as source of truth✅ Any node can propose

For ZeptoDB’s use cases, eventual consistency with epoch ordering is the right trade-off. The interface is ready for Raft when stronger guarantees are needed.