Skip to content

Cluster Integrity and Split-Brain Defense

A distributed database that can’t handle network partitions is a distributed database that will corrupt your data. ZeptoDB’s cluster integrity layer addresses four structural issues: router synchronization, stale coordinator writes, standby promotion gaps, and distributed query merge correctness.


Split Routers

ClusterNode and QueryCoordinator each had independent PartitionRouter copies — no sync.

Stale Writes

After failover, the old coordinator could still write to data nodes.

Empty Promotion

Standby→Active promotion left the coordinator’s endpoint list empty.

Wrong Aggregates

Naive concat of distributed GROUP BY results produced incorrect values.


QueryCoordinator now accepts an external router via shared pointer:

coordinator.set_shared_router(&cluster_node.router(), &cluster_node.router_mutex());

When set, the coordinator reads from the same router as ClusterNode. When not set, it falls back to its own internal router (backward compatible). Read/write access is protected by shared_mutex — reads use shared_lock, ring mutations use unique_lock.


The core defense against split-brain. Every write RPC now carries an epoch:

RpcHeader (24 bytes):
┌──────────┬──────────┬────────────┬─────────────┬──────────┐
│ magic(4) │ type(4) │ req_id(4) │ payload(4) │ epoch(8) │
└──────────┴──────────┴────────────┴─────────────┴──────────┘
(was 16 bytes) (new field)

Server-side validation in TcpRpcServer:

void set_fencing_token(FencingToken* token);
// On TICK_INGEST or WAL_REPLICATE:
// if header.epoch < token->last_seen() → REJECT
// else → token->validate(header.epoch) → ACCEPT
Epoch valueBehavior
0Bypass fencing (legacy client compatibility)
< last_seenRejected — stale coordinator
≥ last_seenAccepted, last_seen updated

Read operations (SQL_QUERY) have no fencing — reads are safe by design.


Timeline:
─────────────────────────────────────────────────────────
Coordinator A (epoch=1) Network Partition Coordinator B
│ ╳ │
│ writes OK (epoch=1) ╳ │
│ ╳ lease acquired │
│ ╳ epoch bumped to 2 │
│ ╳ writes OK (epoch=2)│
│ ╳ │
│ partition heals ───────╳──────────────────────│
│ │
│ tries write (epoch=1) │
│ → REJECTED (last_seen=2) │
│ │
▼ data integrity preserved ▼

In production, CoordinatorHA uses Kubernetes leases for leader election:

CoordinatorHAConfig config;
config.require_lease = true;
config.lease_config.namespace_ = "zeptodb";
config.lease_config.lease_name = "coordinator-leader";

On standby→active promotion:

  1. lease_->try_acquire() — must succeed
  2. fencing_token_.advance() — epoch bump
  3. peer_rpc_->set_epoch(new_epoch) — propagate to all RPC clients
  4. Re-register all known nodes with the coordinator

On lease loss (e.g., another node acquires it):

  • Automatic demotion: ACTIVE → STANDBY
  • All subsequent writes from this node will be rejected by data nodes

When a standby is promoted, it now iterates registered_nodes_ and calls coordinator_.add_remote_node() for each. Without this, the promoted coordinator would have an empty endpoint list and couldn’t route queries.


TestWhat It Proves
StaleCoordinatorWriteRejectedCoordinator A (epoch=1) writes succeed. B promoted (epoch=2). A retries → rejected.
StaleWalReplicationRejectedNew coordinator (epoch=3) replicates WAL. Old (epoch=1) tries → rejected.
K8sLeasePreventsDualLeaderLease holder A loses to B via force_holder(). A cannot re-acquire.
LegacyClientBypassClient with epoch=0 bypasses fencing (backward compat).
PromotionReRegistersNodesRemote nodes available immediately after promotion.

  • epoch=0 bypass should be disabled in production (set_require_epoch(true))
  • CoordinatorHA doesn’t auto-wire FencingToken on promotion — caller must connect the K8sLease → FencingToken → TcpRpcClient.set_epoch chain
  • No fencing on SQL reads (by design — reads don’t mutate state)

These are acceptable trade-offs: the fencing layer prevents data corruption, which is the critical invariant. Read consistency is handled at the query layer.