Split Routers
ClusterNode and QueryCoordinator each had independent PartitionRouter copies — no sync.
A distributed database that can’t handle network partitions is a distributed database that will corrupt your data. ZeptoDB’s cluster integrity layer addresses four structural issues: router synchronization, stale coordinator writes, standby promotion gaps, and distributed query merge correctness.
Split Routers
ClusterNode and QueryCoordinator each had independent PartitionRouter copies — no sync.
Stale Writes
After failover, the old coordinator could still write to data nodes.
Empty Promotion
Standby→Active promotion left the coordinator’s endpoint list empty.
Wrong Aggregates
Naive concat of distributed GROUP BY results produced incorrect values.
QueryCoordinator now accepts an external router via shared pointer:
coordinator.set_shared_router(&cluster_node.router(), &cluster_node.router_mutex());When set, the coordinator reads from the same router as ClusterNode. When not set, it falls back to its own internal router (backward compatible). Read/write access is protected by shared_mutex — reads use shared_lock, ring mutations use unique_lock.
The core defense against split-brain. Every write RPC now carries an epoch:
RpcHeader (24 bytes):┌──────────┬──────────┬────────────┬─────────────┬──────────┐│ magic(4) │ type(4) │ req_id(4) │ payload(4) │ epoch(8) │└──────────┴──────────┴────────────┴─────────────┴──────────┘ (was 16 bytes) (new field)Server-side validation in TcpRpcServer:
void set_fencing_token(FencingToken* token);// On TICK_INGEST or WAL_REPLICATE:// if header.epoch < token->last_seen() → REJECT// else → token->validate(header.epoch) → ACCEPT| Epoch value | Behavior |
|---|---|
0 | Bypass fencing (legacy client compatibility) |
< last_seen | Rejected — stale coordinator |
≥ last_seen | Accepted, last_seen updated |
Read operations (SQL_QUERY) have no fencing — reads are safe by design.
Timeline:─────────────────────────────────────────────────────────
Coordinator A (epoch=1) Network Partition Coordinator B │ ╳ │ │ writes OK (epoch=1) ╳ │ │ ╳ lease acquired │ │ ╳ epoch bumped to 2 │ │ ╳ writes OK (epoch=2)│ │ ╳ │ │ partition heals ───────╳──────────────────────│ │ │ │ tries write (epoch=1) │ │ → REJECTED (last_seen=2) │ │ │ ▼ data integrity preserved ▼In production, CoordinatorHA uses Kubernetes leases for leader election:
CoordinatorHAConfig config;config.require_lease = true;config.lease_config.namespace_ = "zeptodb";config.lease_config.lease_name = "coordinator-leader";On standby→active promotion:
lease_->try_acquire() — must succeedfencing_token_.advance() — epoch bumppeer_rpc_->set_epoch(new_epoch) — propagate to all RPC clientsOn lease loss (e.g., another node acquires it):
ACTIVE → STANDBYWhen a standby is promoted, it now iterates registered_nodes_ and calls coordinator_.add_remote_node() for each. Without this, the promoted coordinator would have an empty endpoint list and couldn’t route queries.
| Test | What It Proves |
|---|---|
StaleCoordinatorWriteRejected | Coordinator A (epoch=1) writes succeed. B promoted (epoch=2). A retries → rejected. |
StaleWalReplicationRejected | New coordinator (epoch=3) replicates WAL. Old (epoch=1) tries → rejected. |
K8sLeasePreventsDualLeader | Lease holder A loses to B via force_holder(). A cannot re-acquire. |
LegacyClientBypass | Client with epoch=0 bypasses fencing (backward compat). |
PromotionReRegistersNodes | Remote nodes available immediately after promotion. |
epoch=0 bypass should be disabled in production (set_require_epoch(true))K8sLease → FencingToken → TcpRpcClient.set_epoch chainThese are acceptable trade-offs: the fencing layer prevents data corruption, which is the critical invariant. Read consistency is handled at the query layer.