Skip to content

Coordinator HA and Automatic Failover Recovery

A distributed database needs two things to survive node failures: a way to elect a new leader, and a way to recover the data that was on the failed node. ZeptoDB handles both — lease-based coordinator HA with fencing, and automatic re-replication on data node failure.


Part 1: Coordinator HA with Lease-Based Fencing

Section titled “Part 1: Coordinator HA with Lease-Based Fencing”

CoordinatorHA used ping-based failover without Kubernetes lease integration. During a network partition, two coordinators could both believe they were active — classic split-brain.

K8s Lease

Only the lease holder can be the active coordinator. Lease loss triggers automatic demotion.

Fencing Token

Every promotion bumps the epoch. Data nodes reject writes from stale epochs.

CoordinatorHAConfig config;
config.require_lease = true; // false = legacy ping-based (backward compat)
config.lease_config.namespace_ = "zeptodb";
config.lease_config.lease_name = "coordinator-leader";
Standby detects Active is down:
1. lease_->try_acquire() // must succeed
2. fencing_token_.advance() // epoch N → N+1
3. peer_rpc_->set_epoch(N+1) // propagate to RPC client
4. Re-register all known nodes // coordinator can route immediately
5. Fire promotion callback

When another node acquires the lease (or the current holder’s lease expires):

on_lost callback fires:
→ state_ = STANDBY
→ all subsequent writes carry stale epoch
→ data nodes reject them via FencingToken

This is automatic — no operator intervention needed.

Coordinator A (epoch=5) ──X── network partition
Coordinator B acquires lease
fencing_token_.advance() → epoch=6
peer_rpc_.set_epoch(6)
writes succeed (epoch=6)
Partition heals:
Coordinator A tries write (epoch=5)
→ Data node: validate(5) fails (last_seen=6)
→ REJECTED — no data corruption

FailoverManager removed failed nodes from the router but had no procedure for recovering unreplicated data. If callbacks weren’t registered, data was silently lost.

FailoverManager now embeds a PartitionMigrator and automatically re-replicates data when a node fails:

Node 2 goes DEAD (HealthMonitor):
→ FailoverManager::trigger_failover(2)
1. router_.remove_node(2)
2. coordinator_.remove_node(2)
3. Calculate re-replication targets:
[{new_primary=1, new_replica=3}]
4. migrator has registered nodes?
YES → async: migrate_symbol(1 → 3)
→ SELECT * FROM trades on node 1
→ replicate_wal() to node 3
→ callback with results
NO → callback immediately (graceful fallback)
struct FailoverConfig {
bool auto_re_replicate = true; // enable automatic re-replication
bool async_re_replicate = true; // run in background thread
};
DecisionRationale
Embedded PartitionMigratorSelf-contained — no external dependencies for re-replication
Async by defaultDon’t block the failover path; re-replication can take time
Graceful fallbackIf register_node() wasn’t called, skip re-replication silently
Statistics trackingre_replication_count() for monitoring

Combined Flow: Coordinator + Data Node Failure

Section titled “Combined Flow: Coordinator + Data Node Failure”
Cluster: Coordinator A (active), Coordinator B (standby), Data Nodes 1-3
Scenario: Coordinator A and Data Node 2 fail simultaneously
1. B detects A is down
→ lease_->try_acquire() succeeds
→ fencing_token_.advance() → epoch=6
→ Re-register nodes 1, 3 (node 2 already gone)
→ B is now active coordinator
2. HealthMonitor detects Node 2 is DEAD
→ FailoverManager::trigger_failover(2)
→ router_.remove_node(2)
→ Re-replication: migrate data from Node 1 → Node 3
→ Ring broadcast via RingConsensus
3. Cluster stabilized with 2 data nodes
→ All queries route correctly
→ No data loss (re-replicated from surviving replicas)

Both features are backward compatible with existing deployments:

FeatureDefaultLegacy Behavior
require_leasefalsePing-based failover (no K8s dependency)
auto_re_replicatetrueIf register_node() not called, callbacks only
async_re_replicatetrueNon-blocking failover path

Existing tests pass without modification — the new behavior activates only when explicitly configured.