Automatic recovery
Re-replication triggers automatically on node failure. No manual intervention, no silent data loss.
When a node dies in a distributed database, removing it from the routing table is the easy part. The hard part is making sure the data it held doesn’t disappear. This post covers how ZeptoDB’s FailoverManager now automatically triggers re-replication — moving data from surviving replicas to healthy nodes without manual intervention.
The existing FailoverManager handled node failure detection and routing table cleanup correctly. When the HealthMonitor declared a node dead, the failover path would:
PartitionRouterCoordinatorBut that was it. Re-replication — copying data from surviving replicas to maintain the desired replication factor — had to be done manually in the callback. If no callback was registered, data was silently lost.
The fix embeds a PartitionMigrator directly inside FailoverManager, making re-replication a first-class operation:
struct FailoverConfig { bool auto_re_replicate = true; // automatically re-replicate on failure bool async_re_replicate = true; // run re-replication in background thread};Two configuration knobs control the behavior:
auto_re_replicate: when true, the failover path automatically migrates data to maintain replication factorasync_re_replicate: when true, migration runs in a background thread so the failover callback isn’t blockedThe FailoverEvent struct now tracks re-replication outcomes:
struct FailoverEvent { NodeId failed_node; // ... existing fields ... bool re_replication_attempted = false; bool re_replication_succeeded = false;};Node 2 declared DEAD (HealthMonitor)│└─ FailoverManager::trigger_failover(2) │ ├─ 1. router_.remove_node(2) ├─ 2. coordinator_.remove_node(2) ├─ 3. Calculate re-replication targets │ → [{symbol: "AAPL", new_primary: 1, new_replica: 3}] │ ├─ 4. migrator_ has registered nodes? │ │ │ ├─ YES → run_re_replication() │ │ ├─ migrator_.migrate_symbol(node_1 → node_3) │ │ │ ├─ SELECT * FROM trades on node 1 │ │ │ └─ replicate_wal() to node 3 │ │ └─ Fire callback with results │ │ │ └─ NO → Fire callback immediately (graceful fallback) │ └─ If async: steps above run in background threadThe key design decision is the graceful fallback. If register_node() was never called — in a single-node deployment or during testing — the failover path skips re-replication entirely and fires the callback immediately. Existing code that constructs FailoverManager(router, coordinator) without registering nodes behaves identically to before.
Nodes must be registered with the migrator before re-replication can work:
FailoverManager fm(router, coordinator, config);
// Register nodes so migrator knows how to reach themfm.register_node(1, "10.0.1.1:9000");fm.register_node(2, "10.0.1.2:9000");fm.register_node(3, "10.0.1.3:9000");The has_node() check inside trigger_failover() determines whether the migrator path is viable. If no nodes are registered, the system logs a warning and falls back to callback-only behavior.
void FailoverManager::run_re_replication( const std::vector<ReReplicationTarget>& targets, FailoverEvent& event) { event.re_replication_attempted = true; bool all_ok = true; for (const auto& target : targets) { bool ok = migrator_.migrate_symbol( target.symbol, target.source_node, target.dest_node); if (!ok) all_ok = false; } event.re_replication_succeeded = all_ok;}Each target represents a symbol that needs to be copied from a surviving replica to a new destination. The migrator handles the actual data transfer: reading from the source node and writing via WAL replication to the destination.
When async_re_replicate is true, each re-replication spawns a detached thread tracked in async_threads_. This prevents the failover callback from blocking on potentially slow data transfers.
Automatic recovery
Re-replication triggers automatically on node failure. No manual intervention, no silent data loss.
Async by default
Background threads handle data migration without blocking the failover callback path.
Graceful fallback
Unregistered nodes skip re-replication cleanly. Full backward compatibility with existing deployments.
796 tests passing
All existing tests pass unchanged. FailoverConfig defaults preserve prior behavior.
Related: Health Monitor Resilience → · Live Rebalancing → · Distributed Cluster Architecture →