Failover Data Recovery: Automatic Re-Replication

When a node dies in a distributed database, removing it from the routing table is the easy part. The hard part is making sure the data it held doesn’t disappear. This post covers how ZeptoDB’s FailoverManager now automatically triggers re-replication — moving data from surviving replicas to healthy nodes without manual intervention.

The Problem

The existing FailoverManager handled node failure detection and routing table cleanup correctly. When the HealthMonitor declared a node dead, the failover path would:

Remove the failed node from the PartitionRouter
Remove it from the Coordinator
Fire a callback

But that was it. Re-replication — copying data from surviving replicas to maintain the desired replication factor — had to be done manually in the callback. If no callback was registered, data was silently lost.

Design: Self-Contained Recovery

The fix embeds a PartitionMigrator directly inside FailoverManager, making re-replication a first-class operation:

struct FailoverConfig {
    bool auto_re_replicate = true;   // automatically re-replicate on failure
    bool async_re_replicate = true;  // run re-replication in background thread
};

Two configuration knobs control the behavior:

auto_re_replicate: when true, the failover path automatically migrates data to maintain replication factor
async_re_replicate: when true, migration runs in a background thread so the failover callback isn’t blocked

The FailoverEvent struct now tracks re-replication outcomes:

struct FailoverEvent {
    NodeId failed_node;
    // ... existing fields ...
    bool re_replication_attempted = false;
    bool re_replication_succeeded = false;
};

Failover Flow

Node 2 declared DEAD (HealthMonitor)
│
└─ FailoverManager::trigger_failover(2)
   │
   ├─ 1. router_.remove_node(2)
   ├─ 2. coordinator_.remove_node(2)
   ├─ 3. Calculate re-replication targets
   │     → [{symbol: "AAPL", new_primary: 1, new_replica: 3}]
   │
   ├─ 4. migrator_ has registered nodes?
   │   │
   │   ├─ YES → run_re_replication()
   │   │   ├─ migrator_.migrate_symbol(node_1 → node_3)
   │   │   │   ├─ SELECT * FROM trades on node 1
   │   │   │   └─ replicate_wal() to node 3
   │   │   └─ Fire callback with results
   │   │
   │   └─ NO → Fire callback immediately (graceful fallback)
   │
   └─ If async: steps above run in background thread

The key design decision is the graceful fallback. If register_node() was never called — in a single-node deployment or during testing — the failover path skips re-replication entirely and fires the callback immediately. Existing code that constructs FailoverManager(router, coordinator) without registering nodes behaves identically to before.

Node Registration

Nodes must be registered with the migrator before re-replication can work:

FailoverManager fm(router, coordinator, config);

// Register nodes so migrator knows how to reach them
fm.register_node(1, "10.0.1.1:9000");
fm.register_node(2, "10.0.1.2:9000");
fm.register_node(3, "10.0.1.3:9000");

The has_node() check inside trigger_failover() determines whether the migrator path is viable. If no nodes are registered, the system logs a warning and falls back to callback-only behavior.

Core Migration Loop

void FailoverManager::run_re_replication(
    const std::vector<ReReplicationTarget>& targets,
    FailoverEvent& event) {
    event.re_replication_attempted = true;
    bool all_ok = true;
    for (const auto& target : targets) {
        bool ok = migrator_.migrate_symbol(
            target.symbol, target.source_node, target.dest_node);
        if (!ok) all_ok = false;
    }
    event.re_replication_succeeded = all_ok;
}

Each target represents a symbol that needs to be copied from a surviving replica to a new destination. The migrator handles the actual data transfer: reading from the source node and writing via WAL replication to the destination.

When async_re_replicate is true, each re-replication spawns a detached thread tracked in async_threads_. This prevents the failover callback from blocking on potentially slow data transfers.

Automatic recovery

Re-replication triggers automatically on node failure. No manual intervention, no silent data loss.

Async by default

Background threads handle data migration without blocking the failover callback path.

Graceful fallback

Unregistered nodes skip re-replication cleanly. Full backward compatibility with existing deployments.

796 tests passing

All existing tests pass unchanged. FailoverConfig defaults preserve prior behavior.