Bounded resources
Thread pool, payload limits, connection caps. No unbounded allocation under load.
There’s a well-known gap between a distributed system that passes tests and one that survives production. This post covers a batch of 15 stability fixes across ZeptoDB’s cluster layer — each addressing a specific failure mode that would eventually surface under real-world load: resource exhaustion, concurrency bugs, connection leaks, and inconsistent snapshots.
The RPC server had four resource management gaps, each a potential production incident.
The original accept_loop() called std::thread(...).detach() for every incoming connection — unbounded thread creation under load. The fix introduces a fixed-size worker pool with a task queue:
server.set_thread_pool_size(std::thread::hardware_concurrency());// Workers pull from std::queue<int> task queue// Bounded concurrency, predictable memory usageNo upper bound on RPC payload size meant a single malformed request could exhaust memory:
size_t max_payload_size_ = 64 * 1024 * 1024; // 64MB default
// handle_connection() validates before allocationif (payload_size > max_payload_size_) { ZEPTO_WARN("Payload {} exceeds limit {}", payload_size, max_payload_size_); // Connection closed, no allocation}stop() now follows a 5-phase shutdown sequence:
drain_timeout_ms (default 30s) for in-flight requestsThis ensures in-flight queries complete during rolling upgrades rather than being killed mid-execution.
A hard cap (default 1024) on concurrent connections prevents thread pool starvation and file descriptor exhaustion. Connections beyond the limit are immediately closed with a warning.
The PartitionRouter manages symbol-to-node mapping. Under concurrent access from query threads (reads) and cluster membership changes (writes), the original implementation had no synchronization.
// Writers (add_node, remove_node): exclusive lockstd::unique_lock lock(ring_mutex_);
// Readers (route, route_replica, node_count): shared lockstd::shared_lock lock(ring_mutex_);Concurrent query routing (the hot path) proceeds in parallel while membership changes (the rare path) are serialized.
ping() was calling connect_to_server() + close() — creating a new TCP connection for every health check. Over time, this leaked connections in TIME_WAIT state. The fix switches to acquire() + release() pool reuse.
bool running_ accessed from multiple threads without synchronization — a textbook data race. Fixed with std::atomic<bool> running_{false}.
Two fixes in one. First, register_node() held a lock while calling fire_event(), which could call back into the registry — classic lock-reentrance deadlock. The fix releases the lock before firing events.
Second, the K8s registry was previously a stub. Now it polls the actual Kubernetes Endpoints API:
// poll_loop() → HTTP GET on K8s Endpoints API// Auto-detects KUBERNETES_SERVICE_HOST/PORT env vars// Service account token authentication// parse_endpoints_json(): extract IP/port, stable NodeId via IP hash// reconcile(): diff current map → JOINED/LEFT eventsNon-K8s environments retain the manual register_node() fallback.
The original snapshot coordinator sent a single command to all nodes. If one failed mid-snapshot, the cluster ended up inconsistent.
Phase 1: PREPARE → Send SNAPSHOT PREPARE <id> to all nodes → Each node pauses ingestion, prepares snapshot → All succeed? → Phase 2a → Any fail? → Phase 2b
Phase 2a: COMMIT → Send SNAPSHOT COMMIT <id> → finalize + resume
Phase 2b: ABORT → Send SNAPSHOT ABORT <id> → discard + resumetake_snapshot_legacy() is retained for single-node backward compatibility.
Partition migration was previously fire-and-forget. The new implementation tracks each move:
PENDING → DUAL_WRITE → COPYING → COMMITTED → FAILED (retry up to 3x)MigrationCheckpoint tracks state per move. resume_plan() retries only FAILED moves, making migration resumable after transient failures.
Partial seed failures were silently ignored. Now: zero successful seeds → cleanup + std::runtime_error. Partial success is allowed. Bootstrap mode (no seeds) works normally.
Bounded resources
Thread pool, payload limits, connection caps. No unbounded allocation under load.
2PC snapshots
PREPARE/COMMIT/ABORT protocol ensures cluster-wide snapshot consistency.
Zero deadlocks
Lock ordering fixes in K8sNodeRegistry. Atomic flags in GossipNodeRegistry. shared_mutex in PartitionRouter.
803+ tests passing
15 new targeted tests covering payload limits, connection caps, concurrent routing, drain behavior, and 2PC abort.
Related: Cluster Integrity & Split Brain → · Failover Data Recovery → · Ring Consensus →