Skip to content

ZeptoDB Kubernetes Failure Scenarios & Recovery Guide

Last updated: 2026-03-24


This document covers failure scenarios and auto/manual recovery procedures when operating ZeptoDB in Kubernetes cluster mode.

┌─────────────────────────────────────────────────────────────┐
│ Protection Layer │
│ │
│ HealthMonitor ─── heartbeat 1s ─── SUSPECT 3s ─── DEAD 10s │
│ │ │
│ ▼ │
│ FailoverManager ─── replica promote ─── re-replication │
│ │ │
│ ▼ │
│ CoordinatorHA ─── active/standby ─── auto promotion │
│ │ │
│ ▼ │
│ FencingToken ─── monotonic epoch ─── stale write rejection │
│ │ │
│ ▼ │
│ WalReplicator ─── async/sync ─── RF=2 data redundancy │
│ │ │
│ ▼ │
│ Auto-Snapshot ─── 60s interval ─── crash recovery │
│ │ │
│ ▼ │
│ K8s Lease ─── split-brain defense ─── single leader │
└─────────────────────────────────────────────────────────────┘
t=0s Heartbeat stopped (pod crash / node failure)
t=3s HealthMonitor: ACTIVE → SUSPECT
t=10s HealthMonitor: SUSPECT → DEAD
t=10s FailoverManager: replica → primary promotion
t=10s FencingToken: epoch advance (stale write blocked)
t=10s PartitionRouter: routing table updated
t=15s K8s: readinessProbe failed → Service endpoints removed
t=30s K8s: livenessProbe failed → pod restart
t=~60s K8s: New pod scheduling + started + readiness passed

Scenario 1: Data Node Pod Crash (Ingestion during)

Section titled “Scenario 1: Data Node Pod Crash (Ingestion during)”

One data node pod is OOMKilled or crashes during data ingestion.

  • Ingestion paused for symbols where this node is primary
  • Possible loss of in-flight ticks (in ring buffer but not written to WAL)
Pod-1 crash (symbol 1,3,5 primary)
HealthMonitor: Pod-1 DEAD (10s)
FailoverManager::trigger_failover(pod-1)
├── Pod-2 (replica of sym 1,3) → promoted to primary
├── Pod-0 (replica of sym 5) → promoted to primary
└── FencingToken::advance() → stale writes from zombie Pod-1 rejected
PartitionRouter updated → queries routed to new primary
K8s Deployment: new Pod-1' Scheduling (30~60s)
Pod-1' started → CoordinatorHAregistered with → replicajoins as
PartitionMigrator: new Pod-1'data to Replication (RF=2 Restore)
Data StateLoss?
WALat Record + replicatransfer complete✅ Safe
WALat Record + replica not transferred✅ WAL replayas Recovery
Ring bufferexists only in (WAL Unwritten)❌ Lost (max A few ms Worth)
Auto-snapshot data after✅ snapshot replay
Auto-snapshot data before❌ max 60s Lost
Terminal window
# 1. Status Check
kubectl get pods -n zeptodb -o wide
kubectl describe pod <crashed-pod> -n zeptodb
# 2. OOMKilledcase Increase memory
helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \
--set resources.limits.memory=64Gi --wait
# 3. cluster Status Check
curl -s http://$LB:8123/health | jq .
# 4. Verify data consistency
curl -X POST http://$LB:8123/ \
-d 'SELECT symbol, count(*) FROM trades GROUP BY symbol'

The active coordinator pod running QueryCoordinator crashes.

Active Coordinator crash
Standby CoordinatorHA: monitor_loop() ping failed
│ (config.check_interval_ms Intervalwith Detection)
Standby → ACTIVE promotion (CoordinatorHA::role_ = ACTIVE)
├── K8s Lease acquired (split-brain prevention)
├── FencingToken::advance() → new epoch
├── registered_nodes_ node re-registration from list
└── promotion_cb_() called → external notification
Client queries: Service LBroutes to alive pods
K8s: new pod started → standby coordinatorjoins as
  • Distributed queries may fail for a few seconds during promotion
  • Single-node queries unaffected as data nodes handle them directly
  • K8s Lease Simultaneously two coordinator active is prevented
Terminal window
# promotion Check
kubectl logs <standby-pod> -n zeptodb | grep -i "promotion\|active"
# Verify query normal operation
curl -X POST http://$LB:8123/ -d 'SELECT count(*) FROM trades'

Scenario 3: Node Drain (Maintenance / K8s upgrade)

Section titled “Scenario 3: Node Drain (Maintenance / K8s upgrade)”

kubectl drain for node maintenance. PDB guarantees minAvailable: 2.

kubectl drain <node>
K8s: PDB Check → minAvailable: 2 check if satisfied
├── satisfied → pod eviction proceed
│ │
│ ▼
│ preStop: sleep 15s (in-flight query complete wait)
│ │
│ ▼
│ Pod terminated → HealthMonitor DEAD → failover
│ │
│ ▼
│ K8s: on another node New pod scheduling
└── not satisfied → eviction rejected (pod maintained)
  • Cannot drain 2 nodes simultaneously in 3-replica setup (PDB blocks)
  • HDB snapshot recommended before drain
Terminal window
# Safe Drain Procedure
# 1. current Status Check
kubectl get pods -n zeptodb -o wide
# 2. Snapshot trigger (possible case)
curl -X POST http://$LB:8123/admin/snapshot \
-H "Authorization: Bearer $ADMIN_KEY"
# 3. Drain
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# 4. Verify new pod is healthy
kubectl rollout status deployment/zeptodb -n zeptodb
# 5. Uncordon
kubectl uncordon <node>

Scenario 4: Split-Brain (Network Partition)

Section titled “Scenario 4: Split-Brain (Network Partition)”

Network partition disrupts pod-to-pod communication. Risk of both sides claiming primary.

┌──────────────┐ ┌──────────────┐
│ Partition A │ ✕ │ Partition B │
│ Pod-0, Pod-1 │◄──────►│ Pod-2 │
└──────────────┘ └──────────────┘
│ │
▼ ▼
K8s Lease Competition K8s Lease Competition
│ │
▼ ▼
Lease Acquired (majority) Lease failed
→ ACTIVE maintained → STANDBY demoted
│ │
▼ ▼
FencingToken: epoch=5 epoch=4 (stale)
→ Write allowed → Write rejected

Triple Defense:

LayerMechanismEffect
K8s LeaseSingle leader guaranteeonly one coordinator Role
FencingTokenmonotonic epochstale epoch’s WAL/tick rejected
HealthMonitorSUSPECT → DEADMarks minority partition nodes as DEAD
Terminal window
# Verify partition recovery
kubectl get pods -n zeptodb -o wide
# Check all pod health
for pod in $(kubectl get pods -n zeptodb -o name); do
echo "--- $pod ---"
kubectl exec -n zeptodb $pod -- curl -s localhost:8123/health
done
# Verify data consistency (compare row counts from both sides)
curl -X POST http://$LB:8123/ \
-d 'SELECT symbol, count(*) FROM trades GROUP BY symbol ORDER BY symbol'

EBS volume failure or PVC inaccessible.

  • Pod CrashLoopBackOff (data dir mount failed)
  • HDB flush failed logs
  • Queries continue with in-memory data (only HDB queries fail)
Terminal window
# 1. PVC Status Check
kubectl describe pvc zeptodb-data -n zeptodb
kubectl get events -n zeptodb | grep -i pvc
# 2. EBS volume Status Check (AWS)
aws ec2 describe-volumes --volume-ids <vol-id>
# 3a. PVC If normal, restart pod
kubectl delete pod <pod> -n zeptodb
# 3b. PVC If corrupted, recover from snapshot
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: zeptodb-data-restored
namespace: zeptodb
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources:
requests:
storage: 500Gi
dataSource:
name: zeptodb-snap-latest
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
EOF
# 4. Switch to new PVC
helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \
--set persistence.existingClaim=zeptodb-data-restored --wait

Scenario 6: Rolling Upgrade during failure

Section titled “Scenario 6: Rolling Upgrade during failure”

helm upgrade during upgrade, new version pod fails readiness probe.

helm upgrade --set image.tag=1.1.0
New Pod-2' (v1.1.0) started
readinessProbe /ready failed (bug, config error etc.)
├── maxUnavailable: 0 → existing Pod-2 (v1.0.0) maintained
├── rollout proceed stopped (new pod ready not so)
└── All 3 existing pods continue serving normally
Manual intervention needed after timeout
Terminal window
# 1. Status Check
kubectl rollout status deployment/zeptodb -n zeptodb
kubectl get pods -n zeptodb
# 2. Check new pod logs
kubectl logs <new-pod> -n zeptodb
# 3. Immediate rollback
helm rollback zeptodb -n zeptodb
# 4. Verify rollback
kubectl rollout status deployment/zeptodb -n zeptodb
curl -s http://$LB:8123/health
SettingValueRole
maxSurge1Creates only 1 additional new pod
maxUnavailable0Never reduces existing pod count
PDB minAvailable2Guarantees minimum 2 pods serving
preStop sleep15sgraceful drain

Scenario 7: full cluster failure (Disaster Recovery)

Section titled “Scenario 7: full cluster failure (Disaster Recovery)”

Complete K8s cluster loss (AZ failure, cluster deletion, etc.).

Terminal window
# 1. Provision new cluster
eksctl create cluster -f cluster-config.yaml
# 2. S3 from backup Check latest backup
aws s3 ls s3://your-zeptodb-backups/backups/ --recursive | tail -5
# 3. PVC Create + backup Restore
kubectl create namespace zeptodb
# Restore data from S3 using temporary pod
kubectl run restore --image=amazon/aws-cli -n zeptodb \
--overrides='{
"spec": {
"containers": [{
"name": "restore",
"image": "amazon/aws-cli",
"command": ["sh", "-c",
"aws s3 cp s3://your-zeptodb-backups/backups/LATEST.tar.gz /tmp/ && tar -xzf /tmp/LATEST.tar.gz -C /data/"],
"volumeMounts": [{"name":"data","mountPath":"/data"}]
}],
"volumes": [{"name":"data","persistentVolumeClaim":{"claimName":"zeptodb-data"}}],
"restartPolicy": "Never"
}
}'
# 4. ZeptoDB Deploy
helm install zeptodb ./deploy/helm/zeptodb -n zeptodb -f values-prod.yaml --wait
# 5. Verify data
curl -X POST http://$LB:8123/ \
-d 'SELECT count(*) FROM trades'
MetricValueBasis
RPO (Data Loss)≤ 60sauto-snapshot interval
RPO (S3 backup)≤ 24hdaily backup CronJob
RTO (pod crash)~60sK8s Auto restart
RTO (node failure)~10sFailoverManager + replica promotion
RTO (full cluster)~30minnew cluster + S3 Restore

Traffic spike causes HPA to scale to 10 replicas; new pods serve with empty data.

  • New pods have no in-memory data → query results incomplete
  • In cluster mode, coordinator scatters to all nodes → empty nodes also included
Terminal window
# 1. HPA Status Check
kubectl get hpa -n zeptodb
kubectl describe hpa zeptodb -n zeptodb
# 2. Emergency: disable HPA + set manual replica count
kubectl patch hpa zeptodb -n zeptodb -p '{"spec":{"minReplicas":3,"maxReplicas":3}}'
# 3. or new poddata to Replication complete after Service Deployment
# (configure readinessProbe to check data load completion)
autoscaling:
scaleUp:
stabilizationSeconds: 120 # 2min Stabilization (Prevent rapid scaling)
scaleDown:
stabilizationSeconds: 300 # 5min Stabilization

Scenario 9: Spot Instance Reclamation (Karpenter)

Section titled “Scenario 9: Spot Instance Reclamation (Karpenter)”

Karpenter Analytics NodePool using Spot instances; AWS sends 2-minute reclamation notice.

AWS: Spot stopped Notice (2min before)
Karpenter: immediately requests replacement instance (Fleet API)
├── Tries different AZ/size in same instance family
├── Falls back to On-Demand if Spot unavailable
└── Typically new node ready within 30~60s
K8s: pod graceful termination (preStop sleep 15s)
HealthMonitor: DEAD → FailoverManager → replica promotion
Pod scheduling on new node → Service Recovery
  • Realtime pool must be on-demand only → Spot Reclamation Impact None
  • Analytics poolonly Spot allowed → batch query with retry Response possible
  • Karpenter consolidateAfter: 5m → quickly cleans up empty nodes

Karpenter vs Cluster Autoscaler Comparison

Section titled “Karpenter vs Cluster Autoscaler Comparison”
Cluster AutoscalerKarpenter
Node provisioningASG → 2~5minFleet API → 30~60sec
Instance selectionASG fixed typeMultiple types/AZ simultaneous request
Spot Reclamation ResponseASG rebalance (Slow)Immediate replacement request
Node cleanup10~15minconsolidateAfter config
Workload separationASG multiple groups to manageNodePool declarative separation

#ScenarioDetectionAuto RecoveryData LossRTO
1Data node crashHealthMonitor 10s✅ replica promotion≤ 60s (snapshot)~10s
2Coordinator crashCoordinatorHA✅ standby promotionNone~5s
3Node drainK8s PDB✅ rescheduleNone~60s
4Split-brainK8s Lease + Fencing✅ Minority partition demotionNone~10s
5Storage failurePod CrashLoop❌ Manual PVC recoveryHDBonly~10min
6Bad upgradereadiness failed✅ rollout stoppedNone~30s (rollback)
7full clusterExternal monitoring❌ Manual DR≤ 24h (S3)~30min
8HPA Over-scalingEmpty query results❌ Manual adjustmentNone~1min
9Spot ReclamationAWS 2min Notice✅ Karpenter Replacement≤ 60s (snapshot)~60s