ZeptoDB Zero-Downtime Upgrade Guide
Last updated: 2026-03-24
Overview
Section titled “Overview”ZeptoDB supports zero-downtime rolling upgrades via Helm. The strategy ensures at least minAvailable pods remain serving traffic at all times during an upgrade.
How It Works
Section titled “How It Works” Rolling Update Flow┌──────────────────────────────────────────────────┐│ Pod-0 (v1.0) ──serving──┐ ││ Pod-1 (v1.0) ──serving──┼── LB ── clients ││ Pod-2 (v1.0) ──serving──┘ ││ ││ 1. Pod-2 gets preStop (sleep 15s, drain) ││ 2. Pod-2 removed from Service endpoints ││ 3. Pod-2 terminated, Pod-2' (v1.1) starts ││ 4. Pod-2' passes readiness → added to LB ││ 5. Repeat for Pod-1, then Pod-0 │└──────────────────────────────────────────────────┘Key settings that make this safe:
maxSurge: 1, maxUnavailable: 0— never fewer pods than current countPodDisruptionBudget: minAvailable: 2— k8s won’t evict below 2preStop: sleep 15— in-flight queries finish before pod diesreadinessProbe— new pod only gets traffic after/readyreturns 200
Standard Upgrade (Image Tag Change)
Section titled “Standard Upgrade (Image Tag Change)”# 1. Pre-flight: verify current statehelm list -n zeptodbkubectl get pods -n zeptodb -o widecurl -s http://<LB>:8123/health
# 2. Upgradehelm upgrade zeptodb ./deploy/helm/zeptodb \ -n zeptodb \ --set image.tag=1.1.0 \ --wait --timeout 5m
# 3. Monitor rolloutkubectl rollout status deployment/zeptodb -n zeptodb --timeout=5m
# 4. Verifykubectl get pods -n zeptodb -o widecurl -s http://<LB>:8123/healthcurl -X POST http://<LB>:8123/ -d 'SELECT 1'Config-Only Upgrade (No Image Change)
Section titled “Config-Only Upgrade (No Image Change)”ConfigMap changes trigger a rollout automatically via the checksum/config annotation.
helm upgrade zeptodb ./deploy/helm/zeptodb \ -n zeptodb \ --set config.workerThreads=16 \ --waitCanary Upgrade (High-Risk Changes)
Section titled “Canary Upgrade (High-Risk Changes)”For major version bumps or schema changes, use a canary approach:
# 1. Deploy canary (1 replica with new version)helm install zeptodb-canary ./deploy/helm/zeptodb \ -n zeptodb \ --set replicaCount=1 \ --set image.tag=2.0.0 \ --set service.type=ClusterIP \ --set autoscaling.enabled=false \ --set podDisruptionBudget.enabled=false
# 2. Test canary directlykubectl port-forward svc/zeptodb-canary 8124:8123 -n zeptodbcurl -X POST http://localhost:8124/ -d 'SELECT vwap(price, volume) FROM trades WHERE symbol = 1'
# 3. If OK, promotehelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb --set image.tag=2.0.0 --waithelm uninstall zeptodb-canary -n zeptodb
# 3b. If NOT OK, rollback canaryhelm uninstall zeptodb-canary -n zeptodbRollback
Section titled “Rollback”# Instant rollback to previous revisionhelm rollback zeptodb -n zeptodb
# Rollback to specific revisionhelm history zeptodb -n zeptodbhelm rollback zeptodb <REVISION> -n zeptodb
# Monitorkubectl rollout status deployment/zeptodb -n zeptodbCluster Mode Upgrade
Section titled “Cluster Mode Upgrade”When cluster.enabled: true, extra care is needed for distributed state.
# 1. Check cluster health before upgradecurl -s http://<LB>:8123/health | jq .
# 2. Pause ingestion (if possible) to reduce in-flight state# Or rely on WAL replay for consistency
# 3. Upgrade with extended grace periodhelm upgrade zeptodb ./deploy/helm/zeptodb \ -n zeptodb \ --set image.tag=1.1.0 \ --set gracefulShutdown.preStopSleepSeconds=30 \ --set gracefulShutdown.terminationGracePeriodSeconds=60 \ --wait --timeout 10m
# 4. Verify cluster re-formation# Nodes re-register via CoordinatorHA auto re-registrationcurl -s http://<LB>:8123/healthPre-Upgrade Checklist
Section titled “Pre-Upgrade Checklist”- Current deployment healthy (
/healthreturns 200 on all pods) - HDB snapshot taken (backup before upgrade)
- New image tested locally or in staging
-
helm diffreviewed (if helm-diff plugin installed) - Monitoring dashboard open (Grafana)
- Rollback plan confirmed (
helm rollbackready)
# Optional: preview changeshelm diff upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb --set image.tag=1.1.0Troubleshooting
Section titled “Troubleshooting”Pod stuck in Pending
Section titled “Pod stuck in Pending”kubectl describe pod <pod> -n zeptodb# Common: insufficient resources, PVC not boundReadiness probe failing on new pod
Section titled “Readiness probe failing on new pod”kubectl logs <pod> -n zeptodbcurl http://<pod-ip>:8123/ready# Check if new version has startup issues# Rollback: helm rollback zeptodb -n zeptodbRollout stuck (deadline exceeded)
Section titled “Rollout stuck (deadline exceeded)”kubectl rollout status deployment/zeptodb -n zeptodb# If stuck > 5 min:kubectl rollout undo deployment/zeptodb -n zeptodb