ZeptoDB Kubernetes Operations Guide
Last updated: 2026-03-24
Table of Contents
Section titled “Table of Contents”- Architecture Overview
- Initial Deployment
- Day-2 Operations
- Monitoring & Alerting
- Scaling
- Backup & Recovery
- Upgrades & Rollback
- Security
- Cluster Mode
- Troubleshooting
- Runbooks
See also: Failure Scenarios & Recovery Guide — Automatic/manual recovery procedures for 8 failure scenarios
1. Architecture Overview
Section titled “1. Architecture Overview”┌─────────────────────────────────────────────────────────────┐│ Kubernetes Cluster ││ ││ ┌──────────────────────────────────────────────────────┐ ││ │ Namespace: zeptodb │ ││ │ │ ││ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ││ │ │ Pod-0 │ │ Pod-1 │ │ Pod-2 │ ← Deployment│ ││ │ │ ZeptoDB │ │ ZeptoDB │ │ ZeptoDB │ (3 replicas)│ ││ │ │ :8123 │ │ :8123 │ │ :8123 │ │ ││ │ └────┬────┘ └────┬────┘ └────┬────┘ │ ││ │ │ │ │ │ ││ │ ┌────┴─────────────┴────────────┴────┐ │ ││ │ │ Service (LoadBalancer :8123) │ │ ││ │ │ + Headless Service (pod discovery) │ │ ││ │ └────────────────────────────────────┘ │ ││ │ │ ││ │ ConfigMap │ PVC (gp3 500Gi) │ PDB │ HPA │ ServiceMon│ ││ └──────────────────────────────────────────────────────┘ ││ ││ ┌──────────────────────┐ ┌─────────────────────────────┐ ││ │ Prometheus │ │ Grafana │ ││ │ ServiceMonitor 15s │ │ Dashboard + 9 Alert Rules │ ││ └──────────────────────┘ └─────────────────────────────┘ │└─────────────────────────────────────────────────────────────┘Helm Chart Components
Section titled “Helm Chart Components”| Resource | Template | Purpose |
|---|---|---|
| Deployment | deployment.yaml | ZeptoDB pods (rolling update) |
| Service | service.yaml | LoadBalancer + Headless |
| ConfigMap | configmap.yaml | zeptodb.conf |
| PVC | pvc.yaml | gp3 500Gi persistent storage |
| HPA | hpa.yaml | Auto-scaling (3–10 replicas) |
| PDB | pdb.yaml | minAvailable: 2 |
| ServiceMonitor | servicemonitor.yaml | Prometheus scrape config |
2. Initial Deployment
Section titled “2. Initial Deployment”Prerequisites
Section titled “Prerequisites”# Requiredkubectl version --client # 1.26+helm version # 3.x
# Verify cluster accesskubectl cluster-infokubectl get nodesDeploy with Helm (Recommended)
Section titled “Deploy with Helm (Recommended)”# Create namespacekubectl create namespace zeptodb
# Installhelm install zeptodb ./deploy/helm/zeptodb \ -n zeptodb \ --set image.repository=your-registry/zeptodb \ --set image.tag=1.0.0
# Verifykubectl get all -n zeptodbProduction values override
Section titled “Production values override”values-prod.yaml:
replicaCount: 3
image: repository: your-registry/zeptodb tag: "1.0.0"
resources: requests: cpu: "4" memory: "16Gi" limits: cpu: "8" memory: "32Gi"
persistence: storageClass: gp3 size: 500Gi
config: workerThreads: 8 parallelThreshold: 100000
autoscaling: enabled: true minReplicas: 3 maxReplicas: 10
podDisruptionBudget: enabled: true minAvailable: 2
# Graviton (ARM) nodesnodeSelector: kubernetes.io/arch: arm64 # or for x86: # kubernetes.io/arch: amd64helm install zeptodb ./deploy/helm/zeptodb \ -n zeptodb \ -f values-prod.yaml \ --wait --timeout 5mPost-Deploy Verification
Section titled “Post-Deploy Verification”# All pods runningkubectl get pods -n zeptodb -o wide
# Health checkexport LB=$(kubectl get svc zeptodb -n zeptodb \ -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')curl -s http://$LB:8123/healthcurl -s http://$LB:8123/ready
# Test querycurl -X POST http://$LB:8123/ -d 'SELECT 1'3. Day-2 Operations
Section titled “3. Day-2 Operations”Daily Checks
Section titled “Daily Checks”#!/bin/bash# daily-check.sh — run from cron or manually
NS=zeptodbLB=$(kubectl get svc zeptodb -n $NS \ -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "=== Pod Status ==="kubectl get pods -n $NS -o wide
echo "=== Health ==="curl -sf http://$LB:8123/health && echo " OK" || echo " FAIL"
echo "=== Readiness ==="curl -sf http://$LB:8123/ready && echo " OK" || echo " FAIL"
echo "=== HPA ==="kubectl get hpa -n $NS
echo "=== PVC ==="kubectl get pvc -n $NS
echo "=== Recent Events ==="kubectl get events -n $NS --sort-by='.lastTimestamp' | tail -10Configuration Changes
Section titled “Configuration Changes”When a ConfigMap is changed, the checksum/config annotation automatically triggers a rollout.
# Change worker threadshelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set config.workerThreads=16 \ --wait
# Change multiple settings at oncehelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ -f values-prod.yaml \ --set config.workerThreads=16 \ --set config.queryCacheSize=2000 \ --waitChecking Logs
Section titled “Checking Logs”# Logs for a specific podkubectl logs -f <pod-name> -n zeptodb
# Logs for all pods (stern recommended)stern zeptodb -n zeptodb
# Previous crash logskubectl logs <pod-name> -n zeptodb --previous
# Logs since a specific timekubectl logs <pod-name> -n zeptodb --since=1hPod Restart
Section titled “Pod Restart”# Full rolling restart (zero-downtime)kubectl rollout restart deployment/zeptodb -n zeptodb
# Delete a specific pod only (Deployment auto-recreates it)kubectl delete pod <pod-name> -n zeptodb4. Monitoring & Alerting
Section titled “4. Monitoring & Alerting”Prometheus Setup
Section titled “Prometheus Setup”# Enable ServiceMonitor (requires Prometheus Operator)helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set serviceMonitor.enabled=true \ --set serviceMonitor.interval=15sIn environments without ServiceMonitor, use Pod annotation-based scraping:
# Already included in deployment.yamlannotations: prometheus.io/scrape: "true" prometheus.io/port: "8123" prometheus.io/path: "/metrics"Key Metrics
Section titled “Key Metrics”# Check directlycurl -s http://$LB:8123/metrics| Metric | Type | Alert Threshold |
|---|---|---|
zepto_server_up | gauge | == 0 → critical |
zepto_server_ready | gauge | == 0 for 5m → warning |
zepto_ticks_ingested_total | counter | rate < 1000/s → warning |
zepto_ticks_dropped_total | counter | rate > 1000/s → warning |
zepto_queries_executed_total | counter | rate > 100/s → info |
zepto_rows_scanned_total | counter | rate > 10M/s → warning |
Alert Rules (9 rules)
Section titled “Alert Rules (9 rules)”Defined in monitoring/zeptodb-alerts.yml:
| Alert | Severity | Condition |
|---|---|---|
| ApexDBDown | critical | zepto_server_up == 0 for 1m |
| ApexDBNotReady | warning | zepto_server_ready == 0 for 5m |
| HighTickDropRate | warning | drop rate > 1000/s for 2m |
| HighQueryRate | info | query rate > 100/s for 5m |
| HighRowScanRate | warning | scan rate > 10M/s for 5m |
| LowIngestionRate | warning | ingestion < 1000/s for 10m |
| HighDiskUsage | warning | disk < 20% free for 5m |
| HighMemoryUsage | warning | memory < 10% free for 5m |
| HighCPUUsage | warning | CPU > 90% for 10m |
Grafana Dashboard
Section titled “Grafana Dashboard”# Import dashboardkubectl create configmap grafana-zeptodb \ -n monitoring \ --from-file=monitoring/grafana-dashboard.json
# Or import via Grafana UI → Import → monitoring/grafana-dashboard.jsonGrafana can connect directly as a ClickHouse data source (port 8123, ClickHouse compatible API).
5. Scaling
Section titled “5. Scaling”Cluster Requirements
Section titled “Cluster Requirements”See EKS Cluster Requirements for full cluster setup including K8s version, Auto Mode, and custom NodePool configuration.
EKS Auto Mode (Node Auto-Scaling)
Section titled “EKS Auto Mode (Node Auto-Scaling)”EKS Auto Mode includes built-in Karpenter — no separate install needed. Nodes are provisioned via EC2 Fleet API when pods are pending.
# Check node pools (built-in + custom)kubectl get nodepoolskubectl get nodeclasses
# Check node claims (active nodes)kubectl get nodeclaims
# Monitor scaling eventskubectl describe nodepool zepto-realtimekubectl describe nodepool zepto-analyticsTwo custom node pools are configured:
| Pool | Trigger | Capacity | Consolidation |
|---|---|---|---|
| zepto-realtime | Pending pods with zeptodb.com/role: realtime | On-Demand only | WhenEmpty, after 30m |
| zepto-analytics | Pending pods with zeptodb.com/role: analytics | Spot + On-Demand | WhenEmptyOrUnderutilized, after 5m |
Scaling flow: HPA increases replicas → pods pending → Auto Mode provisions node (30-60s) → pods scheduled.
Horizontal Pod Autoscaler (HPA)
Section titled “Horizontal Pod Autoscaler (HPA)”Default configuration: Auto-scales between 3–10 replicas based on CPU 70% / Memory 80% thresholds.
# Check HPA statuskubectl get hpa -n zeptodbkubectl describe hpa zeptodb -n zeptodb
# Manual scalekubectl scale deployment zeptodb -n zeptodb --replicas=5
# Change HPA settingshelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set autoscaling.minReplicas=5 \ --set autoscaling.maxReplicas=20 \ --set autoscaling.targetCPU=60Scale-Down Protection
Section titled “Scale-Down Protection”# Already configured in values.yamlautoscaling: scaleDown: stabilizationSeconds: 300 # Scale down after 5-minute stabilization scaleUp: stabilizationSeconds: 60 # Scale up after 1-minute stabilizationVertical Scaling
Section titled “Vertical Scaling”helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set resources.requests.cpu=8 \ --set resources.requests.memory=32Gi \ --set resources.limits.cpu=16 \ --set resources.limits.memory=64Gi \ --waitNode Selection (Graviton / x86)
Section titled “Node Selection (Graviton / x86)”# Graviton (ARM) nodeshelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set nodeSelector."kubernetes\.io/arch"=arm64
# Dedicated instance typehelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set nodeSelector."node\.kubernetes\.io/instance-type"=c7g.4xlarge6. Backup & Recovery
Section titled “6. Backup & Recovery”In-Cluster Backup (CronJob)
Section titled “In-Cluster Backup (CronJob)”apiVersion: batch/v1kind: CronJobmetadata: name: zeptodb-backup namespace: zeptodbspec: schedule: "0 2 * * *" # Daily at 02:00 UTC concurrencyPolicy: Forbid jobTemplate: spec: template: spec: restartPolicy: OnFailure containers: - name: backup image: amazon/aws-cli:latest env: - name: S3_BUCKET value: "your-zeptodb-backups" - name: DATA_DIR value: "/opt/zeptodb/data" command: - /bin/sh - -c - | TIMESTAMP=$(date +%Y%m%d_%H%M%S) tar -czf /tmp/zeptodb-${TIMESTAMP}.tar.gz -C ${DATA_DIR} . aws s3 cp /tmp/zeptodb-${TIMESTAMP}.tar.gz \ s3://${S3_BUCKET}/backups/zeptodb-${TIMESTAMP}.tar.gz \ --storage-class STANDARD_IA echo "Backup completed: zeptodb-${TIMESTAMP}.tar.gz" volumeMounts: - name: data mountPath: /opt/zeptodb/data readOnly: true volumes: - name: data persistentVolumeClaim: claimName: zeptodb-datakubectl apply -f deploy/k8s/backup-cronjob.yaml
# Trigger manual backupkubectl create job --from=cronjob/zeptodb-backup zeptodb-backup-manual -n zeptodb
# Check backup statuskubectl get jobs -n zeptodbkubectl logs job/zeptodb-backup-manual -n zeptodbPVC Snapshot (EBS)
Section titled “PVC Snapshot (EBS)”# VolumeSnapshot (requires CSI driver)cat <<EOF | kubectl apply -f -apiVersion: snapshot.storage.k8s.io/v1kind: VolumeSnapshotmetadata: name: zeptodb-snap-$(date +%Y%m%d) namespace: zeptodbspec: volumeSnapshotClassName: ebs-csi-snapclass source: persistentVolumeClaimName: zeptodb-dataEOF
# Verify snapshotkubectl get volumesnapshot -n zeptodbRecovery from Snapshot
Section titled “Recovery from Snapshot”# Create new PVC from snapshotcat <<EOF | kubectl apply -f -apiVersion: v1kind: PersistentVolumeClaimmetadata: name: zeptodb-data-restored namespace: zeptodbspec: accessModes: [ReadWriteOnce] storageClassName: gp3 resources: requests: storage: 500Gi dataSource: name: zeptodb-snap-20260324 kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.ioEOF
# Replace PVC in Deploymenthelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set persistence.existingClaim=zeptodb-data-restored \ --wait7. Upgrades & Rollback
Section titled “7. Upgrades & Rollback”For details: Rolling Upgrade Guide
Standard Upgrade
Section titled “Standard Upgrade”# 1. Pre-flightkubectl get pods -n zeptodb -o widecurl -s http://$LB:8123/health
# 2. Upgradehelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set image.tag=1.1.0 \ --wait --timeout 5m
# 3. Monitorkubectl rollout status deployment/zeptodb -n zeptodb
# 4. Verifycurl -s http://$LB:8123/healthcurl -X POST http://$LB:8123/ -d 'SELECT 1'Zero-Downtime Guarantee Mechanisms
Section titled “Zero-Downtime Guarantee Mechanisms”| Setting | Value | Effect |
|---|---|---|
maxSurge | 1 | Create 1 new pod first |
maxUnavailable | 0 | Maintain existing pod count |
PDB minAvailable | 2 | Guarantee minimum 2 pods |
preStop sleep | 15s | Wait for in-flight queries to complete |
readinessProbe | /ready | Only ready pods receive traffic |
Rollback
Section titled “Rollback”# Immediate rollbackhelm rollback zeptodb -n zeptodb
# Rollback to a specific revisionhelm history zeptodb -n zeptodbhelm rollback zeptodb <REVISION> -n zeptodb
# kubectl rollback (without Helm)kubectl rollout undo deployment/zeptodb -n zeptodbCanary Deployment
Section titled “Canary Deployment”# 1. Canary deployment (1 replica)helm install zeptodb-canary ./deploy/helm/zeptodb -n zeptodb \ --set replicaCount=1 \ --set image.tag=2.0.0 \ --set service.type=ClusterIP \ --set autoscaling.enabled=false \ --set podDisruptionBudget.enabled=false
# 2. Canary testingkubectl port-forward svc/zeptodb-canary 8124:8123 -n zeptodbcurl -X POST http://localhost:8124/ -d 'SELECT vwap(price, volume) FROM trades WHERE symbol = 1'
# 3a. Success → promotehelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb --set image.tag=2.0.0 --waithelm uninstall zeptodb-canary -n zeptodb
# 3b. Failure → removehelm uninstall zeptodb-canary -n zeptodb8. Security
Section titled “8. Security”TLS Termination
Section titled “TLS Termination”# Create TLS Secretkubectl create secret tls zeptodb-tls \ -n zeptodb \ --cert=/path/to/cert.pem \ --key=/path/to/key.pem
# Ingress with TLScat <<EOF | kubectl apply -f -apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: zeptodb namespace: zeptodb annotations: nginx.ingress.kubernetes.io/backend-protocol: "HTTP"spec: tls: - hosts: - zeptodb.example.com secretName: zeptodb-tls rules: - host: zeptodb.example.com http: paths: - path: / pathType: Prefix backend: service: name: zeptodb port: number: 8123EOFAPI Key / JWT Secrets
Section titled “API Key / JWT Secrets”# API keys filekubectl create secret generic zeptodb-auth \ -n zeptodb \ --from-file=keys.txt=/path/to/keys.txt
# JWT secretkubectl create secret generic zeptodb-jwt \ -n zeptodb \ --from-literal=JWT_SECRET='your-jwt-secret'
# Vault integration (Secrets Store CSI)# → SecretsProvider chain: Vault KV v2 → K8s file → env varNetwork Policy
Section titled “Network Policy”# Allow access only from same namespace + monitoringapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: zeptodb-netpol namespace: zeptodbspec: podSelector: matchLabels: app.kubernetes.io/name: zeptodb policyTypes: [Ingress] ingress: - from: - namespaceSelector: matchLabels: name: zeptodb - namespaceSelector: matchLabels: name: monitoring ports: - port: 8123 protocol: TCPRBAC (Kubernetes)
Section titled “RBAC (Kubernetes)”# Role for operatorsapiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: zeptodb-operator namespace: zeptodbrules:- apiGroups: ["", "apps", "autoscaling"] resources: ["pods", "deployments", "services", "configmaps", "hpa"] verbs: ["get", "list", "watch", "update", "patch"]- apiGroups: [""] resources: ["pods/log", "pods/exec"] verbs: ["get", "create"]9. Cluster Mode
Section titled “9. Cluster Mode”For operating a ZeptoDB distributed cluster on Kubernetes.
Enable Cluster
Section titled “Enable Cluster”helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set cluster.enabled=true \ --set cluster.rpcPortOffset=100 \ --set cluster.heartbeatPort=9100 \ --set headless.enabled=trueDirect pod-to-pod communication via Headless Service:
- RPC:
<pod-name>.zeptodb-headless.zeptodb.svc:8223 - Heartbeat: UDP
:9100
Cluster Health
Section titled “Cluster Health”# Check cluster status for each podfor pod in $(kubectl get pods -n zeptodb -l app.kubernetes.io/name=zeptodb -o name); do echo "--- $pod ---" kubectl exec -n zeptodb $pod -- curl -s http://localhost:8123/health echodoneCluster Upgrade Considerations
Section titled “Cluster Upgrade Considerations”- CoordinatorHA handles automatic re-registration
- FencingToken prevents split-brain
- Increase
gracefulShutdowntime during upgrades to ensure WAL flush
helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set image.tag=1.1.0 \ --set gracefulShutdown.preStopSleepSeconds=30 \ --set gracefulShutdown.terminationGracePeriodSeconds=60 \ --wait --timeout 10m10. Troubleshooting
Section titled “10. Troubleshooting”Pod Fails to Start
Section titled “Pod Fails to Start”# Check statuskubectl describe pod <pod> -n zeptodb
# Common causes:# - ImagePullBackOff → Check image path/authentication# - Pending → Insufficient resources (kubectl describe node)# - CrashLoopBackOff → Check logs (kubectl logs --previous)Readiness Probe Failure
Section titled “Readiness Probe Failure”kubectl logs <pod> -n zeptodb | grep -i "error\|fail\|ready"
# Check directly from inside the podkubectl exec -n zeptodb <pod> -- curl -s http://localhost:8123/readyPVC Not Bound
Section titled “PVC Not Bound”kubectl describe pvc zeptodb-data -n zeptodb
# Check StorageClasskubectl get sc# If gp3 StorageClass does not exist, it needs to be createdOOMKilled
Section titled “OOMKilled”# Check memory usagekubectl top pods -n zeptodb
# Increase limitshelm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb \ --set resources.limits.memory=64Gi \ --waitSlow Queries
Section titled “Slow Queries”# Check query plan with EXPLAINcurl -X POST http://$LB:8123/ -d 'EXPLAIN SELECT ...'
# Check running queries via Admin APIcurl -H "Authorization: Bearer $ADMIN_KEY" http://$LB:8123/admin/queries
# Kill slow querycurl -X DELETE -H "Authorization: Bearer $ADMIN_KEY" \ http://$LB:8123/admin/queries/<query-id>HPA Not Scaling
Section titled “HPA Not Scaling”kubectl describe hpa zeptodb -n zeptodb
# Check metrics-serverkubectl top pods -n zeptodb# "error: Metrics API not available" → metrics-server needs to be installed11. Runbooks
Section titled “11. Runbooks”Runbook: Emergency Restart
Section titled “Runbook: Emergency Restart”# 1. Record current statekubectl get pods -n zeptodb -o wide > /tmp/zeptodb-state.txt
# 2. Rolling restart (zero-downtime)kubectl rollout restart deployment/zeptodb -n zeptodbkubectl rollout status deployment/zeptodb -n zeptodb --timeout=5m
# 3. Verifycurl -s http://$LB:8123/healthcurl -X POST http://$LB:8123/ -d 'SELECT 1'Runbook: Disk Full
Section titled “Runbook: Disk Full”# 1. Checkkubectl exec -n zeptodb <pod> -- df -h /opt/zeptodb/data
# 2. Clean up old HDB data (TTL setting)curl -X POST http://$LB:8123/ \ -d "ALTER TABLE trades SET TTL 90 DAYS"
# 3. Expand PVC (if StorageClass has allowVolumeExpansion: true)kubectl patch pvc zeptodb-data -n zeptodb \ -p '{"spec":{"resources":{"requests":{"storage":"1Ti"}}}}'Runbook: Node Drain (Maintenance)
Section titled “Runbook: Node Drain (Maintenance)”# PDB guarantees minAvailable: 2, so drain is safekubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# After maintenance is completekubectl uncordon <node>Runbook: Complete Redeployment
Section titled “Runbook: Complete Redeployment”# 1. Backupkubectl create job --from=cronjob/zeptodb-backup zeptodb-pre-redeploy -n zeptodbkubectl wait --for=condition=complete job/zeptodb-pre-redeploy -n zeptodb --timeout=10m
# 2. Deletehelm uninstall zeptodb -n zeptodb# PVC is preserved (not deleted by helm uninstall)
# 3. Redeployhelm install zeptodb ./deploy/helm/zeptodb -n zeptodb -f values-prod.yaml --wait
# 4. Verifycurl -s http://$LB:8123/healthQuick Reference
Section titled “Quick Reference”# === Status ===kubectl get all -n zeptodbkubectl get hpa -n zeptodbkubectl get pvc -n zeptodbkubectl get events -n zeptodb --sort-by='.lastTimestamp' | tail -20
# === Logs ===kubectl logs -f deployment/zeptodb -n zeptodbkubectl logs <pod> -n zeptodb --previous
# === Health ===curl http://$LB:8123/healthcurl http://$LB:8123/readycurl http://$LB:8123/metrics
# === Helm ===helm list -n zeptodbhelm history zeptodb -n zeptodbhelm get values zeptodb -n zeptodb
# === Upgrade ===helm upgrade zeptodb ./deploy/helm/zeptodb -n zeptodb --set image.tag=X.Y.Z --waithelm rollback zeptodb -n zeptodb
# === Scale ===kubectl scale deployment zeptodb -n zeptodb --replicas=5kubectl top pods -n zeptodb