Skip to content

ZeptoDB Kubernetes Compatibility & HA Test Report

Date: 2026-04-08 Cluster: EKS zepto-k8s-compat (us-east-1), K8s v1.32.12 Nodes: 3x t3.xlarge (4 vCPU, 16 GiB) Image: nginx:1.27-alpine (stand-in for ZeptoDB)


SuiteTestsPassedFailed
Compatibility (T01–T27)27270
HA (HA01–HA06)660
Performance (PERF01–PERF05)550
Total38380

IDTestResultNotes
T01Helm lintPASSNo errors, 1 info (icon recommended)
T02Helm template (default values)PASSAll resources render correctly
T03Helm template (cluster + karpenter)PASSRPC/heartbeat ports, NodePool/EC2NodeClass rendered
IDTestResultNotes
T04Pods reach Running statePASS2/2 running
T05Pods pass readiness probePASS2/2 ready
T06Service has endpointsPASS2 addresses
T07Headless service (clusterIP=None)PASSPod discovery works
T08ConfigMap contains expected keysPASSport, worker_threads, analytics_mode, data_dir
T09PodDisruptionBudget existsPASSminAvailable=1
IDTestResultNotes
T10Anti-affinity (hostname spread)PASSpreferredDuringScheduling rule present
T11Standard K8s labelsPASSname, instance, version
T12Environment variablesPASSPOD_NAME, POD_NAMESPACE, POD_IP, APEX_WORKER_THREADS
T13preStop lifecycle hookPASSGraceful shutdown enabled
T14RollingUpdate strategyPASSmaxUnavailable=0
T15ConfigMap checksum annotationPASSConfig change triggers rollout
T25terminationGracePeriodSecondsPASS15s
T26Resource requests/limitsPASSBoth set
IDTestResultNotes
T22Headless DNS resolutionPASSResolves to pod IPs
T23Pod-to-pod connectivityPASSDirect IP communication
T24ClusterIP service routingPASSService routes to backend pods
IDTestResultNotes
T16Rolling update executionPASS2 ready after image tag change
T17Pod delete auto-recoveryPASSDeployment recreates pod
T18PDB blocks evictionPASSSecond eviction rejected
T19Helm rollbackPASSPrevious revision restored
T20Scale up (2→3)PASS3 ready
T21Scale down (3→2)PASS2 running
T27No warning eventsPASSClean

IDTestResultDetails
HA013 pods on 3 nodesPASSEach pod on a separate node (anti-affinity working)
HA02Node drain + recoveryPASSPod migrated, service stayed available, recovery=1.1s
HA03PDB blocks concurrent drainPASSSecond drain rejected by PDB
HA04Pod kill + service continuityPASSService reachable during kill, recovery=9.3s
HA05Rolling update zero-downtimePASS20 probes during rollout, 0 failures
HA06Scale 3→5→3PASSScale-up=1.3s, scale-down clean

MetricValueUnitNotes
Pod startup latency (avg)5.22secSchedule + pull + ready (3 samples, stdev=0.02s)
Rolling update duration (3 replicas)30.36secFull rollout with maxUnavailable=0
Node drain recovery1.11secTime until 3 pods ready after drain
Pod kill recovery9.33secTime until replacement pod ready
Service failover time7.25secTime until service consistently routes around dead pod
Scale 3→5 time1.26secTime until 5 pods ready
Pod-to-pod RTT (avg)1721msIncludes kubectl exec overhead (~1.5s)
Pod-to-pod RTT (min)1580msLower bound with kubectl overhead
HTTP sequential throughput50.79req/s100 sequential requests pod-to-pod
  • Pod-to-pod RTT includes kubectl exec overhead (~1.5s per call). Actual network latency is sub-millisecond within the same VPC. For accurate network benchmarks, use an in-pod tool like wrk or hey.
  • HTTP throughput is limited by sequential wget calls. Real throughput with concurrent connections would be orders of magnitude higher.
  • Pod startup latency of ~5s is typical for EKS with IfNotPresent pull policy (image already cached). First pull would add 5–15s.
  • Service failover of ~7s aligns with readinessProbe configuration (periodSeconds=5, failureThreshold=2 = 10s max detection + endpoint removal).

5. Helm Chart Issues Found (Static Analysis)

Section titled “5. Helm Chart Issues Found (Static Analysis)”

Issue 1: Deployment + Single PVC (Shared Storage)

Section titled “Issue 1: Deployment + Single PVC (Shared Storage)”

Severity: Medium (production impact)

The chart uses a Deployment with a single PersistentVolumeClaim. With ReadWriteOnce access mode, only one node can mount the volume. When replicas > 1, pods on different nodes cannot mount the same PVC.

Recommendation: Use StatefulSet with volumeClaimTemplates for per-pod storage, or switch to ReadWriteMany (requires EFS or similar).

Severity: Low

When HPA is enabled, the Deployment still sets spec.replicas from values.yaml. Every helm upgrade resets the replica count to the Helm value, overriding HPA’s scaling decisions.

Recommendation: Conditionally omit spec.replicas when autoscaling.enabled=true.

Issue 3: Hugepages Not Overridable via Values

Section titled “Issue 3: Hugepages Not Overridable via Values”

Severity: Low

resources.requests.hugepages-2Mi in default values.yaml cannot be removed via override values — Helm deep-merges maps. Test environments without hugepages must explicitly set hugepages-2Mi: "0".

Recommendation: Move hugepages into a separate conditional block controlled by performanceTuning.hugepages.enabled.


Terminal window
# Tools
kubectl v1.26+
helm v3.x
eksctl v0.200+
python 3.13
Terminal window
# Create cluster (if needed)
eksctl create cluster -f tests/k8s/eks-compat-cluster.yaml
# Scale to 3 nodes for HA tests
eksctl scale nodegroup --cluster=zepto-k8s-compat --name=test-nodes --nodes=3 --region=us-east-1
# Compatibility tests (27 scenarios)
python3.13 tests/k8s/test_k8s_compat.py
# HA + Performance tests (11 scenarios)
python3.13 tests/k8s/test_k8s_ha_perf.py
# Cleanup
python3.13 tests/k8s/test_k8s_compat.py --cleanup
python3.13 tests/k8s/test_k8s_ha_perf.py --cleanup
eksctl delete cluster -f tests/k8s/eks-compat-cluster.yaml --disable-nodegroup-eviction
ResourceCost/hr
EKS control plane$0.10
3x t3.xlarge On-Demand$0.50
EBS gp3 (if PVC enabled)~$0.02
Total~$0.62/hr

Test duration: ~10 min (both suites). Total cost per run: ~$0.10 + cluster creation time.


FileDescription
tests/k8s/eks-compat-cluster.yamlEKS cluster config (3x t3.xlarge)
tests/k8s/test-values.yamlLightweight Helm values for testing
tests/k8s/test_k8s_compat.pyCompatibility test suite (27 tests)
tests/k8s/test_k8s_ha_perf.pyHA + Performance test suite (11 tests)
tests/k8s/run_k8s_compat.shOne-shot script (create → test → delete)
docs/operations/K8S_TEST_REPORT.mdThis document