ZeptoDB Kubernetes Compatibility & HA Test Report
Date: 2026-04-08
Cluster: EKS zepto-k8s-compat (us-east-1), K8s v1.32.12
Nodes: 3x t3.xlarge (4 vCPU, 16 GiB)
Image: nginx:1.27-alpine (stand-in for ZeptoDB)
1. Test Summary
Section titled “1. Test Summary”| Suite | Tests | Passed | Failed |
|---|---|---|---|
| Compatibility (T01–T27) | 27 | 27 | 0 |
| HA (HA01–HA06) | 6 | 6 | 0 |
| Performance (PERF01–PERF05) | 5 | 5 | 0 |
| Total | 38 | 38 | 0 |
2. Compatibility Tests (T01–T27)
Section titled “2. Compatibility Tests (T01–T27)”Static Validation
Section titled “Static Validation”| ID | Test | Result | Notes |
|---|---|---|---|
| T01 | Helm lint | PASS | No errors, 1 info (icon recommended) |
| T02 | Helm template (default values) | PASS | All resources render correctly |
| T03 | Helm template (cluster + karpenter) | PASS | RPC/heartbeat ports, NodePool/EC2NodeClass rendered |
Deployment Basics
Section titled “Deployment Basics”| ID | Test | Result | Notes |
|---|---|---|---|
| T04 | Pods reach Running state | PASS | 2/2 running |
| T05 | Pods pass readiness probe | PASS | 2/2 ready |
| T06 | Service has endpoints | PASS | 2 addresses |
| T07 | Headless service (clusterIP=None) | PASS | Pod discovery works |
| T08 | ConfigMap contains expected keys | PASS | port, worker_threads, analytics_mode, data_dir |
| T09 | PodDisruptionBudget exists | PASS | minAvailable=1 |
Pod Spec Validation
Section titled “Pod Spec Validation”| ID | Test | Result | Notes |
|---|---|---|---|
| T10 | Anti-affinity (hostname spread) | PASS | preferredDuringScheduling rule present |
| T11 | Standard K8s labels | PASS | name, instance, version |
| T12 | Environment variables | PASS | POD_NAME, POD_NAMESPACE, POD_IP, APEX_WORKER_THREADS |
| T13 | preStop lifecycle hook | PASS | Graceful shutdown enabled |
| T14 | RollingUpdate strategy | PASS | maxUnavailable=0 |
| T15 | ConfigMap checksum annotation | PASS | Config change triggers rollout |
| T25 | terminationGracePeriodSeconds | PASS | 15s |
| T26 | Resource requests/limits | PASS | Both set |
Networking
Section titled “Networking”| ID | Test | Result | Notes |
|---|---|---|---|
| T22 | Headless DNS resolution | PASS | Resolves to pod IPs |
| T23 | Pod-to-pod connectivity | PASS | Direct IP communication |
| T24 | ClusterIP service routing | PASS | Service routes to backend pods |
Lifecycle Operations
Section titled “Lifecycle Operations”| ID | Test | Result | Notes |
|---|---|---|---|
| T16 | Rolling update execution | PASS | 2 ready after image tag change |
| T17 | Pod delete auto-recovery | PASS | Deployment recreates pod |
| T18 | PDB blocks eviction | PASS | Second eviction rejected |
| T19 | Helm rollback | PASS | Previous revision restored |
| T20 | Scale up (2→3) | PASS | 3 ready |
| T21 | Scale down (3→2) | PASS | 2 running |
| T27 | No warning events | PASS | Clean |
3. HA Tests (HA01–HA06)
Section titled “3. HA Tests (HA01–HA06)”| ID | Test | Result | Details |
|---|---|---|---|
| HA01 | 3 pods on 3 nodes | PASS | Each pod on a separate node (anti-affinity working) |
| HA02 | Node drain + recovery | PASS | Pod migrated, service stayed available, recovery=1.1s |
| HA03 | PDB blocks concurrent drain | PASS | Second drain rejected by PDB |
| HA04 | Pod kill + service continuity | PASS | Service reachable during kill, recovery=9.3s |
| HA05 | Rolling update zero-downtime | PASS | 20 probes during rollout, 0 failures |
| HA06 | Scale 3→5→3 | PASS | Scale-up=1.3s, scale-down clean |
4. Performance Benchmarks
Section titled “4. Performance Benchmarks”| Metric | Value | Unit | Notes |
|---|---|---|---|
| Pod startup latency (avg) | 5.22 | sec | Schedule + pull + ready (3 samples, stdev=0.02s) |
| Rolling update duration (3 replicas) | 30.36 | sec | Full rollout with maxUnavailable=0 |
| Node drain recovery | 1.11 | sec | Time until 3 pods ready after drain |
| Pod kill recovery | 9.33 | sec | Time until replacement pod ready |
| Service failover time | 7.25 | sec | Time until service consistently routes around dead pod |
| Scale 3→5 time | 1.26 | sec | Time until 5 pods ready |
| Pod-to-pod RTT (avg) | 1721 | ms | Includes kubectl exec overhead (~1.5s) |
| Pod-to-pod RTT (min) | 1580 | ms | Lower bound with kubectl overhead |
| HTTP sequential throughput | 50.79 | req/s | 100 sequential requests pod-to-pod |
Notes on Performance Numbers
Section titled “Notes on Performance Numbers”- Pod-to-pod RTT includes
kubectl execoverhead (~1.5s per call). Actual network latency is sub-millisecond within the same VPC. For accurate network benchmarks, use an in-pod tool likewrkorhey. - HTTP throughput is limited by sequential
wgetcalls. Real throughput with concurrent connections would be orders of magnitude higher. - Pod startup latency of ~5s is typical for EKS with
IfNotPresentpull policy (image already cached). First pull would add 5–15s. - Service failover of ~7s aligns with readinessProbe configuration (periodSeconds=5, failureThreshold=2 = 10s max detection + endpoint removal).
5. Helm Chart Issues Found (Static Analysis)
Section titled “5. Helm Chart Issues Found (Static Analysis)”Issue 1: Deployment + Single PVC (Shared Storage)
Section titled “Issue 1: Deployment + Single PVC (Shared Storage)”Severity: Medium (production impact)
The chart uses a Deployment with a single PersistentVolumeClaim. With ReadWriteOnce access mode, only one node can mount the volume. When replicas > 1, pods on different nodes cannot mount the same PVC.
Recommendation: Use StatefulSet with volumeClaimTemplates for per-pod storage, or switch to ReadWriteMany (requires EFS or similar).
Issue 2: HPA + spec.replicas Conflict
Section titled “Issue 2: HPA + spec.replicas Conflict”Severity: Low
When HPA is enabled, the Deployment still sets spec.replicas from values.yaml. Every helm upgrade resets the replica count to the Helm value, overriding HPA’s scaling decisions.
Recommendation: Conditionally omit spec.replicas when autoscaling.enabled=true.
Issue 3: Hugepages Not Overridable via Values
Section titled “Issue 3: Hugepages Not Overridable via Values”Severity: Low
resources.requests.hugepages-2Mi in default values.yaml cannot be removed via override values — Helm deep-merges maps. Test environments without hugepages must explicitly set hugepages-2Mi: "0".
Recommendation: Move hugepages into a separate conditional block controlled by performanceTuning.hugepages.enabled.
6. Test Execution
Section titled “6. Test Execution”Prerequisites
Section titled “Prerequisites”# Toolskubectl v1.26+helm v3.xeksctl v0.200+python 3.13Run All Tests
Section titled “Run All Tests”# Create cluster (if needed)eksctl create cluster -f tests/k8s/eks-compat-cluster.yaml
# Scale to 3 nodes for HA testseksctl scale nodegroup --cluster=zepto-k8s-compat --name=test-nodes --nodes=3 --region=us-east-1
# Compatibility tests (27 scenarios)python3.13 tests/k8s/test_k8s_compat.py
# HA + Performance tests (11 scenarios)python3.13 tests/k8s/test_k8s_ha_perf.py
# Cleanuppython3.13 tests/k8s/test_k8s_compat.py --cleanuppython3.13 tests/k8s/test_k8s_ha_perf.py --cleanupeksctl delete cluster -f tests/k8s/eks-compat-cluster.yaml --disable-nodegroup-evictionEstimated Cost
Section titled “Estimated Cost”| Resource | Cost/hr |
|---|---|
| EKS control plane | $0.10 |
| 3x t3.xlarge On-Demand | $0.50 |
| EBS gp3 (if PVC enabled) | ~$0.02 |
| Total | ~$0.62/hr |
Test duration: ~10 min (both suites). Total cost per run: ~$0.10 + cluster creation time.
7. Test File Index
Section titled “7. Test File Index”| File | Description |
|---|---|
tests/k8s/eks-compat-cluster.yaml | EKS cluster config (3x t3.xlarge) |
tests/k8s/test-values.yaml | Lightweight Helm values for testing |
tests/k8s/test_k8s_compat.py | Compatibility test suite (27 tests) |
tests/k8s/test_k8s_ha_perf.py | HA + Performance test suite (11 tests) |
tests/k8s/run_k8s_compat.sh | One-shot script (create → test → delete) |
docs/operations/K8S_TEST_REPORT.md | This document |