ZeptoDB Production Operations Guide
Table of Contents
Section titled “Table of Contents”1. Initial Setup
Section titled “1. Initial Setup”1.1 Service Installation
Section titled “1.1 Service Installation”# 1. Build ZeptoDBcd /home/ec2-user/zeptodbmkdir build && cd buildcmake .. -DCMAKE_BUILD_TYPE=Releasemake -j$(nproc)
# 2. Install production service (requires root)cd ..sudo ./deploy/scripts/install_service.shInstallation includes:
- ✅
zeptodbuser created - ✅ Directories:
/opt/zeptodb,/data/zeptodb,/var/log/zeptodb - ✅ systemd service:
zeptodb.service - ✅ Cron jobs: backup (02:00), EOD (18:00 weekdays)
- ✅ Log rotation: 30-day retention
1.2 Start Service
Section titled “1.2 Start Service”# Start servicesudo systemctl start zeptodb
# Check statussudo systemctl status zeptodb
# View logssudo journalctl -u zeptodb -f1.3 Health Check
Section titled “1.3 Health Check”# Liveness probe (server alive?)curl http://localhost:8123/health# {"status":"healthy"}
# Readiness probe (ready to serve queries?)curl http://localhost:8123/ready# {"status":"ready"}
# Statisticscurl http://localhost:8123/stats2. Monitoring
Section titled “2. Monitoring”2.1 Prometheus + Grafana Setup
Section titled “2.1 Prometheus + Grafana Setup”# Run monitoring stack with Docker Composecd /home/ec2-user/zeptodb/monitoringdocker-compose up -d
# Check servicesdocker-compose psAccess:
- Grafana: http://localhost:3000 (admin/zepto-admin-2026)
- Prometheus: http://localhost:9090
2.2 Grafana Dashboard Configuration
Section titled “2.2 Grafana Dashboard Configuration”- Log in to Grafana
- Configuration → Data Sources → Add Prometheus
- URL:
http://prometheus:9090
- URL:
- Dashboards → Import
- File:
grafana-dashboard.json
- File:
2.3 Check Metrics
Section titled “2.3 Check Metrics”# Check Prometheus metricscurl http://localhost:8123/metricsKey metrics:
| Metric | Type | Description |
|---|---|---|
zepto_ticks_ingested_total | counter | Total ingested ticks |
zepto_ticks_stored_total | counter | Stored ticks |
zepto_ticks_dropped_total | counter | Dropped ticks |
zepto_queries_executed_total | counter | Executed queries |
zepto_rows_scanned_total | counter | Scanned rows |
zepto_server_up | gauge | Server running state (0/1) |
zepto_server_ready | gauge | Readiness state (0/1) |
2.4 Alert Configuration
Section titled “2.4 Alert Configuration”Edit Alertmanager configuration:
vi /home/ec2-user/zeptodb/monitoring/alertmanager.ymlSlack Webhook setup:
- Slack → Apps → Incoming Webhooks
- Copy Webhook URL
- Replace
YOUR_SLACK_WEBHOOK_URLinalertmanager.yml
PagerDuty setup:
- PagerDuty → Service → Integrations → Prometheus
- Copy Integration Key
- Replace
YOUR_PAGERDUTY_SERVICE_KEYinalertmanager.yml
3. Logging
Section titled “3. Logging”3.1 Structured Logging
Section titled “3.1 Structured Logging”ZeptoDB produces structured logs in JSON format.
Log locations:
- File:
/var/log/zeptodb/zeptodb.log - systemd:
journalctl -u zeptodb
Log levels:
TRACE- Detailed debugDEBUG- Development infoINFO- General infoWARN- WarningsERROR- ErrorsCRITICAL- Fatal errors
3.2 Viewing Logs
Section titled “3.2 Viewing Logs”# Recent logs (journalctl)sudo journalctl -u zeptodb -n 100
# Live logssudo journalctl -u zeptodb -f
# JSON log filesudo tail -f /var/log/zeptodb/zeptodb.log | jq .
# Filter errors onlysudo journalctl -u zeptodb -p err3.3 Log Example
Section titled “3.3 Log Example”{ "timestamp": "2026-03-22T14:30:45.123+0900", "level": "INFO", "message": "Query executed successfully", "component": "QueryExecutor", "details": "SELECT sum(volume) FROM trades - 1.2ms"}3.4 Log Rotation
Section titled “3.4 Log Rotation”Automatically configured (/etc/logrotate.d/zeptodb):
- Daily rotation
- 30-day retention
- Compressed storage
4. Backup & Recovery
Section titled “4. Backup & Recovery”4.1 Automatic Backup
Section titled “4.1 Automatic Backup”Cron configuration (auto-runs):
0 2 * * * /opt/zeptodb/scripts/backup.shManual backup:
sudo -u zeptodb /opt/zeptodb/scripts/backup.shBackup contents:
- HDB (Historical Database)
- WAL (Write-Ahead Log)
- Config files
- Metadata
Backup location:
- Local:
/backup/zeptodb/zeptodb-backup-YYYYMMDD_HHMMSS.tar.gz - S3:
s3://${S3_BUCKET}/backups/(optional)
4.2 S3 Backup Configuration
Section titled “4.2 S3 Backup Configuration”# Set environment variablesexport ZEPTO_S3_BACKUP_BUCKET="my-apex-backups"export AWS_REGION="us-east-1"
# Required IAM permissions:# - s3:PutObject# - s3:GetObject# - s3:ListBucket# - s3:DeleteObject4.3 Disaster Recovery
Section titled “4.3 Disaster Recovery”Restore from local backup:
# 1. Stop ZeptoDBsudo systemctl stop zeptodb
# 2. Restore backupsudo /opt/zeptodb/scripts/restore.sh zeptodb-backup-20260322_020000
# 3. Restart servicesudo systemctl start zeptodbRestore from S3 backup:
sudo /opt/zeptodb/scripts/restore.sh zeptodb-backup-20260322_020000 --from-s3Skip WAL replay:
sudo /opt/zeptodb/scripts/restore.sh zeptodb-backup-20260322_020000 --skip-wal-replay4.4 Backup Retention Policy
Section titled “4.4 Backup Retention Policy”Local:
- Retention period: 30 days (default)
- Configure via
BACKUP_RETENTION_DAYSenvironment variable
S3:
- Lifecycle policy recommended
- STANDARD_IA (Infrequent Access) storage class
5. Automation
Section titled “5. Automation”5.1 systemd Service
Section titled “5.1 systemd Service”Service management:
# Startsudo systemctl start zeptodb
# Stopsudo systemctl stop zeptodb
# Restartsudo systemctl restart zeptodb
# Statussudo systemctl status zeptodb
# Enable auto-start on bootsudo systemctl enable zeptodbService features:
- ✅ Auto-restart (5 seconds after failure)
- ✅ CPU affinity (cores 0-7)
- ✅ OOM protection (priority -900)
- ✅ Resource limits (1M files, unlimited memory)
5.2 EOD (End-of-Day) Process
Section titled “5.2 EOD (End-of-Day) Process”Auto-runs (cron):
0 18 * * 1-5 /opt/zeptodb/scripts/eod_process.shManual run:
sudo -u zeptodb /opt/zeptodb/scripts/eod_process.shEOD tasks:
- RDB → HDB flush
- Statistics collection
- WAL cleanup (compress after 7 days, delete after 30 days)
- Automatic backup
- Disk usage check
Logs:
tail -f /var/log/zeptodb/eod.log5.3 Auto Tuning
Section titled “5.3 Auto Tuning”Bare metal tuning:
sudo /opt/zeptodb/scripts/tune_bare_metal.shTuning items:
- CPU governor → performance
- Turbo Boost enabled
- Hugepages 32GB
- IRQ affinity
- Network stack
6. Troubleshooting
Section titled “6. Troubleshooting”6.1 Service Won’t Start
Section titled “6.1 Service Won’t Start”# Check logssudo journalctl -u zeptodb -n 100
# Check config filecat /opt/zeptodb/config.yaml
# Check port conflictsudo lsof -i :8123
# Check permissionsls -la /data/zeptodb6.2 High Tick Drop Rate
Section titled “6.2 High Tick Drop Rate”Causes:
- Ring buffer too small
- CPU overload
- Disk I/O bottleneck
Resolution:
# 1. Check metricscurl http://localhost:8123/metrics | grep dropped
# 2. Check CPU usagetop -u zeptodb
# 3. Increase ring buffer size (config.yaml)ring_buffer_size: 1048576 # default 524288
# 4. Restartsudo systemctl restart zeptodb6.3 Slow Queries
Section titled “6.3 Slow Queries”# 1. Check rows scannedcurl http://localhost:8123/stats
# 2. Check query execution plan (EXPLAIN)curl -X POST http://localhost:8123/ -d "EXPLAIN SELECT ..."
# 3. Check HDB compression statusdu -sh /data/zeptodb/hdb
# 4. Verify parallel query enabled (config.yaml)query_threads: 86.4 Disk Full
Section titled “6.4 Disk Full”# 1. Check usagedf -h /data/zeptodb
# 2. Clean old HDBfind /data/zeptodb/hdb -type d -mtime +90 -exec rm -rf {} \;
# 3. Compress WALfind /data/zeptodb/wal -name "*.wal" -mtime +7 -exec gzip {} \;
# 4. Clean backupsfind /backup/zeptodb -name "*.tar.gz" -mtime +30 -delete6.5 Out of Memory
Section titled “6.5 Out of Memory”# 1. Check memory usagefree -hpmap -x $(pgrep zepto-server)
# 2. OOM killer logsdmesg | grep -i "out of memory"
# 3. Check hugepagescat /proc/meminfo | grep Huge
# 4. Restart process (clear memory)sudo systemctl restart zeptodb6.6 Prometheus Metrics Not Visible
Section titled “6.6 Prometheus Metrics Not Visible”# 1. Check /metrics endpointcurl http://localhost:8123/metrics
# 2. Check Prometheus targetscurl http://localhost:9090/api/v1/targets | jq .
# 3. Prometheus logsdocker logs zepto-prometheus
# 4. Check firewallsudo iptables -L -n | grep 81237. Performance Optimization
Section titled “7. Performance Optimization”7.1 Bare Metal Environment
Section titled “7.1 Bare Metal Environment”# CPU isolation (GRUB)vi /etc/default/grub# GRUB_CMDLINE_LINUX="isolcpus=0-7 nohz_full=0-7"
# Auto tuningsudo /opt/zeptodb/scripts/tune_bare_metal.sh
# Check NUMA topologynumactl --hardware7.2 Cloud Environment
Section titled “7.2 Cloud Environment”AWS optimization:
- Instance:
c7g.16xlarge(64 vCPU, 128GB RAM) - Storage:
io2EBS (64K IOPS) - Network: Enhanced Networking (ENA)
- Placement Group:
cluster
Kubernetes optimization:
# Resource allocationresources: requests: cpu: "32" memory: "64Gi" limits: cpu: "64" memory: "128Gi"
# Node affinitynodeSelector: node.kubernetes.io/instance-type: c7g.16xlarge8. Security
Section titled “8. Security”8.1 Network Security
Section titled “8.1 Network Security”# Firewall configuration (iptables)sudo iptables -A INPUT -p tcp --dport 8123 -s 10.0.0.0/8 -j ACCEPTsudo iptables -A INPUT -p tcp --dport 8123 -j DROP
# UFWsudo ufw allow from 10.0.0.0/8 to any port 81238.2 TLS Configuration
Section titled “8.2 TLS Configuration”server: tls: enabled: true cert_file: /etc/zeptodb/server.crt key_file: /etc/zeptodb/server.key8.3 Authentication
Section titled “8.3 Authentication”auth: enabled: true type: basic # or jwt users: - username: admin password_hash: "$2a$10$..."9. Contact
Section titled “9. Contact”For issues:
- GitHub Issues: https://github.com/zeptodb/zeptodb/issues
- Slack: #zeptodb-support
- Email: support@zeptodb.com
Critical incidents:
- PagerDuty: ZeptoDB Critical Alerts
- On-call: +1-XXX-XXX-XXXX