Skip to content

ZeptoDB Production Operations Guide


Terminal window
# 1. Build ZeptoDB
cd /home/ec2-user/zeptodb
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# 2. Install production service (requires root)
cd ..
sudo ./deploy/scripts/install_service.sh

Installation includes:

  • zeptodb user created
  • ✅ Directories: /opt/zeptodb, /data/zeptodb, /var/log/zeptodb
  • ✅ systemd service: zeptodb.service
  • ✅ Cron jobs: backup (02:00), EOD (18:00 weekdays)
  • ✅ Log rotation: 30-day retention
Terminal window
# Start service
sudo systemctl start zeptodb
# Check status
sudo systemctl status zeptodb
# View logs
sudo journalctl -u zeptodb -f
Terminal window
# Liveness probe (server alive?)
curl http://localhost:8123/health
# {"status":"healthy"}
# Readiness probe (ready to serve queries?)
curl http://localhost:8123/ready
# {"status":"ready"}
# Statistics
curl http://localhost:8123/stats

Terminal window
# Run monitoring stack with Docker Compose
cd /home/ec2-user/zeptodb/monitoring
docker-compose up -d
# Check services
docker-compose ps

Access:

  1. Log in to Grafana
  2. Configuration → Data Sources → Add Prometheus
    • URL: http://prometheus:9090
  3. Dashboards → Import
    • File: grafana-dashboard.json
Terminal window
# Check Prometheus metrics
curl http://localhost:8123/metrics

Key metrics:

MetricTypeDescription
zepto_ticks_ingested_totalcounterTotal ingested ticks
zepto_ticks_stored_totalcounterStored ticks
zepto_ticks_dropped_totalcounterDropped ticks
zepto_queries_executed_totalcounterExecuted queries
zepto_rows_scanned_totalcounterScanned rows
zepto_server_upgaugeServer running state (0/1)
zepto_server_readygaugeReadiness state (0/1)

Edit Alertmanager configuration:

Terminal window
vi /home/ec2-user/zeptodb/monitoring/alertmanager.yml

Slack Webhook setup:

  1. Slack → Apps → Incoming Webhooks
  2. Copy Webhook URL
  3. Replace YOUR_SLACK_WEBHOOK_URL in alertmanager.yml

PagerDuty setup:

  1. PagerDuty → Service → Integrations → Prometheus
  2. Copy Integration Key
  3. Replace YOUR_PAGERDUTY_SERVICE_KEY in alertmanager.yml

ZeptoDB produces structured logs in JSON format.

Log locations:

  • File: /var/log/zeptodb/zeptodb.log
  • systemd: journalctl -u zeptodb

Log levels:

  • TRACE - Detailed debug
  • DEBUG - Development info
  • INFO - General info
  • WARN - Warnings
  • ERROR - Errors
  • CRITICAL - Fatal errors
Terminal window
# Recent logs (journalctl)
sudo journalctl -u zeptodb -n 100
# Live logs
sudo journalctl -u zeptodb -f
# JSON log file
sudo tail -f /var/log/zeptodb/zeptodb.log | jq .
# Filter errors only
sudo journalctl -u zeptodb -p err
{
"timestamp": "2026-03-22T14:30:45.123+0900",
"level": "INFO",
"message": "Query executed successfully",
"component": "QueryExecutor",
"details": "SELECT sum(volume) FROM trades - 1.2ms"
}

Automatically configured (/etc/logrotate.d/zeptodb):

  • Daily rotation
  • 30-day retention
  • Compressed storage

Cron configuration (auto-runs):

0 2 * * * /opt/zeptodb/scripts/backup.sh

Manual backup:

Terminal window
sudo -u zeptodb /opt/zeptodb/scripts/backup.sh

Backup contents:

  • HDB (Historical Database)
  • WAL (Write-Ahead Log)
  • Config files
  • Metadata

Backup location:

  • Local: /backup/zeptodb/zeptodb-backup-YYYYMMDD_HHMMSS.tar.gz
  • S3: s3://${S3_BUCKET}/backups/ (optional)
Terminal window
# Set environment variables
export ZEPTO_S3_BACKUP_BUCKET="my-apex-backups"
export AWS_REGION="us-east-1"
# Required IAM permissions:
# - s3:PutObject
# - s3:GetObject
# - s3:ListBucket
# - s3:DeleteObject

Restore from local backup:

Terminal window
# 1. Stop ZeptoDB
sudo systemctl stop zeptodb
# 2. Restore backup
sudo /opt/zeptodb/scripts/restore.sh zeptodb-backup-20260322_020000
# 3. Restart service
sudo systemctl start zeptodb

Restore from S3 backup:

Terminal window
sudo /opt/zeptodb/scripts/restore.sh zeptodb-backup-20260322_020000 --from-s3

Skip WAL replay:

Terminal window
sudo /opt/zeptodb/scripts/restore.sh zeptodb-backup-20260322_020000 --skip-wal-replay

Local:

  • Retention period: 30 days (default)
  • Configure via BACKUP_RETENTION_DAYS environment variable

S3:

  • Lifecycle policy recommended
  • STANDARD_IA (Infrequent Access) storage class

Service management:

Terminal window
# Start
sudo systemctl start zeptodb
# Stop
sudo systemctl stop zeptodb
# Restart
sudo systemctl restart zeptodb
# Status
sudo systemctl status zeptodb
# Enable auto-start on boot
sudo systemctl enable zeptodb

Service features:

  • ✅ Auto-restart (5 seconds after failure)
  • ✅ CPU affinity (cores 0-7)
  • ✅ OOM protection (priority -900)
  • ✅ Resource limits (1M files, unlimited memory)

Auto-runs (cron):

0 18 * * 1-5 /opt/zeptodb/scripts/eod_process.sh

Manual run:

Terminal window
sudo -u zeptodb /opt/zeptodb/scripts/eod_process.sh

EOD tasks:

  1. RDB → HDB flush
  2. Statistics collection
  3. WAL cleanup (compress after 7 days, delete after 30 days)
  4. Automatic backup
  5. Disk usage check

Logs:

Terminal window
tail -f /var/log/zeptodb/eod.log

Bare metal tuning:

Terminal window
sudo /opt/zeptodb/scripts/tune_bare_metal.sh

Tuning items:

  • CPU governor → performance
  • Turbo Boost enabled
  • Hugepages 32GB
  • IRQ affinity
  • Network stack

Terminal window
# Check logs
sudo journalctl -u zeptodb -n 100
# Check config file
cat /opt/zeptodb/config.yaml
# Check port conflict
sudo lsof -i :8123
# Check permissions
ls -la /data/zeptodb

Causes:

  • Ring buffer too small
  • CPU overload
  • Disk I/O bottleneck

Resolution:

Terminal window
# 1. Check metrics
curl http://localhost:8123/metrics | grep dropped
# 2. Check CPU usage
top -u zeptodb
# 3. Increase ring buffer size (config.yaml)
ring_buffer_size: 1048576 # default 524288
# 4. Restart
sudo systemctl restart zeptodb
Terminal window
# 1. Check rows scanned
curl http://localhost:8123/stats
# 2. Check query execution plan (EXPLAIN)
curl -X POST http://localhost:8123/ -d "EXPLAIN SELECT ..."
# 3. Check HDB compression status
du -sh /data/zeptodb/hdb
# 4. Verify parallel query enabled (config.yaml)
query_threads: 8
Terminal window
# 1. Check usage
df -h /data/zeptodb
# 2. Clean old HDB
find /data/zeptodb/hdb -type d -mtime +90 -exec rm -rf {} \;
# 3. Compress WAL
find /data/zeptodb/wal -name "*.wal" -mtime +7 -exec gzip {} \;
# 4. Clean backups
find /backup/zeptodb -name "*.tar.gz" -mtime +30 -delete
Terminal window
# 1. Check memory usage
free -h
pmap -x $(pgrep zepto-server)
# 2. OOM killer logs
dmesg | grep -i "out of memory"
# 3. Check hugepages
cat /proc/meminfo | grep Huge
# 4. Restart process (clear memory)
sudo systemctl restart zeptodb
Terminal window
# 1. Check /metrics endpoint
curl http://localhost:8123/metrics
# 2. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq .
# 3. Prometheus logs
docker logs zepto-prometheus
# 4. Check firewall
sudo iptables -L -n | grep 8123

Terminal window
# CPU isolation (GRUB)
vi /etc/default/grub
# GRUB_CMDLINE_LINUX="isolcpus=0-7 nohz_full=0-7"
# Auto tuning
sudo /opt/zeptodb/scripts/tune_bare_metal.sh
# Check NUMA topology
numactl --hardware

AWS optimization:

  • Instance: c7g.16xlarge (64 vCPU, 128GB RAM)
  • Storage: io2 EBS (64K IOPS)
  • Network: Enhanced Networking (ENA)
  • Placement Group: cluster

Kubernetes optimization:

# Resource allocation
resources:
requests:
cpu: "32"
memory: "64Gi"
limits:
cpu: "64"
memory: "128Gi"
# Node affinity
nodeSelector:
node.kubernetes.io/instance-type: c7g.16xlarge

Terminal window
# Firewall configuration (iptables)
sudo iptables -A INPUT -p tcp --dport 8123 -s 10.0.0.0/8 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 8123 -j DROP
# UFW
sudo ufw allow from 10.0.0.0/8 to any port 8123
config.yaml
server:
tls:
enabled: true
cert_file: /etc/zeptodb/server.crt
key_file: /etc/zeptodb/server.key
config.yaml
auth:
enabled: true
type: basic # or jwt
users:
- username: admin
password_hash: "$2a$10$..."

For issues:

Critical incidents:

  • PagerDuty: ZeptoDB Critical Alerts
  • On-call: +1-XXX-XXX-XXXX