ZeptoDB Bare Metal Tuning Guide
Detailed guide for extracting maximum performance from ZeptoDB on bare-metal servers. For cloud/K8s tuning, see CLOUD_PERFORMANCE_TUNING.md.
Last Updated: 2026-03-31 Tested On: Intel Xeon 6975P-C (8C), 31GB RAM, Amazon Linux 2023
Quick Start
Section titled “Quick Start”# 1. Kernel boot parameters (reboot required)sudo vi /etc/default/grub# Add: isolcpus=0-3 nohz_full=0-3 rcu_nocbs=0-3 transparent_hugepage=never \# processor.max_cstate=1 intel_idle.max_cstate=0 \# default_hugepagesz=2M hugepagesz=2M hugepages=16384sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot
# 2. Runtime tuningsudo ./deploy/scripts/tune_bare_metal.sh
# 3. Build (production)cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DZEPTO_USE_TCMALLOC=ON -DZEPTO_USE_LTO=ONninja -j$(nproc)
# 4. Runsudo numactl --cpunodebind=0 --membind=0 \ taskset -c 0-3 ./zepto_http_server --port 8123 --hugepages1. CPU Pinning & Isolation
Section titled “1. CPU Pinning & Isolation”Kernel Boot Parameters
Section titled “Kernel Boot Parameters”Add to /etc/default/grub GRUB_CMDLINE_LINUX:
isolcpus=0-3 nohz_full=0-3 rcu_nocbs=0-3| Parameter | Purpose |
|---|---|
isolcpus=0-3 | Remove cores 0-3 from general scheduler — dedicated to ZeptoDB |
nohz_full=0-3 | Disable timer ticks on isolated cores — eliminates jitter |
rcu_nocbs=0-3 | Offload RCU callbacks — no kernel interrupts on hot path |
After editing:
sudo grub2-mkconfig -o /boot/grub2/grub.cfgsudo rebootRuntime CPU Affinity
Section titled “Runtime CPU Affinity”ZeptoDB’s ResourceIsolation supports pthread_setaffinity_np for per-thread pinning (implemented in src/execution/resource_isolation.cpp).
For manual control:
# Pin server to isolated coressudo numactl --cpunodebind=0 --membind=0 \ taskset -c 0-3 ./zepto_http_server --port 8123
# Verifytaskset -p $(pidof zepto_http_server)IRQ Affinity
Section titled “IRQ Affinity”Move network interrupts away from ZeptoDB cores:
# Move NIC IRQs to system cores (8-15)for irq in $(grep -E 'eth|mlx|ixgbe' /proc/interrupts | cut -d: -f1 | tr -d ' '); do echo ff00 > /proc/irq/$irq/smp_affinity 2>/dev/null || truedoneMulti-NUMA Layout
Section titled “Multi-NUMA Layout”For dual-socket servers, run one instance per NUMA node:
# Node 0: realtime ingestionnumactl --cpunodebind=0 --membind=0 taskset -c 0-7 \ ./zepto_http_server --port 8123 --node-id 0
# Node 1: analytics queriesnumactl --cpunodebind=1 --membind=1 taskset -c 16-23 \ ./zepto_http_server --port 8124 --node-id 1Verify NUMA allocation:
numastat -p $(pidof zepto_http_server)2. Hugepages
Section titled “2. Hugepages”ZeptoDB’s arena allocator uses mmap with MAP_HUGETLB for 2MB pages. Without hugepages, a 8.9GB working set requires ~2.2M TLB entries (4KB pages) vs ~4,448 entries (2MB pages) — a 512× difference in TLB pressure.
Allocation
Section titled “Allocation”Boot-time (recommended — avoids fragmentation):
default_hugepagesz=2M hugepagesz=2M hugepages=16384Runtime:
# Allocate 32GB of 2MB hugepagesecho 16384 > /proc/sys/vm/nr_hugepages
# Compact memory first if fragmentedecho 1 > /proc/sys/vm/drop_cachesecho 1 > /proc/sys/vm/compact_memoryPersistent (sysctl):
echo "vm.nr_hugepages = 16384" >> /etc/sysctl.d/99-zepto-tuning.confsysctl -p /etc/sysctl.d/99-zepto-tuning.confSizing Formula
Section titled “Sizing Formula”hugepages = (num_partitions × arena_size_MB) / 2 + headroom_20%Example: 278 partitions × 32MB = 8.9GB → echo 4608 > /proc/sys/vm/nr_hugepages (minimum)
1GB Hugepages (Advanced)
Section titled “1GB Hugepages (Advanced)”For very large working sets (>32GB), 1GB pages further reduce TLB misses:
hugepagesz=1G hugepages=32⚠️ 1GB pages must be allocated at boot. Runtime allocation is not supported.
Verification
Section titled “Verification”grep -i huge /proc/meminfo# HugePages_Total: 16384# HugePages_Free: 12000 ← some used by ZeptoDB# Hugepagesize: 2048 kB3. C-State & CPU Frequency
Section titled “3. C-State & CPU Frequency”Disable Deep C-States
Section titled “Disable Deep C-States”Deep sleep states (C2+) add wake-up latency (10-200μs). For consistent sub-microsecond response:
Boot parameter (strongest):
processor.max_cstate=1 intel_idle.max_cstate=0Runtime:
for state in /sys/devices/system/cpu/cpu*/cpuidle/state[2-9]/disable; do echo 1 > "$state" 2>/dev/null || truedoneCPU Governor
Section titled “CPU Governor”# Set all cores to performance (fixed max frequency)for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > "$gov" 2>/dev/null || truedone⚠️ AWS EC2 instances virtualize CPU frequency controls —
scaling_governormay not be writable. This is expected.
Turbo Boost
Section titled “Turbo Boost”Disable for latency consistency (turbo causes frequency transitions):
# Intelecho 1 > /sys/devices/system/cpu/intel_pstate/no_turbo# AMDecho 0 > /sys/devices/system/cpu/cpufreq/boost4. Kernel Sysctl Parameters
Section titled “4. Kernel Sysctl Parameters”All parameters applied by deploy/scripts/tune_bare_metal.sh:
# Memoryvm.nr_hugepages = 16384 # 32GB of 2MB pages (adjust to workload)vm.swappiness = 0 # never swap — in-memory DBvm.numa_balancing = 0 # ZeptoDB manages NUMA explicitlyvm.zone_reclaim_mode = 0 # don't reclaim from local zonevm.stat_interval = 120 # reduce vmstat overheadvm.dirty_ratio = 80 # delay writeback (HDB flush is explicit)vm.min_free_kbytes = 262144 # 256MB free reserve for hugepage allocationvm.vfs_cache_pressure = 50 # keep dentries/inodes longer
# Schedulerkernel.watchdog = 0 # disable soft lockup detectorkernel.nmi_watchdog = 0 # disable NMI watchdogkernel.timer_migration = 0 # don't migrate timers across CPUs
# Security (keep enabled)kernel.randomize_va_space = 2 # ASLR ON — disabling hurts L3 cache aliasing⚠️
randomize_va_space=0was tested and caused +4ms regression on Xbar due to L3 cache set aliasing from deterministic virtual addresses. Keep ASLR enabled.
Persistent Configuration
Section titled “Persistent Configuration”sudo cp /etc/sysctl.d/99-zepto-tuning.conf /etc/sysctl.d/99-zepto-tuning.conf.baksudo ./deploy/scripts/tune_bare_metal.sh # writes 99-zepto-tuning.conf5. Network Tuning
Section titled “5. Network Tuning”# Low-latency pollingsysctl -w net.core.busy_poll=50 # busy-poll 50μs before sleepingsysctl -w net.core.busy_read=50 # busy-read 50μssysctl -w net.ipv4.tcp_low_latency=1 # prefer latency over throughput
# Buffer sizes (128MB)sysctl -w net.core.rmem_max=134217728sysctl -w net.core.wmem_max=134217728sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
# Backlogsysctl -w net.core.netdev_max_backlog=10000Disable THP
Section titled “Disable THP”Transparent Huge Pages cause unpredictable latency spikes from background compaction:
echo never > /sys/kernel/mm/transparent_hugepage/enabledecho never > /sys/kernel/mm/transparent_hugepage/defragOr via boot parameter: transparent_hugepage=never
6. Build Optimization
Section titled “6. Build Optimization”CMake Options
Section titled “CMake Options”| Option | Default | Purpose |
|---|---|---|
ZEPTO_USE_TCMALLOC | OFF | tcmalloc_minimal — 12% faster for small allocations |
ZEPTO_USE_LTO | OFF | Link-Time Optimization — 8% cross-TU inlining gains |
Recommended Production Build
Section titled “Recommended Production Build”cmake .. -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DZEPTO_USE_TCMALLOC=ON \ -DZEPTO_USE_LTO=ON
ninja -j$(nproc)PGO (Profile-Guided Optimization)
Section titled “PGO (Profile-Guided Optimization)”PGO provides additional 2-5% by optimizing branch prediction and inlining based on real workload profiles.
# Step 1: Instrumented buildcmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_FLAGS="-O3 -march=native -fprofile-generate=/tmp/zepto_pgo"ninja -j$(nproc)
# Step 2: Run representative workload./zepto_tests --gtest_filter="Benchmark.*:SqlExecutor*:FinancialFunction*:WindowJoin*"
# Step 3: Optimized build using profilecmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DZEPTO_USE_TCMALLOC=ON -DZEPTO_USE_LTO=ON \ -DCMAKE_CXX_FLAGS="-O3 -march=native -flto -fprofile-use=/tmp/zepto_pgo -fprofile-correction"ninja -j$(nproc)Allocator Comparison
Section titled “Allocator Comparison”| Allocator | Xbar 1M | Improvement |
|---|---|---|
| glibc malloc | 53.2ms | baseline |
| jemalloc | 51.6ms | +3% |
| tcmalloc_minimal | 47.5ms | +12% |
tcmalloc’s per-thread caches handle ZeptoDB’s pattern of many short-lived small allocations far better than glibc.
7. Benchmarking & Verification
Section titled “7. Benchmarking & Verification”Run Benchmarks
Section titled “Run Benchmarks”cd buildninja zepto_tests./zepto_tests --gtest_filter="Benchmark*"Expected Results (Tuned)
Section titled “Expected Results (Tuned)”| Benchmark | Untuned | Tuned (LTO+PGO+tcmalloc+hugepages) | Improvement |
|---|---|---|---|
| Xbar 1M rows | 45.2ms | 12.4ms | -73% |
| EMA 1M rows | 2.15ms | 2.23ms | flat |
| Window JOIN 100K² | ~12ms | 11.6ms | flat |
Profiling
Section titled “Profiling”# perf stat — check IPC (should be > 1.0 after tuning)sudo perf stat -d ./zepto_tests --gtest_filter="Benchmark.Xbar*"
# Flamegraphsudo perf record -F 999 -a -g -- sleep 30sudo perf script | flamegraph.pl > zepto_flame.svg
# NUMA memory statsnumastat -p $(pidof zepto_http_server)
# Hugepage usagegrep Huge /proc/meminfoVerification Checklist
Section titled “Verification Checklist”# CPU governorcat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # → performance
# Hugepages allocatedcat /proc/sys/vm/nr_hugepages # → 16384
# NUMA balancing offcat /proc/sys/kernel/numa_balancing # → 0
# THP disabledcat /sys/kernel/mm/transparent_hugepage/enabled # → [never]
# Isolated corescat /proc/cmdline | grep -o 'isolcpus=[^ ]*' # → isolcpus=0-3
# C-statescat /sys/devices/system/cpu/cpu0/cpuidle/state2/disable # → 18. Automation
Section titled “8. Automation”tune_bare_metal.sh
Section titled “tune_bare_metal.sh”The script at deploy/scripts/tune_bare_metal.sh applies all runtime tuning in one step:
sudo ./deploy/scripts/tune_bare_metal.shIt handles: CPU governor, turbo boost, hugepages, IRQ affinity, network stack, NUMA balancing, swappiness, THP, C-states, and persists settings to /etc/sysctl.d/99-zepto-tuning.conf.
AI Tuner
Section titled “AI Tuner”For iterative profiling and optimization, use the AI-driven tuner:
python3 deploy/scripts/ai_tune_bare_metal.pyThis uses Claude Opus with extended thinking to profile the system, run benchmarks, and apply tuning commands iteratively. See devlog/015 for detailed results.
systemd Service
Section titled “systemd Service”sudo ./deploy/scripts/install_service.sh# Installs zeptodb.service with NUMA/CPU affinity pre-configuredNot Yet Implemented
Section titled “Not Yet Implemented”| Feature | Status | Notes |
|---|---|---|
| io_uring for HDB I/O | ❌ Not implemented | No io_uring code in codebase. HDB flush uses synchronous write(). Would benefit WAL writes and HDB flush on NVMe. |
| 1GB hugepage support | ❌ Untested | Arena allocator uses 2MB pages only. Kernel boot param works but ZeptoDB doesn’t request 1GB pages via MAP_HUGE_1GB. |
| Clang ThinLTO | ⚠️ Blocked | Requires matching lld version. GCC -flto works. Clang 19 + lld-19 needed (sudo yum install lld19). |
isolcpus in-code detection | ❌ Not implemented | Server doesn’t verify it’s running on isolated cores. Manual taskset required. |