Skip to content

Bare-Metal Performance Tuning for Microsecond Latency

ZeptoDB reduced xbar query latency from 45.2ms to 12.4ms — a 73% improvement — through systematic bare-metal tuning. This post is the practical guide: what we tuned, what worked, what didn’t, and why.

Hardware: Intel Xeon 6975P-C, 8 cores, 31GB RAM, single NUMA node, Amazon Linux 2023.


Before any tuning, perf stat revealed the root cause:

cycles: 4,857,773,539
instructions: 1,506,218,500
IPC: 0.31 ← CPU spends 70% of cycles stalling

Normal compute-bound code runs at IPC 2-4+. An IPC of 0.31 means the CPU is waiting on memory almost all the time. Two culprits: TLB pressure from missing huge pages, and per-row heap allocations in the GROUP BY path.


ZeptoDB allocates 278 partitions with 32MB arenas each (~8.9GB total). Without huge pages, every mmap with MAP_HUGETLB failed silently, falling back to 4KB pages:

4KB pages: 8.9GB / 4KB = ~2.2M TLB entries needed
2MB pages: 8.9GB / 2MB = ~4,448 TLB entries needed
─────────────────────────
512x fewer TLB entries

The fix is one command:

Terminal window
echo 4608 > /proc/sys/vm/nr_hugepages # pre-allocate 9GB of 2MB pages

Huge pages must be allocated before the process starts. The kernel needs contiguous physical memory, which fragments over time.


Parameters that helped:

Terminal window
# Reduce VM overhead
sysctl -w vm.swappiness=1
sysctl -w vm.stat_interval=120 # reduce vmstat timer interrupts
sysctl -w vm.dirty_ratio=80 # delay writeback
sysctl -w vm.min_free_kbytes=262144 # keep 256MB free to avoid reclaim stalls
# Disable noise sources
sysctl -w kernel.watchdog=0
sysctl -w kernel.nmi_watchdog=0
sysctl -w kernel.timer_migration=0
# NUMA (for multi-socket systems)
sysctl -w vm.numa_balancing=0 # disable automatic page migration
sysctl -w vm.zone_reclaim_mode=0 # don't reclaim from local zone

Disabling ASLR (randomize_va_space=0): Made xbar 4ms slower. Deterministic virtual addresses cause L3 cache set aliasing — multiple hot pages map to the same cache sets, causing evictions. Keep ASLR enabled.

CPU governor tuning: Not accessible on AWS EC2. The hypervisor controls CPU frequency. On bare-metal hardware, set performance governor.

C-state disabling: Marginal improvement. Disabled states 2-9 to avoid wakeup latency, but the effect was within noise on this workload.


The xbar GROUP BY path allocates std::vector<int64_t> per row as a group key — 1M heap allocations for 1M rows. Different allocators handle this pattern very differently:

AllocatorXbar 1Mvs. glibc
glibc malloc53.2msbaseline
jemalloc51.6ms+3%
tcmalloc_minimal47.5ms+12%

tcmalloc’s per-thread caches handle the pattern of many small, short-lived allocations far better than glibc’s arena approach. Enable it with a CMake flag:

Terminal window
cmake .. -DAPEX_USE_TCMALLOC=ON

Tested systematically, each layer on top of the previous:

Build ConfigurationXbar 1MCumulative
O3 + march=native53msbaseline
+ tcmalloc48ms-9%
+ PGO47.3ms-11%
+ LTO43.7ms-18%

PGO (Profile-Guided Optimization) collects runtime profiles to guide inlining and branch layout:

Terminal window
# Step 1: Instrument
cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=native -fprofile-generate=/tmp/pgo"
ninja && ./zepto_tests --gtest_filter="Benchmark.*:SqlExecutor*"
# Step 2: Optimize with profile
cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=native -flto -fprofile-use=/tmp/pgo"

LTO (Link-Time Optimization) enables cross-translation-unit inlining. The hot path spans executor.cpp → partition_manager → arena — LTO lets the compiler inline across these boundaries.


After all OS and compiler tuning, profiling revealed the true bottleneck: make_group_key() allocating a std::vector<int64_t> per row.

// Before: heap allocation per row (1M mallocs for 1M rows)
auto make_group_key = [&](const Partition& part, uint32_t idx)
-> std::vector<int64_t> { ... }
// After: flat int64_t key for single-column GROUP BY
std::unordered_map<int64_t, std::vector<GroupState>> groups;
// Zero per-row heap allocations

This single change: 53ms → 13.3ms (-75%). Combined with LTO+PGO+tcmalloc: 12.4ms (-73% from original).


BenchmarkOriginalFinalImprovement
Xbar 1M rows45.2ms12.4ms-73%
EMA 1M rows2.15ms2.23ms~flat
Window JOIN 100K²~12ms11.58ms~flat

EMA and Window JOIN were already efficient — their hot paths don’t hit the GROUP BY allocation pattern or TLB pressure.


Terminal window
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS="-O3 -march=native -flto \
-fprofile-use=/path/to/pgo -fprofile-correction" \
-DAPEX_USE_TCMALLOC=ON \
-DAPEX_USE_LTO=ON

Huge pages

512x fewer TLB entries. Pre-allocate before process start. Single biggest OS-level win.

tcmalloc

12% faster for allocation-heavy workloads. Per-thread caches eliminate contention.

LTO + PGO

18% combined. Cross-TU inlining + profile-guided branch layout.

Application-level

75% from eliminating per-row vector allocation. Always profile the application first.

Lesson: OS and compiler tuning gave 18%. Fixing the application-level allocation pattern gave 75%. Always profile the application before tuning the kernel.


Related: SIMD/JIT Optimization → · Parallel Query Engine → · FlatHashMap JOINs →