Skip to content

Aggressive SIMD/JIT Optimization: From 2x to 4.2x Speedup

ZeptoDB’s first SIMD pass (Highway v1) delivered a 2x speedup — respectable, but still 3-5x slower than kdb+ on filter operations. This post covers the second optimization round that pushed filter performance to 4.9x faster and reached kdb+ parity.


After v1, the gap was clear:

Operation (1M rows)v1 Resultkdb+ ReferenceGap
filter_gt_i641,341μs200–400μs3–5x slower
sum_i64265μs100–200μs~1.5x
vwap530μs~200μs~2.5x
JIT filter1,254μs~530μs2.4x

Filter was the worst offender. Root cause analysis pointed to three problems in the v1 output path.


The v1 filter used StoreMaskBits → extract indices via ctz bit scan:

StoreMaskBits → uint8_t mask_bytes
→ while(bits) { ctz → out_indices[out_count++] = i + k; bits &= bits-1; }

Three problems compound here:

  1. Write bandwidth: 1M rows at 50% selectivity → 500K uint32_t = 2MB of index writes, blowing past L2 cache
  2. Branch misprediction: The while(bits) loop has data-dependent iteration counts — the branch predictor can’t learn irregular bit patterns
  3. Serialization: out_count++ creates a store-to-load forwarding dependency chain

The BitMask approach eliminates all three:

StoreMaskBits → uint8_t mask_bytes
→ out_bits[word_idx] |= (bits << bit_off) // that's it

Output shrinks from 2MB to 128KB (1M / 8 bytes), fitting entirely in L2. No branches — just bitwise OR. Population count uses hardware POPCNT.

Results:

Datasetv1BitMaskSpeedup
1M rows1,169μs272μs4.3x
100K rows (cache-hot)98μs7μs14x

The 100K result is dramatic because the entire working set fits in L2 — pure compute, no memory stalls.


For sum_i64, we tried 4-way scalar unrolling with prefetch:

// 4 independent accumulator chains — breaks the dependency
s0 += data[i+0]; s1 += data[i+1];
s2 += data[i+2]; s3 += data[i+3];
__builtin_prefetch(data + i + 64, 0, 1); // prefetch to L2

At 100K rows (800KB, cache-resident): 4.2x speedup. At 1M rows (8MB): no improvement. The roofline analysis explains why:

Data: 1M × 8B = 8MB (near L3 boundary)
DRAM bandwidth: ~25-50 GB/s
Compute intensity: 1 ADD / 8 bytes = 0.125 FLOP/byte
Roofline ceiling: 0.125 × 25 GB/s ≈ 3.1 GFLOP/s
Actual ADD throughput: ~10+ GFLOP/s
→ Fully memory-bound — no compute optimization helps

We also tried 8-way SIMD unrolling to fill more of the reorder buffer (Skylake: 224 entries). Same result at 1M+ — memory bandwidth is the ceiling.

Takeaway: Know your roofline. SIMD wins in the cache-resident regime; beyond L3, you need algorithmic changes (streaming, fusion) to improve.


VWAP reads two arrays simultaneously (prices + volumes = 16MB total). We tried 4-way FMA unrolling with dual prefetch:

pv0 = MulAdd(p0, v0, pv0); pv1 = MulAdd(p1, v1, pv1);
pv2 = MulAdd(p2, v2, pv2); pv3 = MulAdd(p3, v3, pv3);

Result at 1M: 532μs → 531μs. The memory controller is already saturated by two concurrent read streams. FMA throughput (0.5 CPI on Skylake) is irrelevant when you’re waiting on DRAM.


JIT: O3 Passes and the Bulk Loop Experiment

Section titled “JIT: O3 Passes and the Bulk Loop Experiment”

The v1 JIT only set OptimizeForSize — an attribute hint, not an optimization pass. Adding the actual O3 pipeline:

impl_->jit->getIRTransformLayer().setTransform(
[](ThreadSafeModule tsm, MaterializationResponsibility& r) {
pb.buildPerModuleDefaultPipeline(OptimizationLevel::O3).run(mod, mam);
return tsm;
});

This gave a 10% improvement (1,254μs → 1,129μs). Still 2.1x slower than native C++ — the per-row function call overhead dominates.

To eliminate call overhead, we generated a loop directly in LLVM IR:

void bulk_fn(prices*, volumes*, n, out_indices*, out_count*)
loop: load → cmp → store → inc

Result: 4.7x slower than per-row. Root cause: missing noalias on pointer parameters prevented LLVM from vectorizing the loop. Without noalias, the compiler assumes prices and volumes might alias out_indices — blocking all loop transformations.

The fix (for a future iteration): add noalias attributes and use PHI nodes instead of alloca for loop counters.


Filter: 4.9x faster

BitMask output: 128KB vs 2MB, no branches, fits in L2. Reached kdb+ parity.

Memory wall identified

Sum/VWAP at 1M+ rows are DRAM-bandwidth-bound. SIMD only helps in cache-resident regime.

JIT O3 passes

10% gain from real optimization passes. Bulk loop blocked by missing noalias — fix identified.

OptimizationBeforeAfterSpeedup
filter_gt_i64 (1M)1,341μs272μs4.9x
sum_i64 (100K, cache-hot)25μs6μs4.2x
JIT filter (1M)1,254μs1,129μs1.1x

The biggest lesson: measure before optimizing. The roofline model predicted exactly which operations would benefit from SIMD and which wouldn’t. BitMask filtering was the clear winner because it reduced both compute and memory traffic simultaneously.


Related: Lock-Free Ingestion → · Bare-Metal Tuning → · JIT SIMD Emit →