Aggressive SIMD/JIT Optimization: From 2x to 4.2x Speedup

ZeptoDB’s first SIMD pass (Highway v1) delivered a 2x speedup — respectable, but still 3-5x slower than kdb+ on filter operations. This post covers the second optimization round that pushed filter performance to 4.9x faster and reached kdb+ parity.

The Starting Point

After v1, the gap was clear:

Operation (1M rows)	v1 Result	kdb+ Reference	Gap
filter_gt_i64	1,341μs	200–400μs	3–5x slower
sum_i64	265μs	100–200μs	~1.5x
vwap	530μs	~200μs	~2.5x
JIT filter	1,254μs	~530μs	2.4x

Filter was the worst offender. Root cause analysis pointed to three problems in the v1 output path.

BitMask Filter: The 4.9x Win

The v1 filter used StoreMaskBits → extract indices via ctz bit scan:

StoreMaskBits → uint8_t mask_bytes
→ while(bits) { ctz → out_indices[out_count++] = i + k; bits &= bits-1; }

Three problems compound here:

Write bandwidth: 1M rows at 50% selectivity → 500K uint32_t = 2MB of index writes, blowing past L2 cache
Branch misprediction: The while(bits) loop has data-dependent iteration counts — the branch predictor can’t learn irregular bit patterns
Serialization: out_count++ creates a store-to-load forwarding dependency chain

The BitMask approach eliminates all three:

StoreMaskBits → uint8_t mask_bytes
→ out_bits[word_idx] |= (bits << bit_off)   // that's it

Output shrinks from 2MB to 128KB (1M / 8 bytes), fitting entirely in L2. No branches — just bitwise OR. Population count uses hardware POPCNT.

Results:

Dataset	v1	BitMask	Speedup
1M rows	1,169μs	272μs	4.3x
100K rows (cache-hot)	98μs	7μs	14x

The 100K result is dramatic because the entire working set fits in L2 — pure compute, no memory stalls.

Sum: Hitting the Memory Wall

For sum_i64, we tried 4-way scalar unrolling with prefetch:

// 4 independent accumulator chains — breaks the dependency
s0 += data[i+0];  s1 += data[i+1];
s2 += data[i+2];  s3 += data[i+3];
__builtin_prefetch(data + i + 64, 0, 1);  // prefetch to L2

At 100K rows (800KB, cache-resident): 4.2x speedup. At 1M rows (8MB): no improvement. The roofline analysis explains why:

Data: 1M × 8B = 8MB (near L3 boundary)
DRAM bandwidth: ~25-50 GB/s
Compute intensity: 1 ADD / 8 bytes = 0.125 FLOP/byte
Roofline ceiling: 0.125 × 25 GB/s ≈ 3.1 GFLOP/s
Actual ADD throughput: ~10+ GFLOP/s
→ Fully memory-bound — no compute optimization helps

We also tried 8-way SIMD unrolling to fill more of the reorder buffer (Skylake: 224 entries). Same result at 1M+ — memory bandwidth is the ceiling.

Takeaway: Know your roofline. SIMD wins in the cache-resident regime; beyond L3, you need algorithmic changes (streaming, fusion) to improve.

VWAP: Dual-Array Prefetch

VWAP reads two arrays simultaneously (prices + volumes = 16MB total). We tried 4-way FMA unrolling with dual prefetch:

pv0 = MulAdd(p0, v0, pv0);  pv1 = MulAdd(p1, v1, pv1);
pv2 = MulAdd(p2, v2, pv2);  pv3 = MulAdd(p3, v3, pv3);

Result at 1M: 532μs → 531μs. The memory controller is already saturated by two concurrent read streams. FMA throughput (0.5 CPI on Skylake) is irrelevant when you’re waiting on DRAM.

JIT: O3 Passes and the Bulk Loop Experiment

O3 Transform Layer

The v1 JIT only set OptimizeForSize — an attribute hint, not an optimization pass. Adding the actual O3 pipeline:

impl_->jit->getIRTransformLayer().setTransform(
    [](ThreadSafeModule tsm, MaterializationResponsibility& r) {
        pb.buildPerModuleDefaultPipeline(OptimizationLevel::O3).run(mod, mam);
        return tsm;
    });

This gave a 10% improvement (1,254μs → 1,129μs). Still 2.1x slower than native C++ — the per-row function call overhead dominates.

Bulk Loop IR: A Failed Experiment

To eliminate call overhead, we generated a loop directly in LLVM IR:

void bulk_fn(prices*, volumes*, n, out_indices*, out_count*)
  loop: load → cmp → store → inc

Result: 4.7x slower than per-row. Root cause: missing noalias on pointer parameters prevented LLVM from vectorizing the loop. Without noalias, the compiler assumes prices and volumes might alias out_indices — blocking all loop transformations.

The fix (for a future iteration): add noalias attributes and use PHI nodes instead of alloca for loop counters.

Summary

Filter: 4.9x faster

BitMask output: 128KB vs 2MB, no branches, fits in L2. Reached kdb+ parity.

Memory wall identified

Sum/VWAP at 1M+ rows are DRAM-bandwidth-bound. SIMD only helps in cache-resident regime.

JIT O3 passes

10% gain from real optimization passes. Bulk loop blocked by missing noalias — fix identified.

Optimization	Before	After	Speedup
filter_gt_i64 (1M)	1,341μs	272μs	4.9x
sum_i64 (100K, cache-hot)	25μs	6μs	4.2x
JIT filter (1M)	1,254μs	1,129μs	1.1x

The biggest lesson: measure before optimizing. The roofline model predicted exactly which operations would benefit from SIMD and which wouldn’t. BitMask filtering was the clear winner because it reduced both compute and memory traffic simultaneously.