Filter: 4.9x faster
BitMask output: 128KB vs 2MB, no branches, fits in L2. Reached kdb+ parity.
ZeptoDB’s first SIMD pass (Highway v1) delivered a 2x speedup — respectable, but still 3-5x slower than kdb+ on filter operations. This post covers the second optimization round that pushed filter performance to 4.9x faster and reached kdb+ parity.
After v1, the gap was clear:
| Operation (1M rows) | v1 Result | kdb+ Reference | Gap |
|---|---|---|---|
| filter_gt_i64 | 1,341μs | 200–400μs | 3–5x slower |
| sum_i64 | 265μs | 100–200μs | ~1.5x |
| vwap | 530μs | ~200μs | ~2.5x |
| JIT filter | 1,254μs | ~530μs | 2.4x |
Filter was the worst offender. Root cause analysis pointed to three problems in the v1 output path.
The v1 filter used StoreMaskBits → extract indices via ctz bit scan:
StoreMaskBits → uint8_t mask_bytes→ while(bits) { ctz → out_indices[out_count++] = i + k; bits &= bits-1; }Three problems compound here:
uint32_t = 2MB of index writes, blowing past L2 cachewhile(bits) loop has data-dependent iteration counts — the branch predictor can’t learn irregular bit patternsout_count++ creates a store-to-load forwarding dependency chainThe BitMask approach eliminates all three:
StoreMaskBits → uint8_t mask_bytes→ out_bits[word_idx] |= (bits << bit_off) // that's itOutput shrinks from 2MB to 128KB (1M / 8 bytes), fitting entirely in L2. No branches — just bitwise OR. Population count uses hardware POPCNT.
Results:
| Dataset | v1 | BitMask | Speedup |
|---|---|---|---|
| 1M rows | 1,169μs | 272μs | 4.3x |
| 100K rows (cache-hot) | 98μs | 7μs | 14x |
The 100K result is dramatic because the entire working set fits in L2 — pure compute, no memory stalls.
For sum_i64, we tried 4-way scalar unrolling with prefetch:
// 4 independent accumulator chains — breaks the dependencys0 += data[i+0]; s1 += data[i+1];s2 += data[i+2]; s3 += data[i+3];__builtin_prefetch(data + i + 64, 0, 1); // prefetch to L2At 100K rows (800KB, cache-resident): 4.2x speedup. At 1M rows (8MB): no improvement. The roofline analysis explains why:
Data: 1M × 8B = 8MB (near L3 boundary)DRAM bandwidth: ~25-50 GB/sCompute intensity: 1 ADD / 8 bytes = 0.125 FLOP/byteRoofline ceiling: 0.125 × 25 GB/s ≈ 3.1 GFLOP/sActual ADD throughput: ~10+ GFLOP/s→ Fully memory-bound — no compute optimization helpsWe also tried 8-way SIMD unrolling to fill more of the reorder buffer (Skylake: 224 entries). Same result at 1M+ — memory bandwidth is the ceiling.
Takeaway: Know your roofline. SIMD wins in the cache-resident regime; beyond L3, you need algorithmic changes (streaming, fusion) to improve.
VWAP reads two arrays simultaneously (prices + volumes = 16MB total). We tried 4-way FMA unrolling with dual prefetch:
pv0 = MulAdd(p0, v0, pv0); pv1 = MulAdd(p1, v1, pv1);pv2 = MulAdd(p2, v2, pv2); pv3 = MulAdd(p3, v3, pv3);Result at 1M: 532μs → 531μs. The memory controller is already saturated by two concurrent read streams. FMA throughput (0.5 CPI on Skylake) is irrelevant when you’re waiting on DRAM.
The v1 JIT only set OptimizeForSize — an attribute hint, not an optimization pass. Adding the actual O3 pipeline:
impl_->jit->getIRTransformLayer().setTransform( [](ThreadSafeModule tsm, MaterializationResponsibility& r) { pb.buildPerModuleDefaultPipeline(OptimizationLevel::O3).run(mod, mam); return tsm; });This gave a 10% improvement (1,254μs → 1,129μs). Still 2.1x slower than native C++ — the per-row function call overhead dominates.
To eliminate call overhead, we generated a loop directly in LLVM IR:
void bulk_fn(prices*, volumes*, n, out_indices*, out_count*) loop: load → cmp → store → incResult: 4.7x slower than per-row. Root cause: missing noalias on pointer parameters prevented LLVM from vectorizing the loop. Without noalias, the compiler assumes prices and volumes might alias out_indices — blocking all loop transformations.
The fix (for a future iteration): add noalias attributes and use PHI nodes instead of alloca for loop counters.
Filter: 4.9x faster
BitMask output: 128KB vs 2MB, no branches, fits in L2. Reached kdb+ parity.
Memory wall identified
Sum/VWAP at 1M+ rows are DRAM-bandwidth-bound. SIMD only helps in cache-resident regime.
JIT O3 passes
10% gain from real optimization passes. Bulk loop blocked by missing noalias — fix identified.
| Optimization | Before | After | Speedup |
|---|---|---|---|
| filter_gt_i64 (1M) | 1,341μs | 272μs | 4.9x |
| sum_i64 (100K, cache-hot) | 25μs | 6μs | 4.2x |
| JIT filter (1M) | 1,254μs | 1,129μs | 1.1x |
The biggest lesson: measure before optimizing. The roofline model predicted exactly which operations would benefit from SIMD and which wouldn’t. BitMask filtering was the clear winner because it reduced both compute and memory traffic simultaneously.
Related: Lock-Free Ingestion → · Bare-Metal Tuning → · JIT SIMD Emit →