Single-cycle hash
CRC32 hardware instruction. One cycle per key, excellent distribution for integers.
Every JOIN in ZeptoDB — ASOF, Hash, Window — builds a hash map on the critical path. std::unordered_map was the bottleneck. This post explains why, and how a custom FlatHashMap with hardware-accelerated hashing fixed it.
std::unordered_map is node-based. Every bucket is a heap-allocated linked list node:
Bucket 0 → [Node*] → [Node*] → nullptrBucket 1 → nullptrBucket 2 → [Node*] → nullptr...Three problems for JOIN workloads:
std::hash<int64_t> is typically an identity function or a multi-cycle computation. Neither is ideal.For a hash JOIN that builds on 100K rows and probes with 1M rows, these costs add up to millions of cache misses.
The replacement is a flat, open-addressing hash map specialized for int64_t keys:
┌─────────┬─────────┬─────────┬─────────┬─────────┐│ key|val │ key|val │ empty │ key|val │ empty │└─────────┴─────────┴─────────┴─────────┴─────────┘ slot 0 slot 1 slot 2 slot 3 slot 4All entries live in a single contiguous array. No pointers, no linked lists, no per-entry heap allocations.
static uint64_t hash_key(int64_t key) noexcept { return _mm_crc32_u64(0, static_cast<uint64_t>(key)); // SSE4.2}_mm_crc32_u64 is a single-cycle hardware instruction on x86 (SSE4.2). It produces excellent distribution for integer keys. On non-x86 platforms, a splitmix64 fallback provides equivalent quality.
size_t idx = hash_key(key) & mask_; // bitwise AND, not modulowhile (occupied_[idx]) { if (keys_[idx] == key) return &values_[idx]; // found idx = (idx + 1) & mask_; // linear probe}Linear probing is cache-friendly — sequential slots are adjacent in memory, so the CPU prefetcher helps. Power-of-2 capacity means the modulo is a single bitwise AND.
capacity_ = next_power_of_2(entries * 4 / 3);At 75% load, the expected probe length is less than 2. This means most lookups check 1-2 slots — and those slots are likely in the same cache line.
JOIN maps are build-once, probe-many. The build phase inserts all entries; the probe phase only reads. No tombstones, no deletion logic, no complexity.
Minimal, purpose-built for JOIN operators:
template<typename V>class FlatHashMap { V& operator[](int64_t key); // insert or access V* find(int64_t key) noexcept; // nullptr on miss void for_each(auto&& fn) const; // iterate all entries size_t size() const noexcept;};Specialized for int64_t keys — the only key type used in JOIN operators (symbol IDs, timestamps).
All three JOIN operators were updated:
| Operator | Usage | Pattern |
|---|---|---|
AsofJoinOperator | Symbol grouping | Build: group right rows by symbol. Probe: find matching group per left row. |
HashJoinOperator | INNER/LEFT/RIGHT/FULL | Build: index one side by join key. Probe: match each row from the other side. |
WindowJoinOperator | Symbol + time range | Build: group by symbol. Probe: find group, then scan time window within group. |
The build/probe pattern is identical across all three — only the match logic differs.
Time-series JOINs have a specific access pattern:
FlatHashMap is optimized for exactly this pattern:
Single-cycle hash
CRC32 hardware instruction. One cycle per key, excellent distribution for integers.
Cache-friendly probing
Linear probing on contiguous array. CPU prefetcher helps. Expected probe length < 2.
Zero allocation overhead
One contiguous allocation for the entire map. No per-entry heap nodes.
Build-once, probe-many
No erase support needed. Simpler code, no tombstone overhead.
The replacement was validated against all existing JOIN tests:
The API surface is small enough that the entire implementation fits in a single header file (~120 lines).
Related: How ASOF JOIN Works → · SIMD Window JOIN → · Cost-Based Query Planner →