Skip to content

FlatHashMap: Why We Replaced std::unordered_map for JOINs

Every JOIN in ZeptoDB — ASOF, Hash, Window — builds a hash map on the critical path. std::unordered_map was the bottleneck. This post explains why, and how a custom FlatHashMap with hardware-accelerated hashing fixed it.


std::unordered_map is node-based. Every bucket is a heap-allocated linked list node:

Bucket 0 → [Node*] → [Node*] → nullptr
Bucket 1 → nullptr
Bucket 2 → [Node*] → nullptr
...

Three problems for JOIN workloads:

  1. Pointer chasing: Every lookup follows a pointer to a heap-allocated node. Each pointer dereference is a potential cache miss (~100 cycles to DRAM).
  2. Poor locality: Nodes are scattered across the heap. Building a map of 100K entries creates 100K separate allocations at random addresses.
  3. Hash overhead: std::hash<int64_t> is typically an identity function or a multi-cycle computation. Neither is ideal.

For a hash JOIN that builds on 100K rows and probes with 1M rows, these costs add up to millions of cache misses.


The replacement is a flat, open-addressing hash map specialized for int64_t keys:

┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ key|val │ key|val │ empty │ key|val │ empty │
└─────────┴─────────┴─────────┴─────────┴─────────┘
slot 0 slot 1 slot 2 slot 3 slot 4

All entries live in a single contiguous array. No pointers, no linked lists, no per-entry heap allocations.

static uint64_t hash_key(int64_t key) noexcept {
return _mm_crc32_u64(0, static_cast<uint64_t>(key)); // SSE4.2
}

_mm_crc32_u64 is a single-cycle hardware instruction on x86 (SSE4.2). It produces excellent distribution for integer keys. On non-x86 platforms, a splitmix64 fallback provides equivalent quality.

size_t idx = hash_key(key) & mask_; // bitwise AND, not modulo
while (occupied_[idx]) {
if (keys_[idx] == key) return &values_[idx]; // found
idx = (idx + 1) & mask_; // linear probe
}

Linear probing is cache-friendly — sequential slots are adjacent in memory, so the CPU prefetcher helps. Power-of-2 capacity means the modulo is a single bitwise AND.

capacity_ = next_power_of_2(entries * 4 / 3);

At 75% load, the expected probe length is less than 2. This means most lookups check 1-2 slots — and those slots are likely in the same cache line.

JOIN maps are build-once, probe-many. The build phase inserts all entries; the probe phase only reads. No tombstones, no deletion logic, no complexity.


Minimal, purpose-built for JOIN operators:

template<typename V>
class FlatHashMap {
V& operator[](int64_t key); // insert or access
V* find(int64_t key) noexcept; // nullptr on miss
void for_each(auto&& fn) const; // iterate all entries
size_t size() const noexcept;
};

Specialized for int64_t keys — the only key type used in JOIN operators (symbol IDs, timestamps).


All three JOIN operators were updated:

OperatorUsagePattern
AsofJoinOperatorSymbol groupingBuild: group right rows by symbol. Probe: find matching group per left row.
HashJoinOperatorINNER/LEFT/RIGHT/FULLBuild: index one side by join key. Probe: match each row from the other side.
WindowJoinOperatorSymbol + time rangeBuild: group by symbol. Probe: find group, then scan time window within group.

The build/probe pattern is identical across all three — only the match logic differs.


Time-series JOINs have a specific access pattern:

  1. Build side is typically the smaller table (reference data, or the right side of an ASOF JOIN)
  2. Probe side streams through the larger table sequentially
  3. Keys are integers — symbol IDs and timestamps, not strings

FlatHashMap is optimized for exactly this pattern:

  • Contiguous memory layout means the build phase fills cache lines sequentially
  • Integer keys use hardware CRC32 — one cycle per hash
  • Linear probing keeps probe-phase accesses cache-local
  • No erase means zero overhead from tombstone management

Single-cycle hash

CRC32 hardware instruction. One cycle per key, excellent distribution for integers.

Cache-friendly probing

Linear probing on contiguous array. CPU prefetcher helps. Expected probe length < 2.

Zero allocation overhead

One contiguous allocation for the entire map. No per-entry heap nodes.

Build-once, probe-many

No erase support needed. Simpler code, no tombstone overhead.


The replacement was validated against all existing JOIN tests:

  • 9 new FlatHashMap unit tests (insert, find, collision, iteration, capacity)
  • 23 existing HashJoin tests: all pass unchanged
  • 52 total join/window-related tests: zero regressions

The API surface is small enough that the entire implementation fits in a single header file (~120 lines).


Related: How ASOF JOIN Works → · SIMD Window JOIN → · Cost-Based Query Planner →