Write 'AAPL', store uint32
Dictionary encoding: human-readable SQL, integer-speed storage and queries.
The #1 user-facing gap in ZeptoDB: every demo required explaining “symbol 1 means AAPL.” This post covers the dictionary-encoded string column implementation that lets users write WHERE symbol = 'AAPL' while keeping storage and query performance identical to integer columns.
kdb+ uses symbols (interned strings) natively. For ZeptoDB to be a credible kdb+ replacement, users need to write:
INSERT INTO trades VALUES ('AAPL', 15025, 100, 1711900800000000000)SELECT * FROM trades WHERE symbol = 'AAPL'Not:
INSERT INTO trades VALUES (1, 15025, 100, 1711900800000000000)SELECT * FROM trades WHERE symbol = 1 -- "1 means AAPL, trust me"Low-cardinality columns (symbol, exchange, side, currency) cover 95%+ of string use cases in HFT — dictionary encoding is the right approach.
INSERT ... VALUES ('AAPL', 150.25, 100) │ ▼ ┌─────────────┐ │ Parser │ TokenType::STRING → InsertValue::STRING └──────┬──────┘ ▼ ┌─────────────┐ │ Executor │ StringDictionary::intern("AAPL") → uint32 code └──────┬──────┘ ▼ ┌─────────────┐ │ ColumnStore │ Stored as uint32 (same as integer columns) └──────┬──────┘ ▼ ┌─────────────┐ │ HTTP Output │ symbol_dict->lookup(code) → "AAPL" in JSON └─────────────┘Strings are interned at INSERT time and resolved at output time. Everything in between — storage, filtering, aggregation, JOIN — operates on integer codes.
class StringDictionary { std::unordered_map<std::string, uint32_t> index_; // string → code std::vector<std::string> strings_; // code → string
public: uint32_t intern(const std::string& s); // get-or-create code const std::string& lookup(uint32_t c); // code → string std::optional<uint32_t> find(const std::string& s); // lookup without creating};intern(): O(1) amortized hash insert. Same string always returns same code.lookup(): O(1) vector index. Used only at output time.find(): O(1) hash lookup. Used in WHERE evaluation — returns empty if string not found (→ empty result set, no scan needed).The initial implementation used string_view as hash map keys, pointing into the strings_ vector. When the vector grew and reallocated, all existing keys became dangling pointers. The same string inserted twice would get different codes.
Fix: use std::string keys (owned copies). The memory overhead is negligible for low-cardinality columns — 10,000 unique symbols × ~10 bytes = ~100KB.
The dictionary is global (per ZeptoPipeline), not per-partition. This is critical:
symbol_id works correctlyclass ZeptoPipeline { StringDictionary symbol_dict_; // global, shared across all partitions};WHERE evaluation with string predicates:
SELECT * FROM trades WHERE symbol = 'AAPL'Expr with is_string = true, value_str = "AAPL"symbol_dict.find("AAPL") → uint32_t codeThe find() call happens once per query, not per row. After resolution, the scan operates on integer comparisons — identical performance to WHERE symbol = 42.
The HTTP server resolves codes back to strings in JSON responses:
{ "columns": ["symbol", "price", "volume"], "data": [["AAPL", 15025, 100], ["GOOG", 28100, 50]]}QueryResultSet carries a symbol_dict pointer. The HTTP serializer checks each column — if it’s a SYMBOL type, it calls lookup(code) to produce the string representation.
Integer symbol IDs still work:
-- Both work, can coexist in the same databaseSELECT * FROM trades WHERE symbol = 'AAPL'SELECT * FROM trades WHERE symbol = 1Legacy TickMessage-based ingestion (integer symbol IDs) and new SQL INSERT (string symbols) can coexist. The dictionary simply maps strings to codes; integer-only workflows bypass it entirely.
In a multi-node cluster, extract_symbol_filter() can’t resolve string → code without a global registry. String WHERE queries fall back to scatter-gather: the coordinator broadcasts to all nodes, each resolves locally.
Future optimization: a global symbol registry would enable direct routing (Tier A) for string predicates.
Zero regression — benchmarks unchanged:
| Benchmark | Time |
|---|---|
| xbar GROUP BY 1M rows | 11.20ms |
| EMA 1M rows | 2.15ms |
| Window JOIN 100K×100K | 11.00ms |
Dictionary lookup is O(1) hash. Intern is O(1) amortized. Stored codes are uint32 — same width as before. The only new work is the one-time find() call per string predicate per query.
Write 'AAPL', store uint32
Dictionary encoding: human-readable SQL, integer-speed storage and queries.
Global dictionary
One dictionary per pipeline. Consistent codes across all partitions and queries.
Zero query overhead
String → code resolution happens once per query. Scan operates on integer comparisons.
Backward compatible
Integer symbol IDs still work. Legacy and new ingestion paths coexist.
Related: SQL Parser → · DDL Schema Management → · SQL DML →