Read-after-write
drain_sync() after INSERT guarantees immediate SELECT visibility. No eventual consistency.
Time-series databases are traditionally append-only. But real-world use cases demand corrections (UPDATE wrong prices), purges (DELETE test data), and standard SQL ingestion (INSERT). This post covers how ZeptoDB implements all three DML operations on its columnar storage.
-- Basic (default column order: symbol, price, volume, timestamp)INSERT INTO trades VALUES (1, 15000, 100, 1000000000)
-- Multi-row batchINSERT INTO trades VALUES (1, 15050, 200, 2000000000), (2, 16000, 300, 3000000000)
-- Column specification (timestamp auto-generated if omitted)INSERT INTO trades (symbol, price, volume) VALUES (1, 15100, 150)SQL INSERT → Parser → InsertStmt AST → Executor → pipeline_.ingest_tick(msg) // into ring buffer → pipeline_.drain_sync() // flush to column store → SELECT immediately availableThe key design decision: drain_sync() is called after INSERT to guarantee read-after-write consistency. Without it, a subsequent SELECT might not see the just-inserted rows (they’d still be in the ring buffer).
When the column list omits timestamp, the executor generates one via std::chrono::system_clock::now() in nanoseconds — matching the behavior users expect from INSERT INTO trades (symbol, price, volume) VALUES (...).
No HTTP server changes were needed. The existing POST / → executor_.execute(sql) path handles INSERT identically to SELECT:
curl -X POST http://localhost:8123/ \ -d "INSERT INTO trades VALUES (1, 15000, 100, 1000000000)"This means ClickHouse migration customers can point existing INSERT pipelines at ZeptoDB without code changes.
UPDATE trades SET price = 15200 WHERE symbol = 1 AND price > 15100UPDATE trades SET price = 15200, volume = 500 WHERE symbol = 1UPDATE doesn’t allocate new storage. It writes directly into the existing column spans:
auto span = col_vec.as_span<int64_t>();span[row_idx] = new_value; // direct memory writeThe executor:
Cost: O(matched rows). No copying, no reallocation, no new partition creation.
Multiple SET assignments are applied in a single pass over the matched rows. Each assignment maps a column name to a literal value — the executor resolves column names to ColumnVector pointers once, then iterates.
DELETE FROM trades WHERE symbol = 1 AND price < 15000DELETE FROM trades WHERE symbol = 1 -- all rows for a symbolDELETE can’t just zero out rows — that would leave gaps in the columnar arrays that every subsequent scan would need to skip. Instead, it compacts:
Before: [A, B, C, D, E] — delete B and D ↓Shift: [A, C, E, _, _] ↓Shrink: [A, C, E] — set_size(3)For each column in the partition:
set_size(new_count) to shrink the logical sizeArena memory from deleted rows is not reclaimed — the arena allocator is append-only. This is acceptable because:
A recurring theme across all three DML operations: symbol is a partition key, not a column. The executor extracts symbol conditions via has_where_symbol() and routes to the correct partition(s) before scanning any data.
INSERT: symbol value → partition routing via ingest_tick()UPDATE: WHERE symbol = N → scan only that partitionDELETE: WHERE symbol = N → compact only that partitionWithout this optimization, every DML operation would scan all partitions — O(total rows) instead of O(partition rows).
| Token | Statement |
|---|---|
INSERT, INTO, VALUES | INSERT INTO table VALUES (...) |
UPDATE, SET | UPDATE table SET col = val WHERE ... |
DELETE_KW | DELETE FROM table WHERE ... |
A subtle conflict: adding the SET token broke ALTER TABLE SET TTL parsing, because the parser expected an IDENT token with value “SET” but now received TokenType::SET. Fixed by accepting both token types in the ALTER TABLE parser.
In a multi-node cluster, the QueryCoordinator routes DML based on the symbol:
| Operation | Routing |
|---|---|
| INSERT | Route to node owning the symbol |
| UPDATE with symbol filter | Route to owning node |
| UPDATE without symbol | Broadcast to all nodes |
| DELETE with symbol filter | Route to owning node |
| DELETE without symbol | Broadcast to all nodes |
| DDL | Broadcast to all nodes |
SET price = price * 1.01 isn’t supported — only literal values.Read-after-write
drain_sync() after INSERT guarantees immediate SELECT visibility. No eventual consistency.
Zero-copy UPDATE
Direct span writes — no allocation, no copying. O(matched rows) cost.
Compaction DELETE
Shift + shrink preserves dense columnar layout. Arena memory reclaimed at partition eviction.
ClickHouse compatible
Same INSERT syntax, same HTTP endpoint. Migration customers need zero code changes.
Related: DDL Schema Management → · Storage Tiering → · SQL Parser →