Agent Memory Benchmarks: Search, Context, Cache, and Snapshots

Agent memory performance matters because recall sits directly in the agent loop. If every turn waits on slow context retrieval, the agent feels slow before the model is even called.

ZeptoDB benchmarks the memory layer separately from the time-series core, then shows how both fit together: microsecond evidence retrieval from live tables, millisecond memory search, exact/semantic cache lookup, and zero-copy Python for model-side workflows.

For any external comparison, use the benchmark criteria first: scope, hardware, build flags, dataset shape, cache state, run protocol, and tail latency must be disclosed with the number.

Benchmark shape

The Agent Memory benchmark uses client-supplied 128-dimensional float32 embeddings. It measures:

Filtered memory search
Context assembly under a token budget
Exact prompt cache lookup
Semantic cache lookup
Sidecar snapshot save/load
Optional sparse-projection ANN candidate generation

The memory layer ranks candidates by tenant/session filters, embedding similarity, importance, pinned boost, recency, and access count. Context assembly deduplicates repeated content and respects an optional token budget.

10K memory records

Operation	p50	p95
Memory search top-K	1.23ms	1.40ms
Context assembly	1.34ms	1.41ms
Exact cache lookup	0.00ms	0.00ms
Semantic cache lookup	0.07ms	0.07ms
Snapshot save	5.79ms	-
Snapshot load	11.60ms	-

For many operational agents, 10K scoped memories is already a meaningful working set: current user/session memory, incident summaries, pinned runbooks, prior diagnoses, and cache entries.

Sparse-projection ANN sweep

Sparse-projection ANN is a derived in-memory candidate index. It can reduce filtered-search latency at larger memory counts, but it is recall-sensitive and can fall back to filtered scan when it cannot produce enough filtered candidates.

Records	Search p50	Search p95	Context p50	Context p95	ANN rebuild
10K	0.19ms	0.41ms	0.38ms	0.52ms	12.36ms
100K	2.41ms	4.68ms	2.77ms	2.98ms	138.37ms
1M	32.03ms	36.27ms	25.48ms	29.96ms	1691.56ms

This is useful as a current baseline, not the final word on million-memory search. Stronger ANN index families remain a follow-up area.

How this compares with the time-series core

5.52M events/sec

The ingestion path captures live observations, tool calls, cache events, and model-call telemetry without turning the agent stack into a separate logging system.

272us query on 1M rows

Evidence retrieval stays fast enough to happen before the agent acts, not only after an incident review.

522ns Python zero-copy

Query results can move into Python, NumPy, Pandas, and PyTorch without serialization overhead.

0.07ms semantic cache lookup

Repeated operational prompts can reuse prior responses when application policy allows it.

What to benchmark in your own app

Raw p50 latency is only one part of the picture. For an agent workload, measure the full turn:

Query recent time-series evidence.
Retrieve memories with tenant/session filters.
Assemble context under a token budget.
Check exact and semantic cache.
Call the model only on cache miss.
Write back the decision, cache event, model call, and tool calls.

That is the workload ZeptoDB is designed around: one timeline for facts, context, cache, and decisions.

More detail

Benchmarks Full site benchmark page with time-series and Agent Memory numbers

Agent Memory Guide API surface and Python sketch for memory, cache, and context

Why Agent Memory Needs Time-Series The product rationale behind replayable operational memory