Skip to content

Agent Memory Benchmarks: Search, Context, Cache, and Snapshots

Agent memory performance matters because recall sits directly in the agent loop. If every turn waits on slow context retrieval, the agent feels slow before the model is even called.

ZeptoDB benchmarks the memory layer separately from the time-series core, then shows how both fit together: microsecond evidence retrieval from live tables, millisecond memory search, exact/semantic cache lookup, and zero-copy Python for model-side workflows.

For any external comparison, use the benchmark criteria first: scope, hardware, build flags, dataset shape, cache state, run protocol, and tail latency must be disclosed with the number.


The Agent Memory benchmark uses client-supplied 128-dimensional float32 embeddings. It measures:

  • Filtered memory search
  • Context assembly under a token budget
  • Exact prompt cache lookup
  • Semantic cache lookup
  • Sidecar snapshot save/load
  • Optional sparse-projection ANN candidate generation

The memory layer ranks candidates by tenant/session filters, embedding similarity, importance, pinned boost, recency, and access count. Context assembly deduplicates repeated content and respects an optional token budget.


Operationp50p95
Memory search top-K1.23ms1.40ms
Context assembly1.34ms1.41ms
Exact cache lookup0.00ms0.00ms
Semantic cache lookup0.07ms0.07ms
Snapshot save5.79ms-
Snapshot load11.60ms-

For many operational agents, 10K scoped memories is already a meaningful working set: current user/session memory, incident summaries, pinned runbooks, prior diagnoses, and cache entries.


Sparse-projection ANN is a derived in-memory candidate index. It can reduce filtered-search latency at larger memory counts, but it is recall-sensitive and can fall back to filtered scan when it cannot produce enough filtered candidates.

RecordsSearch p50Search p95Context p50Context p95ANN rebuild
10K0.19ms0.41ms0.38ms0.52ms12.36ms
100K2.41ms4.68ms2.77ms2.98ms138.37ms
1M32.03ms36.27ms25.48ms29.96ms1691.56ms

This is useful as a current baseline, not the final word on million-memory search. Stronger ANN index families remain a follow-up area.


How this compares with the time-series core

Section titled “How this compares with the time-series core”

5.52M events/sec

The ingestion path captures live observations, tool calls, cache events, and model-call telemetry without turning the agent stack into a separate logging system.

272us query on 1M rows

Evidence retrieval stays fast enough to happen before the agent acts, not only after an incident review.

522ns Python zero-copy

Query results can move into Python, NumPy, Pandas, and PyTorch without serialization overhead.

0.07ms semantic cache lookup

Repeated operational prompts can reuse prior responses when application policy allows it.


Raw p50 latency is only one part of the picture. For an agent workload, measure the full turn:

  1. Query recent time-series evidence.
  2. Retrieve memories with tenant/session filters.
  3. Assemble context under a token budget.
  4. Check exact and semantic cache.
  5. Call the model only on cache miss.
  6. Write back the decision, cache event, model call, and tool calls.

That is the workload ZeptoDB is designed around: one timeline for facts, context, cache, and decisions.