Agent Memory Python Quickstart

This quickstart shows the smallest useful Agent Memory loop:

Connect to a ZeptoDB HTTP server.
Store scoped memories with client-supplied embeddings.
Retrieve context under a token budget.
Check exact and semantic prompt cache.
Store the response and write back a decision memory.

ZeptoDB does not call embedding or LLM providers from the server. The application owns embeddings, prompts, model calls, and provider credentials.

Start ZeptoDB

Use Docker for a local server:

docker run -p 8123:8123 zeptodb/zeptodb:latest

Or run the local binary:

./zepto_http_server --port 8123

For persistence, start the server with an Agent Memory directory:

./zepto_http_server --port 8123 --agent-memory-dir ./agent_memory

Install the Python client

pip install zeptodb

If you are working from the source tree, use the local package environment instead.

Run one memory turn

import hashlib
import json

import zepto_py as zepto


def embed(text: str, dims: int = 8) -> list[float]:
    """Small deterministic demo embedding. Replace with your provider."""
    digest = hashlib.sha256(text.encode("utf-8")).digest()
    return [((digest[i] / 255.0) * 2.0) - 1.0 for i in range(dims)]


def call_model(prompt: str) -> str:
    """Mock provider call. Replace with your OpenAI/Anthropic/local model call."""
    return "Check press-7 vibration, inspect bearing wear, and compare the last maintenance window."


db = zepto.connect("localhost", 8123)

tenant_id = "factory-a"
namespace = "maintenance"
user_id = "operator-12"
session_id = "shift-2026-05-28"
agent_id = "maintenance-agent"

# 1. Seed memories.
db.memory.put(
    tenant_id=tenant_id,
    namespace=namespace,
    user_id=user_id,
    session_id=session_id,
    agent_id=agent_id,
    type="incident",
    content="Press-7 vibration rose before bearing wear was found during the last inspection.",
    metadata_json=json.dumps({"machine_id": "press-7", "source": "maintenance_log"}),
    embedding=embed("press-7 vibration bearing wear"),
    token_count=14,
    importance=0.9,
    pinned=True,
)

db.memory.put(
    tenant_id=tenant_id,
    namespace=namespace,
    user_id=user_id,
    session_id=session_id,
    agent_id=agent_id,
    type="runbook",
    content="For rising vibration, compare current, temperature, lubrication, and recent bearing service.",
    metadata_json=json.dumps({"machine_id": "press-7", "source": "runbook"}),
    embedding=embed("rising vibration inspection checklist"),
    token_count=13,
    importance=0.7,
)

# 2. Retrieve context for the current question.
question = "Why is press-7 vibration rising?"
context = db.memory.get_context(
    tenant_id=tenant_id,
    namespace=namespace,
    user_id=user_id,
    session_id=session_id,
    agent_id=agent_id,
    query_embedding=embed(question),
    token_budget=256,
    limit=5,
)

memory_lines = [
    f"- {m.get('content', '')}"
    for m in context.get("memories", [])
]
prompt = "\n".join([
    "Use the retrieved operational memory to answer the question.",
    "",
    "Question:",
    question,
    "",
    "Retrieved memory:",
    *memory_lines,
])

# 3. Check exact/semantic cache before calling a model provider.
cached = db.cache.lookup(
    prompt,
    embedding=embed(prompt),
    tenant_id=tenant_id,
    namespace=namespace,
    semantic_threshold=0.92,
)

if cached.get("hit"):
    response = cached.get("entry", {}).get("response", "")
    source = f"cache:{cached.get('kind', 'unknown')}"
else:
    response = call_model(prompt)
    db.cache.store(
        prompt,
        response,
        embedding=embed(prompt),
        tenant_id=tenant_id,
        namespace=namespace,
        metadata_json=json.dumps({"question": question, "agent_id": agent_id}),
        token_count=len(response.split()),
    )
    source = "provider"

# 4. Write back the decision as memory.
decision_id = db.memory.put(
    tenant_id=tenant_id,
    namespace=namespace,
    user_id=user_id,
    session_id=session_id,
    agent_id=agent_id,
    type="decision",
    content=response,
    metadata_json=json.dumps({"question": question, "source": source}),
    embedding=embed(response),
    token_count=len(response.split()),
    importance=0.85,
)

print({
    "source": source,
    "context_memories": len(context.get("memories", [])),
    "context_tokens": context.get("token_count", 0),
    "decision_id": decision_id,
    "response": response,
})

Add live evidence

Agent Memory becomes more useful when it sits beside live time-series evidence. In a real agent, retrieve recent rows first, then combine SQL evidence with memory context:

recent = db.query("""
    SELECT ts, vibration, temperature, current
    FROM machine_sensors
    WHERE machine_id = 'press-7'
    ORDER BY ts
""")

evidence = recent.to_dict()

The agent prompt can then include both:

recent time-series evidence from SQL
retrieved memories from db.memory.get_context(...)
cache result from db.cache.lookup(...)
decision write-back through db.memory.put(...)

Next steps

Agent Memory Guide Full API surface, performance shape, and current operating model

Why Time-Series Matters Why operational memory needs a replayable event timeline

Benchmarks Search, context, cache, snapshot, ingestion, and query latency numbers