Skip to content

HTTP Server Observability: Structured Logging and Request Tracing

ZeptoDB’s HTTP server had zero request-level logging. No way to trace individual requests, identify slow queries, or correlate client-side errors with server-side events. This post covers the observability layer that fixes all of that.


Every HTTP request produces a structured JSON log entry:

{
"request_id": "r0001a3",
"method": "POST",
"path": "/",
"status": 200,
"duration_us": 532,
"request_bytes": 42,
"response_bytes": 1024,
"remote_addr": "10.0.1.5",
"subject": "algo-service"
}

Emitted via zeptodb::util::Logger (async JSON, rotating file). Log level is determined by status code:

Status RangeLog Level
2xx, 3xxINFO
4xxWARN
5xxERROR

Component tag: "http". This makes it trivial to filter access logs from other server events in log aggregation tools.


Queries exceeding 100ms (or returning errors) get a dedicated log entry:

{
"query_id": "q_a1b2c3",
"subject": "algo-service",
"duration_us": 150234,
"rows": 50000,
"ok": true,
"sql": "SELECT vwap(price, volume) FROM trades WHERE ..."
}

SQL is truncated to 200 characters for log safety — no risk of multi-megabyte log entries from large queries. Component tag: "query".

This is the fastest way to find performance problems in production. Sort by duration_us, and the worst offenders surface immediately.


Every HTTP response includes a unique request identifier:

HTTP/1.1 200 OK
X-Request-Id: r0001a3
Content-Type: application/json

The ID uses a monotonic counter (r<hex>), ensuring uniqueness within a process. Clients can log this value and use it to correlate their errors with server-side access log entries.

Typical debugging workflow:

Client log: "Query failed, request_id=r0001a3"
Server log: grep "r0001a3" /var/log/zeptodb/access.json
Result: {"request_id":"r0001a3","status":500,"duration_us":30012,...}

Startup and shutdown are logged as structured events:

{"event": "server_start", "port": 8123, "tls": false, "auth": true, "async": true}
{"event": "server_stop", "port": 8123}

These are essential for operations — knowing exactly when a server started, with what configuration, and when it stopped.


Two metrics are exposed for monitoring dashboards:

MetricTypeDescription
zepto_http_requests_totalCounterTotal HTTP requests served
zepto_http_active_sessionsGaugeCurrent active sessions

These integrate with the existing Prometheus ServiceMonitor in the Helm chart. Combined with the access log, you get both real-time dashboards and detailed per-request forensics.


HTTP Request
├─→ Generate X-Request-Id (monotonic counter)
├─→ Execute handler (query, admin, health, etc.)
├─→ Measure duration
├─→ Access log entry (util::Logger, async JSON)
│ └─→ Log level based on status code
├─→ Slow query log (if duration > 100ms or error)
├─→ Prometheus counter increment
└─→ Response with X-Request-Id header

The logging is async — util::Logger buffers entries and writes them in a background thread. No blocking on the request hot path.

Structured JSON logs

Every request logged as JSON with request ID, duration, status, and client identity. Machine-parseable, grep-friendly.

Slow query detection

Queries over 100ms automatically logged with SQL, duration, and row count. Sort by duration to find bottlenecks.

Request tracing

X-Request-Id in every response. Clients log it, operators grep for it. End-to-end correlation in seconds.

Prometheus metrics

Request counter and active session gauge for real-time dashboards. Integrates with existing Helm ServiceMonitor.


Related: Helm Chart and Rolling Upgrades → · Health Monitor Resilience → · Kubernetes Compatibility and HA Testing →