Python Ecosystem Integration: NumPy, Pandas, and Polars

Quants prototype in Jupyter notebooks with pandas and Polars, then need to move data into ZeptoDB for production-scale real-time queries. The zepto_py package bridges this gap with zero-copy Arrow paths and vectorized batch ingestion — no row-by-row Python iteration.

Package Architecture

The zepto_py/ package provides six modules:

Module	Purpose
`connection.py`	HTTP client — `query_pandas()`, `query_polars()`, `ingest_pandas()`
`dataframe.py`	Standalone converters — `from_pandas()`, `from_polars()`, `to_pandas()`, `to_polars()`
`arrow.py`	`ArrowSession` — zero-copy Arrow ingest/export, DuckDB registration
`streaming.py`	`StreamingSession` — batch ingest with progress callbacks
`utils.py`	Dependency inspector — `check_dependencies()`, `versions()`
`__init__.py`	Public API surface

Key API

import zepto_py as zeptodb

# Connect via HTTP
db = zeptodb.connect("localhost", 8123)
df = db.query_pandas("SELECT sym, avg(price) FROM trades GROUP BY sym")

# Ingest from pandas
db.ingest_pandas(ticks_df)

# Ingest from Polars (Arrow path — zero overhead)
db.ingest_polars(ticks_pl)

# ArrowSession — zero-copy interop
from zepto_py import ArrowSession
sess = ArrowSession(pipeline)
sess.ingest_arrow(arrow_table)           # pa.Table → ZeptoDB
tbl  = sess.to_arrow(symbol=1)           # ZeptoDB → pa.Table (zero-copy)
conn = sess.to_duckdb(symbol=1)          # Register as DuckDB table
pl_df = sess.to_polars_zero_copy(sym=1)  # ZeptoDB → Polars via Arrow

# StreamingSession — high-throughput with progress
sess = zeptodb.StreamingSession(pipeline, batch_size=50_000)
sess.ingest_pandas(df, show_progress=True)
# Ingested 1,000,000 rows in 1.82s (549,451 rows/sec)

Arrow as the Universal Intermediary

All Polars paths go through Arrow (Polars is Arrow-native). This gives:

                    ┌─────────┐
  pandas ──→ numpy ─┤         ├─→ ZeptoDB C++
  Polars ──→ Arrow ─┤  zepto  │
  DuckDB ──→ Arrow ─┤   _py   ├─→ pa.Table
  NumPy  ──→ direct ┤         ├─→ RecordBatchReader
                    └─────────┘

True zero-copy between Polars and ZeptoDB via Arrow buffers
Compatibility with DuckDB, Ray, Spark via RecordBatchReader
Timestamps preserve nanosecond precision and UTC timezone

Vectorized Batch Ingestion

The previous design used iterrows() — O(n) Python object allocations per row. The current implementation uses vectorized numpy batch extraction:

from_polars(df)  →  df.slice()           (zero-copy view)
                 →  Series.to_numpy()    (Arrow buffer direct reference)
                 →  pipeline.ingest_batch(syms, prices, vols)  ← single C++ call

Key properties:

df.slice() in Polars returns a view — no data copy
Series.to_numpy() for numeric types returns the Arrow buffer directly
ingest_batch() is a single C++ call with a tight loop — no GIL contention per row

Performance Impact

Method	1M rows	Speedup
`iterrows()` (old)	~30–60s	1×
`from_polars()` vectorized	~0.3s	~100–200×
`from_pandas()` vectorized	~0.5s	~60–120×

Float Price Handling

Real DataFrames have float64 prices (e.g., 150.25). The C++ pipeline stores int64 (fixed-point). Two conversion mechanisms:

Python side: price_scale parameter (e.g., 100.0 stores cents)
C++ side: ingest_float_batch(syms, prices_f64, vols_f64, price_scale) — accepts float64 numpy arrays, applies scale in C++ (no Python overhead)

Graceful Degradation

All modules use optional imports guarded with HAS_* flags:

No pyarrow? Arrow path falls back to row-iteration
No pandas? Polars-only workflows still work
No hard dependency requirements at import time

Interoperability Matrix

ZeptoDB →	pandas	Polars	NumPy	Arrow	DuckDB
via HTTP	`query_pandas()`	`query_polars()`	`query_numpy()`	—	—
via pipeline	`to_pandas()`	`to_polars()`	`get_column()`	`to_arrow()`	`to_duckdb()`
zero-copy	numpy view	via Arrow	direct	yes	Arrow register

→ ZeptoDB	pandas	Polars	Arrow	generator
batch	`from_pandas()`	`from_polars()`	`ingest_arrow()`	`ingest_iter()`
streaming	`StreamingSession`	`StreamingSession`	`ArrowSession`	`ingest_iter()`

100–200× faster ingestion

Vectorized numpy batch extraction replaces row-by-row Python iteration. Single C++ call per batch.

Zero-copy Arrow paths

Polars → Arrow → ZeptoDB with no data copies. DuckDB registers Arrow tables directly.

Graceful degradation

Optional imports mean the package works with any subset of pandas, Polars, pyarrow, DuckDB.

208 tests, 100% pass

Comprehensive coverage: Arrow roundtrips, streaming, fast ingest, pandas/Polars/DuckDB integration.