Skip to content

Python Ecosystem Integration: NumPy, Pandas, and Polars

Quants prototype in Jupyter notebooks with pandas and Polars, then need to move data into ZeptoDB for production-scale real-time queries. The zepto_py package bridges this gap with zero-copy Arrow paths and vectorized batch ingestion — no row-by-row Python iteration.


The zepto_py/ package provides six modules:

ModulePurpose
connection.pyHTTP client — query_pandas(), query_polars(), ingest_pandas()
dataframe.pyStandalone converters — from_pandas(), from_polars(), to_pandas(), to_polars()
arrow.pyArrowSession — zero-copy Arrow ingest/export, DuckDB registration
streaming.pyStreamingSession — batch ingest with progress callbacks
utils.pyDependency inspector — check_dependencies(), versions()
__init__.pyPublic API surface

import zepto_py as zeptodb
# Connect via HTTP
db = zeptodb.connect("localhost", 8123)
df = db.query_pandas("SELECT sym, avg(price) FROM trades GROUP BY sym")
# Ingest from pandas
db.ingest_pandas(ticks_df)
# Ingest from Polars (Arrow path — zero overhead)
db.ingest_polars(ticks_pl)
# ArrowSession — zero-copy interop
from zepto_py import ArrowSession
sess = ArrowSession(pipeline)
sess.ingest_arrow(arrow_table) # pa.Table → ZeptoDB
tbl = sess.to_arrow(symbol=1) # ZeptoDB → pa.Table (zero-copy)
conn = sess.to_duckdb(symbol=1) # Register as DuckDB table
pl_df = sess.to_polars_zero_copy(sym=1) # ZeptoDB → Polars via Arrow
# StreamingSession — high-throughput with progress
sess = zeptodb.StreamingSession(pipeline, batch_size=50_000)
sess.ingest_pandas(df, show_progress=True)
# Ingested 1,000,000 rows in 1.82s (549,451 rows/sec)

All Polars paths go through Arrow (Polars is Arrow-native). This gives:

┌─────────┐
pandas ──→ numpy ─┤ ├─→ ZeptoDB C++
Polars ──→ Arrow ─┤ zepto │
DuckDB ──→ Arrow ─┤ _py ├─→ pa.Table
NumPy ──→ direct ┤ ├─→ RecordBatchReader
└─────────┘
  • True zero-copy between Polars and ZeptoDB via Arrow buffers
  • Compatibility with DuckDB, Ray, Spark via RecordBatchReader
  • Timestamps preserve nanosecond precision and UTC timezone

The previous design used iterrows() — O(n) Python object allocations per row. The current implementation uses vectorized numpy batch extraction:

from_polars(df) → df.slice() (zero-copy view)
→ Series.to_numpy() (Arrow buffer direct reference)
→ pipeline.ingest_batch(syms, prices, vols) ← single C++ call

Key properties:

  • df.slice() in Polars returns a view — no data copy
  • Series.to_numpy() for numeric types returns the Arrow buffer directly
  • ingest_batch() is a single C++ call with a tight loop — no GIL contention per row
Method1M rowsSpeedup
iterrows() (old)~30–60s
from_polars() vectorized~0.3s~100–200×
from_pandas() vectorized~0.5s~60–120×

Real DataFrames have float64 prices (e.g., 150.25). The C++ pipeline stores int64 (fixed-point). Two conversion mechanisms:

  1. Python side: price_scale parameter (e.g., 100.0 stores cents)
  2. C++ side: ingest_float_batch(syms, prices_f64, vols_f64, price_scale) — accepts float64 numpy arrays, applies scale in C++ (no Python overhead)

All modules use optional imports guarded with HAS_* flags:

  • No pyarrow? Arrow path falls back to row-iteration
  • No pandas? Polars-only workflows still work
  • No hard dependency requirements at import time

ZeptoDB →pandasPolarsNumPyArrowDuckDB
via HTTPquery_pandas()query_polars()query_numpy()
via pipelineto_pandas()to_polars()get_column()to_arrow()to_duckdb()
zero-copynumpy viewvia ArrowdirectyesArrow register
→ ZeptoDBpandasPolarsArrowgenerator
batchfrom_pandas()from_polars()ingest_arrow()ingest_iter()
streamingStreamingSessionStreamingSessionArrowSessioningest_iter()

100–200× faster ingestion

Vectorized numpy batch extraction replaces row-by-row Python iteration. Single C++ call per batch.

Zero-copy Arrow paths

Polars → Arrow → ZeptoDB with no data copies. DuckDB registers Arrow tables directly.

Graceful degradation

Optional imports mean the package works with any subset of pandas, Polars, pyarrow, DuckDB.

208 tests, 100% pass

Comprehensive coverage: Arrow roundtrips, streaming, fast ingest, pandas/Polars/DuckDB integration.


Related: Arrow Flight Server → · Parquet on S3 → · Zero-Copy Python →