100–200× faster ingestion
Vectorized numpy batch extraction replaces row-by-row Python iteration. Single C++ call per batch.
Quants prototype in Jupyter notebooks with pandas and Polars, then need to move data into ZeptoDB for production-scale real-time queries. The zepto_py package bridges this gap with zero-copy Arrow paths and vectorized batch ingestion — no row-by-row Python iteration.
The zepto_py/ package provides six modules:
| Module | Purpose |
|---|---|
connection.py | HTTP client — query_pandas(), query_polars(), ingest_pandas() |
dataframe.py | Standalone converters — from_pandas(), from_polars(), to_pandas(), to_polars() |
arrow.py | ArrowSession — zero-copy Arrow ingest/export, DuckDB registration |
streaming.py | StreamingSession — batch ingest with progress callbacks |
utils.py | Dependency inspector — check_dependencies(), versions() |
__init__.py | Public API surface |
import zepto_py as zeptodb
# Connect via HTTPdb = zeptodb.connect("localhost", 8123)df = db.query_pandas("SELECT sym, avg(price) FROM trades GROUP BY sym")
# Ingest from pandasdb.ingest_pandas(ticks_df)
# Ingest from Polars (Arrow path — zero overhead)db.ingest_polars(ticks_pl)
# ArrowSession — zero-copy interopfrom zepto_py import ArrowSessionsess = ArrowSession(pipeline)sess.ingest_arrow(arrow_table) # pa.Table → ZeptoDBtbl = sess.to_arrow(symbol=1) # ZeptoDB → pa.Table (zero-copy)conn = sess.to_duckdb(symbol=1) # Register as DuckDB tablepl_df = sess.to_polars_zero_copy(sym=1) # ZeptoDB → Polars via Arrow
# StreamingSession — high-throughput with progresssess = zeptodb.StreamingSession(pipeline, batch_size=50_000)sess.ingest_pandas(df, show_progress=True)# Ingested 1,000,000 rows in 1.82s (549,451 rows/sec)All Polars paths go through Arrow (Polars is Arrow-native). This gives:
┌─────────┐ pandas ──→ numpy ─┤ ├─→ ZeptoDB C++ Polars ──→ Arrow ─┤ zepto │ DuckDB ──→ Arrow ─┤ _py ├─→ pa.Table NumPy ──→ direct ┤ ├─→ RecordBatchReader └─────────┘RecordBatchReaderThe previous design used iterrows() — O(n) Python object allocations per row. The current implementation uses vectorized numpy batch extraction:
from_polars(df) → df.slice() (zero-copy view) → Series.to_numpy() (Arrow buffer direct reference) → pipeline.ingest_batch(syms, prices, vols) ← single C++ callKey properties:
df.slice() in Polars returns a view — no data copySeries.to_numpy() for numeric types returns the Arrow buffer directlyingest_batch() is a single C++ call with a tight loop — no GIL contention per row| Method | 1M rows | Speedup |
|---|---|---|
iterrows() (old) | ~30–60s | 1× |
from_polars() vectorized | ~0.3s | ~100–200× |
from_pandas() vectorized | ~0.5s | ~60–120× |
Real DataFrames have float64 prices (e.g., 150.25). The C++ pipeline stores int64 (fixed-point). Two conversion mechanisms:
price_scale parameter (e.g., 100.0 stores cents)ingest_float_batch(syms, prices_f64, vols_f64, price_scale) — accepts float64 numpy arrays, applies scale in C++ (no Python overhead)All modules use optional imports guarded with HAS_* flags:
| ZeptoDB → | pandas | Polars | NumPy | Arrow | DuckDB |
|---|---|---|---|---|---|
| via HTTP | query_pandas() | query_polars() | query_numpy() | — | — |
| via pipeline | to_pandas() | to_polars() | get_column() | to_arrow() | to_duckdb() |
| zero-copy | numpy view | via Arrow | direct | yes | Arrow register |
| → ZeptoDB | pandas | Polars | Arrow | generator |
|---|---|---|---|---|
| batch | from_pandas() | from_polars() | ingest_arrow() | ingest_iter() |
| streaming | StreamingSession | StreamingSession | ArrowSession | ingest_iter() |
100–200× faster ingestion
Vectorized numpy batch extraction replaces row-by-row Python iteration. Single C++ call per batch.
Zero-copy Arrow paths
Polars → Arrow → ZeptoDB with no data copies. DuckDB registers Arrow tables directly.
Graceful degradation
Optional imports mean the package works with any subset of pandas, Polars, pyarrow, DuckDB.
208 tests, 100% pass
Comprehensive coverage: Arrow roundtrips, streaming, fast ingest, pandas/Polars/DuckDB integration.
Related: Arrow Flight Server → · Parquet on S3 → · Zero-Copy Python →