Skip to content

ARM Graviton: 766/766 Tests Passing, 3x Faster GROUP BY

Shipping a high-performance C++ database on a single architecture is table stakes. Shipping it on two — with SIMD vectorization, JIT compilation, and protocol parsers all working identically — is the real test. This post covers ZeptoDB’s full verification on AWS Graviton (aarch64): 766/766 tests passing, with some benchmarks that surprised us.


AWS Graviton instances offer ~20% cost savings over equivalent x86 instances. For a database that runs 24/7 in production, that’s a significant line item. But cost savings mean nothing if the database doesn’t work correctly — and ZeptoDB relies heavily on architecture-specific features:

  • Google Highway SIMD: vectorized column scans, aggregations, and filter evaluation
  • LLVM JIT: runtime code generation for query expressions
  • Feed parsers: bit-level protocol parsing for FIX and ITCH market data

All of these need to work correctly on ARM’s NEON instruction set, not just x86’s SSE/AVX.

x86 InstanceGraviton Instance
Architecturex86_64aarch64
CPUIntel Xeon 6975P (8 vCPU)Graviton (4 vCPU)
RAM15 GB
OSAmazon Linux 2023Amazon Linux 2023
CompilerClang 19.1.7Clang 19.1.7
Highway SIMD1.2.01.2.0
LLVM JIT19.1.719.1.7

Same compiler, same library versions, same OS. The only variable is the CPU architecture.

CMake + Ninja in Release mode. 137/137 targets built successfully on both architectures — identical target count, no conditional compilation needed. The Python binding (zeptodb.cpython-39-aarch64-linux-gnu.so) also generated successfully.

Highway’s architecture abstraction is the key enabler here. Code written against Highway’s portable API compiles to NEON on ARM and AVX2/AVX-512 on x86 without #ifdef blocks.

Test Suitex86_64aarch64
Unit Tests619/619 ✅619/619 ✅
Feed Tests21/21 ✅21/21 ✅
Migration Tests126/126 ✅126/126 ✅
Total766/766766/766

Zero platform-specific failures. Every test — from basic column operations through JIT-compiled expressions to FIX/ITCH protocol parsing — produces identical results on both architectures.


We ran the standard micro-benchmarks on the Graviton instance:

MetricGraviton (aarch64)x86 (previous)Ratio
xbar GROUP BY (1M rows)7.99 ms24 ms3x faster
ITCH Parser17.18 ns/msg (58.2M msg/s)23.3 ns/msg (42.9M msg/s)1.36x faster
FIX Parser358.97 ns/msg (2.79M msg/s)

The 3x faster xbar GROUP BY on Graviton was unexpected. The likely explanation is memory access pattern differences — Graviton’s memory subsystem handles the sequential column scan + hash aggregation pattern particularly well. HugePages were not configured on the Graviton instance (fallback warning present), so the actual gap may narrow with identical memory configurations.

The ITCH parser at 58.2 million messages per second on ARM is notable — this is a bit-level binary protocol parser where every cycle counts.


One test failure appeared during verification: FIXMessageBuilderTest.BuildLogon. The test constructed a message with FIXMessageBuilder("ZEPTO", "SERVER") but asserted 49=APEX — a SenderCompID mismatch.

// Before (wrong):
// Expected: 49=APEX
// Actual: 49=ZEPTO
// Fix: tests/feeds/test_fix_parser.cpp:165
// Changed expected value to match constructor argument
// 49=APEX → 49=ZEPTO

This was a pre-existing test typo that failed on both architectures. Not an ARM-specific issue.


ZeptoDB is fully portable across x86_64 and aarch64 with zero code changes:

  • Highway SIMD abstracts NEON vs SSE/AVX transparently
  • LLVM JIT generates native ARM code through the same IR pipeline
  • Feed parsers (FIX, ITCH, Binance) work identically — no endianness or alignment issues
  • Python bindings build and load correctly on aarch64

For deployment, this means teams can choose Graviton instances for cost savings without any functional risk.

766/766 tests passing

Every test suite — unit, feed, migration — produces identical results on x86_64 and aarch64.

3x faster GROUP BY

xbar aggregation on Graviton completed in 7.99ms vs 24ms on x86. Memory subsystem advantage.

58.2M ITCH msg/s

Bit-level protocol parsing at full speed on ARM NEON. No performance penalty for portability.

Zero #ifdef blocks

Highway SIMD + LLVM JIT abstract the architecture. Same source, same behavior, different ISA.


Related: SIMD JIT Optimization → · EKS Architecture Benchmark → · Bare Metal Tuning →