Feed Handler Performance Optimization Guide

Goals

FIX Parser: < 500ns per message
ITCH Parser: < 300ns per message
Multicast UDP: < 1μs latency
Total throughput: 5M+ messages/sec

Optimization Techniques

1. Zero-Copy Parsing

Before (string copies):

// Slow: string copy per field
std::string symbol = extract_field(msg, 55);  // copy occurs
std::string price_str = extract_field(msg, 44);
double price = std::stod(price_str);  // another copy

After (zero-copy):

// Fast: store pointer only
const char* symbol_ptr;
size_t symbol_len;
parser.get_field_view(55, symbol_ptr, symbol_len);  // no copy

// Direct parsing
double price = parse_double_fast(ptr, len);  // direct conversion without copy

Performance improvement: 2-3x faster

2. SIMD Optimization (AVX2/AVX-512)

Before (scalar search):

// Slow: check 1 byte at a time
const char* find_soh(const char* start, const char* end) {
    while (start < end) {
        if (*start == 0x01) return start;
        ++start;
    }
    return nullptr;
}

After (SIMD search):

// Fast: check 32 bytes at once (AVX2)
const char* find_soh_avx2(const char* start, const char* end) {
    const char SOH = 0x01;
    while (start + 32 <= end) {
        __m256i chunk = _mm256_loadu_si256((const __m256i*)start);
        __m256i soh_vec = _mm256_set1_epi8(SOH);
        __m256i cmp = _mm256_cmpeq_epi8(chunk, soh_vec);

        int mask = _mm256_movemask_epi8(cmp);
        if (mask != 0) {
            return start + __builtin_ctz(mask);
        }
        start += 32;
    }
    // Handle remaining bytes with scalar
    ...
}

Performance improvement: 5-10x faster (CPU dependent)

Compiler flags:

-mavx2  # Enable AVX2
-march=native  # Optimize for current CPU

3. Memory Pool (Allocation Optimization)

Before (allocate each time):

// Slow: repeated malloc/free
void process_message() {
    Tick* tick = new Tick();  // syscall
    // ...
    delete tick;  // syscall
}

After (Memory Pool):

// Fast: use pre-allocated pool
TickMemoryPool pool(100000);  // once at initialization

void process_message() {
    Tick* tick = pool.allocate();  // only pointer increment
    // ...
    // no delete needed (reuse via pool reset)
}

Performance improvement: 10-20x faster (allocation cost eliminated)

4. Lock-free Ring Buffer

Before (Mutex-based):

// Slow: lock contention
std::mutex mutex;
std::queue<Tick> queue;

void push(const Tick& tick) {
    std::lock_guard<std::mutex> lock(mutex);  // wait for lock
    queue.push(tick);
}

After (Lock-free):

// Fast: use CAS (Compare-And-Swap)
LockFreeRingBuffer<Tick> buffer(10000);

void push(const Tick& tick) {
    buffer.push(tick);  // no lock, atomic only
}

Performance improvement: 3-5x faster (multi-threaded environment)

5. Fast Number Parsing

Before (standard library):

// Slow: strtod, strtol (locale checks, etc.)
double price = std::stod(str);
int64_t qty = std::stoll(str);

After (custom implementation):

// Fast: direct conversion without locale
double parse_double_fast(const char* str, size_t len) {
    double result = 0.0;
    for (size_t i = 0; i < len && str[i] >= '0' && str[i] <= '9'; ++i) {
        result = result * 10.0 + (str[i] - '0');
    }
    // Handle decimal point...
    return result;
}

Performance improvement: 2-3x faster

6. Cache-line Alignment

Before (False Sharing):

// Slow: different threads use same cache line
struct Stats {
    std::atomic<uint64_t> count1;  // bytes 0-7
    std::atomic<uint64_t> count2;  // bytes 8-15 (same cache line!)
};

After (Padding):

// Fast: each on separate cache lines
struct Stats {
    alignas(64) std::atomic<uint64_t> count1;  // bytes 0-63
    alignas(64) std::atomic<uint64_t> count2;  // bytes 64-127
};

Performance improvement: 2-4x faster in multi-threaded environments

Benchmark Results

Parsing Speed (single message)

Item	Before	After	Improvement
FIX Parser	800ns	350ns	2.3x
ITCH Parser	450ns	250ns	1.8x
Symbol Mapping	120ns	50ns	2.4x

Throughput (messages/sec)

Item	Before	After	Improvement
FIX (single-threaded)	1.2M	2.8M	2.3x
ITCH (single-threaded)	2.2M	4.0M	1.8x
ITCH (4 threads)	6.0M	12.0M	2.0x

Memory Allocation

Item	Before (malloc)	After (Pool)	Improvement
Allocation time	150ns	8ns	18.7x
Deallocation time	180ns	0ns	∞

Compiler Optimization Flags

Bare Metal (HFT)

set(CMAKE_CXX_FLAGS "-O3 -march=native -mtune=native")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mavx2 -mavx512f")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -flto")  # Link-Time Optimization
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ffast-math")  # FP optimization

Cloud (General)

set(CMAKE_CXX_FLAGS "-O3 -march=x86-64-v3")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mavx2")
# Exclude AVX-512 (not all instances support it)

CPU Pinning

Single Feed Handler

# Pin to core 0
taskset -c 0 ./feed_handler

# Or in code
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Multi Feed Handler

# Feed Handler 1: cores 0-1
taskset -c 0-1 ./feed_handler_nasdaq &

# Feed Handler 2: cores 2-3
taskset -c 2-3 ./feed_handler_cme &

# ZeptoDB Pipeline: cores 4-7
taskset -c 4-7 ./zepto_server &

NUMA Optimization

Memory Allocation

# Run and allocate memory on NUMA node 0
numactl --cpunodebind=0 --membind=0 ./feed_handler

In Code

#include <numa.h>

// Allocate memory on NUMA node 0
void* buffer = numa_alloc_onnode(size, 0);

Kernel Tuning

UDP Receive Buffer

# Increase receive buffer (prevent packet loss)
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.rmem_default=134217728

IRQ Affinity

# Pin NIC IRQ to core 0
echo 1 > /proc/irq/IRQ_NUM/smp_affinity

CPU Governor

# Performance mode (maximum Turbo Boost)
sudo cpupower frequency-set -g performance

Profiling

perf (CPU profile)

# Profile for 10 seconds
perf record -F 999 -g ./feed_handler

# Analyze results
perf report

flamegraph

# Generate Flame Graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Intel VTune

# HPC Performance Characterization
vtune -collect hpc-performance ./feed_handler
vtune -report hotspots

Feed Handler Performance Optimization Guide

Goals

Optimization Techniques

1. Zero-Copy Parsing

2. SIMD Optimization (AVX2/AVX-512)

3. Memory Pool (Allocation Optimization)

4. Lock-free Ring Buffer

5. Fast Number Parsing

6. Cache-line Alignment

Benchmark Results

Parsing Speed (single message)

Throughput (messages/sec)

Memory Allocation

Compiler Optimization Flags

Bare Metal (HFT)

Cloud (General)

CPU Pinning

Single Feed Handler

Multi Feed Handler

NUMA Optimization

Memory Allocation

In Code

Kernel Tuning

UDP Receive Buffer

IRQ Affinity

CPU Governor

Profiling

perf (CPU profile)

flamegraph

Intel VTune

Checklist

Essential Optimizations

Bare Metal Only

Profiling

Expected Performance

Targets Achieved

HFT Requirements