High-Frequency Data on Dedicated Servers

High-Frequency Data Processing on Dedicated Servers - Hero Image

High-frequency data processing has a precise definition: systems that must ingest, process, and act on data streams at rates that exhaust the capacity of typical web hosting infrastructure. Financial market data feeds arriving at 100,000 updates per second, industrial sensor networks transmitting telemetry from thousands of devices simultaneously, real-time aggregation pipelines that must reduce millions of events per minute to queryable summaries — these workloads require dedicated bare metal hardware for reasons that go beyond simple CPU capacity.

Table of Contents

Why Bare Metal Matters for High-Frequency Workloads

Virtualized infrastructure introduces non-deterministic latency at the worst possible points in a high-frequency processing pipeline. The hypervisor’s scheduler determines when the VM’s virtual CPUs execute. Under load, a VM competing with other tenants for physical CPU time experiences scheduling delays of 1-10ms. For web applications, 5ms of scheduling jitter is invisible. For a financial data feed processor that must react to market events in under 1ms, 5ms of scheduling jitter is a disqualifying problem.

Bare metal dedicated servers eliminate the hypervisor layer entirely. Your processes run directly on the physical CPU, with no scheduling intermediary. Combined with Linux real-time kernel options, CPU affinity pinning, and NUMA-aware memory allocation, dedicated servers can achieve sub-millisecond processing latency for high-frequency workloads that virtualized infrastructure cannot reliably match.

AMD’s EPYC processor architecture documentation notes that the 4545P’s chiplet design provides consistent memory access latency across all cores – relevant for NUMA-sensitive high-frequency workloads where memory access patterns can dominate processing time.

Use Case 1: Financial Market Data Feeds

Financial data providers (Bloomberg, Refinitiv, CME Group) publish market data at rates that require dedicated processing infrastructure. An equities feed during active trading can deliver 50,000-500,000 updates per second across thousands of instruments.

Processing requirements:

Low-latency network stack: Kernel bypass networking (DPDK, RDMA) eliminates TCP stack overhead for the most latency-sensitive implementations; standard kernel networking is sufficient for most use cases below 1 million messages/second
Lock-free data structures: Traditional mutex-based queues introduce contention at high message rates; lock-free ring buffers allow producer and consumer threads to operate without blocking
CPU affinity: Pin the network receive thread and processing threads to specific CPU cores to eliminate scheduling variability

Basic Python implementation of a high-throughput message queue using multiprocessing:

import multiprocessing as mp
from collections import deque
import time
 
class HighFrequencyProcessor:
    def __init__(self, num_workers=8):
        self.queue = mp.Queue(maxsize=100000)
        self.results = mp.Queue()
        self.workers = []
        
        # Pin workers to specific cores for consistent latency
        for i in range(num_workers):
            p = mp.Process(
                target=self._worker,
                args=(self.queue, self.results, i),
                daemon=True
            )
            p.start()
            self.workers.append(p)
    
    def _worker(self, queue, results, worker_id):
        # Set CPU affinity if psutil available
        try:
            import psutil
            psutil.Process().cpu_affinity([worker_id % mp.cpu_count()])
        except ImportError:
            pass
        
        while True:
            try:
                message = queue.get(timeout=0.001)
                result = self._process_message(message)
                results.put(result)
            except Exception:
                continue
    
    def _process_message(self, message):
        # Application-specific processing logic
        return {
            'timestamp': time.time_ns(),
            'symbol': message.get('symbol'),
            'price': message.get('price'),
            'processed': True
        }
    
    def ingest(self, message):
        try:
            self.queue.put_nowait(message)
            return True
        except mp.queues.Full:
            # Queue full - implement backpressure or drop strategy
            return False

import multiprocessing as mp
from collections import deque
import time
 
class HighFrequencyProcessor:
    def __init__(self, num_workers=8):
        self.queue = mp.Queue(maxsize=100000)
        self.results = mp.Queue()
        self.workers = []
        
        # Pin workers to specific cores for consistent latency
        for i in range(num_workers):
            p = mp.Process(
                target=self._worker,
                args=(self.queue, self.results, i),
                daemon=True
            )
            p.start()
            self.workers.append(p)
    
    def _worker(self, queue, results, worker_id):
        # Set CPU affinity if psutil available
        try:
            import psutil
            psutil.Process().cpu_affinity([worker_id % mp.cpu_count()])
        except ImportError:
            pass
        
        while True:
            try:
                message = queue.get(timeout=0.001)
                result = self._process_message(message)
                results.put(result)
            except Exception:
                continue
    
    def _process_message(self, message):
        # Application-specific processing logic
        return {
            'timestamp': time.time_ns(),
            'symbol': message.get('symbol'),
            'price': message.get('price'),
            'processed': True
        }
    
    def ingest(self, message):
        try:
            self.queue.put_nowait(message)
            return True
        except mp.queues.Full:
            # Queue full - implement backpressure or drop strategy
            return False

For implementations where microsecond latency matters, Rust is the language of choice on Linux. Its ownership model eliminates garbage collection pauses that would otherwise introduce unpredictable latency spikes at the worst moments. LMAX Disruptor’s ring buffer pattern provides a proven lock-free queue architecture, with open source implementations available in Java (the reference implementation) and Rust. Go is a practical alternative for teams that need near-real-time throughput with simpler concurrency primitives; its goroutine scheduler handles thousands of concurrent message handlers without the manual thread management Python requires.

Use Case 2: Industrial Sensor Networks

IoT sensor networks from manufacturing equipment, smart grid infrastructure, or environmental monitoring systems generate high-volume telemetry that must be ingested, validated, and aggregated in real time.

A typical industrial IoT deployment might include 10,000 sensors transmitting readings every second – 10,000 messages/second sustained with bursts during anomaly detection events. Processing each message involves timestamp normalization, unit conversion, range validation, and aggregation into time-series storage.

InfluxDB is the standard time-series database for high-frequency sensor data. Its line protocol format is optimized for high-throughput writes:

# Write multiple points in a single HTTP request (batch writes)
curl -i -XPOST 'http://localhost:8086/write?db=sensors&precision=ns' \
  --data-binary '
sensor_data,facility=plant1,device=temp_sensor_001 temperature=72.4,humidity=45.2 1675000000000000000
sensor_data,facility=plant1,device=temp_sensor_002 temperature=71.8,humidity=44.9 1675000000000000001
sensor_data,facility=plant1,device=pressure_001 pressure=14.7,flow_rate=125.3 1675000000000000002'

# Write multiple points in a single HTTP request (batch writes)
curl -i -XPOST 'http://localhost:8086/write?db=sensors&precision=ns' \
  --data-binary '
sensor_data,facility=plant1,device=temp_sensor_001 temperature=72.4,humidity=45.2 1675000000000000000
sensor_data,facility=plant1,device=temp_sensor_002 temperature=71.8,humidity=44.9 1675000000000000001
sensor_data,facility=plant1,device=pressure_001 pressure=14.7,flow_rate=125.3 1675000000000000002'

Batch writes significantly outperform individual writes at high message rates. InfluxDB’s documentation on write performance recommends batches of 5,000-10,000 points per write request for maximum throughput.

Kafka sits upstream of InfluxDB in most production sensor pipelines, acting as a durable message buffer that absorbs ingestion spikes and allows multiple consumers to process the same data stream for different purposes:

# Create a Kafka topic for sensor data with appropriate partitioning
kafka-topics.sh --create \
  --topic sensor-readings \
  --partitions 32 \          # One partition per processing thread
  --replication-factor 1 \   # Single-server deployment
  --bootstrap-server localhost:9092

# Create a Kafka topic for sensor data with appropriate partitioning
kafka-topics.sh --create \
  --topic sensor-readings \
  --partitions 32 \          # One partition per processing thread
  --replication-factor 1 \   # Single-server deployment
  --bootstrap-server localhost:9092

32 partitions allow 32 parallel consumer threads to process sensor data simultaneously. On the Extreme server’s 16-core EPYC (32 threads), this maps cleanly to maximum parallelism without over-subscription.

Use Case 3: Real-Time Aggregation Pipelines

Aggregation pipelines reduce high-velocity event streams to queryable summaries: page view counts per minute, transaction totals by hour, active user sessions by region. The challenge is computing these aggregations in real time while ingesting millions of raw events per hour.

Apache Flink and Apache Kafka Streams are the standard open source tools for streaming aggregation at scale. For single-server deployments on dedicated hardware, Kafka Streams is simpler to operate (no separate cluster required) while providing most of the same aggregation capabilities.

A Kafka Streams aggregation pipeline in Java:

StreamsBuilder builder = new StreamsBuilder();
 
// Read from input topic
KStream<String, PageViewEvent> pageViews = builder.stream("page-views");
 
// Aggregate into 1-minute tumbling windows
KTable<Windowed<String>, Long> viewCounts = pageViews
    .groupBy((key, value) -> value.getPageId())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
    .count(Materialized.as("page-view-counts"));
 
// Write aggregated results to output topic
viewCounts.toStream()
    .map((windowedKey, count) -> KeyValue.pair(
        windowedKey.key(),
        new AggregatedCount(windowedKey.window().startTime(), count)
    ))
    .to("page-view-aggregates");

StreamsBuilder builder = new StreamsBuilder();
 
// Read from input topic
KStream<String, PageViewEvent> pageViews = builder.stream("page-views");
 
// Aggregate into 1-minute tumbling windows
KTable<Windowed<String>, Long> viewCounts = pageViews
    .groupBy((key, value) -> value.getPageId())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
    .count(Materialized.as("page-view-counts"));
 
// Write aggregated results to output topic
viewCounts.toStream()
    .map((windowedKey, count) -> KeyValue.pair(
        windowedKey.key(),
        new AggregatedCount(windowedKey.window().startTime(), count)
    ))
    .to("page-view-aggregates");

State stores for windowed aggregations consume significant memory. A pipeline maintaining 1-hour rolling windows across 100,000 unique page IDs requires roughly 1-2GB of state per pipeline stage. The Extreme server’s 192GB DDR5 RAM provides enough headroom to run multiple aggregation stages with generous state allocation without memory pressure.

Hardware Tuning for High-Frequency Workloads on Linux

Several Linux kernel and hardware configuration options specifically benefit high-frequency processing workloads.

CPU frequency scaling: High-frequency processing benefits from consistent CPU clock speeds. Disable frequency scaling to prevent cores from running at reduced frequency between bursts:

# Set performance governor (run at maximum frequency always)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done
 
# Make persistent via cpupower
cpupower frequency-set -g performance

# Set performance governor (run at maximum frequency always)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done
 
# Make persistent via cpupower
cpupower frequency-set -g performance

NUMA awareness: The AMD EPYC 4545P uses a chiplet architecture, where memory access latency varies depending on which NUMA node the memory is allocated from, relative to the accessing core. For latency-sensitive workloads, pin processing threads to cores within the same NUMA node as the memory they access:

# Check NUMA topology
numactl --hardware
 
# Run a process with NUMA affinity (bind to node 0 CPUs and memory)
numactl --cpunodebind=0 --membind=0 ./your_processor

# Check NUMA topology
numactl --hardware
 
# Run a process with NUMA affinity (bind to node 0 CPUs and memory)
numactl --cpunodebind=0 --membind=0 ./your_processor

Huge pages: The Linux kernel’s default 4KB memory pages require many TLB entries for large working sets. Enabling 2MB huge pages reduces TLB misses for memory-intensive processing:

# Allocate 512 huge pages (512 x 2MB = 1GB)
echo 512 > /proc/sys/vm/nr_hugepages
 
# Persistent across reboots
echo "vm.nr_hugepages = 512" >> /etc/sysctl.conf

# Allocate 512 huge pages (512 x 2MB = 1GB)
echo 512 > /proc/sys/vm/nr_hugepages
 
# Persistent across reboots
echo "vm.nr_hugepages = 512" >> /etc/sysctl.conf

IRQ affinity: For high-throughput network processing, pin network interrupt handling to specific CPU cores to avoid cache thrashing when interrupts are handled on different cores:

# Pin NIC interrupts to cores 0-3
# First identify NIC interrupt numbers
cat /proc/interrupts | grep eth0
 
# Set affinity (example for interrupt 23 to core 0)
echo 1 > /proc/irq/23/smp_affinity

# Pin NIC interrupts to cores 0-3
# First identify NIC interrupt numbers
cat /proc/interrupts | grep eth0
 
# Set affinity (example for interrupt 23 to core 0)
echo 1 > /proc/irq/23/smp_affinity

Storage for High-Frequency Data

High-frequency workloads often generate substantial data volumes. A financial data feed processing 100,000 updates/second, storing each event at 200 bytes, generates 20MB/second – 1.7TB per day.

InMotion Hosting’s Extreme server includes 2×3.84TB NVMe SSDs, providing approximately 4 days of raw storage at this rate before archival is required. For longer retention, configure a tiered storage strategy:

Hot storage (NVMe): Last 48-72 hours of raw data, fully queryable
Warm storage (object storage): 30-90 days, compressed, queryable with some latency
Cold storage (archive): Beyond 90 days, compressed, slow retrieval

Apache Parquet format provides columnar compression that reduces financial and sensor time-series data to 10-20% of raw size while remaining queryable by analytical tools like Apache Spark, DuckDB, or ClickHouse.

InMotion Hosting’s Dedicated Infrastructure for High-Frequency Workloads

The Extreme server’s combination of AMD EPYC 4545P (16 cores, 32 threads), 192GB DDR5 ECC RAM, 2×3.84TB NVMe SSD, and a 3 Gbps base port speed (upgradeable to 10 Gbps) addresses the specific constraints of high-frequency data processing: CPU parallelism for concurrent message processing, memory bandwidth for large state stores, NVMe throughput for high-velocity writes, and network capacity for data ingestion from external sources.

The 3 Gbps base port is particularly relevant for sensor network deployments and financial feed aggregators where inbound data volume is sustained rather than bursty. Teams that need guaranteed throughput rather than burst headroom can add port speed in 1 Gbps increments.

The bare metal nature eliminates hypervisor scheduling jitter — the property that makes dedicated servers specifically appropriate for latency-sensitive processing workloads that cloud VMs cannot reliably serve. For applications where processing latency is measured in microseconds rather than milliseconds, InMotion’s dedicated server lineup provides the hardware foundation that high-frequency workloads require.

Get AMD Performance for Your Workload

InMotion's Extreme Dedicated Server pairs an AMD EPYC 4545P processor with 192GB DDR5 RAM and burstable 10Gbps bandwidth, built for streaming, APIs, and CRM applications that demand burst capacity.

Choose fully managed hosting with Premier Care for expert administration or self-managed bare metal for complete control.

Explore the Extreme Plan

Share this Article