TL;DR
An open-source execution engine called SubMicro Engine reports sub-microsecond end-to-end latency, with a median of 890 ns and tight tail latency (p99 921 ns, p99.9 1,047 ns). The project combines C++17 and Rust, lock-free queues, SIMD vectorization and deterministic replay to target institutional-grade algorithmic trading research and engineering.
What happened
Developers of the SubMicro Engine published an execution engine that they measure as delivering sub-microsecond end-to-end trading decisions. The system is implemented in C++17 with Rust components and emphasizes a lock-free design using single-producer / single-consumer and multi-producer / single-consumer queues to avoid mutex contention. The pipeline is seven stages long (market data, order book, signal processing, ML inference, strategy, risk checks, order send) and the team provides a per-stage breakdown that sums to a median of 890 ns. SIMD optimizations (AVX-512), zero-copy I/O (DMA, shared memory), pre-allocated memory pools and nanosecond timing via TSC/PTP are highlighted. Measurements were taken on an Intel Xeon Platinum 8280 @ 2.7GHz in an isolated, real-time kernel, bare-metal setup with C-states and Turbo Boost disabled. The project also documents deterministic replay with SHA-256 verification and a changelog of optimization milestones.
Why it matters
- Sub-microsecond decision latency changes the scale at which automated strategies perceive and react to market events.
- Deterministic, bit-identical replay with cryptographic checksums can support reproducibility and post-trade auditability.
- Lock-free queues and zero-copy I/O reduce software-induced jitter, which is critical for consistent latency performance.
- An evidence-based changelog and specified measurement methodology help reviewers evaluate claimed performance under controlled conditions.
Key facts
- Median end-to-end latency reported: 890 nanoseconds; p99: 921 ns; p99.9: 1,047 ns.
- Seven-stage pipeline latency breakdown: Market Data 87 ns, Order Book 50 ns, Signal Processing 150 ns, ML Inference 400 ns, Strategy 200 ns, Risk Check 10 ns, Order Send 34 ns.
- Core languages: C++17 for hot paths and Rust for memory-safe components.
- Lock-free architecture with SPSC/MPSC queues and zero mutexes; project lists '0 locks used'.
- SIMD AVX-512 vectorization applied across hot paths; vectorized OFI reduced to 40 ns in v1.5.0.
- Zero-copy operations via DMA and shared memory; custom NIC driver work for Intel X710/X722 and Mellanox ConnectX-5/6 is documented.
- Measurements performed on Intel Xeon Platinum 8280 @ 2.7GHz, isolated core, Real-Time Linux kernel, bare metal, C-states OFF, Turbo Boost OFF; TSC jitter ±5 ns, PTP offset ±17 ns.
- Deterministic replay with SHA-256 manifests, fixed RNG seeds and timestamp-ordered events for bit-identical verification.
- Repository and live demo are presented as research and educational resources; footer explicitly states 'Not for Production Trading' while feature list also indicates 'Production Ready'.
- The project reports a live-simulation throughput figure of about 1.2M decisions per second in the demo.
What to watch next
- Independent third-party validation of the reported sub-microsecond latency under real exchange conditions: not confirmed in the source
- Public case studies or production deployments showing behavior under live market stress: not confirmed in the source
- Broader community audits or ports to alternative hardware/cloud environments: not confirmed in the source
Quick glossary
- AVX-512: A CPU instruction set extension that enables wide SIMD (single instruction, multiple data) operations to process multiple data elements in parallel for faster numeric computation.
- SPSC / MPSC queue: Single-producer single-consumer and multi-producer single-consumer lock-free queue designs used for passing data between threads with minimal synchronization overhead.
- Time Stamp Counter (TSC): A processor register that counts CPU cycles; TSC-based timestamps are often used for high-resolution timing measurements.
- Zero-copy I/O: Techniques that move data between device and application memory without intermediate copies to reduce latency and CPU overhead.
- Deterministic replay: A capability to reproduce program execution exactly, typically by recording inputs and timestamps and re-running with fixed seeds to obtain bit-identical outputs.
Reader FAQ
Is the engine open-source and available to inspect?
Yes — the project is presented with links to a public GitHub repository and documentation.
What latency numbers does the project report and under what conditions?
Median end-to-end latency 890 ns, p99 921 ns, p99.9 1,047 ns; measurements were made on an Intel Xeon Platinum 8280 @ 2.7GHz on bare metal with an RT kernel and power-management features disabled.
Does the system include risk controls?
Yes — the project documents atomic pre-trade checks, configurable position limits and notional caps as part of its risk-management features.
Are production deployments or exchange listings documented?
not confirmed in the source
Which NICs and kernel-bypass approaches are supported?
The codebase documents custom NIC driver support for Intel X710/X722 and Mellanox ConnectX-5/6, Solarflare ef_vi integration and a DPDK/XDP-style kernel-bypass architecture.

Ultra-Low Latency Trading System Sub-microsecond execution engine built with C++17 and Rust. Designed for institutional-grade algorithmic trading. View on GitHub Documentation C++17 Core Language Rust Safe Components SIMD Optimizations Median…
Sources
- Ultra-Low-Latency Trading System
- Sub-microsecond (890 ns) trading execution research system
- Richard-Rose/SubMicroTrading: Ultra Low Latency …
- driven, hardware-accelerated, ultra-low- latency trading …
Related posts
- Python 3.15’s Windows x86-64 interpreter could be about 15% faster
- Mattermost restricts access to older posts after 10,000-message cap is reached
- Why quantum error correction may trigger a near-term FOOM in qubit quality