Is I/O still the bottleneck for word counting on modern systems?

TL;DR

An experiment revisited a claim that disk I/O is no longer the limiting factor for simple text-processing tasks. Sequential read bandwidth on the tester's machine is very high, but common single-threaded word-count implementations fell far short until a hand-vectorized AVX2 routine reached 1.45 GB/s.

What happened

A developer re-ran and extended a recent claim that I/O has outpaced CPU-bound processing for typical interview-style problems. Measured sequential read speeds on the test machine reached about 1.6 GB/s on a cold cache and 12.8 GB/s on a warm cache. Compiling an existing C word-frequency counter (GCC 12 with -O3 -march=native) over a 425 MB test file produced only ~278 MB/s. Micro-optimizations that reduced branchiness improved throughput slightly (to ~330 MB/s with clang). The standard wc -w utility ran at roughly 245 MB/s on the same input. The author then implemented an explicit AVX2 vectorized word counter using PMOVMSKB and bit-trick iteration (including __builtin_ffs), unrolled loops and aligned buffers; the resulting single-threaded program reached about 1.45 GB/s on a warm cache and remained slower on a cold cache after dropping kernel caches. The code is published on GitHub.

Why it matters

Sequential storage bandwidth has risen substantially, shifting pressure onto CPU processing for streaming workloads.
Branch-heavy scalar code can prevent compilers from auto-vectorizing, leaving performance on the table.
Hand-written vectorized code can approach I/O rates but requires careful instruction-level work and is nontrivial to produce and maintain.
Performance tuning choices (vectorization, alignment, loop unrolling, bit tricks) materially affect single-threaded throughput for common text tasks.

Key facts

Measured sequential read speeds: ~1.6 GB/s (cold cache) and ~12.8 GB/s (warm cache).
Test input: a 425 MB file composed of 100 copies of the King James Version of the Bible.
GCC 12 compiled optimized C counter ran in about 1.525s (≈278 MB/s) on the input.
A simple pre-lowercasing change improved throughput to ~330 MB/s when compiled with clang.
wc -w on the same file ran in about 1.758s (≈245.2 MB/s).
A hand-written AVX2 vectorized word counter produced 1.45 GB/s on a warm cache (real time 0.227s).
After dropping caches, the AVX2 program ran in 0.395s (cold-cache run), with user time still dominating.
The AVX2 implementation used PMOVMSKB, bit-mask iteration (ffs), immintrin.h intrinsics, explicit unrolling and 32-byte alignment.
Author reports the hand-optimized single-threaded word counter reached roughly 11% of the machine's warm-cache sequential read bandwidth.

What to watch next

Optimizing hash-map layouts for better cache locality or using stack-based perfect hashing for short words — not confirmed in the source.
Porting the vectorized approach to AVX-512 or wider ISAs where available — not confirmed in the source.
Multi-threaded or parallel streaming implementations to amortize CPU work across cores — not confirmed in the source.

Quick glossary

Sequential read bandwidth: The sustained throughput achieved when reading large contiguous blocks of data from storage sequentially.
Vectorization: Transforming algorithms to operate on multiple data elements in parallel using SIMD registers and instructions.
AVX2: An x86 SIMD instruction set extension that provides 256-bit wide vector registers and operations for integer and floating-point data.
PMOVMSKB: An x86 SIMD instruction that extracts a mask of most-significant bits from vector registers into a general-purpose register, often used to convert per-byte comparisons into a bitmask.
ffs (find first set): A bit operation that returns the index of the least-significant set bit in an integer; useful for iterating set bits in masks.

Reader FAQ

Did the author conclude that I/O is no longer the bottleneck?
The author found storage can be much faster than many single-threaded text-processing implementations, supporting the claim that I/O can outpace naive CPU-bound code.

How close did processing get to raw read bandwidth?
A hand-vectorized single-threaded word counter reached about 1.45 GB/s, which was roughly 11% of the warm-cache sequential read speed measured on that machine.

Is the code used in the experiments available?
Yes. The author published the code on GitHub.

Were multi-threaded or distributed approaches tested?
not confirmed in the source

I/O is no longer the bottleneck? Nov 27th, 2022 Recently Ben Hoyt published a blog post claiming that contrary to popular belief, I/O is not the bottleneck in typical programming…

Is I/O still the bottleneck for word counting on modern systems?

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

How a Single A4 Sheet Can Serve as a Quick, Impractical Measuring Tool

GNOME developer proposes removing Linux’s long-standing middle-click paste

Stop Designing Languages — Prioritize Building Practical Libraries Instead

Leave a Reply Cancel reply

You missed

Virginia forfeited $1.6B in datacenter tax exemptions in fiscal 2025

Ikea expands Matter support to two Varmblixt lamps with color and white tuning

Netflix leans on returning franchises and live events in its 2026 slate

Ikea’s $10 Kallsup Bluetooth speakers — tiny, colorful cubes that pack power