TL;DR
An experiment revisited a claim that disk I/O is no longer the limiting factor for simple text-processing tasks. Sequential read bandwidth on the tester's machine is very high, but common single-threaded word-count implementations fell far short until a hand-vectorized AVX2 routine reached 1.45 GB/s.
What happened
A developer re-ran and extended a recent claim that I/O has outpaced CPU-bound processing for typical interview-style problems. Measured sequential read speeds on the test machine reached about 1.6 GB/s on a cold cache and 12.8 GB/s on a warm cache. Compiling an existing C word-frequency counter (GCC 12 with -O3 -march=native) over a 425 MB test file produced only ~278 MB/s. Micro-optimizations that reduced branchiness improved throughput slightly (to ~330 MB/s with clang). The standard wc -w utility ran at roughly 245 MB/s on the same input. The author then implemented an explicit AVX2 vectorized word counter using PMOVMSKB and bit-trick iteration (including __builtin_ffs), unrolled loops and aligned buffers; the resulting single-threaded program reached about 1.45 GB/s on a warm cache and remained slower on a cold cache after dropping kernel caches. The code is published on GitHub.
Why it matters
- Sequential storage bandwidth has risen substantially, shifting pressure onto CPU processing for streaming workloads.
- Branch-heavy scalar code can prevent compilers from auto-vectorizing, leaving performance on the table.
- Hand-written vectorized code can approach I/O rates but requires careful instruction-level work and is nontrivial to produce and maintain.
- Performance tuning choices (vectorization, alignment, loop unrolling, bit tricks) materially affect single-threaded throughput for common text tasks.
Key facts
- Measured sequential read speeds: ~1.6 GB/s (cold cache) and ~12.8 GB/s (warm cache).
- Test input: a 425 MB file composed of 100 copies of the King James Version of the Bible.
- GCC 12 compiled optimized C counter ran in about 1.525s (≈278 MB/s) on the input.
- A simple pre-lowercasing change improved throughput to ~330 MB/s when compiled with clang.
- wc -w on the same file ran in about 1.758s (≈245.2 MB/s).
- A hand-written AVX2 vectorized word counter produced 1.45 GB/s on a warm cache (real time 0.227s).
- After dropping caches, the AVX2 program ran in 0.395s (cold-cache run), with user time still dominating.
- The AVX2 implementation used PMOVMSKB, bit-mask iteration (ffs), immintrin.h intrinsics, explicit unrolling and 32-byte alignment.
- Author reports the hand-optimized single-threaded word counter reached roughly 11% of the machine's warm-cache sequential read bandwidth.
What to watch next
- Optimizing hash-map layouts for better cache locality or using stack-based perfect hashing for short words — not confirmed in the source.
- Porting the vectorized approach to AVX-512 or wider ISAs where available — not confirmed in the source.
- Multi-threaded or parallel streaming implementations to amortize CPU work across cores — not confirmed in the source.
Quick glossary
- Sequential read bandwidth: The sustained throughput achieved when reading large contiguous blocks of data from storage sequentially.
- Vectorization: Transforming algorithms to operate on multiple data elements in parallel using SIMD registers and instructions.
- AVX2: An x86 SIMD instruction set extension that provides 256-bit wide vector registers and operations for integer and floating-point data.
- PMOVMSKB: An x86 SIMD instruction that extracts a mask of most-significant bits from vector registers into a general-purpose register, often used to convert per-byte comparisons into a bitmask.
- ffs (find first set): A bit operation that returns the index of the least-significant set bit in an integer; useful for iterating set bits in masks.
Reader FAQ
Did the author conclude that I/O is no longer the bottleneck?
The author found storage can be much faster than many single-threaded text-processing implementations, supporting the claim that I/O can outpace naive CPU-bound code.
How close did processing get to raw read bandwidth?
A hand-vectorized single-threaded word counter reached about 1.45 GB/s, which was roughly 11% of the warm-cache sequential read speed measured on that machine.
Is the code used in the experiments available?
Yes. The author published the code on GitHub.
Were multi-threaded or distributed approaches tested?
not confirmed in the source
I/O is no longer the bottleneck? Nov 27th, 2022 Recently Ben Hoyt published a blog post claiming that contrary to popular belief, I/O is not the bottleneck in typical programming…
Sources
- I/O is no longer the bottleneck? (2022)
- I/O is no longer the bottleneck
- Why Your App is Crawling: The CPU vs. I/O Bottleneck …
Related posts
- Strange.website: Poetic, Uneasy Reflections on Websites, AI and Dark UX
- Reverse engineering an abandoned app: restoring TileCreatorPro access
- Migrating cells and microchimerism: How shared cells reshape human identity