TL;DR

An experiment revisited a claim that disk I/O is no longer the limiting factor for simple text-processing tasks. Sequential read bandwidth on the tester's machine is very high, but common single-threaded word-count implementations fell far short until a hand-vectorized AVX2 routine reached 1.45 GB/s.

What happened

A developer re-ran and extended a recent claim that I/O has outpaced CPU-bound processing for typical interview-style problems. Measured sequential read speeds on the test machine reached about 1.6 GB/s on a cold cache and 12.8 GB/s on a warm cache. Compiling an existing C word-frequency counter (GCC 12 with -O3 -march=native) over a 425 MB test file produced only ~278 MB/s. Micro-optimizations that reduced branchiness improved throughput slightly (to ~330 MB/s with clang). The standard wc -w utility ran at roughly 245 MB/s on the same input. The author then implemented an explicit AVX2 vectorized word counter using PMOVMSKB and bit-trick iteration (including __builtin_ffs), unrolled loops and aligned buffers; the resulting single-threaded program reached about 1.45 GB/s on a warm cache and remained slower on a cold cache after dropping kernel caches. The code is published on GitHub.

Why it matters

  • Sequential storage bandwidth has risen substantially, shifting pressure onto CPU processing for streaming workloads.
  • Branch-heavy scalar code can prevent compilers from auto-vectorizing, leaving performance on the table.
  • Hand-written vectorized code can approach I/O rates but requires careful instruction-level work and is nontrivial to produce and maintain.
  • Performance tuning choices (vectorization, alignment, loop unrolling, bit tricks) materially affect single-threaded throughput for common text tasks.

Key facts

  • Measured sequential read speeds: ~1.6 GB/s (cold cache) and ~12.8 GB/s (warm cache).
  • Test input: a 425 MB file composed of 100 copies of the King James Version of the Bible.
  • GCC 12 compiled optimized C counter ran in about 1.525s (≈278 MB/s) on the input.
  • A simple pre-lowercasing change improved throughput to ~330 MB/s when compiled with clang.
  • wc -w on the same file ran in about 1.758s (≈245.2 MB/s).
  • A hand-written AVX2 vectorized word counter produced 1.45 GB/s on a warm cache (real time 0.227s).
  • After dropping caches, the AVX2 program ran in 0.395s (cold-cache run), with user time still dominating.
  • The AVX2 implementation used PMOVMSKB, bit-mask iteration (ffs), immintrin.h intrinsics, explicit unrolling and 32-byte alignment.
  • Author reports the hand-optimized single-threaded word counter reached roughly 11% of the machine's warm-cache sequential read bandwidth.

What to watch next

  • Optimizing hash-map layouts for better cache locality or using stack-based perfect hashing for short words — not confirmed in the source.
  • Porting the vectorized approach to AVX-512 or wider ISAs where available — not confirmed in the source.
  • Multi-threaded or parallel streaming implementations to amortize CPU work across cores — not confirmed in the source.

Quick glossary

  • Sequential read bandwidth: The sustained throughput achieved when reading large contiguous blocks of data from storage sequentially.
  • Vectorization: Transforming algorithms to operate on multiple data elements in parallel using SIMD registers and instructions.
  • AVX2: An x86 SIMD instruction set extension that provides 256-bit wide vector registers and operations for integer and floating-point data.
  • PMOVMSKB: An x86 SIMD instruction that extracts a mask of most-significant bits from vector registers into a general-purpose register, often used to convert per-byte comparisons into a bitmask.
  • ffs (find first set): A bit operation that returns the index of the least-significant set bit in an integer; useful for iterating set bits in masks.

Reader FAQ

Did the author conclude that I/O is no longer the bottleneck?
The author found storage can be much faster than many single-threaded text-processing implementations, supporting the claim that I/O can outpace naive CPU-bound code.

How close did processing get to raw read bandwidth?
A hand-vectorized single-threaded word counter reached about 1.45 GB/s, which was roughly 11% of the warm-cache sequential read speed measured on that machine.

Is the code used in the experiments available?
Yes. The author published the code on GitHub.

Were multi-threaded or distributed approaches tested?
not confirmed in the source

I/O is no longer the bottleneck? Nov 27th, 2022 Recently Ben Hoyt published a blog post claiming that contrary to popular belief, I/O is not the bottleneck in typical programming…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *