TL;DR
Zpdf is a zero-copy PDF text-extraction library written in Zig that emphasizes memory-mapped I/O, SIMD string ops and parallel page extraction. Benchmarks supplied by the project show multi-threaded extraction running several times faster than MuPDF's mutool text extraction, with a reported peak throughput of 41,000 pages/sec on one test document.
What happened
A new open-source library called zpdf provides PDF text extraction implemented in the Zig programming language. The project emphasizes zero-copy, memory-mapped file reads, streaming extraction without intermediate allocations, SIMD-accelerated string handling and multi-threaded page-level extraction. The repository supplies feature lists and benchmarks comparing zpdf to MuPDF 1.26's mutool convert -F text: in sequential runs zpdf is reported to be roughly 2.7x–4.4x faster across several test documents; in parallel mode the project reports speedups ranging from about 5.2x up to 18x, depending on the file, with a peak throughput of 41,000 pages/sec on the Intel SDM document. The author notes that mutool's text conversion is single-threaded by design while zpdf parallelizes extraction across pages. The codebase includes a CLI, library API examples, a build/test workflow and requires Zig 0.15.2 or later.
Why it matters
- Faster text extraction can reduce processing time for large corpora of PDFs used in indexing, search or data analysis.
- Memory-mapped, zero-copy design minimizes intermediate allocations, which helps when handling very large documents.
- Parallel page extraction lets the library scale with available cores for batch workloads.
- Support for common PDF decompression filters and font encodings improves extraction accuracy across diverse documents.
Key facts
- zpdf is written in Zig and distributed under the MIT license.
- Requirements: Zig 0.15.2 or later.
- Supports memory-mapped file reading, streaming text extraction and SIMD string operations.
- Decompression filters implemented: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength.
- Font encoding support includes WinAnsi, MacRoman and ToUnicode CMap; CID font handling for Type0 composite fonts and Identity-H/V encoding is implemented.
- XRef table and stream parsing (PDF 1.5+) and incremental PDF updates (/Prev chain) are implemented.
- Provided CLI commands include extract, info and bench; example build commands: zig build and zig build test.
- Benchmark highlights versus MuPDF 1.26 (mutool convert -F text): sequential speedups reported ~2.7x–4.4x; parallel speedups up to ~18x for one tested document.
- Peak throughput reported by the project: 41,000 pages/sec (Intel SDM, parallel).
- Project repository metadata (at time of publication): 26 stars, 0 forks.
What to watch next
- Broader adoption in tooling and downstream projects — not confirmed in the source
- Independent benchmarks across a wider set of PDF types and encodings — not confirmed in the source
- Integration with existing PDF toolchains or packaging for common platforms — not confirmed in the source
Quick glossary
- Memory-mapped file: An OS feature that maps a file's contents directly into a process's address space to allow efficient, low-copy access.
- SIMD: Single Instruction, Multiple Data — CPU instructions that operate on multiple data points in parallel to speed up certain computations like string operations.
- XRef table/stream: PDF cross-reference structures that locate objects within a file; streams were introduced in later PDF versions to store xref data.
- CMap / ToUnicode CMap: Character map resources in PDFs used to translate character codes in a font to Unicode code points for correct text extraction.
Reader FAQ
Is zpdf open-source?
Yes; the project is released under the MIT license.
Does zpdf support parallel text extraction?
Yes; the project implements multi-threaded, page-level parallel extraction.
What version of Zig is required?
The repository states Zig 0.15.2 or later is required.
Can zpdf render PDF pages to images?
not confirmed in the source
Are the benchmark results representative for all PDFs?
The source provides specific benchmark results for several documents; broader generalization is not confirmed in the source.
zpdf A PDF text extraction library written in Zig. Features Memory-mapped file reading for efficient large file handling Streaming text extraction (no intermediate allocations) Multiple decompression filters: FlateDecode, ASCII85, ASCIIHex,…
Sources
- Zpdf: PDF text extraction in Zig – 5x faster than MuPDF
- MuPDF: The ultimate library for managing PDF documents
- Using MuPDF.js instead of pdf.js
Related posts
- Everything as Code: How One Company Runs Its Entire Platform in a Monorepo
- Show HN: I remade my website in the Sith Lord Theme and I hope it’s fun
- Loss32: Build a Win32-Centric Linux Desktop to Run .exe Apps