TL;DR
Free Law Project released x-ray, an open-source Python tool that detects ineffective redactions in PDF files. It scans PDFs for drawn rectangles, inspects the underlying text, and reports locations and recovered text as JSON or Python objects.
What happened
Free Law Project published x-ray, a Python library designed to find improperly applied redactions in PDF documents. The project grew from the organization's experience collecting large numbers of PDFs and repeatedly encountering redactions implemented as visual overlays (for example, black rectangles) that leave underlying selectable text intact. x-ray uses the PyMuPDF library to parse PDFs, locate rectangle objects, identify letters occupying the same area, render the rectangle as an image and check whether the overlay is a single color; if so, it reports that region as a bad redaction. The tool is usable from the command line or as a Python module, accepts local paths, URLs and in-memory bytes, and emits structured JSON mapping page numbers to bounding boxes and recovered text. The repository is open on GitHub under a permissive BSD license and the maintainers note that detection is not perfect and they welcome help on tougher cases.
Why it matters
- Poorly applied redactions can leave sensitive information recoverable; automated detection helps identify those failures.
- Organizations that manage large PDF collections can use the tool to scan many documents programmatically.
- Open-source licensing and simple JSON output make integration into existing workflows and tooling practical.
- Improving redaction detection reduces legal and privacy risks for publishers and archives that distribute PDFs.
Key facts
- x-ray is available as a CLI tool and a Python module; output is JSON or Python objects.
- Installation options include pip (pip install x-ray) and the uv package manager (uv add x-ray).
- The tool accepts local file paths, URLs (https:// prefix triggers download) and bytes objects in memory.
- Output structure: a mapping of page numbers to a list of objects with bbox (coordinates) and text (recovered content).
- Detection process uses PyMuPDF to find rectangles, locate letters in the same area, render the rectangle, and check whether the overlay is a single color.
- Repository is hosted on GitHub under a BSD license; contributors must sign a contributor license agreement before contributing.
- Releases are automated via GitHub Actions; maintainers document release steps (update CHANGES.md, pyproject.toml, tag commit).
- Maintainers acknowledge the tool is not exhaustive and reference the issues list for unsupported redaction patterns.
What to watch next
- Work to broaden detection to additional types of bad redactions beyond simple monochrome overlays, as noted in the repo issues.
- Repository activity and new releases via GitHub Actions that may add support for more edge cases and Python versions.
- Community contributions and pull requests; contributors must follow the project's CLA and contribution guidelines.
Quick glossary
- Redaction: The process of removing or obscuring sensitive content from a document before publication.
- PDF: Portable Document Format, a file format used to present documents independent of software, hardware, or operating system.
- bbox: Bounding box: a rectangle described by coordinates that indicates the position of an element on a page.
- PyMuPDF: A Python library for reading, writing and manipulating PDF and other document formats.
- CLI: Command-line interface, a way to run software by typing commands in a terminal or shell.
Reader FAQ
How do I install x-ray?
Install via pip (pip install x-ray) or with the uv package manager (uv add x-ray).
Can x-ray recover text hidden by a black rectangle?
Yes — it reports regions where a visual overlay appears to hide selectable text and returns the recovered text.
Does x-ray work on files hosted online?
Yes — if you pass a string starting with https:// x-ray will download and inspect that PDF.
Is the detection perfect?
No — the maintainers state the PDF format is complex and x-ray does not handle every bad-redaction pattern.
Do I need to sign anything to contribute?
Yes — contributors are asked to sign a contributor license agreement before their first contribution.
x-ray is a Python library for finding bad redactions in PDF documents. Why? At Free Law Project, we collect millions of PDFs. An ongoing problem is that people fail to…
Sources
- X-ray: a Python library for finding bad redactions in PDF documents
- X-ray: a Python library for finding bad redactions in PDF …
- X-Ray Bad Redaction Detector
Related posts
- Avoid Mini-Frameworks: Why Small Internal Frameworks Cause Lasting Pain
- How compilers turn simple loops into closed-form math: a surprising optimization
- SPICE at 40: The open-source simulator that shaped modern IC design