TL;DR
A TypeScript library called bridge-anonymization masks and later restores PII to enable safe use of external translation and LLM services while keeping sensitive data local. It combines regex rules and an ONNX-backed NER model, preserves context for translation, and encrypts the mapping used to rehydrate text.
What happened
Developer Tom Jordi Ruesch open-sourced bridge-anonymization, a local-first TypeScript library that masks personally identifiable information (PII) for translation workflows and then restores it after external machine translation or LLM processing. The tool applies a lifecycle of Detect -> Mask -> Translate -> Rehydrate and runs the masking and rehydration steps entirely on-device in Node.js or Bun. Detection is hybrid: deterministic regexes handle structured items like IBANs and credit cards (with checksum validation), while a quantized ONNX NER model identifies names, organizations and locations. To preserve translation quality, the project adds lightweight semantic enrichment via lookup tables (gender and GeoNames) and uses a fuzzy tag matcher to map mangled tags back to original values. The PII mapping table is encrypted using AES-256-GCM, and the project is available under an MIT license on npm and GitHub.
Why it matters
- Helps teams send redacted text to external translation or LLM services without irreversibly losing grammatical context required for accurate translation.
- Runs detection and rehydration locally, reducing the risk of exposing raw PII to third-party APIs.
- Encrypting the PII map reduces the risk from persisted state or local storage leaks.
- Offers a pragmatic trade-off between runtime cost and accuracy by combining fast regexes with a quantized NER model and compact lookup tables.
Key facts
- Library name: bridge-anonymization; implementation: TypeScript.
- Runs on-device for masking and rehydration in Node.js or Bun (supports onnxruntime-node and onnxruntime-web).
- Hybrid detection: regex for structured PII (IBAN, credit cards with Luhn, emails) and ONNX NER for soft PII (names, orgs, locations).
- Provides anonymizeRegexOnly() for low-latency streams and a full anonymize() pipeline for higher-precision scrubbing.
- Uses a quantized (INT8) XLM-RoBERTa ONNX model (~280MB) by default, claiming ~95%+ accuracy relative to the full model.
- Semantic enrichment in V1 relies on lookup tables: gender-guesser (~40k Western names) and GeoNames (cities >15k population) to add attributes like gender or location type.
- Fuzzy Tag Matcher tolerates changes introduced by external APIs (spacing, quotes, attribute order) to reliably rehydrate masked tokens.
- PII mapping table is encrypted with AES-256-GCM; raw PII is kept in local memory and encrypted at rest.
- Project is MIT licensed and available on GitHub and npm (package @elanlanguages/bridge-anonymization).
What to watch next
- Planned research into ML-based semantic enrichment to replace or augment lookup tables (described as a future step).
- Coverage and accuracy beyond mostly Western names and major cities — lookup tables currently cover many common Western names and large cities, but broader coverage is an area for improvement.
- not confirmed in the source
Quick glossary
- PII: Personally Identifiable Information — data that can be used to identify a specific individual, such as names, emails, or identification numbers.
- ONNX: Open Neural Network Exchange — a format and runtime ecosystem that allows machine learning models to run across different frameworks and platforms.
- NER: Named Entity Recognition — an NLP technique that identifies and classifies proper names and entities in text (people, organizations, locations, etc.).
- Quantization: A model compression technique that reduces the precision of neural network weights (e.g., to INT8) to lower size and speed up inference with modest accuracy trade-offs.
- AES-256-GCM: A symmetric encryption algorithm and authenticated mode that provides confidentiality and integrity for stored or transmitted data.
Reader FAQ
Is the mapping between placeholders and original PII stored securely?
Yes. The library encrypts the PII map using AES-256-GCM and keeps raw PII in local memory, with the persisted state encrypted at rest.
Does bridge-anonymization send raw PII to external translation or LLM APIs?
No. The workflow masks PII locally and sends only the anonymized text to external services; the mapping used to restore values remains local and encrypted.
Is the project open-source and where can I get it?
Yes. It is MIT licensed and distributed via npm and GitHub (links provided in the source).
Does it support disambiguation of names and locations for non-Western contexts?
not confirmed in the source
Does the library auto-download the NER model?
According to the source, the quantized model (~280MB) is auto-downloaded on first run when using the default quantized mode.
Press enter or click to view image in full size Photo by Egor Komarov on Unsplash A local-first, reversible PII scrubber for AI workflows using ONNX and Regex Tom Jordi…
Sources
- Show HN: A local-first, reversible PII scrubber for AI workflows
- A local-first, reversible PII scrubber for AI workflows
- Secure LLM Usage With Reversible Data Anonymization
- PII Detection and Anonymization with PySpark on Microsoft …
Related posts
- X-ray: Python library that finds faulty redactions and exposed text in PDFs
- Avoid Mini-Frameworks: Why Small Internal Frameworks Cause Lasting Pain
- How compilers turn simple loops into closed-form math: a surprising optimization