TL;DR
Exa-d is an in‑house data framework built to keep a web-scale search index current by storing datasets on S3 using Lance. It represents index fields as typed, declarative columns, tracks completeness per fragment, and plans work by diffing the desired state against what is materialized to run only necessary computation.
What happened
Building a modern search index requires ingesting hundreds of billions of pages and keeping derived artifacts—text extractions, metadata, embeddings and other signals—up to date as source pages change. Exa built exa-d, a custom data processing framework, to address this problem. Engineers declare columns and their dependencies (base columns for ingested data and derived columns for transformations), and the framework uses that logical dependency graph to determine execution order. Physical storage lives in Lance files on S3; datasets are split into fragments that may contain partial schemas so derived columns can be added or patched per fragment. A planner computes the difference between the ideal (all columns populated) and actual dataset state, then topologically sorts dependent work and parallelizes execution across fragments and machines. The code examples show column declarations, an execute_columns call, and a fragment-level write operation that adds or replaces a single column file without rewriting unrelated data.
Why it matters
- Reduces write amplification by allowing column-level updates for individual fragments rather than rewriting entire rows or files.
- Supports faster iteration on search signals and ML-derived features because columns are typed and declared as contracts, catching mismatches early.
- Enables targeted repair and backfills: engineers can invalidate or replace only the affected fragment columns.
- Improves compute efficiency by planning work from a detailed view of what is already materialized, so only missing or stale outputs are recomputed.
Key facts
- Exa evaluated existing stacks (data warehouses, SQL transformation layers, orchestrators) before building exa-d.
- Logical model: index fields are represented as typed columns with declared dependencies; derived columns reference their inputs and implementations.
- Storage model: datasets are stored in Lance on S3 as collections of fragments; fragments may have partial schemas and missing derived columns are expected.
- Incremental updates rely on writing or deleting a single column file for a fragment, avoiding rewrites of unrelated columns in that fragment.
- Execution planning is a diff between the ideal fully-populated state and the dataset's actual state; a topological sort enforces dependency order.
- Work is parallelized across fragment granularity and heterogeneous compute resources, aiming to compute only what is necessary and skip cached or recoverable outputs.
- Column definitions act as contracts (examples in the source show Tokenizer: str → Tensor and an EmbeddingModel deriving embeddings from tokens).
- Patch operations and fragment metadata provide a global view of which columns are present without auxiliary bookkeeping tables.
What to watch next
- Whether performance benchmarks at web scale (throughput, latency, cost) will be published: not confirmed in the source
- If exa-d or parts of Lance will be released or adopted as open source: not confirmed in the source
- How the system handles transactional consistency or concurrent writers in large deployments: not confirmed in the source
Quick glossary
- S3: An object storage service commonly used for scalable, durable storage of files and data objects.
- Lance: A file/dataset format used in this system to store data on S3 as fragments; it records fragment-level metadata and column presence.
- Column (in exa-d): A typed field in the dataset model; columns can be base (ingested) or derived (computed) and declare dependencies and types.
- Fragment: A unit of physical storage in a dataset that contains files for some subset of columns; fragments can have partial schemas.
- Dependency graph: A directed graph that describes which columns depend on which inputs, used to determine execution order for derived fields.
Reader FAQ
What is exa-d?
An internal data processing framework Exa built to store and maintain a web-scale index on S3, representing fields as typed, declarative columns and storing data in Lance fragments.
How does exa-d avoid rewriting large amounts of data when updating derived fields?
By storing data as fragments with partial schemas in Lance and writing or deleting single-column files per fragment so unaffected columns are not rewritten.
Does exa-d compute everything from scratch for each change?
No. The planner diffs the ideal state against the materialized state and only schedules computation for missing or outdated outputs, with work parallelized across fragments.
Is exa-d open source or available to external users?
not confirmed in the source

Building a modern search engine requires ingesting the entire web and ensuring it is queryable as it changes in real-time. The web has a few properties that make this challenging:…
Sources
- Exa-d: How to store the web in S3
- Using LanceDB with S3 as your Vector Database
- Bring Vector Search And Storage To The Data Lake With …
- lance-format/lance
Related posts
- Axis: a minimalist systems programming language with Python syntax
- CacheKit: High-performance Rust library for cache policies and tiers
- Revup lets you upload once to create multiple related GitHub PRs