Exa-d: Storing and Updating a Web-Scale Index in S3 with Lance

TL;DR

Exa-d is an in‑house data framework built to keep a web-scale search index current by storing datasets on S3 using Lance. It represents index fields as typed, declarative columns, tracks completeness per fragment, and plans work by diffing the desired state against what is materialized to run only necessary computation.

What happened

Building a modern search index requires ingesting hundreds of billions of pages and keeping derived artifacts—text extractions, metadata, embeddings and other signals—up to date as source pages change. Exa built exa-d, a custom data processing framework, to address this problem. Engineers declare columns and their dependencies (base columns for ingested data and derived columns for transformations), and the framework uses that logical dependency graph to determine execution order. Physical storage lives in Lance files on S3; datasets are split into fragments that may contain partial schemas so derived columns can be added or patched per fragment. A planner computes the difference between the ideal (all columns populated) and actual dataset state, then topologically sorts dependent work and parallelizes execution across fragments and machines. The code examples show column declarations, an execute_columns call, and a fragment-level write operation that adds or replaces a single column file without rewriting unrelated data.

Why it matters

Reduces write amplification by allowing column-level updates for individual fragments rather than rewriting entire rows or files.
Supports faster iteration on search signals and ML-derived features because columns are typed and declared as contracts, catching mismatches early.
Enables targeted repair and backfills: engineers can invalidate or replace only the affected fragment columns.
Improves compute efficiency by planning work from a detailed view of what is already materialized, so only missing or stale outputs are recomputed.

Key facts

Exa evaluated existing stacks (data warehouses, SQL transformation layers, orchestrators) before building exa-d.
Logical model: index fields are represented as typed columns with declared dependencies; derived columns reference their inputs and implementations.
Storage model: datasets are stored in Lance on S3 as collections of fragments; fragments may have partial schemas and missing derived columns are expected.
Incremental updates rely on writing or deleting a single column file for a fragment, avoiding rewrites of unrelated columns in that fragment.
Execution planning is a diff between the ideal fully-populated state and the dataset's actual state; a topological sort enforces dependency order.
Work is parallelized across fragment granularity and heterogeneous compute resources, aiming to compute only what is necessary and skip cached or recoverable outputs.
Column definitions act as contracts (examples in the source show Tokenizer: str → Tensor and an EmbeddingModel deriving embeddings from tokens).
Patch operations and fragment metadata provide a global view of which columns are present without auxiliary bookkeeping tables.

What to watch next

Whether performance benchmarks at web scale (throughput, latency, cost) will be published: not confirmed in the source
If exa-d or parts of Lance will be released or adopted as open source: not confirmed in the source
How the system handles transactional consistency or concurrent writers in large deployments: not confirmed in the source

Quick glossary

S3: An object storage service commonly used for scalable, durable storage of files and data objects.
Lance: A file/dataset format used in this system to store data on S3 as fragments; it records fragment-level metadata and column presence.
Column (in exa-d): A typed field in the dataset model; columns can be base (ingested) or derived (computed) and declare dependencies and types.
Fragment: A unit of physical storage in a dataset that contains files for some subset of columns; fragments can have partial schemas.
Dependency graph: A directed graph that describes which columns depend on which inputs, used to determine execution order for derived fields.

Reader FAQ

What is exa-d?
An internal data processing framework Exa built to store and maintain a web-scale index on S3, representing fields as typed, declarative columns and storing data in Lance fragments.

How does exa-d avoid rewriting large amounts of data when updating derived fields?
By storing data as fragments with partial schemas in Lance and writing or deleting single-column files per fragment so unaffected columns are not rewritten.

Does exa-d compute everything from scratch for each change?
No. The planner diffs the ideal state against the materialized state and only schedules computation for missing or outdated outputs, with work parallelized across fragments.

Is exa-d open source or available to external users?
not confirmed in the source

Building a modern search engine requires ingesting the entire web and ensuring it is queryable as it changes in real-time. The web has a few properties that make this challenging:…

Exa-d: Storing and Updating a Web-Scale Index in S3 with Lance

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

SkyPilot: Unified system to run and manage AI compute across clouds

How a 40-line fix closed a 400x JVM performance gap on Linux systems

40-Line Fix Cuts Java Thread User-Time Call Latency by Up to 400×

Leave a Reply Cancel reply

You missed

Microsoft January 2026 Patch Tuesday: 113 Flaws, Zero-Day Exploited

Target Promo Codes: Get $50 Off and Up to 50% Sitewide — Jan 2026

Anthropic provides $1.5 million to help Python Foundation boost security

Tsonic: TypeScript-to-C# compiler that produces native .NET NativeAOT binaries