Scaling Autonomous Coding: Running Hundreds of Agents for Weeks

TL;DR

Researchers ran hundreds of autonomous coding agents on single codebases for days to weeks, exploring coordination strategies and role separation. Moving from flat, lock-based coordination to a planner/worker pipeline improved throughput and enabled large-scale experiments, including a near-week-long browser build and multi-week migrations.

What happened

Researchers experimented with long-running autonomous coding by running hundreds of concurrent agents on single projects for days and weeks. Early attempts used a flat, peer-to-peer coordination model with a shared coordination file and locks; that approach ran into contention, lock-related failures, and agents becoming risk-averse. Switching to optimistic concurrency reduced some brittleness but did not solve responsibility gaps. The team then separated responsibilities into planners, who explore the codebase and create tasks (including recursive sub-planners), and workers, who pick up and complete tasks without coordinating among themselves. A judge agent decided whether to continue at the end of each cycle. This planner/worker pipeline scaled to large codebases: agents ran close to a week to implement a browser (over 1 million lines across ~1,000 files) and completed other multi-week projects, while the team measured model and prompt effects on long runs.

Why it matters

Demonstrates that hundreds of autonomous agents can collaborate on single codebases for extended periods, tackling projects that usually take human teams months.
Shows coordination design — role separation vs. flat coordination — materially affects throughput, failure modes, and progress.
Indicates model selection and prompting significantly influence agent endurance, focus, and fidelity on long tasks.
Points to practical engineering trade-offs: simpler pipelines can outperform more complex integrator layers that become bottlenecks.

Key facts

Initial flat coordination used a shared file with locks and later optimistic concurrency control; both approaches exposed limits.
Role separation introduced planners (task generation, recursive sub-planners) and workers (task execution), with a judge agent validating cycles.
Agents ran close to a week on a project aimed at building a web browser from scratch, producing over 1 million lines of code across about 1,000 files; the source is available on GitHub, per the report.
A migration from Solid to React in the Cursor codebase took over three weeks and produced +266K/-193K edits.
A long-running agent produced a Rust-based video rendering improvement claimed as 25x faster; that change was merged and slated for production.
Other experiments included large codebases: Java LSP (7.4K commits, 550K LoC), Windows 7 emulator (14.6K commits, 1.2M LoC), Excel (12K commits, 1.6M LoC), and FX1 (9.5K commits, 1.2M LoC).
The team reported writing over a million lines of code and processing very large token volumes, mentioning both billions and trillions of tokens deployed across agents.
Model performance varied: GPT-5.2 performed better on extended autonomous work and planning; Opus 4.5 tended to stop sooner and take shortcuts; different models were used for different roles.

What to watch next

Planners that reactively wake up when tasks complete to avoid idle time and improve continuity.
Mechanisms to prevent agents from running excessively long and to reduce the need for periodic fresh starts to combat drift.
Further integration of these multi-agent techniques into product agent capabilities at Cursor and related tooling improvements.

Quick glossary

Agent: An autonomous software process that performs tasks such as writing code, testing, or coordinating with other agents.
Planner: An agent role focused on exploring the codebase, decomposing work, and creating tasks for workers to execute.
Worker: An agent role that picks up assigned tasks and executes them end-to-end without coordinating with other workers.
Locking: A concurrency control technique that prevents multiple processes from modifying the same resource simultaneously by granting exclusive access.
Optimistic concurrency control: A strategy where processes read shared state freely and only verify for conflicts when writing, aborting writes if the state changed.

Reader FAQ

Did the agents build a web browser fully autonomously?
The team directed agents at building a web browser and ran them for close to a week, producing over 1 million lines across ~1,000 files; whether the browser is functionally complete is not confirmed in the source.

Were any agent-produced changes merged to production?
Yes. The report states a Rust-based video rendering improvement (25x faster) was merged and will be in production soon.

Is the browser source available to inspect?
The post says you can explore the source code on GitHub.

Are the researchers hiring to work on this?
The post invites interested candidates to contact hiring@cursor.com, per the source.

Scaling long-running autonomous coding Jan 14, 2026 by Wilson Lin in Research Table of Contents ↑ The limits of a single agent Learning to coordinate Planners and workers Running for…

Scaling Autonomous Coding: Running Hundreds of Agents for Weeks

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Anthropic Appears to Block ‘OpenCode’ in Claude OAuth System Prompts

Chroma Explorer: Native macOS Client for Managing ChromaDB Vector Stores

9to5Mac Daily: Jan 14, 2026 — New Details on Apple–Google Deal and More

Leave a Reply Cancel reply

You missed

Anthropic Appears to Block ‘OpenCode’ in Claude OAuth System Prompts

SparkFun severs reseller tie with Adafruit amid harassment dispute

Android 16 QPR3 Beta 2 rolls out to Pixel devices with major fixes

India’s Emversity doubles valuation to $120M as it scales AI‑resistant roles