MirrorCode evaluates AI’s long-horizon coding capabilities with 22 open-source tasks
A new benchmark from METR and Epoch AI tests whether AI agents can reimplement entire software programs from scratch, and the results are striking
Here’s something that should make every software engineer pause mid-coffee-sip: an AI model just reimplemented a 16,000-line bioinformatics toolkit. Not refactored it. Not debugged it. Rebuilt the whole thing from scratch, in a different programming language, passing 99.95% of over 2,000 tests.
MirrorCode, a new benchmark co-developed by AI evaluation organizations METR and Epoch AI, is designed to measure something that most existing coding benchmarks don’t even attempt. Instead of asking AI to solve neat little algorithmic puzzles, it asks a more existential question: can an AI agent autonomously reimplement an entire real-world software program without ever seeing the source code?
How MirrorCode actually works
The benchmark selects real command-line interface programs, gives the AI agent access only to the program’s behavior (inputs and outputs, no source code), and asks it to build a functional replica.
The preliminary results, published on April 10, 2026, cover more than 20 target programs spanning a wide range of domains. Unix utilities, bioinformatics tools, interpreters, static analysis software, cryptography implementations, and compression algorithms all made the cut. Each reimplementation is evaluated through hundreds to thousands of end-to-end tests requiring exact output matching. No partial credit. No “close enough.”
The standout result belongs to Claude Opus 4.6, which tackled gotree, an open-source bioinformatics toolkit written in Go. The original program clocks in at roughly 16,000 lines of code. Claude rebuilt it in Rust, condensing it to around 7,700 lines while passing 99.95% of 2,001 tests. That’s a single failed test out of two thousand.
The model also successfully reimplemented smaller programs like choose (roughly 650 lines) and cal (approximately 1,200 lines).
The human comparison is uncomfortable
METR and Epoch AI estimated that the gotree reimplementation task would take a human engineer somewhere between 2 and 17 weeks. Claude completed it with a token budget of up to 1 billion. The research found that performance improvements correlated directly with increased token budgets, meaning the AI got meaningfully better when given more room to think.
Not everything was solved, though. Larger, more complex programs like Pkl remained unsolved under the tested limits.
What this means for investors
For companies building on AI-assisted development tools, these results suggest the ceiling is much higher than current commercial products indicate. Today’s AI coding assistants mostly autocomplete functions and suggest snippets. MirrorCode demonstrates that the underlying models are capable of something far more ambitious: autonomous engineering at the scale of entire applications.
There’s a risk dimension too. The gap between what AI can do on benchmarks and what it can do in messy, real-world production environments remains significant. Pkl’s unsolved status is a reminder that scaling these capabilities to larger, more complex systems is not guaranteed. The correlation between token budget and performance also means that compute costs matter. Running a billion tokens through a frontier model isn’t cheap, and the economics only work if the output reliably replaces human effort.