MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

Here’s something that should make every software engineer pause mid-coffee-sip: an AI model just reimplemented a 16,000-line bioinformatics toolkit. Not refactored it. Not debugged it. Rebuilt the whole thing from scratch, in a different programming language, passing 99.95% of over 2,000 tests.

MirrorCode, a new benchmark co-developed by AI evaluation organizations METR and Epoch AI, is designed to measure something that most existing coding benchmarks don’t even attempt. Instead of asking AI to solve neat little algorithmic puzzles, it asks a more existential question: can an AI agent autonomously reimplement an entire real-world software program without ever seeing the source code?

How MirrorCode actually works

The benchmark selects real command-line interface programs, gives the AI agent access only to the program’s behavior (inputs and outputs, no source code), and asks it to build a functional replica.

The preliminary results, published on April 10, 2026, cover more than 20 target programs spanning a wide range of domains. Unix utilities, bioinformatics tools, interpreters, static analysis software, cryptography implementations, and compression algorithms all made the cut. Each reimplementation is evaluated through hundreds to thousands of end-to-end tests requiring exact output matching. No partial credit. No “close enough.”

The standout result belongs to Claude Opus 4.6, which tackled gotree, an open-source bioinformatics toolkit written in Go. The original program clocks in at roughly 16,000 lines of code. Claude rebuilt it in Rust, condensing it to around 7,700 lines while passing 99.95% of 2,001 tests. That’s a single failed test out of two thousand.

The model also successfully reimplemented smaller programs like choose (roughly 650 lines) and cal (approximately 1,200 lines).

The human comparison is uncomfortable

METR and Epoch AI estimated that the gotree reimplementation task would take a human engineer somewhere between 2 and 17 weeks. Claude completed it with a token budget of up to 1 billion. The research found that performance improvements correlated directly with increased token budgets, meaning the AI got meaningfully better when given more room to think.

Not everything was solved, though. Larger, more complex programs like Pkl remained unsolved under the tested limits.

What this means for investors

For companies building on AI-assisted development tools, these results suggest the ceiling is much higher than current commercial products indicate. Today’s AI coding assistants mostly autocomplete functions and suggest snippets. MirrorCode demonstrates that the underlying models are capable of something far more ambitious: autonomous engineering at the scale of entire applications.

There’s a risk dimension too. The gap between what AI can do on benchmarks and what it can do in messy, real-world production environments remains significant. Pkl’s unsolved status is a reminder that scaling these capabilities to larger, more complex systems is not guaranteed. The correlation between token budget and performance also means that compute costs matter. Running a billion tokens through a frontier model isn’t cheap, and the economics only work if the output reliably replaces human effort.

How MirrorCode actually works

The benchmark selects real command-line interface programs, gives the AI agent access only to the program’s behavior (inputs and outputs, no source code), and asks it to build a functional replica.

The model also successfully reimplemented smaller programs like choose (roughly 650 lines) and cal (approximately 1,200 lines).

The human comparison is uncomfortable

Not everything was solved, though. Larger, more complex programs like Pkl remained unsolved under the tested limits.

What this means for investors

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

MirrorCode evaluates AI’s long-horizon coding capabilities with 22 open-source tasks

How MirrorCode actually works

The human comparison is uncomfortable

What this means for investors

MirrorCode evaluates AI’s long-horizon coding capabilities with 22 open-source tasks

How MirrorCode actually works

The human comparison is uncomfortable

What this means for investors

MirrorCode evaluates AI’s long-horizon coding capabilities with 22 open-source tasks

How MirrorCode actually works

The human comparison is uncomfortable

What this means for investors

MirrorCode evaluates AI’s long-horizon coding capabilities with 22 open-source tasks

How MirrorCode actually works

The human comparison is uncomfortable

What this means for investors

Get Crypto Briefing in your inbox