OpenAI introduces GeneBench to evaluate AI on computational biology’s hardest problems
A new benchmark reveals that even the best AI models struggle to solve the kinds of multi-stage genomics problems that take senior scientists weeks to crack.
OpenAI has published GeneBench, a benchmark designed to stress-test AI models on the kinds of problems that make computational biologists earn their salaries. The benchmark, released as a bioRxiv preprint on April 23, 2026, is not asking models to explain what DNA is. It is asking them to do the actual work.
Each of GeneBench’s 103 problems spans multiple analytical stages across ten domains in genomics and quantitative biology. A senior scientist would need roughly 10 to 40 hours to work through a single one of them. Current AI models are, to put it politely, not there yet.
What the numbers actually say
GPT-5.5 Pro posted the highest pass rate among evaluated models, at 33.2%. That sounds modest until you see what the rest of the field managed.
The standard version of GPT-5.5 reached a 25.0% pass rate. Gemini 3.1 Pro landed at 11.2%. Roughly 60% of the benchmark’s problems remained below a 20% pass rate even for the best models tested.
The benchmark was published by J. Li and collaborators alongside developments related to GPT-5.5, framing it as both an evaluation tool and a signal of where the field stands heading into a period of rapid model development.
A June 2026 update to a model called GPT-Rosalind achieved a 21.6% pass rate on GeneBench compared to GPT-5.5’s 20.4%, while using 31% fewer tokens to get there.
Why a benchmark like this matters
GeneBench is grounding its problems in work that reflects what scientists actually do. Multi-stage genomics analysis involves long inference chains, domain-specific reasoning, and decisions that compound across many steps.
The fact that 60% of problems sit below a 20% pass rate for current models tells researchers, investors, and companies where the ceiling currently sits.
What this means for the market
AI companies are increasingly positioning their models as tools for drug discovery, genomic analysis, and biomedical research. A 33.2% pass rate on problems requiring 10 to 40 hours of senior scientist effort is both an honest admission of current limits and a baseline that future models can be measured against.
The GPT-Rosalind efficiency result adds another dimension. If a model can approach the performance of a larger, more expensive model while consuming 31% fewer tokens, the unit economics of deploying AI in research workflows improve considerably.