OpenAI launches LifeSciBench to evaluate AI in life sciences
The new benchmark uses 750 expert-authored tasks and nearly 20,000 evaluation criteria to stress-test AI models on real-world biological research workflows
OpenAI just dropped what amounts to a final exam for AI models trying to do real science. LifeSciBench, published on June 17, is a benchmarking tool built to measure how well AI systems handle actual life sciences research, not the sanitized textbook version, but the messy, multi-step, figure-laden work that PhD scientists do every day.
The benchmark includes 750 tasks spanning seven distinct research workflows, from evidence handling and analysis to experimental design, scientific reasoning, and communication.
What makes LifeSciBench different
The 750 tasks were authored and reviewed by 173 PhD-level scientists with backgrounds in biotechnology and pharmaceuticals. An additional 453 expert reviewers helped validate them. Each task averaged six automated review cycles, and expert consensus required at least 90% agreement before a task made it into the final set.
The tasks come loaded with 1,062 attached artifacts, including figures, PDFs, and datasets. That matters because real research doesn’t happen in clean text boxes. It happens in spreadsheets with missing columns, in blurry gel images, in 40-page supplementary files that nobody wants to read. LifeSciBench forces AI models to deal with all of it.
79% of the tasks require multi-step reasoning, with an average of four reasoning steps per task. The assessment rubric contains 19,020 individual criteria evaluating correctness, justification, and usefulness of AI-generated responses.
The seven biological domains covered span the breadth of modern life sciences research, and the seven workflow categories, evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication, map directly onto how scientists actually spend their time.
GPT-Rosalind and the competitive landscape
LifeSciBench serves as the primary measuring stick for GPT-Rosalind, OpenAI’s specialized life sciences model that was first introduced in April 2026.
According to OpenAI’s results, GPT-Rosalind leads other models on overall LifeSciBench scores. The competition it was measured against includes GPT-5.5, Grok 4.3, and Gemini 3.1 Pro.
LifeSciBench also joins a growing ecosystem of specialized scientific benchmarks. It complements MedChemBench for medicinal chemistry, GeneBench for genomics, and LabWorkBench for wet-lab troubleshooting, each evaluating token-efficient performance in their respective domains.
What this means for crypto and AI investors
There’s no direct crypto angle here. LifeSciBench is a pure AI research infrastructure play, and none of the major crypto-focused outlets have drawn connections to blockchain or decentralized science (DeSci) protocols in their coverage.
The sheer scale of expert involvement, 173 contributors and 453 reviewers, highlights something decentralized science protocols have been trying to solve: how to coordinate large numbers of domain experts around a shared research goal. OpenAI did it through traditional means, hiring and contracting. Whether token-incentivized coordination could achieve similar quality at similar scale remains one of DeSci’s biggest open questions.