Biohub unveils AI world model for protein biology, aiming to reshape drug discovery
The Zuckerberg-Chan backed nonprofit releases three open-source models trained on billions of protein sequences, part of a $500 million push into virtual biology.
Think of proteins as the Lego bricks of life, tiny molecular machines that do everything from fighting infections to building muscle. The problem is there are billions of possible configurations, and figuring out which ones matter for treating disease has traditionally been about as efficient as assembling a 10,000-piece set without the instruction manual.
Biohub, the nonprofit biomedical research organization backed by Mark Zuckerberg and Priscilla Chan, just released what it calls a “world model of protein biology.” The open-source system, announced on May 27, combines three AI models designed to predict protein structures, design functional binders, and map the protein universe at a scale that would make previous efforts look like a rough sketch on a napkin.
What Biohub actually built
The release includes three distinct tools, each handling a different piece of the protein puzzle.
First is ESMC, a protein language model trained on billions of sequences. In English: it learned the “grammar” of proteins by reading an enormous library of them, letting it predict how unfamiliar proteins might behave based on patterns it has absorbed.
Second is ESMFold2, which focuses on structure prediction and design. Where ESMC understands the vocabulary, ESMFold2 figures out the 3D shapes. This matters because a protein’s shape determines its function, much like how a key’s shape determines which lock it opens.
Third is the ESM Atlas, a massive resource containing 6.8 billion protein sequences and 1.1 billion predicted structures. That is not a typo. Billion, with a B. To put that in perspective, scientists have experimentally determined the structures of roughly 200,000 proteins over the past several decades. The ESM Atlas multiplies that knowledge base by a factor of roughly 5,000.
All three models are available under a permissive MIT license, meaning academic researchers, startups, and large pharmaceutical companies can all use them freely. Biohub is also offering hosted access and compute credits through its platform, lowering the barrier for labs that lack the GPU firepower to run these models on their own hardware.
The $500 million bet on virtual biology
This release does not exist in a vacuum. It is part of Biohub’s broader $500 million Virtual Biology Initiative, which aims to build open datasets and sophisticated cell-level modeling tools for the global scientific community.
Earlier ESM models were implemented in 2024, establishing the foundational architecture. The new suite represents a significant leap forward, integrating multiple capabilities into a cohesive system rather than offering isolated tools.
Alex Rives, who leads the science effort at Biohub, has been driving the development of these protein language models. The team’s approach treats biological sequences like text, applying transformer architectures (the same technology underpinning large language models like GPT) to the molecular world. The difference is that instead of predicting the next word in a sentence, these models predict the next amino acid in a protein chain, or the shape that chain will fold into.
Biohub says its new models have already demonstrated therapeutic-level binder affinity in lab trials. In English: the protein binders designed by the AI stuck to their targets well enough to potentially work as drugs. That is a meaningful validation, though it is worth noting the distance between lab results and approved therapies remains substantial.