Inception Labs’ Mercury 2 outperforms Google’s DiffusionGemma in the race to replace autoregressive AI
Both models ditch token-by-token text generation for parallel diffusion, but Mercury 2 keeps its reasoning skills intact while doing it
The AI industry just got its first real horse race in diffusion-based language models, and the startup is beating the tech giant. Inception Labs’ Mercury 2, which launched in February 2026, is outperforming Google DeepMind’s DiffusionGemma on a metric that matters more than raw speed: maintaining sophisticated reasoning while generating text in parallel.
Here’s why that distinction is important. Traditional large language models, the kind powering ChatGPT and Claude, generate text one token at a time, left to right, like a typewriter. Diffusion language models (dLLMs) take a fundamentally different approach, generating multiple tokens simultaneously through a denoising process. In English: instead of writing a sentence word by word, they sketch the whole thing at once and then refine it, more like a painter than a typist.
The numbers behind Mercury 2’s edge
Mercury 2 pushes roughly 1,009 tokens per second when running on NVIDIA’s Blackwell GPUs. That throughput figure alone would be impressive, but Inception Labs paired it with pricing that undercuts established competitors: $0.25 per million input tokens and $0.75 per million output tokens.
The company positions those rates as competitive against Claude 4.5 Haiku and GPT-5.2 Mini, both of which are already considered the budget-friendly speed options in the market.
Google DeepMind’s DiffusionGemma, which launched on June 10, 2026, as an experimental open-source model, is built on a 26B-parameter Gemma 4 mixture-of-experts (MoE) architecture and claims up to four times faster inference compared to standard autoregressive models. Mercury 2 appears to retain the reasoning capabilities of its predecessor models while delivering that parallel generation speed. DiffusionGemma, still labeled experimental, hasn’t demonstrated the same balance. Both models apply diffusion techniques for parallel text generation rather than traditional token-by-token prediction, but only Mercury 2 seems to do it without sacrificing the quality of its outputs on reasoning benchmarks.
A startup with serious backing
Inception Labs was founded in 2024 by a team that includes Stanford’s Stefano Ermon, a researcher whose work on diffusion models has been foundational to the field. In November 2025, the company raised $50 million in a round led by Menlo Ventures.
Google DeepMind chose to release DiffusionGemma as open-source, which follows a familiar playbook. Open-sourcing experimental models lets Google seed the developer ecosystem, gather feedback at scale, and iterate faster than a closed commercial product could.
What this means for investors
Neither Inception Labs nor Google DeepMind has any direct connection to crypto, blockchain protocols, or digital assets. There are no tokens associated with either project, no decentralized compute integrations, no on-chain inference layers.
The speed and cost profile of Mercury 2, in particular, makes it viable for real-time applications where latency is critical. If diffusion-based language models prove they can match or exceed autoregressive models on quality while dramatically beating them on speed, the entire inference infrastructure market gets repriced. Parallel generation workloads stress GPUs differently than sequential ones, which would change which hardware configurations are most valuable for decentralized compute providers.