Nexo Earn with Nexo
Stanford, MIT, Harvard, Anthropic study reveals why larger models learn rare tasks better

Stanford, MIT, Harvard, Anthropic study reveals why larger models learn rare tasks better

New research identifies 'gradient interference' as the key mechanism explaining why bigger AI models pick up complex, infrequent skills that smaller ones simply overwrite.

There’s a persistent question in AI development that sounds deceptively simple: why do bigger models just… work better? Not incrementally better. Qualitatively better, picking up skills that smaller models never seem to learn at all. A new paper from researchers at Stanford, Harvard’s Kempner Institute, MIT, and Anthropic finally offers a mechanistic answer, and it has real implications for how the industry thinks about scaling.

The study, titled “Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention” and published on arXiv (2605.29548), pinpoints a phenomenon called reduced gradient interference as the core reason larger models outperform smaller ones on rare and complex tasks. In English: bigger models get the easy stuff out of the way early, which frees up space for harder lessons to actually stick.

Advertisement

The gradient interference problem

In neural networks, gradient updates from frequent tasks are strong and persistent. They dominate the training process. Rare tasks produce weaker gradient signals that get overwritten in smaller models before they can solidify into learned behavior. The researchers found that larger models sidestep this problem through a specific sequence of events during training.

First, because they have more parameters, larger models effectively master common tasks early in the training process. Once those frequent tasks are handled, the gradient updates they produce become weaker. This creates breathing room. The faint signals from rare, complex tasks are no longer getting steamrolled by dominant common-task gradients. They persist long enough to actually be learned.

The authors tested this across OLMo models ranging from 4 million to 4 billion parameters, trained on the Dolma corpus. Only the larger models in that range succeeded at learning the complex, infrequent tasks. Smaller models never got there, not because they lacked some fundamental capability, but because the learning dynamics of training kept erasing rare-task features before they could take hold.

What this means for model design

The researchers propose that increasing the frequency of rare tasks in training data could help smaller models acquire skills that currently require much larger architectures. If the problem is that rare-task signals get overwritten because they appear too infrequently, then showing those tasks more often during training should, in theory, give smaller models a fighting chance.

The research team includes Jing Huang, Ekdeep Singh Lubana, Rachit Bansal, Naomi Saphra, Laura Ruis, and contributors from Anthropic. The paper was first published on May 28, 2026, with a revised version (v2) appearing on June 1, 2026.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Stanford, MIT, Harvard, Anthropic study reveals why larger models learn rare tasks better

Stanford, MIT, Harvard, Anthropic study reveals why larger models learn rare tasks better

New research identifies 'gradient interference' as the key mechanism explaining why bigger AI models pick up complex, infrequent skills that smaller ones simply overwrite.

There’s a persistent question in AI development that sounds deceptively simple: why do bigger models just… work better? Not incrementally better. Qualitatively better, picking up skills that smaller models never seem to learn at all. A new paper from researchers at Stanford, Harvard’s Kempner Institute, MIT, and Anthropic finally offers a mechanistic answer, and it has real implications for how the industry thinks about scaling.

The study, titled “Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention” and published on arXiv (2605.29548), pinpoints a phenomenon called reduced gradient interference as the core reason larger models outperform smaller ones on rare and complex tasks. In English: bigger models get the easy stuff out of the way early, which frees up space for harder lessons to actually stick.

Advertisement

The gradient interference problem

In neural networks, gradient updates from frequent tasks are strong and persistent. They dominate the training process. Rare tasks produce weaker gradient signals that get overwritten in smaller models before they can solidify into learned behavior. The researchers found that larger models sidestep this problem through a specific sequence of events during training.

First, because they have more parameters, larger models effectively master common tasks early in the training process. Once those frequent tasks are handled, the gradient updates they produce become weaker. This creates breathing room. The faint signals from rare, complex tasks are no longer getting steamrolled by dominant common-task gradients. They persist long enough to actually be learned.

The authors tested this across OLMo models ranging from 4 million to 4 billion parameters, trained on the Dolma corpus. Only the larger models in that range succeeded at learning the complex, infrequent tasks. Smaller models never got there, not because they lacked some fundamental capability, but because the learning dynamics of training kept erasing rare-task features before they could take hold.

What this means for model design

The researchers propose that increasing the frequency of rare tasks in training data could help smaller models acquire skills that currently require much larger architectures. If the problem is that rare-task signals get overwritten because they appear too infrequently, then showing those tasks more often during training should, in theory, give smaller models a fighting chance.

The research team includes Jing Huang, Ekdeep Singh Lubana, Rachit Bansal, Naomi Saphra, Laura Ruis, and contributors from Anthropic. The paper was first published on May 28, 2026, with a revised version (v2) appearing on June 1, 2026.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.