Nexo Earn with Nexo
Lenz Research study finds AI models disagree on 67% of fact-check claims

Lenz Research study finds AI models disagree on 67% of fact-check claims

Five frontier AI models were given 1,000 real-world claims to verify, and the results should make anyone relying on a single model for truth very uncomfortable.

Ask five of the world’s most advanced AI models whether something is true, and two-thirds of the time, at least one of them will disagree with the group. That’s the headline finding from a new study by Lenz Research, which tested GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro on 1,000 real-world claims submitted by actual users to a fact-checking platform.

The results are sobering. Out of those 1,000 claims, 672, or 67%, produced at least one model that dissented from the panel majority. In English: if you’re treating any single AI model as your personal oracle of truth, you’re rolling the dice more often than you think.

The numbers behind the disagreement

Lenz Research didn’t just measure whether models agreed or disagreed in a binary sense. They looked at the depth of disagreement, too. A full 343 claims, roughly 34%, showed what the researchers call “substantive disagreements,” where the most-disagreeing pair of models landed two or more verdict categories apart on a scale that ranged from True to Mostly True to Misleading to False.

Advertisement

To quantify the overall level of agreement, the study used Krippendorff’s alpha, a standard statistical measure for inter-rater reliability. The score came in at 0.639 on an ordinal scale. For context, a score of 1.0 means perfect agreement, and most researchers consider anything below 0.667 to indicate only tentative conclusions should be drawn. The models, in other words, landed just below the threshold where social scientists would start feeling comfortable relying on the results.

The dataset itself was designed to reflect real-world conditions rather than lab-friendly benchmarks. All 1,000 claims were organic user submissions, none older than February 15, 2026, ensuring the models were being tested on fresh, messy, real information rather than curated examples they might have seen during training.

One particularly telling finding: the models didn’t just disagree randomly. They disagreed systematically. Some models gravitated toward polar True/False verdicts, treating fact-checking as a binary exercise. Others spread their verdicts across the middle categories, Mostly True and Misleading, suggesting a more nuanced but also more ambiguous approach.

Why crypto markets should pay attention

The study didn’t specifically examine crypto-related claims. But if the underlying AI models can’t agree on what’s true 67% of the time, then every AI-powered trading signal, every automated news summary, every chatbot-generated market analysis carries a hidden asterisk. The model you happened to use might have called a claim True while two of its peers flagged it as Misleading.

What this means for investors

The study’s 95% confidence interval puts the disagreement rate between 64% and 70%, meaning this isn’t a fluke result sensitive to the specific sample of claims chosen. The problem is structural.

For crypto investors specifically, the takeaway isn’t that AI fact-checking is useless. It’s that relying on a single model’s output is roughly equivalent to asking one analyst for their opinion and treating it as gospel. When three or four models agree, you can have higher confidence. When they split, that’s your signal to dig deeper with human judgment.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Lenz Research study finds AI models disagree on 67% of fact-check claims

Lenz Research study finds AI models disagree on 67% of fact-check claims

Five frontier AI models were given 1,000 real-world claims to verify, and the results should make anyone relying on a single model for truth very uncomfortable.

Ask five of the world’s most advanced AI models whether something is true, and two-thirds of the time, at least one of them will disagree with the group. That’s the headline finding from a new study by Lenz Research, which tested GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro on 1,000 real-world claims submitted by actual users to a fact-checking platform.

The results are sobering. Out of those 1,000 claims, 672, or 67%, produced at least one model that dissented from the panel majority. In English: if you’re treating any single AI model as your personal oracle of truth, you’re rolling the dice more often than you think.

The numbers behind the disagreement

Lenz Research didn’t just measure whether models agreed or disagreed in a binary sense. They looked at the depth of disagreement, too. A full 343 claims, roughly 34%, showed what the researchers call “substantive disagreements,” where the most-disagreeing pair of models landed two or more verdict categories apart on a scale that ranged from True to Mostly True to Misleading to False.

Advertisement

To quantify the overall level of agreement, the study used Krippendorff’s alpha, a standard statistical measure for inter-rater reliability. The score came in at 0.639 on an ordinal scale. For context, a score of 1.0 means perfect agreement, and most researchers consider anything below 0.667 to indicate only tentative conclusions should be drawn. The models, in other words, landed just below the threshold where social scientists would start feeling comfortable relying on the results.

The dataset itself was designed to reflect real-world conditions rather than lab-friendly benchmarks. All 1,000 claims were organic user submissions, none older than February 15, 2026, ensuring the models were being tested on fresh, messy, real information rather than curated examples they might have seen during training.

One particularly telling finding: the models didn’t just disagree randomly. They disagreed systematically. Some models gravitated toward polar True/False verdicts, treating fact-checking as a binary exercise. Others spread their verdicts across the middle categories, Mostly True and Misleading, suggesting a more nuanced but also more ambiguous approach.

Why crypto markets should pay attention

The study didn’t specifically examine crypto-related claims. But if the underlying AI models can’t agree on what’s true 67% of the time, then every AI-powered trading signal, every automated news summary, every chatbot-generated market analysis carries a hidden asterisk. The model you happened to use might have called a claim True while two of its peers flagged it as Misleading.

What this means for investors

The study’s 95% confidence interval puts the disagreement rate between 64% and 70%, meaning this isn’t a fluke result sensitive to the specific sample of claims chosen. The problem is structural.

For crypto investors specifically, the takeaway isn’t that AI fact-checking is useless. It’s that relying on a single model’s output is roughly equivalent to asking one analyst for their opinion and treating it as gospel. When three or four models agree, you can have higher confidence. When they split, that’s your signal to dig deeper with human judgment.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.