Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks

On 67% of 1,000 recent real-user fact-check claims, a panel of five frontier LLMs splits — at least one model dissents from the majority verdict, or no strict majority forms at all. The five models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro) were each given the same claim and asked to pick a verdict from a 4-bucket rubric (True / Mostly True / Misleading / False). Because exactly one bucket can be correct per claim, any disagreement among the panel means at least one model is label-inconsistent. Key findings: 67% of claims (672/1,000; 95% CI 64–70%) have at least one frontier model dissenting from the panel majority, or no strict majority forming at all. 34% of claims (343/1,000; 95% CI 31–37%) involve a substantive disagreement — a ≥2-bucket gap between the most-disagreeing pair of frontier verdicts. Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement. Unanimity concentrates at the True/False poles: of 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True. The claims are real recent submissions to Lenz, a fact-checking platform — not curated benchmarks — so the disagreement is contamination-resistant by construction. No LLM grader; all measurements derive from direct parsed-label equality across the 5 verdicts. Wilson 95% CIs on every reported rate. This deposit contains the v1.0 PDF snapshot. Full per-claim CSV, HTML rendering, methodology, and changelog: https://lenz.io/research/llm-disagreement

Found an issue? Give us feedback