
We study a systematic failure mode in language models: when the true answer to a STEM question is surprising relative to training-data priors, models prefer plausible-sounding distractors over the correct answer. We build a 97-fact STEM benchmark spanning six domains (calculus, physics, chemistry, statistics, linear algebra, constants) and evaluate six models from GPT-2 (117M) to Qwen3-4B using log-probability multiple-choice ranking. Accuracy rises from 16% to 77% with scale, but systematic errors persist even at 4B parameters. We identify four scale-invariant bias patterns (positivity, linearity, missing-constant, truncation) that appear at all scales. A transfer matrix experiment shows zero cross-pattern generalization from single-pattern adapters; mixed training achieves 70-100% per-pattern accuracy. Log-probability margin is a perfect binary oracle: positive margin predicts correct answer with 100% precision and recall on the 40-fact probe set. Margin magnitude tracks domain difficulty. v1.1 changes: Expanded limitations section, replaced informal self-references with DOI citations, strengthened abstract opening, added GitHub link.
Part of the rho-eval / knowledge-fidelity research program. Paper 9 of 9. Code: https://github.com/SolomonB14D3/knowledge-fidelity
factual accuracy, model calibration, log-probability, language models, adapters, STEM benchmarks, multiple-choice evaluation, bias correction, knowledge fidelity
factual accuracy, model calibration, log-probability, language models, adapters, STEM benchmarks, multiple-choice evaluation, bias correction, knowledge fidelity
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
