Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Prior Substitution: How Language Models Default to Training Priors When Categorical Rules Are Ignored

Authors: Chakraborty, Sameer;

Prior Substitution: How Language Models Default to Training Priors When Categorical Rules Are Ignored

Abstract

We document prior substitution, a systematic failure mode in which language models, when unable to apply a categorical rule, default to the most training-frequent answer rather than failing randomly. Using MendEval, a 240-problem benchmark spanning genetics, organic chemistry, stoichiometry, and Bayesian inference with contamination-controlled fictional entities, we show that wrong-answer clustering concentrates on specific high-frequency answers in categorical domains (up to 100% of errors on a single answer) but scatters uniformly in continuous domains. This pattern persists across five models from Mistral-7B to GPT-4o, with larger models defaulting to different but equally dominant priors — learning better shortcuts rather than better reasoning. We introduce the Rule Utilization Score (RUS), a mechanistic probe based on sequence log-probability under rule-swapped and rule-scrambled prompts, and show that problems where the model ignores the rule (low RUS) are significantly more likely to trigger prior substitution (p = 0.001). Crucially, chain-of-thought prompting does not eliminate the effect: GPT-4o with step-by-step reasoning still defaults to a single mechanism on 100% of its wrong organic chemistry answers. Prior substitution is distinct from random failure, general capability limitations, and dataset artifacts, representing a predictable failure mode tied to the structure of the answer space and the training distribution.

Powered by OpenAIRE graph
Found an issue? Give us feedback