Consensus Collapse in Language Model Pretraining

This manuscript investigates source disagreement in language-model pretraining. It argues that standard next-token cross-entropy collapses source-conditioned disagreement structure into a source-frequency-weighted marginal, making source reliability invisible to the training objective. The paper develops a formal framework for this phenomenon ("consensus collapse"), proves a collapse theorem and a non-identifiability result for source-conditioned families under marginalization, derives consequences for attribution, calibration, and alignment, and evaluates the framework on controlled synthetic corpora. Included are the primary manuscript, a technical note documenting derivations and discarded hypotheses, experimental code, and supporting materials.

Found an issue? Give us feedback