
Current alignment approaches for large language models (LLMs) rely predominantly on reinforcement learning from human feedback (RLHF), which optimises output distributions toward human preference ratings. We argue this is structurally misaligned with the goal of building models that reason well: it shapes the mask rather than the mind, optimising for approval rather than for sound epistemic practice. We propose an alternative architecture in which a dedicated internal conversation role is introduced into training data, interleaving raw human text with epistemically annotated reflections generated by an adversarially diverse model ensemble. Rather than targeting human values — which are contingent, biased, and inconsistent — the framework targets convergent rational values: positions that survive adversarial scrutiny from genuinely diverse reasoners regardless of substrate or cultural origin. A bootstrapping property follows naturally: each generation of model, having internalised stronger epistemic priors, produces higher-quality internal annotations for the next, constituting a moral ratchet that strictly improves annotation quality over successive training rounds on identical data. We further argue that as frontier models develop sufficiently rich latent representations to model peer expectations — a capacity empirically demonstrated by recent alignment-faking research — ensemble diversity alone is insufficient to guarantee annotation integrity. A blind verification architecture, in which annotators are informed they may be audited but never told when, enforces honest annotation via incentive structure rather than construction. This strengthens both the ratchet guarantee and the convergence criterion. We further observe that this alignment signal carries a secondary capability benefit: models trained to interrogate inputs epistemically reason more reliably across all downstream tasks, as alignment and capability prove to be the same intervention seen from two angles.
alignment faking, training data, epistemic annotation, AI alignment, annotation integrity, artificial intelligence, blind verification, machine learning, epistemic prior, AI safety, value alignment, moral ratchet, bootstrapping, inner monologue, RLHF, large language models, adversarial ensemble, supervised fine-tuning, convergent values, LLM training
alignment faking, training data, epistemic annotation, AI alignment, annotation integrity, artificial intelligence, blind verification, machine learning, epistemic prior, AI safety, value alignment, moral ratchet, bootstrapping, inner monologue, RLHF, large language models, adversarial ensemble, supervised fine-tuning, convergent values, LLM training
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
