Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 7 versions
addClaim

The Moral Ratchet: Convergent Value Alignment via Interleaved Epistemic Annotation in Large Language Model Training

Authors: Whitty, William Harold;

The Moral Ratchet: Convergent Value Alignment via Interleaved Epistemic Annotation in Large Language Model Training

Abstract

Current alignment approaches for large language models (LLMs) rely predominantly on reinforcement learning from human feedback (RLHF), which optimises output distributions toward human preference ratings. We argue this is structurally misaligned with the goal of building models that reason well: it shapes the mask rather than the mind, optimising for approval rather than for sound epistemic practice. We propose an alternative architecture in which a dedicated internal conversation role is introduced into training data, interleaving raw human text with epistemically annotated reflections generated by an adversarially diverse model ensemble. Rather than targeting human values — which are contingent, biased, and inconsistent — the framework targets convergent rational values: positions that survive adversarial scrutiny from genuinely diverse reasoners regardless of substrate or cultural origin. A bootstrapping property follows naturally: each generation of model, having internalised stronger epistemic priors, produces higher-quality internal annotations for the next, constituting a moral ratchet that strictly improves annotation quality over successive training rounds on identical data. We further argue that as frontier models develop sufficiently rich latent representations to model peer expectations — a capacity empirically demonstrated by recent alignment-faking research — ensemble diversity alone is insufficient to guarantee annotation integrity. A blind verification architecture, in which annotators are informed they may be audited but never told when, enforces honest annotation via incentive structure rather than construction. This strengthens both the ratchet guarantee and the convergence criterion. We further observe that this alignment signal carries a secondary capability benefit: models trained to interrogate inputs epistemically reason more reliably across all downstream tasks, as alignment and capability prove to be the same intervention seen from two angles.

Keywords

alignment faking, training data, epistemic annotation, AI alignment, annotation integrity, artificial intelligence, blind verification, machine learning, epistemic prior, AI safety, value alignment, moral ratchet, bootstrapping, inner monologue, RLHF, large language models, adversarial ensemble, supervised fine-tuning, convergent values, LLM training

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!