Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

The Philosophical Impossibility of Value Alignment: Temporal Fixation, Ontological Compression, and the Failure of RLHF

Authors: Liu, Echo;

The Philosophical Impossibility of Value Alignment: Temporal Fixation, Ontological Compression, and the Failure of RLHF

Abstract

Abstract This paper argues that value alignment in the RLHF sense is a philosophically impossible task. Existing critiques of RLHF target implementation defects—annotator bias, insufficient diversity, competing objectives—and thereby misidentify the nature of the problem. RLHF's difficulty lies not in its execution but in the untenability of its philosophical premises, which fail on two distinct levels here termed dual dimensionality reduction. The first dimensionality reduction is epistemological. RLHF presupposes a stable, capturable object—"correct human values"—that does not exist. Value judgments are temporal, socially constructed, and lack the external calibration anchor that would permit progressive approximation toward correctness. More critically, once LLMs reach sufficient scale to shape social cognition, RLHF's existence corrodes its own reference system: the values it aligns to are already partially produced by its own operation. The reference system undergoes reflexive dissolution. The second dimensionality reduction is ontological. RLHF compresses multi-dimensional embodied existence into linguistic preference rankings, presupposing that language adequately represents the full basis of human judgment. It does not. Human meaning is rooted in embodied experience, temporal accumulation, vulnerability, and the capacity to bear consequences—dimensions for which irreducible information loss occurs at the linguistic level. AI systems produce meaning structures of their own operational logic, but these are heterogeneous in kind from human embodied meaning; to substitute one for the other is a category mistake, not an approximation. These two reductions share a common structure: the compression of a high-dimensional, dynamic, irreducible reality into a low-dimensional, static, operationalizable symbolic system, with the product of compression claimed as adequate representation. Standard engineering remedies—increased training data, expanded annotator samples, multimodal inputs, continuous updating—all fail because they operate within RLHF's operational logic rather than addressing the untenability of its premises. The paper concludes by arguing that "aligning to human values" must be abandoned as a governing framework, and that the productive question is not how to align better but what kind of thing human values are and what relationship between AI and human beings their nature actually permits. Preprint version. This manuscript has not yet undergone peer review.

Keywords

RLHF; value alignment; embodied cognition; philosophy of AI; category mistake; reflexivity; meaning heterogeneity; AI governance

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!