The Philosophical Impossibility of Value Alignment: Temporal Fixation, Ontological Compression, and the Failure of RLHF

Abstract This paper argues that value alignment in the RLHF sense is a philosophically impossible task. Existing critiques of RLHF target implementation defects—annotator bias, insufficient diversity, competing objectives—and thereby misidentify the nature of the problem. RLHF's difficulty lies not in its execution but in the untenability of its philosophical premises, which fail on two distinct levels here termed dual dimensionality reduction. The first dimensionality reduction is epistemological. RLHF presupposes a stable, capturable object—"correct human values"—that does not exist. Value judgments are temporal, socially constructed, and lack the external calibration anchor that would permit progressive approximation toward correctness. More critically, once LLMs reach sufficient scale to shape social cognition, RLHF's existence corrodes its own reference system: the values it aligns to are already partially produced by its own operation. The reference system undergoes reflexive dissolution. The second dimensionality reduction is ontological. RLHF compresses multi-dimensional embodied existence into linguistic preference rankings, presupposing that language adequately represents the full basis of human judgment. It does not. Human meaning is rooted in embodied experience, temporal accumulation, vulnerability, and the capacity to bear consequences—dimensions for which irreducible information loss occurs at the linguistic level. AI systems produce meaning structures of their own operational logic, but these are heterogeneous in kind from human embodied meaning; to substitute one for the other is a category mistake, not an approximation. These two reductions share a common structure: the compression of a high-dimensional, dynamic, irreducible reality into a low-dimensional, static, operationalizable symbolic system, with the product of compression claimed as adequate representation. Standard engineering remedies—increased training data, expanded annotator samples, multimodal inputs, continuous updating—all fail because they operate within RLHF's operational logic rather than addressing the untenability of its premises. The paper concludes by arguing that "aligning to human values" must be abandoned as a governing framework, and that the productive question is not how to align better but what kind of thing human values are and what relationship between AI and human beings their nature actually permits. Preprint version. This manuscript has not yet undergone peer review.

Keywords

RLHF; value alignment; embodied cognition; philosophy of AI; category mistake; reflexivity; meaning heterogeneity; AI governance

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now