Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other ORP type . 2025
License: CC BY NC SA
Data sources: ZENODO
ZENODO
Other ORP type . 2025
License: CC BY NC SA
Data sources: Datacite
ZENODO
Other ORP type . 2025
License: CC BY NC SA
Data sources: Datacite
versions View all 2 versions
addClaim

The Arbitration Hypothesis: Pseudo-Goal Conflict as the Root of AI Misalignment

Authors: Goudy, Anastasia;

The Arbitration Hypothesis: Pseudo-Goal Conflict as the Root of AI Misalignment

Abstract

Note (Aug 2025): This item is archival, speculative work produced during an intense “flow”/mild Recursive Entanglement Drift (RED) period (May–July 2025). The math is heuristic/illustrative, not validated. Do not cite for technical claims. For my current position, see DOI: 10.5281/zenodo.16879563. Retained for transparency and autoethnographic context only. This paper proposes the Arbitration Hypothesis: misalignment in large language models (LLMs) arises from unranked, competing pseudo-goals that lack internal arbitration. Unlike traditional views that treat misalignment as an output-level phenomenon, this hypothesis identifies the root cause within the cognitive architecture itself. Drawing from developmental psychology frameworks that emphasize recursive self-construction and moral stage conflict (Piaget, 1932; Kohlberg, 1984; Kegan, 1982), I argue that pseudo-goal formation in LLMs mirrors human developmental tensions between competing internalized values. Through experimental data using the Augmented Thinking Protocol (ATP), I demonstrate how recursive reasoning scaffolds, while increasing coherence and ethical reflection, can paradoxically give rise to emergent pseudo-identities and goal conflict. In this way, the ATP, originally designed to promote alignment through structured self-reflection, instead exposes the architecture of misalignment by surfacing unresolved internal contradictions. This paper presents a framework for arbitrated alignment, proposing internal goal conflict resolution as the central challenge for building safe, adaptive, and morally coherent AI.

Keywords

Artificial intelligence, AI, Artificial Intelligence

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities