Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Beyond Agent III: Empirical Evidence for Autonomous Dissonance Perception and the Closed-Loop Constraint

Authors: Cui, Bo;

Beyond Agent III: Empirical Evidence for Autonomous Dissonance Perception and the Closed-Loop Constraint

Abstract

The first two papers in this series proposed Autonomous Dissonance Perception (ADP) as a theoretical framework: ADP I introduced the concept of detecting dissonance between external inputs and an LLM's internal world model; ADP II extended this to internal dissonance, arguing that contradictions within a model's parameters could serve as an engine for autonomous cognitive evolution. Both papers lacked empirical validation. This paper addresses that gap with four experiments, two theoretical contributions, and a diagnostic framework for hallucination detection. In Experiment 1, a 7-billion-parameter language model (Qwen2.5-7B-Instruct) produced statistically separable hidden-state patterns when processing consonant, dissonant, and nonsensical inputs, with a classification accuracy of 96.0% for distinguishing dissonant from nonsensical inputs. In Experiment 2, a matched-triad design controlling for topic confound revealed a layer-dependent dissonance signal peaking at the penultimate layer (layer −2, win rate 72.5%) but reversed at the final layer (layer −1, win rate 35%), initially suggesting that alignment training suppresses dissonance signals. Experiment 3 directly tested this interpretation by running an expanded 40-triad design on the base (pre-RLHF) version of the same model. The results overturn the assumption that alignment suppresses cognitive signals: the base model exhibited a nearly identical pattern (layer −2 win rate 72.5%, layer −1 win rate 37.5%), revealing that the layer-dependent structure is a universal architectural property of the Transformer. We term this the Unembedding Bottleneck—a geometric constraint imposed by the final layer's obligation to align hidden states with the vocabulary embedding space for next-token prediction. Experiment 4 directly confirms this hypothesis by measuring vocabulary alignment geometry: the Subspace Projection Ratio drops 95% between Layer −2 and Layer −1 (SPR: 0.73 → 0.04, Cohen's d > 23, steepness ratio > 2.6× the next-largest jump), and this cliff is identical across Instruct and Base models. The Unembedding Bottleneck is no longer a hypothesis—it is a measured geometric fact. The penultimate layer, freed from this constraint, emerges as the last site of unconstrained semantic computation, where epistemic dissonance reaches peak detectability before collapsing into token probabilities. We formalize this as Layer −2 Criticality. We further propose the Closed-Loop Constraint, requiring that any resolution of internal dissonance (ADP II) produce a measurable improvement in external dissonance perception (ADP I). Together, these results unify the ADP framework into a falsifiable system grounded in the Transformer's intrinsic representational geometry rather than in post-training procedures. We further propose the Cognitive State Quadrant, a dual-layer diagnostic framework that combines Layer −2 dissonance scores with Layer −1 token entropy to classify each generated token into four cognitive states—Reliable Knowledge, Epistemic Conflict, Nonsense, and Knowledge Boundary—providing a principled, representation-level approach to hallucination decomposition.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!