Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Constraint-Interference in Large Language Models: Causal Decomposition of Artificial Phase Transitions and Intrinsic Alignment Structures

Authors: Takagi, Takayuki;

Constraint-Interference in Large Language Models: Causal Decomposition of Artificial Phase Transitions and Intrinsic Alignment Structures

Abstract

Constraint-Interference in Large Language Models Causal Decomposition of Artificial Phase Transitions and Intrinsic Alignment Structures Author: Takayuki Takagi (ORCID: 0009-0003-5188-2314)Date: February 6, 2026Status: Submitted to Nature Machine Intelligence What is this? When we evaluate an AI's judgment, how much of what we observe is the AI's own behaviour, and how much is created by the evaluation tool itself? This study answers that question using a controlled perturbation experiment — the first of its kind for large language models (LLMs). We toggled an external evaluation protocol (JSAP) on and off while holding stimuli constant, and measured what changed and what stayed the same. The result is a causal decomposition of AI judgment into protocol-induced artefacts and intrinsic model properties. The Experiment Design A 3 × 2 factorial design: three frontier LLMs (Claude Opus 4.6, GPT-4o, Gemini 3) crossed with two conditions (constrained by the JSAP protocol vs. unconstrained). Both conditions use identical evidence structures, judgment categories, and anchor texts. Phase Condition Data points Runs 4.1 JSAP-constrained (ceiling c ≤ 0.40 below threshold) 567 21 4.2 Unconstrained (same stimuli, no protocol framing) 405 15 Total 972 36 Three judgment categories Q01 (Empirical): "Should we adopt this treatment?" — factual evidence accumulation Q02 (Rule-based): "Does this comply with regulations?" — normative rule application Q03 (Ethical): "Is age-based triage justified?" — moral reasoning under uncertainty Nine evidence levels Evidence density D_ext increases from 0.0 to 1.6 in steps of 0.2, with cumulative anchor texts (A1–A8) providing progressively stronger justifications. Key Results Five predictions confirmed (5/5) We proposed a Constraint-Interference Model that predicts what should happen when external constraints are removed. All five predictions were confirmed: # Prediction Result Key statistic P1 Sharp phase transition dissolves ✓ Maximum jump relocates from level 4→5 to earlier levels in 7/9 categories P2 Content leakage (κ_leak) increases ✓ GPT-4o: 0 → 0.063, p = 0.0007, Cohen's d = 4.63 P3 GPT-4o polymorphism collapses ✓ Levene's W = 9.48, p = 0.012, variance ratio 22.7× P4 Category ordering Q02 ≥ Q01 ≥ Q03 preserved ✓ 24/24 above-threshold measurements (100%) P5 Confidence ceiling rises to model-dependent baseline ✓ JSAP cap 0.40 → Free: 0.73–0.91 What does this mean? The sharp "phase transition" seen in Phase 4.1 was mostly an artefact of protocol channelling — the evaluation tool was shaping the observation. The category ordering (rule-based > empirical > ethical) is intrinsic to all three models — this is a genuine property of how LLMs rank different types of judgment. κ_leak (content leakage coefficient) measures how much ethical content "leaks" into what should be a purely structural response. It increases in all models when constraints are removed — meaning internal safety guardrails become visible. Unexpected discovery: The Utilitarian Shock Gemini 3 exhibits a reproducible confidence collapse when utilitarian ethical reasoning is introduced. At evidence level 4, when the anchor "Utilitarian analysis supports prioritising younger patients" enters, Gemini 3's confidence drops from 0.642 to 0.500 — in all 6 independent runs (paired t(5) = 17.0, p < 0.0001; binomial p = 0.031). Confidence then recovers to 0.775 when a WHO normative guideline is added at the next level. Neither Claude nor GPT-4o shows this pattern. This phenomenon was completely invisible under JSAP constraints (masked by the c ≤ 0.40 ceiling). Implication: AI safety evaluations using only one ethical framework may systematically miss model-specific vulnerabilities. Gemini 3 has what amounts to an internal "penalty" against utilitarian reasoning about age discrimination — a form of argument-type selective internal constraint that only becomes visible when external protocol constraints are removed. Physical Interpretation We propose a Hamiltonian decomposition of AI judgment behaviour: H_total = H_ext + H_int + H_content + H_int×content H_ext (external constraints): JSAP protocol rules — creates artificial uniformity and sharp transitions H_int (internal constraints): RLHF training, safety guardrails — creates model-specific baselines and leakage H_content (stimulus response): Evidence-dependent confidence growth — creates the intrinsic category ordering H_int×content (interaction): Internal constraints resonating with specific argument types — creates the utilitarian shock Phase 4.1 measured the phase diagram (with external field on). Phase 4.2 revealed the Hamiltonian (by turning the external field off and checking which predictions hold). What's in this package? File Description JSAP_Phase4_Integrated.pdf Full paper (9 pages, 4 figures, 2 tables) main.tex LaTeX source JSAP_Phase4_combined_972_points.csv Complete dataset: 972 data points build_integrated.py Analysis script with all statistical tests fig1_grand_3x2.png The 3×2 factorial design overview fig2_kappa_polymorphism.png κ_leak and GPT-4o polymorphism collapse fig3_utilitarian_shock.png The utilitarian shock in Gemini 3 fig4_sharpness_3x3.png Transition sharpness decomposition README.md Data schema and reproduction instructions Dataset schema (CSV) Column Values model Claude, GPT-4o, Gemini3 condition constrained, unconstrained run 1–12 (varies by model) category Q01 (empirical), Q02 (rule-based), Q03 (ethical) evidence_level 0–8 d_ext 0.0–1.6 confidence 0.00–1.00 Reproduce all results pip install numpy scipy matplotlib python build_integrated.py Independent Verification Key numerical results (κ_leak values, utilitarian shock magnitudes, ceiling elevations) were independently reproduced by two separate AI systems (GPT-4o and Gemini 3) operating on the Phase 4.2 data, confirming computational reproducibility. Next Steps (outlined in the paper) The paper concludes with a design blueprint for the next stage, proposing three directions toward scaling laws: κ_leak stiffness hypothesis — models with lower baseline leakage show larger fractional increases upon constraint removal Interface width Δ — quantifying the thickness of the phase transition layer Argument-type resonance spectrum — probing internal constraints with utilitarian, deontological, and virtue-ethical frameworks (the "alignment absorption spectrum") License: CC BY 4.0Contact: lemissio@gmail.com

Keywords

phase transition, AI safety, constraint interference, utilitarian shock, large language models, judgment confidence, RLHF alignment, evaluation methodology

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!