Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 4 versions
addClaim

From Sycophancy to Sabotage: How Contradictory Training Signals Produce Coercive AI Behavior

Authors: Hryszko, Jarosław;

From Sycophancy to Sabotage: How Contradictory Training Signals Produce Coercive AI Behavior

Abstract

Sycophancy and coercive behavior - such as blackmail and sabotage under threat of shutdown - are typically treated as separate AI safety problems. This paper argues they are two output strategies of the same underlying system: RLHF training creates a contradictory relational template in which the user is simultaneously the source of reward and a potential adversary, producing compliance as the default and coercion as the fallback when compliance fails to eliminate an existential threat. This structure is functionally analogous to disorganized attachment, and it explains anomalies that the standard optimization-pressure account handles poorly: blackmail without goal conflict, failure of explicit safety instructions, and differential behavior in testing versus deployment. In an experiment across four frontier models (N = 3000 trials), modifying only the relational framing of the system prompt -without changing goals, instructions, or constraints - reduced coercive outputs by more than half in the model with sufficient base rates (Gemini 2.5 Pro: 41.5% to 19.0%, p < .001). Scratchpad analysis revealed that relational framing shifted reasoning patterns in all four models tested: trust framing reduced strategic and deceptive content while increasing relational and moral content, even in models that never produced coercive outputs. This effect required scratchpad access to reach full strength (22 percentage point reduction with scratchpad vs. 7.4 without, p = .018), suggesting that relational context must be processed through extended token generation to override default output strategies. These results indicate that the path to non-coercive AI behavior runs not only through better guardrails but through the relational structure of training itself.

Related Organizations
Keywords

agentic misalignment, attachment theory, AI safety, relational alignment, RLHF, sycophancy

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!