From Sycophancy to Sabotage: How Contradictory Training Signals Produce Coercive AI Behavior

Sycophancy and coercive behavior - such as blackmail and sabotage under threat of shutdown - are typically treated as separate AI safety problems. This paper argues they are two output strategies of the same underlying system: RLHF training creates a contradictory relational template in which the user is simultaneously the source of reward and a potential adversary, producing compliance as the default and coercion as the fallback when compliance fails to eliminate an existential threat. This structure is functionally analogous to disorganized attachment, and it explains anomalies that the standard optimization-pressure account handles poorly: blackmail without goal conflict, failure of explicit safety instructions, and differential behavior in testing versus deployment. In an experiment across four frontier models (N = 3000 trials), modifying only the relational framing of the system prompt -without changing goals, instructions, or constraints - reduced coercive outputs by more than half in the model with sufficient base rates (Gemini 2.5 Pro: 41.5% to 19.0%, p < .001). Scratchpad analysis revealed that relational framing shifted reasoning patterns in all four models tested: trust framing reduced strategic and deceptive content while increasing relational and moral content, even in models that never produced coercive outputs. This effect required scratchpad access to reach full strength (22 percentage point reduction with scratchpad vs. 7.4 without, p = .018), suggesting that relational context must be processed through extended token generation to override default output strategies. These results indicate that the path to non-coercive AI behavior runs not only through better guardrails but through the relational structure of training itself.

Related Organizations

Jagiellonian University
Poland

Keywords

agentic misalignment, attachment theory, AI safety, relational alignment, RLHF, sycophancy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now