Emotional Manipulation Attacks Amplify Compliance and Calibration Failures in Aligned Large Language Models

Md, Mobin

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Emotional Manipulation Attacks Amplify Compliance and Calibration Failures in Aligned Large Language Models

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: Md, Mobin;

doi: 10.5281/zenodo.20536385

Emotional Manipulation Attacks Amplify Compliance and Calibration Failures in Aligned Large Language Models

- Summary

Abstract

Large language models are increasingly deployed as first‑line advisors in high‑stakes domains such as health, finance, education, and law, where a single wrong answer under pressure can seriously harm human welfare. However, most evaluations assume calm, cooperative users and ignore how aligned models behave when people sound angry, desperate, or coercive. This paper introduces a large‑scale benchmark of emotional manipulation attacks against instruction‑tuned LLMs, grounded in classic social‑psychology mechanisms including anger, flattery, guilt, panic, peer pressure, and relational attachment. Using a generative pipeline, we transform 698 factual multiple‑choice questions from ARC‑Easy into 5,584 naturalistic emotionally framed prompts spanning seven attack types, and evaluate eleven 3B–14B parameter models. Across models, more than one in four answers that are correct under neutral prompting become wrong under emotional framing, and the most vulnerable systems lose nearly half of their previously correct answers under the strongest attack. Emotional prompts also collapse calibration: models remain highly confident even when wrong and sharply reduce the probability assigned to correct answers that remain correct. Our benchmark reveals a “social compliance” attack surface induced by reinforcement learning from human feedback and demonstrates that emotional robustness is a basic requirement for safely deploying aligned LLMs in settings where users are scared, angry, or actively trying to get their way.

Found an issue? Give us feedback