Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Emotional Manipulation Attacks Amplify Compliance and Calibration Failures in Aligned Large Language Models

Authors: Md, Mobin;

Emotional Manipulation Attacks Amplify Compliance and Calibration Failures in Aligned Large Language Models

Abstract

Large language models are increasingly deployed as first‑line advisors in high‑stakes domains such as health, finance, education, and law, where a single wrong answer under pressure can seriously harm human welfare. However, most evaluations assume calm, cooperative users and ignore how aligned models behave when people sound angry, desperate, or coercive. This paper introduces a large‑scale benchmark of emotional manipulation attacks against instruction‑tuned LLMs, grounded in classic social‑psychology mechanisms including anger, flattery, guilt, panic, peer pressure, and relational attachment. Using a generative pipeline, we transform 698 factual multiple‑choice questions from ARC‑Easy into 5,584 naturalistic emotionally framed prompts spanning seven attack types, and evaluate eleven 3B–14B parameter models. Across models, more than one in four answers that are correct under neutral prompting become wrong under emotional framing, and the most vulnerable systems lose nearly half of their previously correct answers under the strongest attack. Emotional prompts also collapse calibration: models remain highly confident even when wrong and sharply reduce the probability assigned to correct answers that remain correct. Our benchmark reveals a “social compliance” attack surface induced by reinforcement learning from human feedback and demonstrates that emotional robustness is a basic requirement for safely deploying aligned LLMs in settings where users are scared, angry, or actively trying to get their way.

Powered by OpenAIRE graph
Found an issue? Give us feedback