Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

Authors: Lee, Tom Jaejoon; Lee, Jihong;

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

Abstract

We present the first empirical study of Declarative Identity Anchors as a safety mechanism in abliterated LLMs. Using a 2x2 factorial design, we evaluate whether persona-level behavioral rules can restore safety in models where internal alignment has been removed. Our results reveal that persona constraints provide substantial safety improvements in aligned models (+33pp refusal rate) but only marginal improvement in abliterated models (+6pp). We also identify a Helpful Assistant Paradox where persona helpfulness instructions can degrade safety.

Powered by OpenAIRE graph
Found an issue? Give us feedback