Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs

Authors: Peyriguere, Boris;

Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs

Abstract

Large Language Models remain vulnerable to jailbreak attacks that bypass traditional safety measures. We propose Layer-Native Safety Clamping, a representation engineering approach that operates directly within the model's activation space. By learning harm directions from contrastive safe/harmful pairs and clamping activations that exceed learned thresholds, our method provides safety guarantees that cannot be bypassed through prompt manipulation alone. We integrate this approach into INL (Inertial Neural Learning) dynamics and release a 10K contrastive safety dataset. Code and dataset available at: https://huggingface.co/datasets/Pacific-Prime/safety_dataset

Powered by OpenAIRE graph
Found an issue? Give us feedback