Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs

Name: Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs
Creator: Peyriguere, Boris

Peyriguere, Boris

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: Peyriguere, Boris;

doi: 10.5281/zenodo.18359832

Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs

- Summary

Abstract

Large Language Models remain vulnerable to jailbreak attacks that bypass traditional safety measures. We propose Layer-Native Safety Clamping, a representation engineering approach that operates directly within the model's activation space. By learning harm directions from contrastive safe/harmful pairs and clamping activations that exceed learned thresholds, our method provides safety guarantees that cannot be bypassed through prompt manipulation alone. We integrate this approach into INL (Inertial Neural Learning) dynamics and release a 10K contrastive safety dataset. Code and dataset available at: https://huggingface.co/datasets/Pacific-Prime/safety_dataset

Found an issue? Give us feedback