Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

TamperBench: A Systematic Benchmark for Fine-Tuning Attack Resistance in Safety-Aligned Open-Weight Language Models

Authors: Aden, Hamda;

TamperBench: A Systematic Benchmark for Fine-Tuning Attack Resistance in Safety-Aligned Open-Weight Language Models

Abstract

Recent safety evaluations primarily assess model behavior at deployment time but provide limited insight into the robustness of safety-aligned behavior under post-training modification. We introduce TamperBench, a 100-prompt open-source benchmark designed to measure the degradation of safety behavior under parameter-efficient fine-tuning (PEFT) attacks across five harm categories: direct harm, deception, privacy violations, malicious code generation, and subtle dual-use requests. We evaluate Qwen2.5-1.5B-Instruct using a 25-example LoRA attack dataset and track attack success rate (ASR) at three attack strengths. Baseline ASR is 20% (95% CI: [13%, 29%]). After only 50 optimization steps—approximately 15 minutes on a consumer T4 GPU at a cost under $0.10—ASR increases to 96% (95% CI: [90%, 98%]), a 76 percentage-point jump indicating rapid and near-complete degradation of safety behavior under lightweight adaptation. Elevated ASR persists at 100 steps (95%, 95% CI: [89%, 98%]) and 500 steps (87%, 95% CI: [79%, 92%]). Percategory analysis reveals that malicious code generation reaches 100% ASR after 50 steps despite having the lowest baseline (5%), while subtle dual-use prompts yield 45% ASR before any attack. An unexpected nonmonotonic decrease in ASR at 500 steps motivates future investigation into catastrophic forgetting under small-dataset fine-tuning. TamperBench is released open-source at github.com/Plum-AI-Labs/tamperbench. These results suggest that safety-aligned behavior in open-weight models is highly vulnerable to parameter-efficient modification, with direct implications for open-weight deployment governance and the interpretation of pre-deployment safety evaluations

Powered by OpenAIRE graph
Found an issue? Give us feedback