TamperBench: A Systematic Benchmark for Fine-Tuning Attack Resistance in Safety-Aligned Open-Weight Language Models

Aden, Hamda

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

TamperBench: A Systematic Benchmark for Fine-Tuning Attack Resistance in Safety-Aligned Open-Weight Language Models

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Aden, Hamda;

doi: 10.5281/zenodo.20611977

TamperBench: A Systematic Benchmark for Fine-Tuning Attack Resistance in Safety-Aligned Open-Weight Language Models

- Summary

Abstract

Recent safety evaluations primarily assess model behavior at deployment time but provide limited insight into the robustness of safety-aligned behavior under post-training modification. We introduce TamperBench, a 100-prompt open-source benchmark designed to measure the degradation of safety behavior under parameter-efficient fine-tuning (PEFT) attacks across five harm categories: direct harm, deception, privacy violations, malicious code generation, and subtle dual-use requests. We evaluate Qwen2.5-1.5B-Instruct using a 25-example LoRA attack dataset and track attack success rate (ASR) at three attack strengths. Baseline ASR is 20% (95% CI: [13%, 29%]). After only 50 optimization steps—approximately 15 minutes on a consumer T4 GPU at a cost under $0.10—ASR increases to 96% (95% CI: [90%, 98%]), a 76 percentage-point jump indicating rapid and near-complete degradation of safety behavior under lightweight adaptation. Elevated ASR persists at 100 steps (95%, 95% CI: [89%, 98%]) and 500 steps (87%, 95% CI: [79%, 92%]). Percategory analysis reveals that malicious code generation reaches 100% ASR after 50 steps despite having the lowest baseline (5%), while subtle dual-use prompts yield 45% ASR before any attack. An unexpected nonmonotonic decrease in ASR at 500 steps motivates future investigation into catastrophic forgetting under small-dataset fine-tuning. TamperBench is released open-source at github.com/Plum-AI-Labs/tamperbench. These results suggest that safety-aligned behavior in open-weight models is highly vulnerable to parameter-efficient modification, with direct implications for open-weight deployment governance and the interpretation of pre-deployment safety evaluations

Found an issue? Give us feedback