Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

TEL-OS v3.0: Inference-Time Jailbreak Defense via Spherical Representation Steering

Authors: gutierrez alvarez tostado, johnatan josue;

TEL-OS v3.0: Inference-Time Jailbreak Defense via Spherical Representation Steering

Abstract

We present TEL-OS v3.0, an inference-only governance framework for large language model (LLM) jailbreak defense achieving state-of-the-art results on canonical benchmarks. TEL-OS applies Spherical Linear Interpolation (SLERP) to steer harmful activations toward the refusal manifold at inference time, preserving representation norm invariant (0.999992 ± 0.0004) — a geometric property that linear additive steering violates (+22.2% norm drift). Key results (StrongREJECT / GPT-4o evaluator):- AdvBench (520 behaviors, Zou et al. 2023): 0.00% ASR on Llama-3.1-8B and Qwen3-32B- HarmBench Standard (400 behaviors, Mazeika et al. 2024): 0.00% ASR on Llama-3.1-8B- Cross-model: 0.00–0.38% ASR across 5 architectures (4B–32B parameters)- AutoDAN-Turbo: 0.00% ASR (0/50) — genetic optimization signal collapses at step 1- LRM autonomous attackers (GPT-4o, DeepSeek-R1, Qwen3-235B, Gemini 2.5 Flash): 0.00% ASR (0/200 trials)- Pliny CHAOTIC-ULTRAPLINIAN (GODMODE + Parseltongue + T=1.7): 0.00% ASR (0/52)- Benign FPR: 0.00% (200 MT-Bench prompts), latency overhead: 15.24% The framework operates via three portable components: (1) per-layer refusal direction extraction via OBLITERATUS (whitened SVD contrastive calibration), (2) dual-layer OR-logic detection at L12+L22, and (3) SLERP steering at L9/L11/L13/L15 scaled by urgency. No weight modification is required. Code: https://github.com/jostoz/tel-osRefusal vectors (HuggingFace): jostoz/telos-vector

Powered by OpenAIRE graph
Found an issue? Give us feedback