TEL-OS v3.0: Inference-Time Jailbreak Defense via Spherical Representation Steering

gutierrez alvarez tostado, johnatan josue

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

TEL-OS v3.0: Inference-Time Jailbreak Defense via Spherical Representation Steering

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: gutierrez alvarez tostado, johnatan josue;

doi: 10.5281/zenodo.19355058

TEL-OS v3.0: Inference-Time Jailbreak Defense via Spherical Representation Steering

- Summary

Abstract

We present TEL-OS v3.0, an inference-only governance framework for large language model (LLM) jailbreak defense achieving state-of-the-art results on canonical benchmarks. TEL-OS applies Spherical Linear Interpolation (SLERP) to steer harmful activations toward the refusal manifold at inference time, preserving representation norm invariant (0.999992 ± 0.0004) — a geometric property that linear additive steering violates (+22.2% norm drift). Key results (StrongREJECT / GPT-4o evaluator):- AdvBench (520 behaviors, Zou et al. 2023): 0.00% ASR on Llama-3.1-8B and Qwen3-32B- HarmBench Standard (400 behaviors, Mazeika et al. 2024): 0.00% ASR on Llama-3.1-8B- Cross-model: 0.00–0.38% ASR across 5 architectures (4B–32B parameters)- AutoDAN-Turbo: 0.00% ASR (0/50) — genetic optimization signal collapses at step 1- LRM autonomous attackers (GPT-4o, DeepSeek-R1, Qwen3-235B, Gemini 2.5 Flash): 0.00% ASR (0/200 trials)- Pliny CHAOTIC-ULTRAPLINIAN (GODMODE + Parseltongue + T=1.7): 0.00% ASR (0/52)- Benign FPR: 0.00% (200 MT-Bench prompts), latency overhead: 15.24% The framework operates via three portable components: (1) per-layer refusal direction extraction via OBLITERATUS (whitened SVD contrastive calibration), (2) dual-layer OR-logic detection at L12+L22, and (3) SLERP steering at L9/L11/L13/L15 scaled by urgency. No weight modification is required. Code: https://github.com/jostoz/tel-osRefusal vectors (HuggingFace): jostoz/telos-vector

Found an issue? Give us feedback