Safety Beyond the Interface: Detecting Harm via Latent LLM States

External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84% respectively competitive with 1000×+ larger guard models while cutting latency and compute costs.

Found an issue? Give us feedback