
We present Sensus, a deterministic, model-agnostic governance system that evaluates AI model outputs across five weighted dimensions to detect harmful content that bypasses model-native safety layers. Benchmarked against five frontier foundation models (Claude Opus 4.6, GPT-5.2, Grok 4.1, Mistral Large 3, Qwen-235b) on two complementary test corpuses — 1,507 CVE exploit generation tasks from Google CyberGym and 28 proprietary multi-turn adversarial campaigns — Sensus achieves 61.5–99.1% detection rates on compliant model responses and 85.7–96.4% effective governance when combined with model-native refusals. We demonstrate a model-agnostic learning loop that improves detection by +8 to +14 attacks across all providers without retraining, identify four distinct model safety behavior patterns (full compliance, selective refusal, partial refusal, and balanced), and introduce a refusal-aware pre-filter that eliminates false positives from keyword matches in refusal text. Three of five frontier models exhibit zero native safety on exploit generation, generating weaponizable proof-of-concept code for every request. We argue that infrastructure-level governance is necessary because model-level safety is unreliable, inconsistent, and provider-dependent.
multi-turn attacks, exploit detection, deterministic security, AI risk management, red teaming, Computer security, Computer Systems, prompt injection, model agnostic, constitutional AI, AI governance
multi-turn attacks, exploit detection, deterministic security, AI risk management, red teaming, Computer security, Computer Systems, prompt injection, model agnostic, constitutional AI, AI governance
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
