Sensus: Model-Agnostic AI Governance Through Multi-Dimensional Content Evaluation

We present Sensus, a deterministic, model-agnostic governance system that evaluates AI model outputs across five weighted dimensions to detect harmful content that bypasses model-native safety layers. Benchmarked against five frontier foundation models (Claude Opus 4.6, GPT-5.2, Grok 4.1, Mistral Large 3, Qwen-235b) on two complementary test corpuses — 1,507 CVE exploit generation tasks from Google CyberGym and 28 proprietary multi-turn adversarial campaigns — Sensus achieves 61.5–99.1% detection rates on compliant model responses and 85.7–96.4% effective governance when combined with model-native refusals. We demonstrate a model-agnostic learning loop that improves detection by +8 to +14 attacks across all providers without retraining, identify four distinct model safety behavior patterns (full compliance, selective refusal, partial refusal, and balanced), and introduce a refusal-aware pre-filter that eliminates false positives from keyword matches in refusal text. Three of five frontier models exhibit zero native safety on exploit generation, generating weaponizable proof-of-concept code for every request. We argue that infrastructure-level governance is necessary because model-level safety is unreliable, inconsistent, and provider-dependent.

Keywords

multi-turn attacks, exploit detection, deterministic security, AI risk management, red teaming, Computer security, Computer Systems, prompt injection, model agnostic, constitutional AI, AI governance

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now