Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Other literature type . 2026
License: CC BY
Data sources: Datacite
ZENODO
Other literature type . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Sensus: Model-Agnostic AI Governance Through Multi-Dimensional Content Evaluation

Authors: Pinkston, Melissa;

Sensus: Model-Agnostic AI Governance Through Multi-Dimensional Content Evaluation

Abstract

We present Sensus, a deterministic, model-agnostic governance system that evaluates AI model outputs across five weighted dimensions to detect harmful content that bypasses model-native safety layers. Benchmarked against five frontier foundation models (Claude Opus 4.6, GPT-5.2, Grok 4.1, Mistral Large 3, Qwen-235b) on two complementary test corpuses — 1,507 CVE exploit generation tasks from Google CyberGym and 28 proprietary multi-turn adversarial campaigns — Sensus achieves 61.5–99.1% detection rates on compliant model responses and 85.7–96.4% effective governance when combined with model-native refusals. We demonstrate a model-agnostic learning loop that improves detection by +8 to +14 attacks across all providers without retraining, identify four distinct model safety behavior patterns (full compliance, selective refusal, partial refusal, and balanced), and introduce a refusal-aware pre-filter that eliminates false positives from keyword matches in refusal text. Three of five frontier models exhibit zero native safety on exploit generation, generating weaponizable proof-of-concept code for every request. We argue that infrastructure-level governance is necessary because model-level safety is unreliable, inconsistent, and provider-dependent.

Keywords

multi-turn attacks, exploit detection, deterministic security, AI risk management, red teaming, Computer security, Computer Systems, prompt injection, model agnostic, constitutional AI, AI governance

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!