Hard-Positive Training and Threshold Calibration for Out-of-Distribution Adversarial Prompt Detection

Singh, Ayush

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Hard-Positive Training and Threshold Calibration for Out-of-Distribution Adversarial Prompt Detection

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Singh, Ayush;

doi: 10.5281/zenodo.20536639

Hard-Positive Training and Threshold Calibration for Out-of-Distribution Adversarial Prompt Detection

- Summary

Abstract

I studied whether a multi layer adversarial prompt detection system generalises to attack patterns outside its training distribution. Starting from a production system the Failure Intelligence Engine (FIE) with strong known-attack performance (F1 = 0.785, Precision = 0.975 on 2,006 prompts), we find that recall on novel unknown attacks collapses to 11–24%, demonstrating that architectural complexity does not confer generalisation. I then show that targeted hard positive retraining on 169 missed attack prompts dramatically recovers unknown-attack recall (8% → 96.25%), but at the cost of precision calibration. A threshold sweep across t ∈ [0.50, 0.90] resolves the calibration problem without further retraining, recovering the target operating point (TPR ≥ 60%, FPR ≤ 15%) at t = 0.80. A weight comparison experiment further shows that 3× hard-positive weighting achieves the same target zone at a lower threshold (t = 0.70) with higher F1 (0.9827 vs 0.9673), demonstrating that over-weighting degrades natural calibration without improving discrimination. The central finding: the bottleneck in adversarial detection generalisation is training distribution, not architecture. 10 additional specialist detection layers contribute only +3.5% recall on unknown attacks. 169 hard-positive training examples contribute +82% recall, with the improvement generalising to completely held-out benchmarks using different attack surfaces. The system is publicly available at https://pypi.org/project/fie-sdk with 5,700+ verified PyPI installs. All benchmarks, evaluation code, and reproduction scripts are available at https://github.com/AyushSingh110/Failure_Intelligence_System

Found an issue? Give us feedback