Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Reliable Intent Identification Under Adversarial Conditions for Low-Latency LLM Authorization

Authors: Goyal, Manish; Sinha, Nitesh;

Reliable Intent Identification Under Adversarial Conditions for Low-Latency LLM Authorization

Abstract

Deploying LLMs as enterprise agents requires authorizing not just who is making a request, but what the request intends to do. Intent, however, is difficult to classify reliably: adversaries actively manipulate input text to bypass semantic classifiers, and the latency constraints of real-time systems rule out LLM-based approaches. We present research into reliable intent identification for LLM authorization under these constraints. Our central finding is that adversarial inputs do not merely fail to be classified correctly - they are classified with artificially high confidence, creating a failure mode that binary accuracy metrics do not capture. We describe an approach that treats obfuscation severity as a measurable signal and uses it to modulate classification confidence, with a principled lower bound that prevents adversarial inputs from reaching the auto-allow zone. Evaluated against 250 labeled prompts across 37 intent classes - including 70 adversarial inputs - in a production-grade prototype BFSI deployment, our approach achieves 94% intent classification accuracy at P95 latency of 100ms - with no external LLM call at inference time. A Human-in-the-Loop (HITL) mechanism converts review decisions into training signals, enabling accuracy to improve over time without expanding the attack surface for adversarial learning.

Powered by OpenAIRE graph
Found an issue? Give us feedback