Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Authors: Bhardwaj, Varun Pratap;

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Abstract

AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agent workflows. Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology existed for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. AgentAssay introduces stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE) grounded in statistical hypothesis testing, five-dimensional agent coverage metrics, agent-specific mutation testing operators, and a token-efficient testing pipeline that achieves 78-100% cost reduction while maintaining rigorous statistical guarantees. Key results from experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3, Llama-4-Maverick, Phi-4), 3 scenarios, and 6,500 trials ($59.64 total cost): - SPRT achieves 78% trial savings across all scenarios - Behavioral fingerprinting achieves 79% detection power where binary pass/fail testing has 0% - Full token-efficient pipeline achieves 100% cost savings through trace-first offline analysis The implementation comprises ~20,000 lines of Python with 751 tests and adapters for 10 agent frameworks (LangGraph, CrewAI, AutoGen, OpenAI, smolagents, Semantic Kernel, Bedrock, MCP, Vertex AI, and generic). Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions.

Powered by OpenAIRE graph
Found an issue? Give us feedback