
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agent workflows. Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology existed for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. AgentAssay introduces stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE) grounded in statistical hypothesis testing, five-dimensional agent coverage metrics, agent-specific mutation testing operators, and a token-efficient testing pipeline that achieves 78-100% cost reduction while maintaining rigorous statistical guarantees. Key results from experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3, Llama-4-Maverick, Phi-4), 3 scenarios, and 6,500 trials ($59.64 total cost): - SPRT achieves 78% trial savings across all scenarios - Behavioral fingerprinting achieves 79% detection power where binary pass/fail testing has 0% - Full token-efficient pipeline achieves 100% cost savings through trace-first offline analysis The implementation comprises ~20,000 lines of Python with 751 tests and adapters for 10 agent frameworks (LangGraph, CrewAI, AutoGen, OpenAI, smolagents, Semantic Kernel, Bedrock, MCP, Vertex AI, and generic). Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions.
