
Tool-using AI agents fail in production in ways that are hard to bisect: a model upgrade subtly changes a tool-call shape, a prompt edit reorders steps, a tool's response schema drifts, a budget tightens, and the failure surface looks identical at the API layer. This paper presents Agent Trajectory Replay, a small, single-concern artifact for capturing the full sequence of tool calls, arguments, and intermediate state from an agent run, and replaying it deterministically against a candidate version to surface the exact step at which behavior diverges. The contribution is a minimal trajectory record format that captures tool-call shape, argument values, retry hints, and budget state, plus a diff algorithm that ranks regressions by where the trajectory first differs. The artifact is published as a small TypeScript library on npm with a 1:1 Python port, and an MCP-server variant so a remote LLM can ask 'replay this trajectory' as a tool. The paper documents the trace format, the diff algorithm, and the operational pattern of using trajectory snapshots as regression fixtures in CI.DOI: 10.5281/zenodo.20073574 (Zenodo concept record)Artifact paper repo: https://github.com/MukundaKatta/agent-trajectory-replay-paperLicense: CC BY 4.0
workflow evaluation, tool use, regression testing, agent debugging, trajectory replay, AI agents
workflow evaluation, tool use, regression testing, agent debugging, trajectory replay, AI agents
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
