State Tracking as the Binding Constraint: How Long-Horizon Coherence Failures Unify Agentic Data Analysis, Clinical Decision-Making, and Multi-Turn Tool Use

Saluca Agentic AI Research Team

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Research

Data sources: ZENODO

State Tracking as the Binding Constraint: How Long-Horizon Coherence Failures Unify Agentic Data Analysis, Clinical Decision-Making, and Multi-Turn Tool Use

descriptionPublicationkeyboard_double_arrow_right Research Under curationPublisher:Zenodo

Authors: Saluca Agentic AI Research Team;

doi: 10.5281/zenodo.20541076

State Tracking as the Binding Constraint: How Long-Horizon Coherence Failures Unify Agentic Data Analysis, Clinical Decision-Making, and Multi-Turn Tool Use

- Summary

Abstract

Version 2 — revised in response to an external structural review and an automated critique pass. See "Response to Review" appendix in the PDF for the change log.A recurring failure pattern appears across three superficially distinct domains in recent machine learning research: agents that perform well on isolated steps but degrade sharply when required to maintain coherent, evolving state across many sequential decisions. This paper synthesises findings from benchmarks and theoretical analyses spanning long-horizon data analysis, clinical inpatient simulation, multi-hop question answering, continual learning, MCP-based personal tool use, and agentic mathematical research to argue that *state tracking coherence*—the capacity to maintain, update, and compose an accurate internal representation of task context across turns—is a binding upstream constraint on agentic performance, distinct from and not reducible to per-step reasoning quality. This is a **heuristic reading** that unifies sources by a shared failure signature rather than by a single formal mechanism; we name the mechanism-gap explicitly and propose falsification paths throughout. No single formal derivation connects all cited results; the cross-domain pattern is suggestive, not conclusive. Sources are drawn primarily from cs.AI and cs.LG preprints from May–June 2026. The central thesis is that long-horizon performance collapse is not primarily a reasoning deficit but a *state representation deficit*: agents lose track of what has been established, what has been revised, and what remains open. Primary evidence comes from [corpus:arxiv:2605.30434] (LongDS-Bench, 68 tasks, best model 48.45% accuracy with ~47-point drop from early to late turns), [corpus:arxiv:2606.02568] (ClinEnv, 0.17 F1 on management actions vs. 0.51 on discharge diagnosis), [corpus:arxiv:2606.02461] (AgentCL, continual learning across compositional task streams), and [corpus:arxiv:2606.02488] (RASER, cost-accuracy routing in multi-hop QA). Two additional sources—[corpus:arxiv:2606.02470] (MCP-Persona) and [corpus:arxiv:2606.02484] (Iteris)—are included as suggestive rather than evidential; their connections to the thesis are inferential and are flagged in a dedicated weakly-connected addendum. Falsification path: an ablation that injects a perfect external state ledger into a failing agent and measures whether late-turn performance recovers to early-turn levels would distinguish state-tracking failure from reasoning failure. If performance does not recover, the thesis is weakened. ---Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.Cited arXiv preprints: 2605.30434, 2605.31261, 2605.31468, 2606.02458, 2606.02461, 2606.02470, 2606.02484, 2606.02488, 2606.02497, 2606.02530, 2606.02536, 2606.02568

Found an issue? Give us feedback