
Most evaluation of conversational AI relies on short, prompt‑based tests that fail to reflect how real people use these systems in real and diverse situations. Such tests do not capture the demands of extended interaction, shifting user intent, or the cumulative effects of context over time. This paper introduces the Argo AI Testing Protocol (the Argo Protocol), a conceptual approach for evaluating AI systems within the User Interaction Space - the full set of observable outputs and interactions available to a user. The Protocol outlines a long‑form, multi‑dimensional perspective on evaluation, recognising that behaviour emerges across extended interaction rather than isolated prompts. It describes a set of conceptual load dimensions that influence model behaviour, without prescribing specific procedures, measurements, or implementation details. The Protocol’s purpose is to provide a vocabulary and framing that developers can adapt to their own environments, rather than a fixed or prescriptive testing method. The aim here is not to define a standard, though the Protocol may serve as a starting point should the field require a formalised approach in the future. By grounding evaluation in the observable behaviour of the User Interaction Space under sustained, multi‑dimensional conditions, the Argo Protocol offers a conceptual route toward more realistic assessment of how AI systems behave when used by real people.
computer-human interaction, AI testing, LLM, user interaction space, Artificial intelligence, model stability, long form evaluation, conversational models, quality assurance, information systems, stress testing, failure modes, behavioral reliability, distributed computing, multi-axis load testing, prompt engineering, red teaming, Computer security, testing protocol, systems theory
computer-human interaction, AI testing, LLM, user interaction space, Artificial intelligence, model stability, long form evaluation, conversational models, quality assurance, information systems, stress testing, failure modes, behavioral reliability, distributed computing, multi-axis load testing, prompt engineering, red teaming, Computer security, testing protocol, systems theory
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
