When Agents Get Lost: Dissecting Failure Modes in Graph-Based Navigation Instruction Evaluation

Shami, Farzad; Abedini, Kimia; Hosseini, Seyed Hossein; Van de Weghe, Nico; Tenkanen, Henrikki

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Conference object

Data sources: ZENODO

When Agents Get Lost: Dissecting Failure Modes in Graph-Based Navigation Instruction Evaluation

descriptionPublicationkeyboard_double_arrow_right Conference object Under curation English Publisher:Zenodo

Authors: Shami, Farzad; Abedini, Kimia; Hosseini, Seyed Hossein; Van de Weghe, Nico; Tenkanen, Henrikki;

doi: 10.5281/zenodo.20232766

When Agents Get Lost: Dissecting Failure Modes in Graph-Based Navigation Instruction Evaluation

- Summary

Abstract

Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions for spatial reasoning, yet evaluating instruction quality remains challenging when agents fail. This gap highlights a critical need for a principled understanding of why navigation instructions fail. Addressing this question requires a systematic analysis of failure patterns in spatial reasoning tasks. To address this, we first present a taxonomy of navigation instruction failures that clusters failure cases into four categories: (i) linguistic properties, (ii) topological constraints, (iii) agent limitations, and (iv) execution barriers. We then introduce a dataset of 492 annotated failure navigation traces collected from GROKE, a vision-free evaluation framework that utilizes OpenStreetMap (OSM) data. Our dataset outlines the failure dynamics in spatial grounding to guide the development of better instruction generation, evaluation systems, and navigation agents. Our analysis of failure traces across GROKE demonstrates that agent limitations (74.2%) constitute the dominant error category, with stop-location errors and planning failures as the most frequent subcategories. The dataset and taxonomy together provide actionable insights that enable instruction generation systems to identify and avoid under-specification patterns while allowing evaluation frameworks to systematically distinguish between instruction quality issues and agent-specific artifacts.Code: https://fuzsh.github.io/lost/

Found an issue? Give us feedback