
Version 2 — revised in response to an external structural review and an automated critique pass. See "Response to Review" appendix in the PDF for the change log.Vision-Language-Action (VLA) models are increasingly deployed as the cognitive core of physical robots, yet their failure modes, perceptual representations, and runtime safety mechanisms are almost universally designed and evaluated in isolation. This paper synthesises five to seven specific findings from recent robotics, human-computer interaction, and systems-control preprints to argue a single defensible thesis: **the gap between a VLA's internal representational structure and the monitoring or certification apparatus placed around it is itself a primary source of deployment fragility — not a secondary concern to be patched post hoc.** We advance this thesis as a heuristic reading of the evidence, not a formal derivation; the cited findings are consistent with the thesis but do not uniquely entail it. We draw on empirical evidence that VLA architectures produce architecture-specific failure signatures at the motor-command level [corpus:arxiv:2605.28726], that visual encoders trained without dynamics awareness leave motion understanding to downstream policies that then fail under distribution shift [corpus:arxiv:2605.30350], that fine-grained language supervision reshapes policy behaviour in ways coarse goal-level data cannot [corpus:arxiv:2605.27284], that belief-space safety filters require explicit conformal-prediction certification to handle inference error [corpus:arxiv:2606.02562], and that world-task factorisation provides a principled structural decomposition that could — subject to hardware validation — align these concerns [corpus:arxiv:2606.02027]. Secondary evidence from swarm fault-tolerance [corpus:arxiv:2606.01970] and in-flight reinforcement learning [corpus:arxiv:2606.01478] is used narrowly to bound the latency budget available to any monitoring approach; the structural analogy between those multi-agent contexts and single-robot VLA deployment is argued explicitly in the Discussion rather than assumed. The falsification path is concrete: if architecture-matched monitors do not outperform generic monitors on held-out hardware trials, or if dynamics-aware encoders do not reduce out-of-distribution failure rates beyond what coarse encoders achieve with domain randomisation, the co-design thesis collapses to a restatement of the obvious. We close by identifying the structural conditions under which co-design is tractable and where it currently is not. ---Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.Cited arXiv preprints: 2605.27284, 2605.28726, 2605.29677, 2605.30326, 2605.30350, 2606.01478, 2606.01597, 2606.01970, 2606.02027, 2606.02562
