Why ARC-AGI-3 Is Hard: A Preregistered Structural Analysis

ARC-AGI-3 is the first interactive reasoning benchmark in the ARC-AGI series. Humans score 100%. Frontier AI systems score below 1%. We argue that this gap exists not because AI systems lack any single capability—exploration, modeling, planning, or goal-setting—but because they lack a second space in which to govern their reasoning. The Functional Model of Intelligence (FMI) distinguishes two spaces: a conceptual space navigated by external operators (System I pattern-matching and System II path-construction), and a fitness space navigated by internal operators that maintain dynamical stability and determine which external operator to deploy. Current AI architectures operate in one space. General intelligence requires two. The mode-switching capacity that separates humans from current AI is not a missing capability in conceptual space—it is the absence of the fitness-space machinery that governs navigation through conceptual space. We derive five falsifiable predictions from this two-space analysis, filed before meaningful agent performance data on ARC-AGI-3 exists, and specify the statistical tests and falsification conditions for each. We present supporting empirical evidence from an independent domain: a witness protocol applied to peer reviews demonstrates the same structural distinction between first-order reasoning (navigation in conceptual space), second-order reasoning (governance from fitness space), and third-order reasoning (constructing governance instruments for other agents). All formal derivations are presented in the companion Supplement.

Found an issue? Give us feedback