
No prior automated system tracks hypothesis-level evidence across the full Active Inference and Free Energy Principle (FEP) literature. Manual synthesis cannot keep pace with a field that has grown at a compound annual rate of 20.36% across 2005â2026, and the FEPâs theoretical generality has invited falsifiability critiques that only hypothesis-specific evidence profiling can address. Building on pioneering systematic manual annotation paired with ontology-based anal- ysis at the scale of hundreds of papers, we present a computational meta-analysis framework that automates and scales this approach. The pipeline retrieves literature from arXiv, Semantic Scholar, and OpenAlex, deduplicating đ = 819 papers via a canonical identifier hierarchy (DOI > arXiv ID > Semantic Scholar ID > OpenAlex ID). It classifies papers into a three-tier taxonomy spanning eight categories: A (Core Theory), B (Tools & Translation), and C (Application Domains). An LLM-powered extraction system then evaluates each abstract against eight core hypotheses, producing structured nanopublicationsâeach encoding directionality, a confidence score, and natural-language reasoningâthat populate an RDF-compatible knowledge graph scored by a citation-weighted evidence function. All extracted assertions are automatically generated and have not been manually validated; hypothesis scores should be considered preliminary. The resulting evidence landscape reveals a field where application domains (Domain C, 64.0%) collectively dominate the corpus, with tools development (Domain B, 20.8%)âincluding pymdp, RxInfer.jl, and interpretable alternatives such as Free Energy Projective Simulationâand core theory (Domain A, 15.2%) rounding out the taxonomy. Non-negative matrix factorization identifies 5 latent topics that cross-cut the keyword taxonomy, and citation network analysis exposes a sparse yet structured graph (2,176 intra-corpus edges out of 29,323 total outgoing referencesâonly 7.4% reference resolution, reflecting the corpusâs specialised scope rather than the underlying citation density of any single paper) anchored by pronounced hub papers. Hypothesis scores cluster into three tiers: a broad consensus tier (score > 0.83) covering five hypothesesâH7 Morphogenesis, H2 AIF Optimality, H4 Predictive Coding, H6 Clinical Utility, and H5 Scalability; a near-consensus boundary (H8 Language AIF, score â+0.83); a moderate debate tier (H3 Markov Blanket Realism, â+0.78); and a diffuse tier (H1 FEP Universality, â+0.48) where a large neutral plurality reflects the principleâs broad invocation without explicit empirical testâthough absolute score magnitudes are inflated by publication bias and linguistic asymmetry in academic writing, making relative rankings and temporal trajectories more reliable than point estimates. By demonstrating that automated LLM-driven assertion extractionâoperating without human-validated ground truthâcan generate scalable, queryable representations of scientific evidence, this work provides a reusable architecture for living literature reviewsâcontinuously updated knowledge graphs that track hypothesis-level consensus across rapidly evolving fields. All code, results, and methods to reproduce this manuscript are open source at https://github.com/ActiveInferenceInstitute/act_inf_metaanalysis/ .
