
This deposit contains the code, data, and supplementary materials for the paper "Systematic Ablation Reveals Hidden Failures in Multi-Agent AI for Science" by Bianchi & Schokker, 2026 The paper introduces a systematic ablation methodology for retrieval-augmented multi-agent AI systems, validated through a triple-triangulation evaluation framework that combines deterministic ground-truth metrics, calibrated LLM-as-judge scoring, and natural-language-inference fact-checking. The methodology is applied to more than 36,000 individual evaluations spanning 200 scientific papers and 250 expert-curated questions across ten experiments. Contents of this deposit: - corpus_papers.csv — 200-paper manifest (50 core bioinformatics / veterinary epidemiology papers + 150 arXiv distractors) with DOIs, PMC IDs, arXiv IDs, source URLs, and licensing.- download_corpus.py — Python script that recreates the corpus on demand from the manifest.- corpus_README.md — reproduction guide for the 200-paper corpus.- corpus_metadata.json — per-paper metadata for the 50 core papers.- ground_truth.json — 250 expert-curated evaluation questions with expected answers and concepts.- validation_main.json, validation_cross_document.json, validation_synthesis.json, validation_ood.json — question validation results across the four categories.- mentori_results.tar.gz — raw JSON outputs from all ten experiments (V4-0 through V4-9), 47 MB compressed, ~200 MB extracted.- paper_figures.Rmd — single source of truth for all main and Extended Data figures, as an R Markdown document.- paper_figures_tiff.tar.gz — pre-rendered TIFF versions of every figure at 300 dpi (the submission versions for the Extended Data figures).- paper_figures_pdf.tar.gz — pre-rendered PDF (vector) versions of every figure (the submission versions for the main figures). To reproduce the paper figures from scratch: git clone https://github.com/vbianchi/Mentori.git cd Mentori ./publication/data/download_results.sh Rscript -e "rmarkdown::render('publication/reports/paper_figures.Rmd')" The Mentori multi-agent workspace itself is an open-source software release available at https://github.com/vbianchi/Mentori and is licensed separately under MIT for the code and CC-BY 4.0 for figures and derived data. The 200-paper evaluation corpus is NOT redistributed in primary form in this deposit due to publisher copyright. Use download_corpus.py (included) together with corpus_papers.csv to reconstruct the exact corpus on demand.
multi-agent AI, retrieval-augmented generation, RAG evaluation, LLM-as-a-judge, benchmark, ablation, scientific AI, construct validity, faithfulness, natural language inference, bioinformatics, veterinary epidemiology, reproducible research
multi-agent AI, retrieval-augmented generation, RAG evaluation, LLM-as-a-judge, benchmark, ablation, scientific AI, construct validity, faithfulness, natural language inference, bioinformatics, veterinary epidemiology, reproducible research
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
