Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint . 2026
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint . 2026
License: CC BY
Data sources: ZENODO
versions View all 2 versions
addClaim

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

Authors: Kumaresan, Ramchand;

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

Abstract

We present ACAR (Adaptive Complexity & Attribution Routing) as a measurement framework for studying multi-model orchestration under auditable conditions.ACAR uses self-consistency variance (σ) computed from N=3 samples to route tasks across single-model, two-model, and three-model execution modes, implemented atop TEAMLLM, a deterministic substrate with immutable artifacts and complete decision traces. We evaluate across 1,510 tasks spanning four benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA) with Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing 7,550+ auditable runs. What holds: σbased routing achieves 55.6% accuracy, exceeding the two-model baseline (54.4%) while avoiding full ensembling on 54.2% of tasks; the mechanism is model-agnostic and requires no learned components. What does not hold: (1) Retrieval augmentation decreased accuracy by 3.4 percentage points—median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces harmful noise rather than grounding. (2) When models agree on incorrect answers (σ=0), no downstream ensemble can recover; this “agreement-but-wrong” failure mode is intrinsic to self-consistency and bounds achievable accuracy at 8pp below full ensembling. (3) Attribution estimates based on proxy signals (response similarity, entropy) showed weak correlation with ground-truth leave-one-out values; practical attribution requires explicit counterfactual computation. This paper documents what assumptions fail in practice, providing falsifiable baselines for future work on routing, retrieval, and multi-model attribution.

Keywords

Large Language Models, Multi-model routing, Model Evaluation

Powered by OpenAIRE graph
Found an issue? Give us feedback