ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

We present ACAR (Adaptive Complexity & Attribution Routing) as a measurement framework for studying multi-model orchestration under auditable conditions.ACAR uses self-consistency variance (σ) computed from N=3 samples to route tasks across single-model, two-model, and three-model execution modes, implemented atop TEAMLLM, a deterministic substrate with immutable artifacts and complete decision traces. We evaluate across 1,510 tasks spanning four benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA) with Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing 7,550+ auditable runs. What holds: σbased routing achieves 55.6% accuracy, exceeding the two-model baseline (54.4%) while avoiding full ensembling on 54.2% of tasks; the mechanism is model-agnostic and requires no learned components. What does not hold: (1) Retrieval augmentation decreased accuracy by 3.4 percentage points—median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces harmful noise rather than grounding. (2) When models agree on incorrect answers (σ=0), no downstream ensemble can recover; this “agreement-but-wrong” failure mode is intrinsic to self-consistency and bounds achievable accuracy at 8pp below full ensembling. (3) Attribution estimates based on proxy signals (response similarity, entropy) showed weak correlation with ground-truth leave-one-out values; practical attribution requires explicit counterfactual computation. This paper documents what assumptions fail in practice, providing falsifiable baselines for future work on routing, retrieval, and multi-model attribution.

Keywords

Large Language Models, Multi-model routing, Model Evaluation

Found an issue? Give us feedback