
doi: 10.5281/zenodo.18044748 , 10.5281/zenodo.18045015 , 10.5281/zenodo.18042791 , 10.48550/arxiv.2601.00847 , 10.5281/zenodo.18067219 , 10.5281/zenodo.18045379 , 10.5281/zenodo.18073416 , 10.5281/zenodo.18080972 , 10.5281/zenodo.18042792 , 10.5281/zenodo.18049864 , 10.5281/zenodo.18050162 , 10.5281/zenodo.18044986
arXiv: 2601.00847
doi: 10.5281/zenodo.18044748 , 10.5281/zenodo.18045015 , 10.5281/zenodo.18042791 , 10.48550/arxiv.2601.00847 , 10.5281/zenodo.18067219 , 10.5281/zenodo.18045379 , 10.5281/zenodo.18073416 , 10.5281/zenodo.18080972 , 10.5281/zenodo.18042792 , 10.5281/zenodo.18049864 , 10.5281/zenodo.18050162 , 10.5281/zenodo.18044986
arXiv: 2601.00847
Modern AI inference systems treat transformer execution as mandatory, conflating model capability with execution necessity. We reframe inference as a control-plane decision problem: determining when execution is necessary versus when correctness can be preserved through alternative pathways. We introduce Meaning-First Execution (MFEE), a control-plane architecture implementing this framework, selectively invoking transformer inference only when required. MFEE operates as a gating layer above existing stacks without modifying models, weights, or parameters. Across 1,000 diverse prompts under deterministic decoding, MFEE achieves 78.1% execution reduction while maintaining 100% exact-match equivalence for invoked executions. Comparative evaluation reveals pattern-based routers achieve at most 53.3% avoidance with correctness failures, while MFEE reaches 100% avoidance with zero failures through semantic analysis. We prove this limitation via Theorem 1: any router operating solely on finite feature maps cannot simultaneously guarantee zero false skips and positive avoidance on feature-collision pairs. These results establish execution governance as a foundational layer in ML systems infrastructure, orthogonal to model-level optimization techniques.
24 pages, 5 figures. Deterministic evaluation protocol. Includes theoretical analysis and empirical validation on GPT-2 and Gemma 2 9B
FOS: Computer and information sciences, AI Infrastructure, Runtime Optimization, Production ML Systems, Reproducible Benchmarks, Transformer Inference, GPU Cost Reduction, Compute Avoidance, Transformer Equivalence, Inference Gating, Energy-Efficient AI, Machine Learning (cs.LG), Machine Learning, Inference Optimization, Meaning-First Execution, Deterministic Evaluation, ML Systems
FOS: Computer and information sciences, AI Infrastructure, Runtime Optimization, Production ML Systems, Reproducible Benchmarks, Transformer Inference, GPU Cost Reduction, Compute Avoidance, Transformer Equivalence, Inference Gating, Energy-Efficient AI, Machine Learning (cs.LG), Machine Learning, Inference Optimization, Meaning-First Execution, Deterministic Evaluation, ML Systems
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
