You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

Name: You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference
Creator: Shamim, Ryan
Keywords: FOS: Computer and information sciences, AI Infrastructure, Runtime Optimization, Production ML Systems, Reproducible Benchmarks, Transformer Inference, GPU Cost Reduction, Compute Avoidance, Transformer Equivalence, Inference Gating

Shamim, Ryan

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

https://dx.doi.org/10.48550/ar...

Article . 2026

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

ZENODO

Preprint . 2025

License: CC BY

Data sources: Datacite

You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

descriptionPublicationkeyboard_double_arrow_right Preprint , Article 24 Dec 2025Embargo end date: 01 Jan 2026 English Publisher:Zenodo

Authors: Shamim, Ryan;

You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Modern AI inference systems treat transformer execution as mandatory, conflating model capability with execution necessity. We reframe inference as a control-plane decision problem: determining when execution is necessary versus when correctness can be preserved through alternative pathways. We introduce Meaning-First Execution (MFEE), a control-plane architecture implementing this framework, selectively invoking transformer inference only when required. MFEE operates as a gating layer above existing stacks without modifying models, weights, or parameters. Across 1,000 diverse prompts under deterministic decoding, MFEE achieves 78.1% execution reduction while maintaining 100% exact-match equivalence for invoked executions. Comparative evaluation reveals pattern-based routers achieve at most 53.3% avoidance with correctness failures, while MFEE reaches 100% avoidance with zero failures through semantic analysis. We prove this limitation via Theorem 1: any router operating solely on finite feature maps cannot simultaneously guarantee zero false skips and positive avoidance on feature-collision pairs. These results establish execution governance as a foundational layer in ML systems infrastructure, orthogonal to model-level optimization techniques.

24 pages, 5 figures. Deterministic evaluation protocol. Includes theoretical analysis and empirical validation on GPT-2 and Gemma 2 9B

Keywords

FOS: Computer and information sciences, AI Infrastructure, Runtime Optimization, Production ML Systems, Reproducible Benchmarks, Transformer Inference, GPU Cost Reduction, Compute Avoidance, Transformer Equivalence, Inference Gating, Energy-Efficient AI, Machine Learning (cs.LG), Machine Learning, Inference Optimization, Meaning-First Execution, Deterministic Evaluation, ML Systems

2 Research products, page 1 of 1

Semantic Field Execution: A Substrate for Field-Native, Transformer-Decoupled Inference
2025IsSupplementTo
Post-Transformer Inference: 224× Compression of Llama-70B with Improved Accuracy
2025Continues

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

2 Research products, page 1 of 1

Semantic Field Execution: A Substrate for Field-Native, Transformer-Decoupled Inference

Post-Transformer Inference: 224× Compression of Llama-70B with Improved Accuracy