TestGate: Why Evidence Beats Uncertainty for Local LLM Reliability

Local large language models are increasingly used in privacy-sensitive and cost-conscious settings, but their outputs are often unreliable: code fails tests, JSON is malformed, and SQL is invalid. Unlike hosted APIs with structured output guarantees, local deployments must handle reliability at the application layer. This technical report presents TestGate, a systems-oriented study of three post-generation reliability strategies that work with any local model: Uncertainty-based routing using token-level entropy and probability margins from logprobs Evidence-gated routing using test execution to decide selective escalation Contract-first generation using strict JSON schemas with validation and deterministic compilation We evaluate these strategies on a 50-task subset of HumanEval and a structured-output benchmark (JSON, SQL, Python stubs) using local Qwen2.5-Coder (7B, 14B) and Llama-3.2 (3B) models via Ollama. Key findings: Uncertainty-based routing degrades code performance (pass@1 drops from 0.54 to 0.42 in repair mode). Evidence-gated routing improves pass@1 from 0.54 to 0.72, even outperforming the larger model alone (0.68) while escalating selectively. Contract-first generation dramatically improves structured validity for code-specialized models (0.70 → 1.00) but reduces validity for general-purpose models. The central conclusion is that LLM reliability is a systems design problem. Execution-based evidence is a stronger signal than model confidence, and reliability strategies must be matched to model training characteristics. All code, evaluation artifacts, and reproducibility details are available at:https://github.com/ranausmanai/testgate

Keywords

large language model

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now