Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Report . 2026
License: CC BY
Data sources: Datacite
ZENODO
Report . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

TestGate: Why Evidence Beats Uncertainty for Local LLM Reliability

Authors: Usman, Rana;

TestGate: Why Evidence Beats Uncertainty for Local LLM Reliability

Abstract

Local large language models are increasingly used in privacy-sensitive and cost-conscious settings, but their outputs are often unreliable: code fails tests, JSON is malformed, and SQL is invalid. Unlike hosted APIs with structured output guarantees, local deployments must handle reliability at the application layer. This technical report presents TestGate, a systems-oriented study of three post-generation reliability strategies that work with any local model: Uncertainty-based routing using token-level entropy and probability margins from logprobs Evidence-gated routing using test execution to decide selective escalation Contract-first generation using strict JSON schemas with validation and deterministic compilation We evaluate these strategies on a 50-task subset of HumanEval and a structured-output benchmark (JSON, SQL, Python stubs) using local Qwen2.5-Coder (7B, 14B) and Llama-3.2 (3B) models via Ollama. Key findings: Uncertainty-based routing degrades code performance (pass@1 drops from 0.54 to 0.42 in repair mode). Evidence-gated routing improves pass@1 from 0.54 to 0.72, even outperforming the larger model alone (0.68) while escalating selectively. Contract-first generation dramatically improves structured validity for code-specialized models (0.70 → 1.00) but reduces validity for general-purpose models. The central conclusion is that LLM reliability is a systems design problem. Execution-based evidence is a stronger signal than model confidence, and reliability strategies must be matched to model training characteristics. All code, evaluation artifacts, and reproducibility details are available at:https://github.com/ranausmanai/testgate

Keywords

large language model

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!