Monitoring Verifier Health in Test-Time Scaling Using Stochastic Power Metrics

Test-time scaling methods such as LLM-as-a-Verifier (Mirhoseini et al., 2026) improve answer selection quality by using log-probability rank signals to score candidate outputs. These methods assume the verifier remains reliably discriminative throughout the sampling process. We identify a gap: no existing method monitors whether the verifier is currently healthy — whether it is still producing meaningful discriminative signal or has begun to plateau, drift, or produce flat rankings. This paper proposes applying the stochastic power metric P(t) = E(t) × W(t) as a real-time verifier health signal. E(t) measures whether the verifier's current score spread exceeds its own adaptive expected spread. W(t) measures consistency of that outperformance. When P(t) drops below a threshold, the verifier has lost discrimination power and continued sampling yields diminishing returns. In a stylized simulation calibrated to published TerminalBench 2.0 results, the power metric correctly identifies verifier plateau states and reduces unnecessary candidate generation by 84–96% with quality scores of 0.944–0.976 relative to full-budget verification. This framing is consistent with sequential decision-making theory: the verifier health signal is an instance of the Resource Commitment Principle applied to the verification layer of test-time scaling.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average