Artifact for "Beyond Benchmarks: A Case Study of LLM-Generated Verus Specification Failures on Asterinas Vostd"

Large language models have recently shown promise in automating formal specification generation, but their effectiveness has mainly been demonstrated on self-contained benchmark tasks. This paper examines whether such results transfer to industrial verification by studying LLM-generated Verus specifications on Asterinas vostd, an industrial OS-kernel verification project built around cross-file dependencies and Entity-Model proof abstractions. We analyze six formal verification targets spanning 156 task instances, 50 Rust source files, and 7,149 lines of code, together with 1,587 reviewed failure episodes produced by DeepSeek-V4-Flash, GPT-4o, and Qwen-Coder during generation and repair. Our results show that aggregate benchmark pass rates alone obscure the real industrial bottlenecks. Failures are systematic rather than random: they cluster across four verification stages (i.e., repository integration, Verus legality, semantic modeling, and proof construction), and their distributions vary across models and targets. We further show that verifier diagnostics do not always identify the true source of failure, and that verifier-guided repair can regress or stall instead of converging. These findings indicate that LLM-generated specifications fail not merely because industrial code is harder, but because failures are structured, stage-dependent, and often repaired in the wrong direction. Future evaluation and tooling should therefore move beyond final pass/fail outcomes and account for where specifications break, whether diagnostics identify the real cause, and whether repair makes progress or simply cycles through new errors.

Related Organizations

Shenzhen University
China (People's Republic of)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average