Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2026
Data sources: ZENODO
ZENODO
Dataset . 2026
Data sources: Datacite
ZENODO
Dataset . 2026
Data sources: Datacite
ZENODO
Dataset . 2026
Data sources: Datacite
versions View all 3 versions
addClaim

Artifact for "Beyond Benchmarks: A Case Study of LLM-Generated Verus Specification Failures on Asterinas Vostd"

Authors: Liu, Dugang;

Artifact for "Beyond Benchmarks: A Case Study of LLM-Generated Verus Specification Failures on Asterinas Vostd"

Abstract

Large language models have recently shown promise in automating formal specification generation, but their effectiveness has mainly been demonstrated on self-contained benchmark tasks. This paper examines whether such results transfer to industrial verification by studying LLM-generated Verus specifications on Asterinas vostd, an industrial OS-kernel verification project built around cross-file dependencies and Entity-Model proof abstractions. We analyze six formal verification targets spanning 156 task instances, 50 Rust source files, and 7,149 lines of code, together with 1,587 reviewed failure episodes produced by DeepSeek-V4-Flash, GPT-4o, and Qwen-Coder during generation and repair. Our results show that aggregate benchmark pass rates alone obscure the real industrial bottlenecks. Failures are systematic rather than random: they cluster across four verification stages (i.e., repository integration, Verus legality, semantic modeling, and proof construction), and their distributions vary across models and targets. We further show that verifier diagnostics do not always identify the true source of failure, and that verifier-guided repair can regress or stall instead of converging. These findings indicate that LLM-generated specifications fail not merely because industrial code is harder, but because failures are structured, stage-dependent, and often repaired in the wrong direction. Future evaluation and tooling should therefore move beyond final pass/fail outcomes and account for where specifications break, whether diagnostics identify the real cause, and whether repair makes progress or simply cycles through new errors.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average