CE-Bench

Overview This project focuses on investigating the capabilities of large language models (LLMs) in identifying spurious counterexamples during program verification processes. The core challenge addressed is determining whether counterexamples generated during verification are genuine violations or artifacts of the proof technique. Benchmark Construction Methodology Our benchmark construction employs two distinct approaches to generate spurious counterexamples: Approach 1: Invariant Generation-Based Method 1.Utilize the invariant generation tool LaM4Inv to generate clauses via large language models 2.Apply SMT solvers to get inductive counterexamples Approach 2: Boundary State Analysis Method 1. Use OMT solving on the entire inductive invariant obtained from LaM4Inv to derive boundary values for each variable 2. Determine the corresponding boundary states TAG 1. For states satisfying the program's precondition: - Insert the state back at the beginning of the original C program's loop 2. For states not satisfying the precondition: - Insert the state at the end of the loop body 3. Insert assertion in the loop body: `assert(!(target state));` 4. Ultimate Automizer: - FALSE (counterexample) → state is reachable → (must GENUINE) - TRUE (no counterexample) → state is unreachable (must SPURIOUS) - UNKNOWN (not unsed) → cannot decide Research Goals This project aims to evaluate and enhance the ability of LLMs to distinguish between genuine and spurious counterexamples in formal program verification, contributing to more robust and reliable verification methodologies.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average