CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs Towards CWE Detection

Name: CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs Towards CWE Detection
Keywords: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Large Language Models, Computer Science - Cryptography and Security, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Generative AI, Security, Security Analysis

Richard A. Dubniczky; Krisztofer Zoltan Horvát; Tamás Bisztray; Mohamed Amine Ferrag; Lucas C. Cordeiro; Norbert Tihanyi

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

Pure University of Manchester

Conference object . 2025

License: CC BY

Data sources: Pure University of Manchester

The University of Manchester - Institutional Repository

Contribution for newspaper or weekly magazine . 2025

Data sources: The University of Manchester - Institutional Repository

https://doi.org/10.1007/978-3-...

Part of book or chapter of book . 2025 . Peer-reviewed

License: Springer Nature TDM

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs Towards CWE Detection

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint , Conference object , Contribution for newspaper or weekly magazine 09 Jul 2025Embargo end date: 01 Jan 2025 English Publisher:Springer Nature Switzerland

Authors: Richard A. Dubniczky; Krisztofer Zoltan Horvát; Tamás Bisztray; Mohamed Amine Ferrag; Lucas C. Cordeiro; Norbert Tihanyi;

doi: 10.1007/978-3-031-98208-8_15 , 10.48550/arxiv.2503.09433

arXiv: 2503.09433

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs Towards CWE Detection

- Summary
- Subjects
- Metrics

Abstract

Identifying vulnerabilities in source code is crucial, especially in critical software components. Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect security flaws. This paper introduces CASTLE (CWE Automated Security Testing and Low-Level Evaluation), a benchmarking framework for evaluating the vulnerability detection capabilities of different methods. We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs. We propose the CASTLE Score, a novel evaluation metric to ensure fair comparison. Our results reveal key differences: ESBMC (a formal verification tool) minimizes false positives but struggles with vulnerabilities beyond model checking, such as weak cryptography or SQL injection. Static analyzers suffer from high false positives, increasing manual validation efforts for developers. LLMs perform exceptionally well in the CASTLE dataset when identifying vulnerabilities in small code snippets. However, their accuracy declines, and hallucinations increase as the code size grows. These results suggest that LLMs could play a pivotal role in future security solutions, particularly within code completion frameworks, where they can provide real-time guidance to prevent vulnerabilities. The dataset is accessible at https://github.com/CASTLE-Benchmark.

Related Organizations

University of Oslo
Norway
University of Salford
United Kingdom
University of Guelma
Algeria
Technology Innovation Institute
United Arab Emirates
Eötvös Loránd University
Hungary

View all View all

Keywords

Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Large Language Models, Computer Science - Cryptography and Security, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Generative AI, Security, Security Analysis, Cryptography and Security (cs.CR), Static Code Analysis

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green