A collection of datasets for software vulnerability detection

This is a collection of datasets that are used for AI-based software vulnerability detection. All the datasets are in the .csv format and each row represents a sample. Each dataset includes a set of functions written in C and the target of each function is either 0 (non-vulnerable) or 1 (vulnerable). data_C_Lin2017_test.csv: Reference paper: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects, 2017. Data source on GitHub: https://github.com/DanielLin1986/function_representation_learning This dataset includes 44 vulnerable and 577 non-vulnerable functions from the LibPNG project. data_C_LineVul_test.csv: Reference paper: LineVul: A Transformer-based Line-Level Vulnerability Prediction, 2022. Data source on Hugging Face: https://huggingface.co/datasets/Partha117/LineVul_Test_Dataset This dataset includes 1055 vulnerable and 17809 non-vulnerable functions. data_C_PrimeVul_test.csv: Reference paper: Vulnerability Detection with Code LanguageModels: How Far Are We? 2024. Data source on GitHub: https://github.com/DLVulDet/PrimeVul From the data source, the primevul_test.jsonl was used to created this dataset. This dataset includes 695 vulnerable and 25213 non-vulnerable functions. data_C_Choi2017_test.csv: Reference paper: End-to-End Prediction of Buffer Overruns from Raw Source Codevia Neural Memory Networks, 2017. Data source on GitHub: https://github.com/mjc92/buffer_overrun_memory_networks From GitHub, all the data in trainnig_100.txt, test_1_100.txt, test_2_100.txt,test_3_100.txt,test_4_100.txt, and corresponding _labels.txt files are combined to create this dataset. This dataset includes 7054 vulnerable and 6946 non-vulnerable functions. data_C_Devign_test.csv: Reference paper: Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks, 2019 Data source on Hugging Face: https://huggingface.co/datasets/claudios/code_x_glue_devign From Hugging Face, all the data in train, validation, and test are combined to create this dataset. This dataset includes 12460 vulnerable and 14858 non-vulnerable functions. data_C_Ours_{train,test}.csv: This dataset is manually collected from projects on GitHub that have registered CVEs into NVD from 2002 to 2023. The 6,766 non-vulnerable code functions are extracted from the DiverseVul dataset to increase the code diversity. This training set includes 5413 vulnerable and 5413 non-vulnerable functions. The test set includes 1353 vulnerable and 1353 non-vulnerable functions.

Related Organizations

Luxembourg Institute of Science and Technology
Luxembourg

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Funded by

EC| LAZARUS