
This is a collection of datasets that are used for AI-based software vulnerability detection. All the datasets are in the .csv format and each row represents a sample. Each dataset includes a set of functions written in C and the target of each function is either 0 (non-vulnerable) or 1 (vulnerable). data_C_Lin2017_test.csv: Reference paper: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects, 2017. Data source on GitHub: https://github.com/DanielLin1986/function_representation_learning This dataset includes 44 vulnerable and 577 non-vulnerable functions from the LibPNG project. data_C_LineVul_test.csv: Reference paper: LineVul: A Transformer-based Line-Level Vulnerability Prediction, 2022. Data source on Hugging Face: https://huggingface.co/datasets/Partha117/LineVul_Test_Dataset This dataset includes 1055 vulnerable and 17809 non-vulnerable functions. data_C_PrimeVul_test.csv: Reference paper: Vulnerability Detection with Code LanguageModels: How Far Are We? 2024. Data source on GitHub: https://github.com/DLVulDet/PrimeVul From the data source, the primevul_test.jsonl was used to created this dataset. This dataset includes 695 vulnerable and 25213 non-vulnerable functions. data_C_Choi2017_test.csv: Reference paper: End-to-End Prediction of Buffer Overruns from Raw Source Codevia Neural Memory Networks, 2017. Data source on GitHub: https://github.com/mjc92/buffer_overrun_memory_networks From GitHub, all the data in trainnig_100.txt, test_1_100.txt, test_2_100.txt,test_3_100.txt,test_4_100.txt, and corresponding _labels.txt files are combined to create this dataset. This dataset includes 7054 vulnerable and 6946 non-vulnerable functions. data_C_Devign_test.csv: Reference paper: Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks, 2019 Data source on Hugging Face: https://huggingface.co/datasets/claudios/code_x_glue_devign From Hugging Face, all the data in train, validation, and test are combined to create this dataset. This dataset includes 12460 vulnerable and 14858 non-vulnerable functions. data_C_Ours_{train,test}.csv: This dataset is manually collected from projects on GitHub that have registered CVEs into NVD from 2002 to 2023. The 6,766 non-vulnerable code functions are extracted from the DiverseVul dataset to increase the code diversity. This training set includes 5413 vulnerable and 5413 non-vulnerable functions. The test set includes 1353 vulnerable and 1353 non-vulnerable functions.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
