Automatically Detect Software Security Vulnerabilities Based on Natural Language Processing Techniques and Machine Learning Algorithms

descriptionPublicationkeyboard_double_arrow_right Article 11 May 2022Publisher:The Institute for Research and Community Services (LPPM) ITBJournal:Journal of ICT Research and Applications, volume 16, pages 70-87 (issn: 2337-5787, eissn: 2338-5499,

Authors: Do Xuan Cho; Vu Ngoc Son; Duong Duc;

doi: 10.5614/itbj.ict.res.appl.2022.16.1.5

Automatically Detect Software Security Vulnerabilities Based on Natural Language Processing Techniques and Machine Learning Algorithms

- Summary
- Subjects
- Metrics

Abstract

Nowadays, software vulnerabilities pose a serious problem, because cyber-attackers often find ways to attack a system by exploiting software vulnerabilities. Detecting software vulnerabilities can be done using two main methods: i) signature-based detection, i.e. methods based on a list of known security vulnerabilities as a basis for contrasting and comparing; ii) behavior analysis-based detection using classification algorithms, i.e., methods based on analyzing the software code. In order to improve the ability to accurately detect software security vulnerabilities, this study proposes a new approach based on a technique of analyzing and standardizing software code and the random forest (RF) classification algorithm. The novelty and advantages of our proposed method are that to determine abnormal behavior of functions in the software, instead of trying to define behaviors of functions, this study uses the Word2vec natural language processing model to normalize and extract features of functions. Finally, to detect security vulnerabilities in the functions, this study proposes to use a popular and effective supervised machine learning algorithm.

Related Organizations

POSTS AND TELECOMMUNICATIONS INSTITUTE OF TECHNOLOGY
Viet Nam
FPT University
Viet Nam

Keywords

source code features, machine learning algorithms, software security vulnerability detection, natural language processing techniques, Telecommunication, TK5101-6720, Information technology, T58.5-58.64, software vulnerabilities

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	13
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

Top 10%

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering