Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models

Name: Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models
Keywords: Binary code similarity detection, machine learning, Electrical engineering. Electronics. Nuclear engineering, neural machine translation, TK1-9971

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2023Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Access, volume 11, pages 102,796-102,805 (eissn: 2169-3536,

Authors: Norimitsu Ito; Masaki Hashimoto; Akira Otsuka;

doi: 10.1109/access.2023.3316215

Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models

- Summary
- Subjects
- Metrics

Abstract

Binary code similarity detection is an effective analysis technique for vulnerability, bug, and plagiarism detection in software for which the source code cannot be obtained. The recent proliferation of IoT devices has also increased the demand for similarity detection across different architectures. However, there are currently not many examples of feature extraction methods using neural machine translation (NMT) models being applied to similarity detection in basic block units across different architectures. In this research, we propose new methods that extract features at a higher speed and detect similarities across different architectures with higher accuracy than existing methods for basic block feature extraction using neural machine translation models. We assume that the intermediate representation of the NMT model, which learned the translation of basic blocks across different architectures, includes the semantics of the instructions in the basic block. Hence we adopted the intermediate representation as the features of the basic blocks. Then, we applied the linear transformation used in bilingual word embedding to match the embedding space of basic blocks across different architectures. This enables the similarity detection in basic block units across different architectures with higher accuracy than the distance learning method used in existing research to match the embedding space. In the evaluation experiment, we compare the Precision at k (P@k) on the same dataset with existing research methods and our method achieved the highest accuracy of 92%. In addition, We also compare the time required for feature extraction using GPUs, and found that it was up to 16 times faster.

Related Organizations

National Police Academy
Japan
Institute of Information Security
Japan

Keywords

Binary code similarity detection, machine learning, Electrical engineering. Electronics. Nuclear engineering, neural machine translation, TK1-9971

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

gold