publication . Preprint . 2021

Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling

Banerjee, Pratyay; Pal, Kuntal Kumar; Wang, Fish; Baral, Chitta;
Open Access English
  • Published: 23 Mar 2021
Abstract
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. While modern decompilers can reconstruct and recover much information that is discarded during compilation, inferring variable names is still extremely difficult. Inspired by recent advances in natural language processing, we propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as Transformers and BERT. Our solution takes \textit{raw} decompiler output, the less semantically meaningful code, as input, and enriches it ...
Subjects
free text keywords: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security
Download from
54 references, page 1 of 4

[1] Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 281-293.

[2] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1-37.

[3] Fiorella Artuso, Giuseppe Antonio Di Luna, Luca Massarelli, and Leonardo Querzoni. 2019. Function Naming in Stripped Binaries Using Neural Networks. arXiv preprint arXiv:1912.07946 (2019). [OpenAIRE]

[4] Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A deep learning-based approach to infer natural variable names from usage contexts. arXiv preprint arXiv:1809.05193 (2018).

[5] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 1-6.

[6] Sahil Bhatia and Rishabh Singh. 2016. Automated correction for syntax errors in programming assignments using recurrent neural networks. arXiv preprint arXiv:1603.06129 (2016). [OpenAIRE]

[7] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. 213-222.

[8] Juan Caballero and Zhiqiang Lin. 2016. Type Inference on Executables. Comput. Surveys 48, 4 (2016), 1-35. https://doi.org/10.1145/2896499

[9] Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax errors just aren't natural: improving error reporting with language models. In Proceedings of the 11th Working Conference on Mining Software Repositories. 252- 261.

[10] Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, and Gerardo Canfora. 2013. A hidden markov model to detect coded information islands in free text. In 2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 157-166.

[11] F Chagnon. [n.d.]. IDA-Decompiler.

[12] Yaniv David, Uri Alon, and Eran Yahav. 2019. Neural reverse engineering of stripped binaries. arXiv preprint arXiv:1902.09122 (2019).

[13] Premkumar Devanbu. 2015. New initiative: the naturalness of software. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 543-546.

[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[15] The ghidra decompiler. 2019. The ghidra decompiler. https://ghidra-sre.org/

54 references, page 1 of 4
Abstract
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. While modern decompilers can reconstruct and recover much information that is discarded during compilation, inferring variable names is still extremely difficult. Inspired by recent advances in natural language processing, we propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as Transformers and BERT. Our solution takes \textit{raw} decompiler output, the less semantically meaningful code, as input, and enriches it ...
Subjects
free text keywords: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security
Download from
54 references, page 1 of 4

[1] Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 281-293.

[2] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1-37.

[3] Fiorella Artuso, Giuseppe Antonio Di Luna, Luca Massarelli, and Leonardo Querzoni. 2019. Function Naming in Stripped Binaries Using Neural Networks. arXiv preprint arXiv:1912.07946 (2019). [OpenAIRE]

[4] Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A deep learning-based approach to infer natural variable names from usage contexts. arXiv preprint arXiv:1809.05193 (2018).

[5] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 1-6.

[6] Sahil Bhatia and Rishabh Singh. 2016. Automated correction for syntax errors in programming assignments using recurrent neural networks. arXiv preprint arXiv:1603.06129 (2016). [OpenAIRE]

[7] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. 213-222.

[8] Juan Caballero and Zhiqiang Lin. 2016. Type Inference on Executables. Comput. Surveys 48, 4 (2016), 1-35. https://doi.org/10.1145/2896499

[9] Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax errors just aren't natural: improving error reporting with language models. In Proceedings of the 11th Working Conference on Mining Software Repositories. 252- 261.

[10] Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, and Gerardo Canfora. 2013. A hidden markov model to detect coded information islands in free text. In 2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 157-166.

[11] F Chagnon. [n.d.]. IDA-Decompiler.

[12] Yaniv David, Uri Alon, and Eran Yahav. 2019. Neural reverse engineering of stripped binaries. arXiv preprint arXiv:1902.09122 (2019).

[13] Premkumar Devanbu. 2015. New initiative: the naturalness of software. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 543-546.

[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[15] The ghidra decompiler. 2019. The ghidra decompiler. https://ghidra-sre.org/

54 references, page 1 of 4
Any information missing or wrong?Report an Issue