publication . Preprint . 2019

In Nomine Function: Naming Functions in Stripped Binaries with Neural Networks

Artuso, Fiorella; Di Luna, Giuseppe Antonio; Massarelli, Luca; Querzoni, Leonardo;
Open Access English
  • Published: 17 Dec 2019
Abstract
In this paper we investigate the problem of automatically naming pieces of assembly code. Where by naming we mean assigning to an assembly function a string of words that would likely be assigned by a human reverse engineer. We formally and precisely define the framework in which our investigation takes place. That is we define the problem, we provide reasonable justifications for the choices that we made for the design of training and the tests. We performed an analysis on a large real-world corpora constituted by nearly 9 millions of functions taken from more than 22k softwares. In such framework we test baselines coming from the field of Natural Language Proc...
Subjects
free text keywords: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning
Download from
17 references, page 1 of 2

[1] “cxxfilt.” [Online]. Available: https://pypi.org/project/cxxfilt/

[2] A. Vaswan, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, J. Uszkoreit, “Tensor2tensor for neural machine translation,” CoRR, vol. abs/1803.07416, 2018. [Online]. Available: http://arxiv.org/abs/1803.07416

[3] D. Britz, A. Goldie, M. T. Luong, Q. Le, “Massive Exploration of Neural Machine Translation Architectures.” [Online]. Available: https://github.com/google/seq2seq

[4] David, Yaniv and Alon, Uri and Yahav, Eran, “Neural Reverse Engineering of Stripped Binaries,” arXiv preprint arXiv:1902.09122, Tech. Rep., 2019.

[5] S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 472-489.

[6] C. Fu, H. Chen, H. Liu, X. Chen, Y. Tian, F. Koushanfar, and J. Zhao, “Coda: An end-to-end neural program decompiler,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche´-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 3703-3714. [Online]. Available: http://papers.nips.cc/paper/8628-coda-an-end-to-end-neural-program-decompiler.pdf

[7] J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2018, pp. 1667-1680.

[8] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1700-1709. [OpenAIRE]

[9] O. Katz, Y. Olshaker, Y. Goldberg, and E. Yahav, “Towards neural decompilation,” CoRR, vol. abs/1905.08325, 2019. [Online]. Available: http://arxiv.org/abs/1905.08325 [OpenAIRE]

[10] L. Pointal, “TreeTagger.” [Online]. Available: https://treetaggerwrapper.readthedocs.io/en/latest/

[11] J. Lacomis, P. Yin, E. J. Schwartz, M. Allamanis, C. L. Goues, G. Neubig, and B. Vasilescu, “Dire: A neural approach to decompiled identifier naming,” arXiv preprint arXiv:1909.09029, 2019. [OpenAIRE]

[12] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014, pp. 1188-1196. [OpenAIRE]

[13] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74-81.

[14] M. Allamanis, E.T. Barr, P. Devanbu, C. Sutton, “A Survey of Machine Learning for Big Code and Naturalness,” 2018. [Online]. Available: https://arxiv.org/pdf/1709.06182.pdf [OpenAIRE]

[15] L. Massarelli, G. A. Di Luna, F. Petroni, L. Querzoni, and R. Baldoni, “Investigating graph embedding neural networks with unsupervised features extraction for binary analysis,” in Proceedings of the 2nd Workshop on Binary Analysis Research (BAR), 2019.

17 references, page 1 of 2
Abstract
In this paper we investigate the problem of automatically naming pieces of assembly code. Where by naming we mean assigning to an assembly function a string of words that would likely be assigned by a human reverse engineer. We formally and precisely define the framework in which our investigation takes place. That is we define the problem, we provide reasonable justifications for the choices that we made for the design of training and the tests. We performed an analysis on a large real-world corpora constituted by nearly 9 millions of functions taken from more than 22k softwares. In such framework we test baselines coming from the field of Natural Language Proc...
Subjects
free text keywords: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning
Download from
17 references, page 1 of 2

[1] “cxxfilt.” [Online]. Available: https://pypi.org/project/cxxfilt/

[2] A. Vaswan, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, J. Uszkoreit, “Tensor2tensor for neural machine translation,” CoRR, vol. abs/1803.07416, 2018. [Online]. Available: http://arxiv.org/abs/1803.07416

[3] D. Britz, A. Goldie, M. T. Luong, Q. Le, “Massive Exploration of Neural Machine Translation Architectures.” [Online]. Available: https://github.com/google/seq2seq

[4] David, Yaniv and Alon, Uri and Yahav, Eran, “Neural Reverse Engineering of Stripped Binaries,” arXiv preprint arXiv:1902.09122, Tech. Rep., 2019.

[5] S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 472-489.

[6] C. Fu, H. Chen, H. Liu, X. Chen, Y. Tian, F. Koushanfar, and J. Zhao, “Coda: An end-to-end neural program decompiler,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche´-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 3703-3714. [Online]. Available: http://papers.nips.cc/paper/8628-coda-an-end-to-end-neural-program-decompiler.pdf

[7] J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2018, pp. 1667-1680.

[8] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1700-1709. [OpenAIRE]

[9] O. Katz, Y. Olshaker, Y. Goldberg, and E. Yahav, “Towards neural decompilation,” CoRR, vol. abs/1905.08325, 2019. [Online]. Available: http://arxiv.org/abs/1905.08325 [OpenAIRE]

[10] L. Pointal, “TreeTagger.” [Online]. Available: https://treetaggerwrapper.readthedocs.io/en/latest/

[11] J. Lacomis, P. Yin, E. J. Schwartz, M. Allamanis, C. L. Goues, G. Neubig, and B. Vasilescu, “Dire: A neural approach to decompiled identifier naming,” arXiv preprint arXiv:1909.09029, 2019. [OpenAIRE]

[12] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014, pp. 1188-1196. [OpenAIRE]

[13] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74-81.

[14] M. Allamanis, E.T. Barr, P. Devanbu, C. Sutton, “A Survey of Machine Learning for Big Code and Naturalness,” 2018. [Online]. Available: https://arxiv.org/pdf/1709.06182.pdf [OpenAIRE]

[15] L. Massarelli, G. A. Di Luna, F. Petroni, L. Querzoni, and R. Baldoni, “Investigating graph embedding neural networks with unsupervised features extraction for binary analysis,” in Proceedings of the 2nd Workshop on Binary Analysis Research (BAR), 2019.

17 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue