publication . Preprint . 2019

A Literature Study of Embeddings on Source Code

Chen, Zimin; Monperrus, Martin;
Open Access English
  • Published: 05 Apr 2019
Abstract
Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide ...
Subjects
free text keywords: Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering, Statistics - Machine Learning
Download from
32 references, page 1 of 3

Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. Learning continuous semantic representations of symbolic expressions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 80-88. JMLR. org, 2017b. [OpenAIRE]

Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400, 2018.

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):40, 2019. [OpenAIRE]

David Azcona, Piyush Arora, I-Han Hsiao, and Alan Smeaton. user2code2vec: Embeddings for profiling students based on distributional representations of source code. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, pp. 86-95. ACM, 2019.

Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. Neural code comprehension: A learnable representation of code semantics. In Advances in Neural Information Processing Systems, pp. 3589-3601, 2018. [OpenAIRE]

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? In International conference on database theory, pp. 217-235. Springer, 1999.

L. Büch and A. Andrzejak. Learning-based recursive aggregation of abstract syntax trees for code clone detection. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 95-104, Feb 2019. doi: 10.1109/SANER.2019.8668039. [OpenAIRE]

Zimin Chen and Martin Monperrus. The remarkable role of similarity in redundancy-based program repair. arXiv preprint arXiv:1811.05703, 2018. [OpenAIRE]

Alexander Chistyakov, Ekaterina Lobacheva, Arseny Kuznetsov, and Alexey Romanenko. Semantic embeddings for program behavior patterns. arXiv preprint arXiv:1804.03635, 2018. [OpenAIRE]

Daniel DeFreez, Aditya V Thakur, and Cindy Rubio-González. Path-based function embedding and its application to specification mining. arXiv preprint arXiv:1802.07779, 2018. [OpenAIRE]

Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli. Semantic code repair using neuro-symbolic transformation networks. arXiv preprint arXiv:1710.11054, 2017. [OpenAIRE]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep api learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 631-642. ACM, 2016.

Jacob A Harer, Louis Y Kim, Rebecca L Russell, Onur Ozdemir, Leonard R Kosta, Akshay Rangamani, Lei H Hamilton, Gabriel I Centeno, Jonathan R Key, Paul M Ellingwood, et al. Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497, 2018.

Jordan Henkel, Shuvendu K Lahiri, Ben Liblit, and Thomas Reps. Code vectors: understanding programs through embedded abstracted symbolic traces. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 163-174. ACM, 2018. [OpenAIRE]

32 references, page 1 of 3
Abstract
Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide ...
Subjects
free text keywords: Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering, Statistics - Machine Learning
Download from
32 references, page 1 of 3

Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. Learning continuous semantic representations of symbolic expressions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 80-88. JMLR. org, 2017b. [OpenAIRE]

Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400, 2018.

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):40, 2019. [OpenAIRE]

David Azcona, Piyush Arora, I-Han Hsiao, and Alan Smeaton. user2code2vec: Embeddings for profiling students based on distributional representations of source code. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, pp. 86-95. ACM, 2019.

Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. Neural code comprehension: A learnable representation of code semantics. In Advances in Neural Information Processing Systems, pp. 3589-3601, 2018. [OpenAIRE]

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? In International conference on database theory, pp. 217-235. Springer, 1999.

L. Büch and A. Andrzejak. Learning-based recursive aggregation of abstract syntax trees for code clone detection. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 95-104, Feb 2019. doi: 10.1109/SANER.2019.8668039. [OpenAIRE]

Zimin Chen and Martin Monperrus. The remarkable role of similarity in redundancy-based program repair. arXiv preprint arXiv:1811.05703, 2018. [OpenAIRE]

Alexander Chistyakov, Ekaterina Lobacheva, Arseny Kuznetsov, and Alexey Romanenko. Semantic embeddings for program behavior patterns. arXiv preprint arXiv:1804.03635, 2018. [OpenAIRE]

Daniel DeFreez, Aditya V Thakur, and Cindy Rubio-González. Path-based function embedding and its application to specification mining. arXiv preprint arXiv:1802.07779, 2018. [OpenAIRE]

Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli. Semantic code repair using neuro-symbolic transformation networks. arXiv preprint arXiv:1710.11054, 2017. [OpenAIRE]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep api learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 631-642. ACM, 2016.

Jacob A Harer, Louis Y Kim, Rebecca L Russell, Onur Ozdemir, Leonard R Kosta, Akshay Rangamani, Lei H Hamilton, Gabriel I Centeno, Jonathan R Key, Paul M Ellingwood, et al. Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497, 2018.

Jordan Henkel, Shuvendu K Lahiri, Ben Liblit, and Thomas Reps. Code vectors: understanding programs through embedded abstracted symbolic traces. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 163-174. ACM, 2018. [OpenAIRE]

32 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue