
The application of neural networks in software engineering has greatly eased the pressure of traditional method of extracting code features manually. Previous code feature extraction models usually regard code as natural language or heavily depend on the domain knowledge of experts. The method of transferring code into natural language is too simple and can easily cause information loss. However, the model with heuristic rules designed by experts is usually too complicated and lacks of expansibility and generalization. In regard of the problems above, this paper proposes a model based on convolutional neural network and recurrent neural network to extract code features through abstract syntax tree (AST). To solve the problem of gradient vanishing caused by the huge size of AST, this paper splits the AST into a sequence of small ASTs and then feeds these trees into the model. The model uses convolutional neural network and recurrent neural network to extract structure information and sequence information respectively. The whole procedure doesn't need to introduce the domain knowledge of experts to guide the model training and the model will automatically learn how to extract features through the codes which have been labeled classification. This paper uses the task of similar code search to test the performance of the trained encoder, the metric of Top1, NDCG and MRR is 0.560, 0.679 and 0.638 respectively. Compared with recent state-of-the-art feature extraction deep learning models and common similar code detection tools, the proposed model has significant advantages.
code feature extraction, program comprehension, code classification, Electronic computers. Computer science, similar code search, QA75.5-76.95
code feature extraction, program comprehension, code classification, Electronic computers. Computer science, similar code search, QA75.5-76.95
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
