Supporting Concept Extraction and Identifier Quality Improvement through Programmers' Lexicon Analysis
Abebe, Surafel Lemma
- Publisher: University of Trento
Identifiers play an important role in communicating the intentions associated with the program entities they represent. The information captured in identifiers support programmers to (re-)build the â€œmental modelâ€ of the software and facilitates understanding. (Re-)building the â€œmental modelâ€ and understanding large software, however, is difficult and expensive. Besides, the effort involved in the process heavily depends on the quality of the programmersâ€™ lexicon used to construct the identifiers.
This thesis addresses the problem of program understanding focusing on (i) concept extraction, and (ii) quality of the lexicon used in identifiers. To address the first problem (concept extraction), two ontology extraction approaches exploiting the natural language information captured in identifiers and structural information of the source code are proposed and evaluated. We have also proposed a method to automatically train a natural language analyzer for identifiers. The trained analyzer is used for concept extraction. The evaluation was conducted on a program understanding task, concept location. Results show that the extracted concepts increase the effectiveness of concept location queries. Besides extracting concepts from the source code, we have investigated information retrieval (IR) based techniques to filter domain concepts from implementation concepts.
To address the second problem (quality of the lexicon used in identifiers), we have defined a publicly available catalog of lexicon bad smells (LBS) and developed a suite of tools to automatically detect them. LBS indicate some potential lexicon construction problems that can be addressed through refactoring. The impact of LBS on concept location and the contribution they can give to fault prediction have been studied empirically. Results indicate that LBS refactoring has a significant positive impact on IR-based concept location task and contributes to improve fault prediction, when used in conjunction with structural metrics. In addition to detecting LBS in identifiers, we try also to fix them. We have proposed an approach which uses the concepts extracted from the source code to suggest names which can be used to complete or replace an identifier. The evaluation of the approach shows that it provides useful suggestions, which can effectively support programmers to write consistent names.