Python code smells detection using conventional machine learning models

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 29 May 2023 English Publisher:PeerJJournal:PeerJ Computer Science, volume 9, page e1370 (eissn: 2376-5992,

Copyright policy )

Authors: Rana Sandouka; Hamoud Aljamaan;

doi: 10.7717/peerj-cs.1370 , 10.5281/zenodo.7512515 , 10.5281/zenodo.7512516

pmid: 37346528

pmc: PMC10280480

Python code smells detection using conventional machine learning models

- Summary
- Subjects
- Metrics

Abstract

Code smells are poor code design or implementation that affect the code maintenance process and reduce the software quality. Therefore, code smell detection is important in software building. Recent studies utilized machine learning algorithms for code smell detection. However, most of these studies focused on code smell detection using Java programming language code smell datasets. This article proposes a Python code smell dataset for Large Class and Long Method code smells. The built dataset contains 1,000 samples for each code smell, with 18 features extracted from the source code. Furthermore, we investigated the detection performance of six machine learning models as baselines in Python code smells detection. The baselines were evaluated based on Accuracy and Matthews correlation coefficient (MCC) measures. Results indicate the superiority of Random Forest ensemble in Python Large Class code smell detection by achieving the highest detection performance of 0.77 MCC rate, while decision tree was the best performing model in Python Long Method code smell detection by achieving the highest MCC Rate of 0.89.

Related Organizations

King Fahd University of Petroleum and Minerals
Saudi Arabia

Keywords

Detection, Artificial Intelligence, Electronic computers. Computer science, Machine learning, QA75.5-76.95, Code smell, Large class, Long method, Python

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	20
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%