A Support Vector Machine based approach for plagiarism detection in Python code submissions in undergraduate settings

descriptionPublicationkeyboard_double_arrow_right Article 13 Jun 2024Publisher:Frontiers Media SAJournal:Frontiers in Computer Science, volume 6 (eissn: 2624-9898,

Copyright policy )

Authors: Nandini Gandhi; Kaushik Gopalan; Prajish Prasad;

doi: 10.3389/fcomp.2024.1393723

A Support Vector Machine based approach for plagiarism detection in Python code submissions in undergraduate settings

- Summary
- Subjects
- Metrics

Abstract

Mechanisms for plagiarism detection play a crucial role in maintaining academic integrity, acting both to penalize wrongdoing while also serving as a preemptive deterrent for bad behavior. This manuscript proposes a customized plagiarism detection algorithm tailored to detect source code plagiarism in the Python programming language. Our approach combines textual and syntactic techniques, employing a support vector machine (SVM) to effectively combine various indicators of similarity and calculate the resulting similarity scores. The algorithm was trained and tested using a sample of code submissions of 4 coding problems each from 45 volunteers; 15 of these were original submissions while the other 30 were plagiarized samples. The submissions of two of the questions was used for training and the other two for testing-using the leave-p-out cross-validation strategy to avoid overfitting. We compare the performance of the proposed method with two widely used tools-MOSS and JPlag—and find that the proposed method results in a small but significant improvement in accuracy compared to JPlag, while significantly outperforming MOSS in flagging plagiarized samples.

Related Organizations

Flame University
India

Keywords

source code plagiarism detection, Support Vector Machine, Python programming, Abstract Syntax Trees, Electronic computers. Computer science, QA75.5-76.95, textual similarity

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Top 10%

Average

gold

Fields of Science (4) View all

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

View all