Name: AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
Keywords: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 25 Jun 2025Embargo end date: 01 Jan 2024Publisher:IEEEJournal:2025 33rd Signal Processing and Communications Applications Conference (SIU)

Authors: Demirok, Basak; Kutlu, Mucahid;

doi: 10.1109/siu66497.2025.11112334 , 10.48550/arxiv.2412.16594

arXiv: 2412.16594

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

- Summary
- Subjects
- Metrics

Abstract

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

Related Organizations

Qatar University
TOBB University of Economics and Technology
Turkey
TOBB University of Economics and Technology
Turkey
Qatar University
Qatar University
Qatar

Keywords

Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green