Data correlation matrix-based spam URL detection using machine learning algorithms

Name: Data correlation matrix-based spam URL detection using machine learning algorithms
Creator: Funda Akar
Keywords: Makine Öğrenme (Diğer), Artificial Intelligence (Other), 0209 industrial biotechnology, classification;machine learning;Spam Detection;Tree-based Algorithms, 0202 electrical engineering, electronic engineering, information engineering, Yapay Zeka (Diğer), 02 engineering and technology, Machine Learning (Other)

Funda Akar

Found an issue? Give us feedback

Journal of Scientifi...arrow_drop_down

Journal of Scientific Reports-A

Article . 2024 . Peer-reviewed

Data sources: Crossref

TÜBİTAK ULAKBİM DergiPark

Article . 2024

Data sources: TÜBİTAK ULAKBİM DergiPark

Data correlation matrix-based spam URL detection using machine learning algorithms

descriptionPublicationkeyboard_double_arrow_right Article 31 Mar 2024Publisher:Kütahya Dumlupinar ÜniversitesiJournal:Journal of Scientific Reports-A (eissn: 2687-6167,

Copyright policy )

Authors: Funda Akar;

doi: 10.59313/jsr-a.1422913

Data correlation matrix-based spam URL detection using machine learning algorithms

- Summary
- Subjects
- Metrics

Abstract

In recent years, the widespread availability of internet access has brought both advantages and disadvantages. Users now enjoy numerous benefits, including unlimited access to vast amounts of information and seamless communication with others. However, this accessibility also exposes users to various threats, including malicious software and deceptive practices, leading to victimization of many individuals. Common issues encountered include spam emails, fake websites, and phishing attempts. Given the essential nature of internet usage in contemporary society, the development of systems to protect users from such malicious activities has become imperative. Accordingly, this study utilized eight prominent machine learning algorithms to identify spam URLs using a large dataset. Since the dataset only contained URL information and spam classification, additional feature extractions such as URL length and the number of digits were necessary. The inclusion of such features enhances decision-making processes within the framework of machine learning, resulting in more efficient detection. As the effectiveness of feature extraction significantly impacts the results of the methods, the study initially conducted feature extraction and trained models based on the weight of features. This paper proposes a data correlated matrix approach for spam URL detection using machine learning algorithms. The distinctive aspect of this study lies in the feature extraction process applied to the dataset, aimed at discerning the most impactful features, and subsequently training models while considering the weighting of these features. The entire dataset was used without any reduction in data. Experimental findings indicate that tree-based machine learning algorithms yield superior results. Among all applied methods, the Random Forest approach achieved the highest success rate, with a detection rate of 96.33% for the non-spam class. Additionally, a combined and weighted calculation method yielded an accuracy of 94.16% for both spam and non-spam data.

Related Organizations

Erzincan University
Turkey
Erzincan Binali Yıldırım University
Turkey

Keywords

Makine Öğrenme (Diğer), Artificial Intelligence (Other), classification;machine learning;Spam Detection;Tree-based Algorithms, Yapay Zeka (Diğer), Machine Learning (Other)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

gold

Fields of Science (3) View all

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

View all