Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Article
Data sources: ZENODO
addClaim

Real-Time Phishing Website Detection via Mutual Information-Driven Feature Selection and Random Forest Ensemble Classification

Authors: Dr. Vanita Rani; Dr. Vanya Bardeja; Dr. Himani Sharma;

Real-Time Phishing Website Detection via Mutual Information-Driven Feature Selection and Random Forest Ensemble Classification

Abstract

Phishing attacks represent one of the most prevalent and economically damaging threats in contemporary cybersecurity, exploiting counterfeit websites to harvest sensitive user credentials. This paper introduces a machine learning-based phishing website detection framework constructed upon the PhiUSIIL Phishing URL Dataset, encompassing 235,795 labelled URL samples. The original dataset comprises 56 features derived from URL structure, HTML content, and webpage metadata. To enhance model efficiency and reduce computational overhead, a feature selection methodology grounded in Mutual Information (MI) scoring was applied, contracting the feature space from 56 to 20 URL-extractable features with negligible performance degradation. Four machine learning algorithms were systematically evaluated: Random Forest, Decision Tree, Gradient Boosting, and Logistic Regression. The Random Forest classifier configured with 200 estimators delivered superior performance, attaining an accuracy of 97.38%, an AUC-ROC of 0.9973, and robust generalisation through 5-fold cross-validation yielding a mean accuracy of 97.36% ± 0.04%. A deterministic rule-based override layer was further incorporated to manage unambiguous phishing or legitimate signals with high confidence. The complete system is deployed as an interactive Streamlit web application enabling real-time URL classification. These findings affirm that a compact suite of URL-based features, paired with a robust ensemble classifier, yields an effective and practically deployable phishing detection solution.

Powered by OpenAIRE graph
Found an issue? Give us feedback