Real-Time Phishing Website Detection via Mutual Information-Driven Feature Selection and Random Forest Ensemble Classification

Dr. Vanita Rani; Dr. Vanya Bardeja; Dr. Himani Sharma

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Article

Data sources: ZENODO

Journal on Communications

Real-Time Phishing Website Detection via Mutual Information-Driven Feature Selection and Random Forest Ensemble Classification

descriptionPublicationkeyboard_double_arrow_right Article Under curationPublisher:Journal on Communications

Authors: Dr. Vanita Rani; Dr. Vanya Bardeja; Dr. Himani Sharma;

doi: 10.5281/zenodo.20539079

Real-Time Phishing Website Detection via Mutual Information-Driven Feature Selection and Random Forest Ensemble Classification

- Summary

Abstract

Phishing attacks represent one of the most prevalent and economically damaging threats in contemporary cybersecurity, exploiting counterfeit websites to harvest sensitive user credentials. This paper introduces a machine learning-based phishing website detection framework constructed upon the PhiUSIIL Phishing URL Dataset, encompassing 235,795 labelled URL samples. The original dataset comprises 56 features derived from URL structure, HTML content, and webpage metadata. To enhance model efficiency and reduce computational overhead, a feature selection methodology grounded in Mutual Information (MI) scoring was applied, contracting the feature space from 56 to 20 URL-extractable features with negligible performance degradation. Four machine learning algorithms were systematically evaluated: Random Forest, Decision Tree, Gradient Boosting, and Logistic Regression. The Random Forest classifier configured with 200 estimators delivered superior performance, attaining an accuracy of 97.38%, an AUC-ROC of 0.9973, and robust generalisation through 5-fold cross-validation yielding a mean accuracy of 97.36% ± 0.04%. A deterministic rule-based override layer was further incorporated to manage unambiguous phishing or legitimate signals with high confidence. The complete system is deployed as an interactive Streamlit web application enabling real-time URL classification. These findings affirm that a compact suite of URL-based features, paired with a robust ensemble classifier, yields an effective and practically deployable phishing detection solution.

Found an issue? Give us feedback