
Phishing attacks represent one of the most prevalent and economically damaging threats in contemporary cybersecurity, exploiting counterfeit websites to harvest sensitive user credentials. This paper introduces a machine learning-based phishing website detection framework constructed upon the PhiUSIIL Phishing URL Dataset, encompassing 235,795 labelled URL samples. The original dataset comprises 56 features derived from URL structure, HTML content, and webpage metadata. To enhance model efficiency and reduce computational overhead, a feature selection methodology grounded in Mutual Information (MI) scoring was applied, contracting the feature space from 56 to 20 URL-extractable features with negligible performance degradation. Four machine learning algorithms were systematically evaluated: Random Forest, Decision Tree, Gradient Boosting, and Logistic Regression. The Random Forest classifier configured with 200 estimators delivered superior performance, attaining an accuracy of 97.38%, an AUC-ROC of 0.9973, and robust generalisation through 5-fold cross-validation yielding a mean accuracy of 97.36% ± 0.04%. A deterministic rule-based override layer was further incorporated to manage unambiguous phishing or legitimate signals with high confidence. The complete system is deployed as an interactive Streamlit web application enabling real-time URL classification. These findings affirm that a compact suite of URL-based features, paired with a robust ensemble classifier, yields an effective and practically deployable phishing detection solution.
