Comparative Analysis of Logistic Regression and XGBoost for  Depression Detection from Reddit Posts

Tripathi, Saransh; Tripathi, Nikhil; Singh, Pramod

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Comparative Analysis of Logistic Regression and XGBoost for Depression Detection from Reddit Posts

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Tripathi, Saransh; Tripathi, Nikhil; Singh, Pramod;

doi: 10.5281/zenodo.20533103

Comparative Analysis of Logistic Regression and XGBoost for Depression Detection from Reddit Posts

- Summary

Abstract

Depression is a major mental health disorder that affects millions of individuals worldwide and often remains undiagnosed due to social stigma and limited access to professional care. The increasing use of social media platforms such as Reddit provides an opportunity to analyze textual expressions that may contain indicators of depressive behavior. This study investigates the effectiveness of machine learning techniques for depression detection using textual data collected from Reddit posts. A Reddit Depression Dataset containing 7,731 posts was analyzed through exploratory data analysis, text preprocessing, TF-IDF feature extraction, and VADER sentiment analysis. The extracted features were evaluated using two machine learning classifiers: Logistic Regression and XGBoost. Performance was assessed using accuracy, precision, recall, F1-score, confusion matrix analysis, and ROC-AUC. Experimental results demonstrated strong classification performance. Logistic Regression achieved an accuracy of 94.44% and an AUC of 0.986, while XGBoost achieved a slightly higher accuracy of 94.70%. The findings indicate that TF-IDF lexical features provide substantial predictive information for distinguishing depressive and non-depressive posts. Sentiment analysis further revealed noticeable differences in emotional polarity between the two classes. The study presents a reproducible and computationally efficient framework for depression detection using publicly available data and open-source tools. The proposed workflow is suitable for academic research and educational environments where interpretability, simplicity, and reproducibility are important considerations.

Found an issue? Give us feedback