Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark

Project Overview This repository contains the code and presentation for a big data project focused on fake news detection through binary text classification. The goal is to build a scalable system using Apache Spark and PySpark to handle large volumes of textual data efficiently. The project was developed as part of a practical exam (TP) in big data processing. Problem Statement The spread of disinformation online demands automated, evolving systems for fact verification. Objective Develop a distributed machine learning model capable of classifying press articles as Reliable (0) or Fake News (1). Dataset Total Articles: 45,000 (from Fake.csv + True.csv) Distribution: 52% Fake (23,481 articles) / 48% Real (21,417 articles) Big Data Architecture Technological Choice: Apache Spark via PySpark for distributed and rapid processing of large textual data volumes. Spark Configuration: Version: Spark 3.5.4 Mode: local[*] Application: FakeNewsDetection Driver Memory: 4 GB Step 1: Preparation and Cleaning Loading and Merging: Load Fake.csv and True.csv, add a label column (1 for Fake, 0 for True), and union into a combined DataFrame. Text Cleaning (UDF): Use a PySpark User Defined Function (UDF) applied in parallel: Remove URLs Remove special characters and digits Convert to lowercase Remove multiple spaces Class Distribution: Balanced dataset with 52% Fake and 48% Real articles. ML Pipeline: Text Transformation to Vectors Tokenizer: Separate cleaned text into individual words (tokens). StopWordsRemover: Remove common words (e.g., "the", "a", "in") to improve relevance. HashingTF: Convert tokens into frequency vectors (1,000 dimensions). IDF: Weight frequencies by rarity of terms in the corpus. Unsupervised Analysis K-Means Clustering (K=5): Explores natural data structure to validate topic separation and identify dominant themes. Cluster Results: Cluster 0: Trump, Republican, Hillary, Clinton, Political Cluster 1: Said, Government, People, State, Federal Cluster 2: Police, Shooting, US, CIA, Intelligence Cluster 3: Trump, Group, America, Organization, Political Cluster 4: HUD, Lobbying, Housing, Agencies, Federal Conclusion: Clusters reveal distinct political and social themes (politics, government, security, organizations, lobbying), confirming semantic richness and classification relevance. Data Split Strategy: Stratified split to maintain balanced class distribution. Ratio: 80% Training (~35,500 articles) / 20% Test (~8,700 articles) Reproducibility: Random seed = 42 for consistent results. Classification Models Logistic Regression: Simple linear model effective for binary classification. Produces calibrated probabilities. Parameters: maxIter=100, regParam=0.01 Naive Bayes: Based on Bayes' theorem, performs well on text classification assuming feature independence. Parameters: Type=Multinomial, smoothing=1.0 Feature Engineering: 80% of dataset used for TF-IDF vectorization. Evaluation Metrics Five key metrics for binary classification performance: Accuracy: Proportion of correct predictions. (TP + TN) / (TP + TN + FP + FN) Precision: Among predicted "Fake", how many are truly fake. TP / (TP + FP) Recall: Among truly fake articles, how many were correctly identified. TP / (TP + FN) F1-Score: Harmonic mean of Precision and Recall. 2 × (Precision × Recall) / (Precision + Recall) AUC-ROC: Measures ability to distinguish classes across decision thresholds (0 to 1, where 1 is perfect).

The proliferation of online misinformation requires scalable, automated systems for fact-checking. This project develops a distributed machine learning model to classify news articles as Reliable (0) or Fake News (1). Using Apache Spark via PySpark, we process a dataset of approximately 45,000 articles (Fake.csv and True.csv) with a near-balanced distribution (52% fake, 48% real). The pipeline includes data loading, text cleaning (removing URLs, special characters, digits, and extra spaces), feature engineering with TF-IDF, unsupervised K-Means clustering (K=5) for thematic exploration, and supervised classification with Logistic Regression and Naive Bayes. Evaluated on a stratified 80/20 train-test split, the Logistic Regression model outperforms with 99.21% accuracy, precision, recall, and F1-score, and an AUC-ROC of 0.9994. This demonstrates the effectiveness of big data techniques for high-accuracy fake news detection. The upload includes the Jupyter notebook (code implementation) and PDF presentation (project summary).

Keywords

fake news detection, Apach spark, pYspark, machine Learning, Text Classification, big data analytics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average