Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other ORP type . 2026
License: CC BY
Data sources: ZENODO
ZENODO
Other ORP type . 2026
License: CC BY
Data sources: Datacite
ZENODO
Other ORP type . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark

Authors: naily, rehab;

Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark

Abstract

Project Overview This repository contains the code and presentation for a big data project focused on fake news detection through binary text classification. The goal is to build a scalable system using Apache Spark and PySpark to handle large volumes of textual data efficiently. The project was developed as part of a practical exam (TP) in big data processing. Problem Statement The spread of disinformation online demands automated, evolving systems for fact verification. Objective Develop a distributed machine learning model capable of classifying press articles as Reliable (0) or Fake News (1). Dataset Total Articles: 45,000 (from Fake.csv + True.csv) Distribution: 52% Fake (23,481 articles) / 48% Real (21,417 articles) Big Data Architecture Technological Choice: Apache Spark via PySpark for distributed and rapid processing of large textual data volumes. Spark Configuration: Version: Spark 3.5.4 Mode: local[*] Application: FakeNewsDetection Driver Memory: 4 GB Step 1: Preparation and Cleaning Loading and Merging: Load Fake.csv and True.csv, add a label column (1 for Fake, 0 for True), and union into a combined DataFrame. Text Cleaning (UDF): Use a PySpark User Defined Function (UDF) applied in parallel: Remove URLs Remove special characters and digits Convert to lowercase Remove multiple spaces Class Distribution: Balanced dataset with 52% Fake and 48% Real articles. ML Pipeline: Text Transformation to Vectors Tokenizer: Separate cleaned text into individual words (tokens). StopWordsRemover: Remove common words (e.g., "the", "a", "in") to improve relevance. HashingTF: Convert tokens into frequency vectors (1,000 dimensions). IDF: Weight frequencies by rarity of terms in the corpus. Unsupervised Analysis K-Means Clustering (K=5): Explores natural data structure to validate topic separation and identify dominant themes. Cluster Results: Cluster 0: Trump, Republican, Hillary, Clinton, Political Cluster 1: Said, Government, People, State, Federal Cluster 2: Police, Shooting, US, CIA, Intelligence Cluster 3: Trump, Group, America, Organization, Political Cluster 4: HUD, Lobbying, Housing, Agencies, Federal Conclusion: Clusters reveal distinct political and social themes (politics, government, security, organizations, lobbying), confirming semantic richness and classification relevance. Data Split Strategy: Stratified split to maintain balanced class distribution. Ratio: 80% Training (~35,500 articles) / 20% Test (~8,700 articles) Reproducibility: Random seed = 42 for consistent results. Classification Models Logistic Regression: Simple linear model effective for binary classification. Produces calibrated probabilities. Parameters: maxIter=100, regParam=0.01 Naive Bayes: Based on Bayes' theorem, performs well on text classification assuming feature independence. Parameters: Type=Multinomial, smoothing=1.0 Feature Engineering: 80% of dataset used for TF-IDF vectorization. Evaluation Metrics Five key metrics for binary classification performance: Accuracy: Proportion of correct predictions. (TP + TN) / (TP + TN + FP + FN) Precision: Among predicted "Fake", how many are truly fake. TP / (TP + FP) Recall: Among truly fake articles, how many were correctly identified. TP / (TP + FN) F1-Score: Harmonic mean of Precision and Recall. 2 × (Precision × Recall) / (Precision + Recall) AUC-ROC: Measures ability to distinguish classes across decision thresholds (0 to 1, where 1 is perfect).

The proliferation of online misinformation requires scalable, automated systems for fact-checking. This project develops a distributed machine learning model to classify news articles as Reliable (0) or Fake News (1). Using Apache Spark via PySpark, we process a dataset of approximately 45,000 articles (Fake.csv and True.csv) with a near-balanced distribution (52% fake, 48% real). The pipeline includes data loading, text cleaning (removing URLs, special characters, digits, and extra spaces), feature engineering with TF-IDF, unsupervised K-Means clustering (K=5) for thematic exploration, and supervised classification with Logistic Regression and Naive Bayes. Evaluated on a stratified 80/20 train-test split, the Logistic Regression model outperforms with 99.21% accuracy, precision, recall, and F1-score, and an AUC-ROC of 0.9994. This demonstrates the effectiveness of big data techniques for high-accuracy fake news detection. The upload includes the Jupyter notebook (code implementation) and PDF presentation (project summary).

Keywords

fake news detection, Apach spark, pYspark, machine Learning, Text Classification, big data analytics

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average