From Birdwatch to Community Notes, from Twitter to X: four years of community-based content moderation

Dataset and Code Description This repository contains the data and code used to analyse interactions within the Community Notes platform from January 23, 2021, to January 23, 2025. The files are organised as follows: 🧪 Code Notebooks Create_graphs.ipynb: Constructs full interaction networks and separate sub-networks (helpful, somewhat helpful, unhelpful) from the monthly raw rating files. Url_analysis.ipynb: Detects the language of each note and extracts any URLs or domain names mentioned. BERTopic_English_hard_PCA100_UMAP10_MinCluster500.ipynb: Applies BERTopic to English-language notes to extract latent topics. Dimensionality is reduced using PCA (100 components) and UMAP (10 dimensions). Only clusters with at least 500 notes are retained to ensure robustness. 📄 Data Files Notes Data notes_with_lang.csv: All Community Notes written between January 23, 2021, and January 23, 2025, with detected language, extracted URLs, and domain names. english_notes_with_nlp.csv: Subset of English notes with BERTopic topics, topic numbers, and keyword representations. Each note file contains the following variables: noteId: Unique ID of the note. noteAuthorParticipantId: Unique ID of the note's author. tweetId: ID of the tweet the note addresses. date: Date the note was written (YYYY-MM-DD). Timestamp: Time the note was written (HH:MM:SS). language: Detected language of the note. extracted_urls: List of URLs mentioned in the note. news_source: List of extracted domain names. BERTopic_word (only in English notes file): Main topic name. BERTopic_number (only in English notes file): Numeric topic identifier. BERTopic_representation (only in English notes file): List of keywords representing the topic. Rating Data Monthly rating files are stored in the rating monthly files/ directory with the naming format ratings_m_yyyy.csv. Each file includes: noteId: ID of the rated note. raterParticipantId: ID of the participant giving the rating. helpfulnessLevel: Rating category (HELPFUL, SOMEWHAT_HELPFUL, NOT_HELPFUL). helpful, notHelpful: Deprecated binary flags (use helpfulnessLevel instead). 🌐 Network Files Each month’s ratings are used to construct interaction graphs with user-to-user edges based on rating behaviours. Whole Networks (whole_network__.graphml): Full user interaction networks, with edges annotated by the number of helpful, unhelpful, and somewhat helpful ratings. Each edge contains: source: Rater’s participant ID. target: Note author’s participant ID. helpful, unhelpful, somewhathelpful: Count of ratings by type from rater to author. Helpful Networks (network___helpful.graphml): Subnetworks based on helpful ratings only. Somewhat Helpful Networks (network___somewhat.graphml): Subnetworks based on somewhat helpful ratings. Unhelpful Networks (network___unhelpful.graphml): Subnetworks based on unhelpful ratings.

Related Organizations

EOSC Subjects

Twitter Data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average