
To overcome the limitations found in many existing fake news datasets, which often analyze either news content or social media posts in isolation, we present a comprehensive, triangulated dataset that systematically interlinks four essential components: original news articles, social media posts, multimedia content, and veracity labels. The original news articles are sourced from NELA-GT as well as mainstream media outlets, providing a foundational layer of factual reporting. These articles are paired with their corresponding social media derivatives, which include posts from platforms such as Twitter and Reddit, along with extensive metadata like engagement statistics and bot-likelihood scores. Multimedia content, including both images and videos, is incorporated from datasets like FakeNewsNet to allow for visual misinformation analysis. Veracity labels are curated through fact-checked claims provided by the TruthSeekers repository, ensuring each instance is associated with a trusted assessment of truthfulness. The resulting dataset contains 158,400 meticulously aligned instances, encompassing a rich array of modalities such as text, image data, social interaction context, and temporal metadata. The alignment of these diverse data points is achieved through a multi-tiered method. This includes URL and keyword matching using Levenshtein distance thresholds (0.85), and multimodal validation using CLIP similarity scores (>0.7). These techniques collectively ensure high-confidence matching across modalities. Compared to existing datasets such as FakeNewsNet and LIAR, our triangulated dataset offers several critical advantages. It uniquely includes social context features like bot scores and retweet graphs, supports multimodal pairings of text, images, and social media posts, and allows for provenance tracking by comparing original and manipulated versions of content. For instance, it enables detailed tracing of how a legitimate BBC article titled “Climate Accord Signed” may be repurposed into a misleading viral tweet like “Politicians FAKED climate deal!” accompanied by doctored images. This level of integration provides researchers with a powerful tool to study the lifecycle and mutation of fake news across platforms and modalities.
Twitter Data
Twitter Data
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
