
This dataset accompanies the paper "LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media" (Findings of ACL 2025). It contains the curated data used for taxonomy construction experiments described in the paper, focusing on factual claims extracted from social media discussions across three topic domains, including COVID-19 vaccine, climate change, and cybersecurity. This dataset is designed to support research in taxonomy construction and factual claim analysis. Contents tweets.csv: The ids of 384,676 tweets collected from X (formerly Twitter) for the three domains above. (Note: Facebook data in the paper are not included due to data-sharing restrictions and privacy policies.) Taxonomies: Nine final taxonomies of factual claims generated by three LLMs (Zephyr, GPT-4o mini, Gemini 2.0 Flash) across the three datasets. Each taxonomy includes three hierarchical levels: broad, medium, and detailed topics.
Factual Claim, Social Media, Taxonomy
Twitter Data
Factual Claim, Social Media, Taxonomy
Twitter Data
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
