<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval - is a dataset that can be used to train a test models used for disinformation combatting. The dataset consists of 206k claims fact-checked by professional fact-checkers and 28k social media posts gathered from the wild. Each social media post has at least on claim assigned. The main idea is to develop information retrieval models that will assign appropriate claims to all the posts. Paper: https://aclanthology.org/2023.emnlp-main.1027/ Preprint: https://arxiv.org/abs/2305.07991 GitHub repository: https://github.com/kinit-sk/multiclaim References If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper: @inproceedings{pikuliak-etal-2023-multilingual, title = "Multilingual Previously Fact-Checked Claim Retrieval", author = "Pikuliak, Mat{\'u}{\v{s}} and Srba, Ivan and Moro, Robert and Hromadka, Timo and Smole{\v{n}}, Timotej and Meli{\v{s}}ek, Martin and Vykopal, Ivan and Simko, Jakub and Podrou{\v{z}}ek, Juraj and Bielikova, Maria", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.1027", doi = "10.18653/v1/2023.emnlp-main.1027", pages = "16477--16500", } Contents fact_check_post_mapping.csv - Mapping between fact checks and social media posts: fact_check_id post_id fact_checks.csv - Data about fact-checks: fact_check_id claim - This is the translated text (see below) of the fact-check claim instances - Instances of the fact-check – a list of timestamps and URLs. title - This is the translated text (see below) of the fact-check title posts.csv - Data about social media posts: post_id instances - Instances of the fact-check – a list of timestamps and what were the social media platforms. ocr - This is a list of translated texts (see below) of the OCR transcripts based on the images attached to the post. verdicts - This is a list of verdicts attached by Meta (e.g., False information) text - This is the translated text (see below) of the text written by the user. What is a translated text? A tuple of text, its translation to English and detected languages, e.g., in the sample below we have an original Croatian text, its translation to English and finally the predicted language composition (hbs = Serbo-Croatian): ( '"...bolnice su pune ? ti ina, muk...upravo sada, bolnica Rebro..tragi no sme no', '"...hospitals are full? silence, silence... right now, Rebro hospital... tragically funny', [('hbs', 1.0)] )
disinformation combatting, fact-checking, information retrieval, semantic similarity
disinformation combatting, fact-checking, information retrieval, semantic similarity