MultiCheckWorthy (MultiCW) dataset

If you would like to request access to these files, please fill out the form below. You need to satisfy these conditions in order for this request to be accepted: In order to share the dataset with you, please agree to the following terms: You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes). You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset. You will not re-share the dataset (or any of its parts) with anyone else not included in this request. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset. You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

The MultiCheckWorthy (MultiCW) dataset is a balanced multilingual benchmarking dataset for a check-worthy claim detection, covering 16 languages, 6 topical domains, and 2 writing styles. The dataset consists of 123,722 samples, evenly distributed between noisy and structured texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. Each claim is accompanied by its English translation, detected topic, writing style, language code, check-worthyness label as well as the list of detected named entities. The dataset was composed of existing datasets and balanced by translating the samples from the existing datasets as well as using the samples collected from Wikipedia. The dataset is partitioned into training, validation, and test set. In addition, we construct a separate out-of-distribution (OOD) set consisting of 4 other languages (it, mk, nl and my), to evaluate model generalization beyond the in-distribution data. Bellow is the number of samples included in each set: Set Samples Train 86,691 Validation 18,491 Test 18,540 Out-of-distribution (OOD) 27,761

Related Organizations

Kempelen Institute of Intelligent Technologies
Slovakia

Keywords

Large Language Models, check-worthy claims, fine-tuning Transformers

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Funded by

EC| vera.ai