CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form. Citation Please cite our work as @article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} } Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German. Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows: False - The main claim made in an article is untrue. Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services. True - This rating indicates that the primary elements of the main claim are demonstrably true. Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles. Cross-Lingual Task (German) Along with the multi-class task for the English language, we have introduced a task for low resourced language. We will provide the data for test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language. Input Data The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows: ID- Unique identifier of the news article Title- Title of the news article text- Text mentioned inside the news article our rating - class of the news article as false, partially false, true, other Output data format public_id- Unique identifier of the news article predicted_rating- predicted class Sample File public_id, predicted_rating 1, false 2, true Additional data for Training To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources: Fakenews Classification Datasets Fake News Detection Challenge KDD 2020 FakeNewsNet IMPORTANT! We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc. Evaluation Metrics This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. here is no limit to the number of submissions, we will evaluate the last submission from each team. Please mention your team name in each submission. Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498 Submission Link: Codalab Page Related Work Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14 Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104 Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF. Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham. Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

Related Organizations

University of Duisburg-Essen
Germany
University of Klagenfurt
Austria
Darmstadt University of Applied Sciences
Germany
University of Hildesheim
Germany
Fachhochschule Potsdam
Germany

Keywords

Cross-Lingual, Fake News, Misinformation, NLP, Fact-checking

EOSC Subjects

Twitter Data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average