Generalized Deception Dataset

{"references": ["Pawan Kumar Verma, Prateek Agrawal, and Radu Prodan. 2021. WELFake dataset for fake news detection in text data. https://doi.org/10.5281/zenodo.4561253", "Sokratis Vidros, Constantinos Kolias, Georgios Kambourakis, and Leman Akoglu. 2017. Automatic Detection of Online Recruitment Frauds: Characteristics, Methods, and a Public Dataset. Future Internet 9, 1 (2017).", "Rakesh M Verma, Victor Zeng, and Houtan Faridi. 2019. Data quality for security challenges: Case studies of phishing, malware, and intrusion detection datasets. In Proc. ACM SIGSAC Conf. on Computer and Communications Security. 2605\u20132607.", "Tariq Alhindi, Savvas Petridis, and Smaranda Muresan. 2018. Where is your Evidence: Improving Fact-checking by Justification Modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). 85\u201390"]}

We took labeled datasets from five different deception-detection tasks with no licensing issues and converted them to a standard format. We inspected each dataset for quality and generated new cleaned versions. Task # Deceptive # Truthful Product Reviews 10493 10481 Phishing 6134 9202 Job Scams 608 13735 Political Statements 5669 7167 Fake News 27486 34615 Our data is structured as five jsonlines files (one for each task) with a text to classify and a Boolean is_deceptive label. Sample data point: { "text":"the Annies List political group supports third-trimester abortions on demand.", "is_deceptive":true }

This work was completed in part with resources provided by the Research Computing Data Core at the University of Houston and supported in part by NSF grants DGE 1433817, CCF 1950297, ARO award W911NF-20-1-0254, and ONR award N00014-19-S-F009.

Related Organizations

University of California, Berkeley
United States
University of Houston
United States
The University of Texas at Austin
United States

Keywords

domain-independent deception detection

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average