
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
This is the dataset released with the paper titled: "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board". The dataset is a single Newline delimited JSON file. Each line in the file consists of a JSON object which is a full 4chan /pol/ thread. The JSON objects contain all the key/values returned by the 4chan API, along with three additional keys (entities, perspectives, and extracted_poster_id). For each JSON object we complement the data with the list of the named entities we detect for each post, using the spaCy Python library. In addition, for each post we add scores returned by the Google’s Perspective API, and more specifically seven scores in the [0; 1] interval. For the detailed description of every key in the JSON structure, along with the type of the value, please read the readme.pdf file provided with this dataset. If you find our dataset useful, please cite our paper: @article{papasavva2020raiders, title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board}, author={Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn}, journal={14th International AAAI Conference On Web And Social Media (ICWSM), 2020}, year={2020} } How to extract the data: Note that the data is compressed. See the instructions below on how to extract the data: Linux and Mac Step 1: Open a terminal window and navigate to the path where the file pol_0616-1119_labeled.tar.zst is located. Step2: Run the following command: unzstd pol_0616-1119_labeled.tar.zst The above command will result in a file named pol_0616-1119_labeled.tar. (in the same directory) Step 3: Again, from your terminal window, run this command: tar -xvf pol_0616-1119_labeled.tar When the above command finishes, you will get (in the same directory) the extracted data - a file named pol_062016-112019_labeled.ndjson. Windows There are many applications that can be used to extract this data on Windows available online. The authors cannot recommend specific applications. Note that the file is compressed twice so you will need to perform the data extraction twice - once on the downloaded file, and once on the file that was extracted from the downloaded file. Please do not hesitate to contact the author of this study in case you face any problem at: antonis.papasavva@ucl.ac.uk
Evolutionary Biology, spaCy Python library, 3.5 Years, JSON, 60506 Virology, 111714 Mental Health, Plant Biology, Emiliano De Cristofaro, 80699 Information Systems not elsewhere classified, Biochemistry, Microbiology, Augmented 4 chan, post, Lost Kek, 4 chan API, Genetics, 110309 Infectious Diseases, score, dataset, entity, Incorrect, Neuroscience, nbsp, object
Evolutionary Biology, spaCy Python library, 3.5 Years, JSON, 60506 Virology, 111714 Mental Health, Plant Biology, Emiliano De Cristofaro, 80699 Information Systems not elsewhere classified, Biochemistry, Microbiology, Augmented 4 chan, post, Lost Kek, 4 chan API, Genetics, 110309 Infectious Diseases, score, dataset, entity, Incorrect, Neuroscience, nbsp, object
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 4 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
views | 5K | |
downloads | 4K |