
This is the dataset released with the paper titled: "Measuring Online Hate on 4chan Using Deep Learning". This dataset contains a collection of 500,000 posts extracted from the /pol/ board (Politically Incorrect) of 4chan using the 4chan API. The dataset is structured as a single CSV file with one column, com, which includes the raw content of the posts. The dataset does not preserve the structure of threads or replies; instead, it consists of a flat collection of individual posts extracted from /pol/. This format is intended to support applications such as text analysis, natural language processing, and computational social science research by providing a straightforward dataset of raw post content. Dataset Format File Format: CSV (Comma-Separated Values) Columns: com: The raw content of the post. Source The posts were extracted from 4chan’s /pol/ board using the official 4chan API. This board is known for hosting discussions on various topics, often with a focus on political content. Due to the nature of the /pol/ board, the content may include offensive language, hate speech, or otherwise sensitive material. Users should exercise caution and consider ethical implications when analysing this dataset. Potential Use Cases Text analysis and natural language processing (NLP). Studies on online discourse, extremism, or political polarization. Research on language usage and sentiment in online forums. Development and testing of machine learning models for text classification or moderation. Example Data Here’s an example of what a few rows of the dataset look like: com "Why does no one talk about this?" "The government is hiding the truth!" "We need to take action against this injustice." If you find our dataset useful, please cite our paper: @article{ }
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
