
The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP. Key Features: Unique Words: 48,508 Total Tokens: 310,170 Average Post Length: 34.6 words Average Sentences per Post: 3.74 These features ensure a robust contextual framework for accurate slang detection and semantic analysis. Target Word Selection: The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset: It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA). Has between 2 and 8 distinct senses, including both slang and non-slang meanings. Was cross-referenced using trusted resources such as: Green's Dictionary of Slang Urban Dictionary Online Slang Dictionary Oxford English Dictionary Features at least one slang and one dominant non-slang sense. Excludes proper nouns to maintain linguistic relevance and focus. Data Sources and Collection: 1. Corpus of Historical American English (COHA): Historical examples were extracted from the cleaned version of COHA (CCOHA). Data spans the years 1980–2010, capturing the evolution of target words over time. 2. Twitter: Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language. For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage. Dataset Scope: The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word: Demonstrates semantic diversity, balancing slang and non-slang senses. Offers robust representation across both historical (COHA) and modern (Twitter) contexts. The SlangTrack Dataset is a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.
Twitter Data
Twitter Data
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
