SlangTrack Dataset

The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP. Key Features: Unique Words: 48,508 Total Tokens: 310,170 Average Post Length: 34.6 words Average Sentences per Post: 3.74 These features ensure a robust contextual framework for accurate slang detection and semantic analysis. Target Word Selection: The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset: It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA). Has between 2 and 8 distinct senses, including both slang and non-slang meanings. Was cross-referenced using trusted resources such as: Green's Dictionary of Slang Urban Dictionary Online Slang Dictionary Oxford English Dictionary Features at least one slang and one dominant non-slang sense. Excludes proper nouns to maintain linguistic relevance and focus. Data Sources and Collection: 1. Corpus of Historical American English (COHA): Historical examples were extracted from the cleaned version of COHA (CCOHA). Data spans the years 1980–2010, capturing the evolution of target words over time. 2. Twitter: Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language. For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage. Dataset Scope: The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word: Demonstrates semantic diversity, balancing slang and non-slang senses. Offers robust representation across both historical (COHA) and modern (Twitter) contexts. The SlangTrack Dataset is a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.

EOSC Subjects

Twitter Data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average