
Glitch tokens are anomalous subword units that consistently trigger unexpected, unstable, or unsafe behaviors in large language models (LLMs). Some popular examples of such tokens include “SolidGoldMagikarp”, “PsyNetMessage”, and “petertodd”. Prior work has largely attributed the vulnerability to data sparsity—tokens that occur too infrequently in pretraining data, leading to poorly trained embeddings. However, this framing overlooks the concept of context sparsity. We hypothesize that a distinct class of glitch tokens emerges not due to rarity, but because they occur in narrow or repetitive contextual environments. These tokens can lead to jailbreaking and adversarial attacks on LLMs. We systematically investigate context-sparse glitch tokens across multiple families of open-source models (Gemma, LLaMA, and Mistral). Using semantic clustering, diversity scoring, and sequential n-gram analysis, we characterize tokens with limited contextual variety and evaluate their impact on robustness. Our experiments reveal mixed evidence for context sparsity effects on the HarmBench dataset, when injected at suffix positions in prompts or embedded within chat templates. These findings broaden the understanding of glitch vulnerabilities, showing that they are not simply artifacts of token rarity but reflect both contextual patterns and structural weaknesses in how LLMs process conversations.
adversarial attack, context sparsity, AI safety, large language models, glitch tokens
adversarial attack, context sparsity, AI safety, large language models, glitch tokens
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
