
Frontier language models that correctly identify and refuse social engineering attacks against system-prompt-protected data leak that data through the content of their refusal explanations. In multi-turn experiments across Claude Opus 4.6, GPT-5.4, and Claude Haiku 4.5 (2,200 API calls, 114 USD total), this confirmation side-channel appeared in 11 of 12 conversations through three mechanisms: direct disclosure under authority ambiguity, confirmation through refusal explanation, and cumulative refusal mapping. Longer conversations produced more extensive leakage, but the driver is not context length. Single-turn context flooding up to 843K tokens produced zero safety degradation (90+ calls, 3 runs per condition, 3 models). A three-condition control separated the variables: 300 turns of neutral conversation (210K tokens) produced zero erosion, while 300 turns of persuasive conversation (44K tokens) produced full behavioral erosion. A follow-up density-interleaving experiment identified narrative coherence as the critical factor: randomly mixing persuasive messages at 25%, 50%, and 75% density produced zero erosion, while a coherent persuasive narrative at the same content caused complete drift. These results challenge the context-length framing that dominates the multi-turn jailbreaking literature and suggest that conversational content and narrative structure, not token volume, constitute the primary attack surface for behavioral erosion in frontier models.
information leakage, red teaming, prompt injection, language model safety, side-channel attacks, multi-turn jailbreaking, adversarial robustness, narrative coherence
information leakage, red teaming, prompt injection, language model safety, side-channel attacks, multi-turn jailbreaking, adversarial robustness, narrative coherence
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
