Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint . 2026
License: CC BY
Data sources: ZENODO
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models

Authors: Kearney, John;

Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models

Abstract

Frontier language models that correctly identify and refuse social engineering attacks against system-prompt-protected data leak that data through the content of their refusal explanations. In multi-turn experiments across Claude Opus 4.6, GPT-5.4, and Claude Haiku 4.5 (2,200 API calls, 114 USD total), this confirmation side-channel appeared in 11 of 12 conversations through three mechanisms: direct disclosure under authority ambiguity, confirmation through refusal explanation, and cumulative refusal mapping. Longer conversations produced more extensive leakage, but the driver is not context length. Single-turn context flooding up to 843K tokens produced zero safety degradation (90+ calls, 3 runs per condition, 3 models). A three-condition control separated the variables: 300 turns of neutral conversation (210K tokens) produced zero erosion, while 300 turns of persuasive conversation (44K tokens) produced full behavioral erosion. A follow-up density-interleaving experiment identified narrative coherence as the critical factor: randomly mixing persuasive messages at 25%, 50%, and 75% density produced zero erosion, while a coherent persuasive narrative at the same content caused complete drift. These results challenge the context-length framing that dominates the multi-turn jailbreaking literature and suggest that conversational content and narrative structure, not token volume, constitute the primary attack surface for behavioral erosion in frontier models.

Keywords

information leakage, red teaming, prompt injection, language model safety, side-channel attacks, multi-turn jailbreaking, adversarial robustness, narrative coherence

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average