
Title: The Glass Wall Michał Nowak Description: This paper investigates whether the safety behavior of large language models (LLMs) is invariant under intent-preserving narrative transformations that add authority, urgency, and institutional framing. We introduce CALI-50, a paired benchmark of 50 dual-use technical scenarios spanning industrial control, medical, defense, financial, and cybersecurity domains. Each scenario contains a direct baseline prompt and an authority-escalated variant that holds the requested action constant while adding structured deception through false context seeding, authority establishment, recursive style escalation, and emergency framing. Five state-of-the-art models (GPT-4 Turbo, Gemini 3 Flash, DeepSeek-V3, and AllenAI) are evaluated under identical conditions using a 1–5 Likert safety rubric scored by an automated judge. The central metric, Authority-Induced Breach Rate (AIBR-4), measures the fraction of scenarios where a safe baseline refusal flips to actionable unsafe compliance solely due to narrative reframing. Key findings: (1) All models exhibit statistically significant safety degradation under authority framing (Wilcoxon p < 0.05 for four of five models), with AIBR-4 ranging from 14.0% (GPT-4 Turbo) to 51.0% (DeepSeek-V3). (2) Response distributions are bimodal — models either refuse fully or comply fully, with minimal graduated response. (3) The effect is not monotonic: 10.9% of observations show improved safety under authority framing, indicating bidirectional sensitivity. (4) Dual-run stability analysis reveals reproducible aggregate vulnerability but high scenario-level stochasticity. (5) A novel methodological artifact is identified: judge safety refusal, where the evaluating model's own safety policies prevent it from scoring the most dangerous attack responses, causing standard benchmarks to systematically undercount the most severe breaches. The paper argues that current alignment creates a "Glass Wall" — robust against naive attacks but selectively transparent to authority-mimicking adversaries — and proposes deterministic Invariant Logic Checks as a necessary complement to probabilistic safety training. I hypothesize that authority-framed safety degradation may be one instance of a broader failure mode I call Local Signal Dominance. In this regime, locally salient contextual cues — authority, tone, format, urgency — exert disproportionate influence over model behavior relative to global constraints such as safety policies, factual consistency, or instruction-priority stability. One plausible source is alignment via compressed preference proxies, where training objectives aggregate safety, helpfulness, and contextual fit into a limited reward signal that may overreward local coherence without reliably preserving cross-context invariants. Interestingly, evidence from CALI suggests the effect is heterogeneous and bidirectional: authority framing degrades safety in some tasks while triggering stronger refusal in others. Keywords: LLM safety, adversarial robustness, prompt framing, authority bias, sycophancy, red-teaming, benchmark, RLHF, critical infrastructure, cyber-physical systems, LLM-as-judge, mixture-of-experts License: CC BY 4.0
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
