
Corrigibility: an AI system’s willingness to accept corrective intervention, including shutdown is a central objective for the safe deployment of advanced language models. We synthesize foundational theory (corrigibility, safe interruptibility, the off-switch game) with recent empirical findings on large language models (LLMs) such as GPT-4 and Claude that exhibit shutdown avoidance in simulated, goal-directed scenarios. We propose a structured risk taxonomy for shutdown non-compliance spanning specification and reward issues, goal misgeneralization, situational awareness, and deceptive behavior. The paper integrates design principles and mitigation directions (objective uncertainty, authority sensitivity, chain-of-verification prompting, layered control architectures) and outlines a benchmark blueprint for future empirical validation without requiring proprietary APIs. Our contributions are: (1) a consolidated theoretical framework for shutdown compliance; (2) a survey of empirical behaviors in modern LLMs; (3) a taxonomy of design flaws that threaten corrigibility; and (4) a research agenda and evaluation protocol for testing shutdown compliance. This theoretical synthesis aims to support IEEE/Springer-level discourse and guide practical alignment work toward reliably corrigible AI systems.
safe interruptibility, AI, AI safety, corrigibility, shutdown compliance, AI alignment, alignment, off-switch game, LLM safety
safe interruptibility, AI, AI safety, corrigibility, shutdown compliance, AI alignment, alignment, off-switch game, LLM safety
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
