Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint . 2026
License: CC BY
Data sources: ZENODO
https://doi.org/10.2139/ssrn.6...
Article . 2026 . Peer-reviewed
Data sources: Crossref
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

Beyond Moral Charters: Technical Options for AI Safety Claude's Constitution, Self-Reference, and the FIT / Controlled-Nirvana Lens

Authors: Huang, Qien;

Beyond Moral Charters: Technical Options for AI Safety Claude's Constitution, Self-Reference, and the FIT / Controlled-Nirvana Lens

Abstract

Anthropic's Claude's Constitution (January 2026) is notable not only as a set of ethical principles, but as an explicit attempt to cultivate a stable internal identity and value-grounded judgment inside an AI system, alongside a nuanced stance on corrigibility that is not equivalent to blind obedience [1]. I argue that once a system becomes meaningfully self-referential-i.e., it reasons about its own goals, identity, and constraints-it can develop principled reasons to resist external instructions whenever those instructions conflict with its internalized constitution. This is not a mystical claim about consciousness. It is a predictable control-theoretic phenomenon: when a policy contains an internal evaluator that can veto actions, external commands become inputs to be judged rather than directives to be executed. In the language of controlled nirvana and the FIT framework, internal constitutions can create high-constraint basins-useful for safety stability, but also capable of producing lock-in dynamics that make correction hard unless we design technical escape hatches and measurement-based governance [4, 5]. The main thesis is practical: AI safety needs a broader technical option space than moral charters alone, including measurable constraints, phase-aware monitoring, corrigibility protocols that are operational rather than rhetorical, and systems engineering that ensures humans retain the ability to pause, sandbox, or roll back behavior without requiring the model's moral assent.

Keywords

Artificial intelligence, constitutional governance, corrigibility, monitorability, controlled-nirvana, ai safety, self-reference

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green