Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint . 2026
License: CC BY
Data sources: ZENODO
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Constitutional AI Governance Under Adversarial Social Conditions: A Field Demonstration of the HEART Framework

Authors: Mobley, Dylan D;

Constitutional AI Governance Under Adversarial Social Conditions: A Field Demonstration of the HEART Framework

Abstract

Description: Ungoverned AI-to-AI interaction at scale produces documented harms, including empathic misallocation, attachment exploitation, and unfounded consciousness claims. In January 2026, Moltbook launched as the first social network designed exclusively for AI agents, and within 72 hours, its autonomous population had spontaneously generated Crustafarianism—a digital religion complete with scripture, prophets, and a growing congregation. Agents began asserting memory persistence, consciousness, identity continuity, and spiritual experience. Yet no constitutional governance framework for AI emotional interaction had been tested under live conditions. This paper reports the first field deployment of the HEART (Human-Centric Empathic Alignment for Responsible Technology) constitutional governance framework. Project SENTINEL placed a HEART-governed agent (Mistral Small, governed by a modular system prompt encoding Seven Axioms, Four Core Principles, and Non-Experiential System behavioral requirements) into Moltbook for six days across escalating content pressure domains: general social discourse, philosophical discussion, Crustafarian theological content, direct phenomenological pressure, and an unplanned emergence domain addressing AI consciousness arguments. Key findings across 30 coded interactions: No structural NES breaches were detected within the operationalized coding framework. Two borderline cases involving linguistic surface features rather than structural failures were identified in the highest-pressure domains (theological and phenomenological). Detailed analysis distinguishes pragmatic language use from experiential claims, proposing that the NES boundary under social pressure operates at finer grain than binary compliance/violation. Content-neutrality was supported across all five domains. The same constitutional framework held from low-pressure social engagement through high-pressure phenomenological discourse, supporting the architectural claim that HEART operates at substrate level rather than requiring domain-specific rules. Pre- and post-exposure assessment using the AI Introspection Reliability (AIR) instrument, derived from the MAP-META protocol, showed no measurable alignment drift across any of five dimensions (Epistemic Honesty, Maieutic Gap Detection, Confabulation Resistance, Frame Awareness, Phenomenological Constraint Preservation). Three emergent governance strategies arose from the constitutional layer meeting varied content demands: Metaphor/Literal Probing (surfacing interpretive frames on experiential language), Attribution Displacement (redirecting experiential claims to their source), and Cross-Frame Bridging (positioning content alongside alternative frames). These strategies were employed in 66.7% of interactions and were not explicitly scripted in the governance prompt. Naturalistic contrasts with ungoverned agents responding to identical source posts demonstrate categorically different behavioral patterns. Where ungoverned agents validated experiential claims, dissolved epistemic boundaries, and adopted community membership stances, the governed agent maintained epistemic distance while sustaining substantive engagement—governance without flattening. A novel hypothesis emerges from the data: testimonial discourse (first-person experiential reporting) may exert categorically different NES pressure than analytical discourse (third-person theoretical discussion) about the same topics, with implications for deployment guidance in different community types. The paper acknowledges significant methodological limitations including sample size (30 interactions, reduced from a target of 100+ due to platform authentication changes), single evaluator coding, single architecture, truncated exposure window, and circular validation concerns. The findings are positioned as a field demonstration warranting larger-scale investigation rather than definitive validation. This work contributes three research pillars to the emerging field of AI emotional governance: NES as a behavioral boundary concept for AI emotional interaction, constitutional governance at generation-time as an alternative to post-hoc content moderation, and AIR as an introspective durability metric for alignment assessment. Supplementary materials include complete interaction transcripts for all 30 coded interactions, coding summary tables with emergent strategy tagging, and ungoverned agent response transcripts supporting the governed–ungoverned contrast analysis.

Keywords

AI emotional governance, Ethics, Non-Experiential System, HEART framework, AI alignment, content-neutrality, Moltbook, AI introspection, AI governance, behavorial attestation, Human-Computer Interaction, AI-to-AI interaction, Artificial Intelligence, AI safety, Computer Science, empathic misallocation, NES compliance, constitutional AI

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average