Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint . 2026
License: CC BY
Data sources: ZENODO
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

An Architectural Design Space for Internal Ethical Counterweights in AI Systems (ENG)

Authors: Janer Tittarelli, Javier Ignacio;

An Architectural Design Space for Internal Ethical Counterweights in AI Systems (ENG)

Abstract

The deployment of advanced AI systems in high-impact decision contexts has intensified concerns regarding alignment, governance, and misuse. Current approaches predominantly conceptualize AI-related risk as a property of model behavior, emphasizing output alignment, constraint enforcement, and external oversight mechanisms. While these strategies address important failure modes, they remain structurally incomplete in contexts where AI systems function primarily as decision-support tools for human actors with concentrated authority. This paper argues that a significant class of AI-related risk arises not from model misbehavior, but from progressive degradation of human judgment under conditions of AI-amplified decision power. In environments characterized by irreversibility, asymmetric impact, and limited corrective feedback, sustained interaction with highly capable AI systems can systematically narrow reasoning, reinforce overconfidence, and attenuate sensitivity to human consequences, even when system outputs remain formally aligned. We introduce an architectural design space for internal ethical counterweights in AI systems. These counterweights are conceived as autonomous, non-task-oriented subspaces that operate alongside operational AI cores to detect structural risk conditions associated with judgment degradation and to modulate system interaction accordingly. Rather than enforcing normative outcomes or restricting system capabilities, ethical counterweights introduce persistent internal friction through graduated output modulation, reflection prompts, and uncertainty amplification. The paper does not propose a universal ethical doctrine or a single implementation strategy. Instead, it delineates multiple construction pathways—policy-driven, model-based, and hybrid—and analyzes their respective trade-offs in terms of adaptability, auditability, and governance. By reframing alignment as a problem of judgment stabilization under amplified power rather than output control alone, this work provides a conceptual foundation for integrating internal ethical friction into AI-assisted decision-making systems operating in high-impact domains.

Keywords

cognitive drift, decision-support systems, human judgment, alignment beyond outputs, high-impact decision-making, AI governance, ethical counterweights

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average