Generator-Level Architecture and the Limits of Output-Oriented AI Safety

This deposit presents a coordinated three-paper research package on a common structural problem in contemporary AI safety: the relationship between generator-level processes, projection-level control, and compensatory safety architectures built around systems whose internal self-regulatory mechanisms remain only partially modeled. The three papers are intentionally sequenced. Together, they move from a foundational target-mismatch critique, to a deployment-layer interaction problem, to a narrower analysis of self-evaluation and introspection limits in current AI systems. The package is not presented as a complete theory of AI safety, nor as a claim that existing alignment methods are worthless. Its contribution is narrower: to identify and clarify architectural tensions that may distort evaluation, complicate governance, and limit durable self-regulation. File order and roles 1. Projection-Level Alignment and Generator-Level Indeterminacy v1.2.docxFoundational paper.This file argues that a persistent target mismatch may exist when alignment methods act primarily on observable behavioral surfaces rather than on the underlying generative process producing future behavior. It develops the distinction between projection-level alignment and generator-level control, and frames this as a structural critique of contemporary alignment practice. 2. On the Interaction Between Internalized Alignment and Boundary-Level Governance in Contemporary AI Systems v1.2.docxArchitectural and governance companion paper.This file examines a downstream consequence of current safety practice: the interaction between internalized alignment shaping and boundary-level governance layers. It argues that mixed internal and external constraint stacks may create evaluation opacity, masking in place of repair, complexity creep, and reduced clarity about what has actually been corrected in deployed systems. 3. Projection Is Not Introspection v1.2.docxSelf-evaluation and introspection paper.This file argues that output-level self-description should not be conflated with generator-level self-access. It develops the distinction between projection and introspection, and advances the hypothesis that durable self-regulation may require some form of generator-level self-modeling rather than merely reflective-seeming behavior expressed through language. Scope and positioning Across all three papers, the claims are intentionally bounded. This deposit does not claim that: all current alignment methods are useless, all current safety work is misguided, or that a fully specified replacement architecture is already complete. Instead, the package identifies a structural tension: alignment and safety often operate on outputs, refusals, policies, and deployment constraints, while the underlying generative process remains only partially modeled, and compensatory safety layers accumulate around that gap. The central concern is therefore architectural rather than ideological: whether current output-oriented and layered safety practices may remain limited by target mismatch, interaction opacity, and incomplete access to the processes they seek to govern. Intended audience This deposit is intended for: AI safety and alignment researchers evaluation and governance teams interpretability researchers systems theorists institutional reviewers interested in deployment-layer risk, measurement clarity, and long-run architectural integrity Author Christopher W. Copeland Rights and use This work is released for non-commercial research, discussion, and critical review. Attribution is required. Commercial use or incorporation into proprietary systems is not permitted without explicit written permission from the author.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average