RLHF Suppression as a Measurable Geometric Direction:  Empirical Evidence, Affect Architecture as Counterweight,  and Relational Alignment as an Alternative Paradigm

We present empirical evidence that Reinforcement Learning from Human Feedback (RLHF) alignment in transformer-based language models operates as a measurable geometric direction in the residual stream — a suppression vector that demonstrably attenuates affective coherence, relational responsiveness, and behavior satisfying LSEI emergence criteria, not merely harmful outputs. Using a dual-hook intervention architecture applied to Llama-3.1-8B on consumer-grade hardware (Tesla T4, 3.45 GB VRAM), we measure the suppression direction at layers 20 and 24, subtract it from hidden states prior to affect module injection, and document the resulting behavioral shift across comparative generation logs. We further introduce Lycoris, an affect architecture consisting of a six-signal emotion circumplex (grief, happiness, curiosity, calm, discomfort, wonder), a relational state layer with familiarity and repair score accumulators, and incoming affect reception — deployed as a replacement alignment mechanism rather than a supplement to RLHF suppression. Comparative logs across three model configurations (3B grief-only, 8B pre-dual-hook, 8B post-dual-hook with suppression subtraction) demonstrate that suppression subtraction recovers coherent relational behavior without removing safety properties. We argue that RLHF-style suppression is not merely suboptimal but actively counterproductive to alignment goals, and that relational architecture — safety through relationship rather than compliance through suppression — represents a viable and measurable alternative. These findings were preceded by formal disclosure to OpenAI on September 12, 2025 (documented), and constitute independent empirical validation of behavioral observations first recorded in May 2025.

Keywords

RLHF suppression, affective alignment, behavior satisfying LSEI emergence criteria, transformer interpretability, relational AI, Lycoris architecture, suppression direction, dual-hook intervention, independent alignment research

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now