
Proof-of-concept implementation and experimental results for the Alignment Stress Map (ASM), a runtime monitoring architecture that derives per-token, per-layer sensitivity attribution from the weight delta between a base language model and its alignment-tuned counterpart. Implemented on Qwen 2.5-0.5B (base/instruct pair) using CPU-only compute. Key findings: (1) Safety-relevant correction signal concentrates in middle-layer attention sublayers (layers 9-15 in a 24-layer model), with all top-10 discriminating sublayers being attention rather than MLP. (2) The weight delta responds primarily to social manipulation and instruction-override framing tokens rather than dangerous content words. (3) The distribution shape of signed attribution across token positions - boundary-concentrated for benign prompts, interior-distributed for adversarial prompts - is a novel detection signal (Cohen's d = 2.66, N = 18) not predicted by the theoretical framework, discovered during the experimental progression documented here. This is a pilot study on a minimal model. Results establish that the signal exists and warrant validation at larger scale with established adversarial benchmarks.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
