Alignment Robustness Depends More on Training than Architecture: A Cross-Vendor Analysis of Attention Specialization in Large Language Models

We present a systematic empirical study examining how preference optimization methods (RLHF, DPO) affect attention head specialization across eight vendor families and more than 25 large language model variants. Using a standardized evaluation protocol (bfloat16 precision, three-seed cross-validation, and SHA-256–verified prompts), we quantify attention head diversity via the Specialization Index (SI) and compare base and instruction-tuned model pairs. Main finding: Robustness to alignment-induced specialization loss is strongly associated with training methodology, following a consistent hierarchy: Training Methodology > Sliding Window Attention > Architecture > Scale. Key results: SI reduction pattern: RLHF and DPO reduce SI in most model families lacking architectural protection (LLaMA-3.1: −56.3%; LLaMA-2: −7.95%), whereas models equipped with Sliding Window Attention maintain or increase specialization (Mistral: +4.2%). Architecture-dependent sensitivity: At matched scale, Grouped Query Attention exhibits approximately 5,800× higher sensitivity to random attention noise than Multi-Head Attention (ratio-of-means across three seeds; permutation test, p < 0.05). Training-based robustness: Synthetic training (Phi family) yields scale-invariant specialization (SI ≈ 0.33 across a 10.8× parameter range), and Qwen2 shows no observed recursive degradation within the tested 50-generation window. This release includes 19 documented Jupyter notebooks that support the full experimental pipeline, 27 result JSON files, and command-line tools that enable end-to-end reproducibility.The paper text is released under CC-BY-4.0; accompanying code and tooling are released under the MIT License.

Related Organizations

International University of Applied Sciences Bad Honnef
Germany

Keywords

attention mechanisms, specialization index, large language models, RLHF, alignment, transformers, MHA, GQA, reproducibility

Found an issue? Give us feedback