What Outermost Layers Do: A Cross-Architecture Mechanistic Study of Trained Transformers

Romero Castella, Ivan Alejandro

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

What Outermost Layers Do: A Cross-Architecture Mechanistic Study of Trained Transformers

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Romero Castella, Ivan Alejandro;

doi: 10.5281/zenodo.20575450

What Outermost Layers Do: A Cross-Architecture Mechanistic Study of Trained Transformers

- Summary

Abstract

Trained transformers process language through stacks of structurally identical layers, but their layers do not behave identically. The first and last few layers appear to do something qualitatively distinct from those in between, and what exactly they do has remained less clear. We characterize the outermost layers of three pretrained models — DistilBERT, BERT, and GPT-2 — using geometric metrics, reconstructibility from context, and a combination of linear probing and causal ablation, with hypotheses pre-registered before any numbers were extracted. We find that a sandwich pattern generalizes across the three architectures, with a compositional core that absorbs additional depth while the translator regions retain near-fixed size; that the entry and exit translators operate in directionally opposite ways between encoders and the decoder; and that the dominant principal direction of GPT-2's final layer, capturing roughly 35% of total variance, is orthogonal to part-of-speech, lexical, positional, and sentiment information. We close with observations on how these layer-wise differences relate to active questions about cross-model representation sharing.

Found an issue? Give us feedback