
Trained transformers process language through stacks of structurally identical layers, but their layers do not behave identically. The first and last few layers appear to do something qualitatively distinct from those in between, and what exactly they do has remained less clear. We characterize the outermost layers of three pretrained models — DistilBERT, BERT, and GPT-2 — using geometric metrics, reconstructibility from context, and a combination of linear probing and causal ablation, with hypotheses pre-registered before any numbers were extracted. We find that a sandwich pattern generalizes across the three architectures, with a compositional core that absorbs additional depth while the translator regions retain near-fixed size; that the entry and exit translators operate in directionally opposite ways between encoders and the decoder; and that the dominant principal direction of GPT-2's final layer, capturing roughly 35% of total variance, is orthogonal to part-of-speech, lexical, positional, and sentiment information. We close with observations on how these layer-wise differences relate to active questions about cross-model representation sharing.
