
We present a mathematical framework connecting intelligence to predictive compression through ε-machines (minimal sufficient statistics of the past for predicting the future) and demonstrate that modern transformer language models implicitly implement this compression. Through systematic reverse-engineering of GPT-2, we reveal a three-phase "V-shape" crystallization pattern: tokens compress into ~200 predictive equivalence classes by layer 2, undergo controlled semantic disambiguation in middle layers, and recrystallize into context-specific representations by layer 11. We validate this theory by training a learned discrete bottleneck model that routes tokens through 512 concepts using Gumbel-softmax, achieving 2.3× better validation loss (1.60 vs 3.30) and producing coherent text compared to static pre-clustered baselines that collapse during training. We further compare our architecture against standard models (char-RNN, small GPT, GPT-2 124M), showing that enforced compression achieves competitive performance with 19% fewer parameters and dramatically better interpretability. Our results suggest that intelligence emerges from compression into minimal predictive representations, with practical implications for reducing training costs through enforced discrete bottlenecks. 9 pages, 3 figures, 12 tables. Code available upon request.
language models, predictive compression, ε-machines, information bottleneck, GPT-2, concept learning, Gumbel-softmax, discrete representations, mechanistic interpretability, computational mechanics
language models, predictive compression, ε-machines, information bottleneck, GPT-2, concept learning, Gumbel-softmax, discrete representations, mechanistic interpretability, computational mechanics
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
