Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

Kearney, John

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Kearney, John;

doi: 10.5281/zenodo.19346536

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

- Summary

Abstract

Neural networks have a finite capacity for algorithmic discovery through grokking, the phenomenon where models generalize long after memorizing training data. Using modular arithmetic as a testbed, we show that a d=128 transformer reliably groks up to 5 simultaneous operations but collapses completely at 6. This collapse is not gradual: across all 5 random seeds, the model fails on every operation, including ones it trivially solves alone. We localize the interference to the shared embedding layer and show two interventions that recover full performance: separate per-operation embeddings (a minimal fix at this scale) and a split architecture with independent modules (a general solution). An automatic pipeline discovers optimal operation groupings through gradient similarity analysis with no domain expertise. The split achieves equivalent performance at roughly half the parameters of the smallest monolithic model that can match it. These results suggest that as models are asked to learn more diverse algorithms, some form of representational separation will be necessary to avoid capacity limits.

Found an issue? Give us feedback