Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

Authors: Kearney, John;

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

Abstract

Neural networks have a finite capacity for algorithmic discovery through grokking, the phenomenon where models generalize long after memorizing training data. Using modular arithmetic as a testbed, we show that a d=128 transformer reliably groks up to 5 simultaneous operations but collapses completely at 6. This collapse is not gradual: across all 5 random seeds, the model fails on every operation, including ones it trivially solves alone. We localize the interference to the shared embedding layer and show two interventions that recover full performance: separate per-operation embeddings (a minimal fix at this scale) and a split architecture with independent modules (a general solution). An automatic pipeline discovers optimal operation groupings through gradient similarity analysis with no domain expertise. The split achieves equivalent performance at roughly half the parameters of the smallest monolithic model that can match it. These results suggest that as models are asked to learn more diverse algorithms, some form of representational separation will be necessary to avoid capacity limits.

Powered by OpenAIRE graph
Found an issue? Give us feedback