Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

Authors: Koneko, Aileen;

K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

Abstract

We introduce K-Operators, a sequence modeling architecture designed for linear-timeexecution, combining learned exponential decay with learnable positional kernels. The core K2layer decomposes sequence mixing into two complementary paths: (1) a low-rank gamma-decayedrecurrent interaction with per-channel learned decay rates spanning short to long memory, and(2) a learnable causal base kernel Kbase providing asymmetric local correction that exponentialdecay alone cannot express.Systematic ablation across tokenization granularities reveals that removing either componentdegrades performance even under equal parameter budgets: on WikiText-2 (subword), the fullarchitecture achieves 19.99 ± 0.09 PPL at 4.08M parameters (5-seed sweep) vs. 20.99 PPLfor an equal-capacity model without Kbase; on Tiny Shakespeare (character-level), 4.41 ± 0.01PPL at 0.81M parameters (5-seed sweep) vs. 4.78 PPL without Kbase—within 0.06 PPL ofa 10.65M parameter Transformer baseline. The optimal contribution of Kbase scales inverselywith token granularity—∼4% for character-level, ∼0.5% for subword—but is never zero. Thisratio is discovered automatically via gradient descent with a sigmoid floor that acts as implicitarchitectural regularization.Uncapping the gamma decay range from [0.85, 0.995] to [0.15, 0.995] yields substantial gains:the model learns to use the full spectrum, with some channels selecting γ ≈ 0.15 (2-token effectivewindow) while others maintain γ > 0.99 (100+ token memory). The architecture does not requireexplicit positional encodings; positional information is instead captured implicitly through thelearned causal kernel structure.We also describe an iterative equilibrium refinement loop with learned step-size η. Whilemathematically motivated, ablation shows refinement consistently hurts performance in ourexperiments; we document it for completeness and future investigation.

Powered by OpenAIRE graph
Found an issue? Give us feedback