K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

Koneko, Aileen

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Koneko, Aileen;

doi: 10.5281/zenodo.19136398

K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

- Summary

Abstract

We introduce K-Operators, a sequence modeling architecture designed for linear-timeexecution, combining learned exponential decay with learnable positional kernels. The core K2layer decomposes sequence mixing into two complementary paths: (1) a low-rank gamma-decayedrecurrent interaction with per-channel learned decay rates spanning short to long memory, and(2) a learnable causal base kernel Kbase providing asymmetric local correction that exponentialdecay alone cannot express.Systematic ablation across tokenization granularities reveals that removing either componentdegrades performance even under equal parameter budgets: on WikiText-2 (subword), the fullarchitecture achieves 19.99 ± 0.09 PPL at 4.08M parameters (5-seed sweep) vs. 20.99 PPLfor an equal-capacity model without Kbase; on Tiny Shakespeare (character-level), 4.41 ± 0.01PPL at 0.81M parameters (5-seed sweep) vs. 4.78 PPL without Kbase—within 0.06 PPL ofa 10.65M parameter Transformer baseline. The optimal contribution of Kbase scales inverselywith token granularity—∼4% for character-level, ∼0.5% for subword—but is never zero. Thisratio is discovered automatically via gradient descent with a sigmoid floor that acts as implicitarchitectural regularization.Uncapping the gamma decay range from [0.85, 0.995] to [0.15, 0.995] yields substantial gains:the model learns to use the full spectrum, with some channels selecting γ ≈ 0.15 (2-token effectivewindow) while others maintain γ > 0.99 (100+ token memory). The architecture does not requireexplicit positional encodings; positional information is instead captured implicitly through thelearned causal kernel structure.We also describe an iterative equilibrium refinement loop with learned step-size η. Whilemathematically motivated, ablation shows refinement consistently hurts performance in ourexperiments; we document it for completeness and future investigation.

Found an issue? Give us feedback