xLSTM: Extended Long Short-Term Memory

Name: xLSTM: Extended Long Short-Term Memory
Keywords: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)

Maximilian Beck; Korbinian Pöppel; Markus Spanring; Andreas Auer; Oleksandra Prudnikova; Michael Kopp 0001; Günter Klambauer; Johannes Brandstetter; Sepp Hochreiter

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://doi.org/10.52202/07901...

Article . 2024 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

DBLP

Conference object

Data sources: DBLP

xLSTM: Extended Long Short-Term Memory

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:Neural Information Processing Systems Foundation, Inc. (NeurIPS)Journal:Advances in Neural Information Processing Systems 37

Authors: Maximilian Beck; Korbinian Pöppel; Markus Spanring; Andreas Auer; Oleksandra Prudnikova; Michael Kopp 0001; Günter Klambauer; +2 Authors

doi: 10.52202/079017-3417 , 10.48550/arxiv.2405.04517

arXiv: 2405.04517

xLSTM: Extended Long Short-Term Memory

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Code available at https://github.com/NX-AI/xlstm

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)

4 Research products, page 1 of 1

RedPajama-Data software on GitHub
IsRelatedTo
mamba software on GitHub
IsRelatedTo
RWKV-LM software on GitHub
IsRelatedTo
flash-linear-attention software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	9
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

9

Top 10%

Green

xLSTM: Extended Long Short-Term Memory

xLSTM: Extended Long Short-Term Memory

4 Research products, page 1 of 1

RedPajama-Data software on GitHub

mamba software on GitHub

RWKV-LM software on GitHub

flash-linear-attention software on GitHub