
This master’s thesis explores the trade-off between computational efficiency and modeling performance in Transformer architectures by studying and improving attention mechanisms. The work consists of two parts: (1) a comparative empirical study evaluating five attention variants—standard Multi-Head Attention, FlashAttention, Sparse Attention, Sliding-Window Attention, and Linear Attention—across multiple Transformer model families, including encoder-only (BERT-style), decoder-only (GPT-style), and full encoder–decoder architectures. This broader evaluation highlights how each mechanism behaves under different structural constraints and reveals the consistent underperformance of linear attention in decoder-only setups due to weaker long-range modeling. (2) A technical contribution proposing two new hybrid mechanisms, Linear SparseAttention and Linear Sliding-Window Attention, which enhance the expressiveness of linear attention while preserving its linear-time complexity. Experiments show that both hybrids significantly outperform standard linear attention and narrow the performance gap with full attention, offering a promising path toward efficient and scalable Transformer models deployable in resource-constrained settings.
Transformers, Attention Mechanisms, Linear Attention, Hybrid Attention, Deep Learning, Large Language Models, BERT, GPT, Encoder-Decoder Models, Machine Learning
Transformers, Attention Mechanisms, Linear Attention, Hybrid Attention, Deep Learning, Large Language Models, BERT, GPT, Encoder-Decoder Models, Machine Learning
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
