Linear attention is (maybe) all you need (to understand Transformer optimization)

79citations
arXiv:2310.01082
79
citations
#373
in ICLR 2024
of 2297 papers
6
Top Authors
4
Data Points

Abstract

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearizedshallowTransformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.

Citation History

Jan 28, 2026
0
Feb 13, 2026
79+79
Feb 13, 2026
79
Feb 13, 2026
79