Transformers, parallel computation, and logarithmic depth
60
citations
#244
in ICML 2024
of 2635 papers
3
Top Authors
4
Data Points
Top Authors
Topics
Abstract
We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds ofMassively Parallel Computation. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Citation History
Jan 28, 2026
0
Feb 13, 2026
60+60
Feb 13, 2026
60
Feb 13, 2026
60