Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

21citations

arXiv:2502.11517

citations

#257

in ICML 2025

of 3340 papers

Top Authors

Data Points

Top Authors

Tian Jin Ellie Cheng Zachary Ankner Nikunj Saunshi Blake Elias Amir Yazdanbakhsh Jonathan Ragan-Kelley Suvinay Subramanian Michael Carbin

Abstract

Decoding with autoregressive language models traditionally occurs sequentially, generating one token after another. Recent attempts to introduce parallelism require a pre-determined structure in the generated content to implement parallel generation, such as by pattern-matching on bullet points. In this work, we present a new technique to automate parallel generation by dynamically exploiting the semantic independence of generation outputs to implement asynchronous decoding. We introduce an annotation language Pasta-Lang for language models to initiate asynchronous decoding at inference time. We also develop an accompanying Pasta-Lang interpreter that performs on-the-fly asynchronous decoding, effectively implementing parallel generation and speeding up inference. We present an instruction-finetuning dataset with Pasta-Lang-annotated responses for teaching LLMs to annotate semantic independence with Pasta-Lang as well as the methodology for creating the dataset. Our evaluation shows using the interpreter with a Pasta-Lang-equipped model achieves significant speedup while maintaining the same generation quality.

Citation History

Jan 28, 2026

Feb 13, 2026

21+21

Feb 13, 2026