Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

21citations
arXiv:2502.11517
21
citations
#257
in ICML 2025
of 3340 papers
9
Top Authors
4
Data Points

Abstract

Decoding with autoregressive language models traditionally occurs sequentially, generating one token after another. Recent attempts to introduce parallelism require a pre-determined structure in the generated content to implement parallel generation, such as by pattern-matching on bullet points. In this work, we present a new technique to automate parallel generation by dynamically exploiting the semantic independence of generation outputs to implement asynchronous decoding. We introduce an annotation language Pasta-Lang for language models to initiate asynchronous decoding at inference time. We also develop an accompanying Pasta-Lang interpreter that performs on-the-fly asynchronous decoding, effectively implementing parallel generation and speeding up inference. We present an instruction-finetuning dataset with Pasta-Lang-annotated responses for teaching LLMs to annotate semantic independence with Pasta-Lang as well as the methodology for creating the dataset. Our evaluation shows using the interpreter with a Pasta-Lang-equipped model achieves significant speedup while maintaining the same generation quality.

Citation History

Jan 28, 2026
0
Feb 13, 2026
21+21
Feb 13, 2026
21
Feb 13, 2026
21