HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment

21citations

arXiv:2502.07903

citations

#839

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

YOUHE JIANG Ran Yan Binhang Yuan

Topics

generative inference large language models heterogeneous gpus disaggregated inference kv cache communication distributed system constraint optimization graph partitioning

Abstract

Disaggregating the prefill and decoding phases represents an effective new paradigm for generative inference of large language models (LLM), which eliminates prefill-decoding interference and optimizes resource allocation. However, it is still an open problem about how to deploy the disaggregated inference paradigm across a group of heterogeneous GPUs, which can be an economical alternative to deployment over homogeneous high-performance GPUs. Towards this end, we introduce HexGen-2, a distributed system for efficient and economical LLM serving on heterogeneous GPUs following the disaggregated paradigm. Built on top of HexGen, the core component of HexGen-2 is a scheduling algorithm that formalizes the allocation of disaggregated LLM inference computations and communications over heterogeneous GPUs and network connections as a constraint optimization problem. We leverage the graph partitioning and max-flow algorithms to co-optimize resource allocation, parallel strategies for distinct inference phases, and the efficiency of inter-phase key-value (KV) cache communications. We conduct extensive experiments to evaluate HexGen-2, i.e., on OPT (30B) and Llama-2 (70B) models in various real-world settings, the results reveal that HexGen-2 delivers up to a 2.0 times and on average a 1.3 times improvement in serving throughput, reduces the average inference latency by 1.5 times compared with state-of-the-art systems given the same price budget, and achieves comparable inference performance with a 30% lower price budget.

Citation History

Jan 26, 2026

Jan 27, 2026

17+17

Feb 3, 2026

18+1

Feb 13, 2026

21+3

Feb 13, 2026