"llm inference efficiency" Papers
9 papers found
Conference
CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs
Gunho Park, Jeongin Bae, Byeongwook Kim et al.
NEURIPS 2025arXiv:2512.17970
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
Hancheng Ye, Zhengqi Gao, Mingyuan Ma et al.
NEURIPS 2025arXiv:2510.12872
4
citations
Progressive Mixed-Precision Decoding for Efficient LLM Inference
Hao (Mark) Chen, Fuwen Tan, Alexandros Kouris et al.
ICLR 2025arXiv:2410.13461
8
citations
Pushing the Limits of BFP on Narrow Precision LLM Inference
Hui Wang, Yuan Cheng, Xiaomeng Han et al.
AAAI 2025paperarXiv:2502.00026
1
citations
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang, Yang Lin, Jing Lin et al.
ICLR 2025arXiv:2407.15891
62
citations
STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
Peijie Dong, Lujun Li, Yuedong Zhong et al.
ICLR 2025arXiv:2408.01803
32
citations
CLLMs: Consistency Large Language Models
Siqi Kou, Lanxiang Hu, Zhezhi He et al.
ICML 2024arXiv:2403.00835
58
citations
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
Harry Dong, Xinyu Yang, Zhenyu Zhang et al.
ICML 2024arXiv:2402.09398
79
citations
Online Cascade Learning for Efficient Inference over Streams
Lunyiu Nie, Zhimin Ding, Erdong Hu et al.
ICML 2024arXiv:2402.04513
16
citations