"kv cache optimization" Papers

11 papers found

$\text{D}_{2}\text{O}$: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Zhongwei Wan, Xinjian Wu, Yu Zhang et al.

ICLR 2025
22
citations

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao et al.

NEURIPS 2025arXiv:2407.11550
106
citations

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Wenxuan Huang, Zijie Zhai, Yunhang Shen et al.

ICLR 2025arXiv:2412.00876
42
citations

Let the Code LLM Edit Itself When You Edit the Code

Zhenyu He, Jun Zhang, Shengjie Luo et al.

ICLR 2025oralarXiv:2407.03157
3
citations

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.

ICLR 2025arXiv:2408.11049
64
citations

Neural Attention Search

Difan Deng, Marius Lindauer

NEURIPS 2025arXiv:2502.13251
1
citations

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Ao Wang, Hui Chen, Jianchao Tan et al.

NEURIPS 2025arXiv:2412.03409
6
citations

Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Ruikun Luo, Changwei Gu, Qiang He et al.

NEURIPS 2025

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2410.10781
98
citations

Bifurcated Attention for Single-Context Large-Batch Sampling

Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda et al.

ICML 2024

QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jiaming Tang, Yilong Zhao, Kan Zhu et al.

ICML 2024arXiv:2406.10774
248
citations