"kv cache optimization" Papers
11 papers found
Conference
$\text{D}_{2}\text{O}$: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
Zhongwei Wan, Xinjian Wu, Yu Zhang et al.
ICLR 2025
22
citations
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng, Junlin Lv, Yukun Cao et al.
NEURIPS 2025arXiv:2407.11550
106
citations
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Wenxuan Huang, Zijie Zhai, Yunhang Shen et al.
ICLR 2025arXiv:2412.00876
42
citations
Let the Code LLM Edit Itself When You Edit the Code
Zhenyu He, Jun Zhang, Shengjie Luo et al.
ICLR 2025oralarXiv:2407.03157
3
citations
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.
ICLR 2025arXiv:2408.11049
64
citations
Neural Attention Search
Difan Deng, Marius Lindauer
NEURIPS 2025arXiv:2502.13251
1
citations
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang, Hui Chen, Jianchao Tan et al.
NEURIPS 2025arXiv:2412.03409
6
citations
Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse
Ruikun Luo, Changwei Gu, Qiang He et al.
NEURIPS 2025
When Attention Sink Emerges in Language Models: An Empirical View
Xiangming Gu, Tianyu Pang, Chao Du et al.
ICLR 2025arXiv:2410.10781
98
citations
Bifurcated Attention for Single-Context Large-Batch Sampling
Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda et al.
ICML 2024
QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Jiaming Tang, Yilong Zhao, Kan Zhu et al.
ICML 2024arXiv:2406.10774
248
citations