Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

13citations

arXiv:2412.08890

citations

#439

in ICML 2025

of 3340 papers

Top Authors

Data Points

Top Authors

Junhyuck Kim Jongho Park Jaewoong Cho Dimitris Papailiopoulos

Abstract

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Citation History

Jan 28, 2026

Feb 13, 2026

13+13

Feb 13, 2026