"memory-efficient inference" Papers
7 papers found
Conference
Computation and Memory-Efficient Model Compression with Gradient Reweighting
Zhiwei Li, Yuesen Liao, Binrui Wu et al.
NEURIPS 2025
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li, Zigeng Chen, Cheng-Yen Yang et al.
NEURIPS 2025arXiv:2505.19602
9
citations
Surprising Effectiveness of pretraining Ternary Language Model at Scale
Ayush Kaushal, Tejas Vaidhya, Arnab Mondal et al.
ICLR 2025arXiv:2407.12327
13
citations
UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression
Chenlong Deng, Zhisong Zhang, Kelong Mao et al.
NEURIPS 2025arXiv:2509.15763
4
citations
CaM: Cache Merging for Memory-efficient LLMs Inference
Yuxin Zhang, Yuxuan Du, Gen Luo et al.
ICML 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin et al.
ICML 2024arXiv:2402.02750
368
citations
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
Zhao Tianchen, Xuefei Ning, Tongcheng Fang et al.
ECCV 2024arXiv:2405.17873
37
citations