"inference efficiency" Papers

35 papers found

Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

Roberto Garcia, Jerry Liu, Daniel Sorvisto et al.

ICLR 2025arXiv:2503.18216
1
citations

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Qizhe Zhang, Mengzhen Liu, Lichen Li et al.

NEURIPS 2025arXiv:2506.10967
22
citations

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Qijiong Liu, Jieming Zhu, Lu Fan et al.

NEURIPS 2025arXiv:2503.05493
4
citations

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Zheng Chong, Xiao Dong, Haoxiang Li et al.

ICLR 2025arXiv:2407.15886
68
citations

Complexity Experts are Task-Discriminative Learners for Any Image Restoration

Eduard Zamfir, Zongwei Wu, Nancy Mehta et al.

CVPR 2025arXiv:2411.18466
32
citations

DecompNet: Enhancing Time Series Forecasting Models with Implicit Decomposition

Donghao Luo, Xue Wang

NEURIPS 2025

Depth-Width Tradeoffs for Transformers on Graph Tasks

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher et al.

NEURIPS 2025spotlight

Diffusion on Demand: Selective Caching and Modulation for Efficient Generation

Hee Min Choi, Hyoa Kang, Dokwan Oh et al.

NEURIPS 2025

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.

CVPR 2025arXiv:2503.02175
57
citations

DLP: Dynamic Layerwise Pruning in Large Language Models

Yuli Chen, Bo Cheng, Jiale Han et al.

ICML 2025arXiv:2505.23807
2
citations

Faster Cascades via Speculative Decoding

Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat et al.

ICLR 2025arXiv:2405.19261
23
citations

FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Shoutao Guo, Shaolei Zhang, Qingkai Fang et al.

NEURIPS 2025arXiv:2507.14815
3
citations

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Wei Huang, Yue Liao, Jianhui Liu et al.

ICLR 2025arXiv:2410.06270
24
citations

Optimize Incompatible Parameters Through Compatibility-aware Knowledge Integration

Zheqi Lv, Keming Ye, Zishu Wei et al.

AAAI 2025paperarXiv:2501.07596
1
citations

Overfill: Two-Stage Models for Efficient Language Model Decoding

Woojeong Kim, Junxiong Wang, Jing Nathan Yan et al.

COLM 2025paperarXiv:2508.08446

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Dhouib Mohamed, Davide Buscaldi, Vanier Sonia et al.

CVPR 2025arXiv:2504.08966
21
citations

PICASO: Permutation-Invariant Context Composition with State Space Models

Tian Yu Liu, Alessandro Achille, Matthew Trager et al.

ICLR 2025oralarXiv:2502.17605

PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation

Liyao Jiang, Negar Hassanpour, Mohammad Salameh et al.

AAAI 2025paperarXiv:2412.14283
5
citations

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Wei Suo, Ji Ma, Mengyang Sun et al.

ICCV 2025arXiv:2412.06458
1
citations

RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation

Boyuan Cao, Jiaxin Ye, Yujie Wei et al.

NEURIPS 2025spotlightarXiv:2410.06055
9
citations

RouteLLM: Learning to Route LLMs from Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu et al.

ICLR 2025
24
citations

Selective Attention Improves Transformer

Yaniv Leviathan, Matan Kalman, Yossi Matias

ICLR 2025arXiv:2410.02703
21
citations

Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

Weili Zeng, Ziyuan Huang, Kaixiang Ji et al.

ICCV 2025arXiv:2503.21817
6
citations

Thinker: Learning to Think Fast and Slow

Stephen Chung, Wenyu Du, Jie Fu

NEURIPS 2025arXiv:2505.21097
8
citations

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Zayne Sprague, Fangcong Yin, Juan Rodriguez et al.

ICLR 2025arXiv:2409.12183
250
citations

Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

Guanchen Li, Yixing Xu, Zeping Li et al.

NEURIPS 2025arXiv:2503.09657
6
citations

Value-Guided Search for Efficient Chain-of-Thought Reasoning

Kaiwen Wang, Jin Zhou, Jonathan Chang et al.

NEURIPS 2025arXiv:2505.17373
7
citations

Variational Best-of-N Alignment

Afra Amini, Tim Vieira, Elliott Ash et al.

ICLR 2025arXiv:2407.06057
38
citations

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li et al.

NEURIPS 2025arXiv:2505.17272
7
citations

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao

ICML 2024arXiv:2401.12200
28
citations

Deciphering RNA Secondary Structure Prediction: A Probabilistic K-Rook Matching Perspective

Cheng Tan, Zhangyang Gao, Hanqun CAO et al.

ICML 2024arXiv:2212.14041
2
citations

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition

Yurong Zhang, Honghao Chen, Zhang Xinyu et al.

ECCV 2024arXiv:2407.14302
4
citations

Efficient Denoising Diffusion via Probabilistic Masking

Weizhong Zhang, Zhiwei Zhang, Renjie Pi et al.

ICML 2024

Tandem Transformers for Inference Efficient LLMs

Aishwarya P S, Pranav Nair, Yashas Samaga et al.

ICML 2024arXiv:2402.08644
10
citations

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj Jha

CVPR 2024arXiv:2305.17328
61
citations