Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

16citations
arXiv:2503.06063
16
citations
#512
in CVPR 2025
of 2873 papers
8
Top Authors
7
Data Points

Abstract

Multimodal Large Language Models (MLLMs) have made significant advancements in recent years, with visual features playing an increasingly critical role in enhancing model performance. However, the integration of multi-layer visual features in MLLMs remains underexplored, particularly with regard to optimal layer selection and fusion strategies. Existing methods often rely on arbitrary design choices, leading to suboptimal outcomes. In this paper, we systematically investigate two core aspects of multi-layer visual feature fusion: (1) selecting the most effective visual layers and (2) identifying the best fusion approach with the language model. Our experiments reveal that while combining visual features from multiple stages improves generalization, incorporating additional features from the same stage typically leads to diminished performance. Furthermore, we find that direct fusion of multi-layer visual features at the input stage consistently yields superior and more stable performance across various configurations. We make all our code publicly available: https://github.com/EIT-NLP/Layer_Select_Fuse_for_MLLM.

Citation History

Jan 24, 2026
0
Jan 26, 2026
15+15
Jan 27, 2026
15
Feb 3, 2026
15
Feb 13, 2026
16+1
Feb 13, 2026
16
Feb 13, 2026
16