Chaoyou Fu

Google Scholar OpenReview

h-index

papers

2,968

total citations

papers (19)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

NEURIPS 2025arXiv

1,277

citations

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CVPR 2025arXiv

917

citations

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification

ICCV 2021arXiv

141

citations

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

NEURIPS 2025arXiv

138

citations

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

ICML 2025arXiv

112

citations

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

ICML 2025arXiv

citations

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

CVPR 2024arXiv

citations

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

NEURIPS 2025

citations

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

CVPR 2025arXiv

citations

Pareidolia Face Reenactment

CVPR 2021arXiv

citations

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

ICLR 2025arXiv

citations

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes

NEURIPS 2023arXiv

citations

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

NEURIPS 2025arXiv

citations

Information Bottleneck Disentanglement for Identity Swapping

CVPR 2021

citations

Rethinking Image Cropping: Exploring Diverse Compositions From Global Views

CVPR 2022

citations

Chaoyou Fu

papers (19)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Aligning and Prompting Everything All at Once for Universal Visual Perception

Cross-Spectral Face Hallucination via Disentangling Independent Factors

Multi-modal Queried Object Detection in the Wild

AOT: Appearance Optimal Transport Based Identity Swapping for Forgery Detection

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Pareidolia Face Reenactment

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Information Bottleneck Disentanglement for Identity Swapping

Rethinking Image Cropping: Exploring Diverse Compositions From Global Views

papers (19)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Aligning and Prompting Everything All at Once for Universal Visual Perception

Cross-Spectral Face Hallucination via Disentangling Independent Factors

Multi-modal Queried Object Detection in the Wild

AOT: Appearance Optimal Transport Based Identity Swapping for Forgery Detection

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Pareidolia Face Reenactment

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Information Bottleneck Disentanglement for Identity Swapping

Rethinking Image Cropping: Exploring Diverse Compositions From Global Views