Pan Zhang

papers

2,082

total citations

papers (24)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation

CVPR 2021arXiv

564

citations

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

CVPR 2024arXiv

385

citations

MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation

CVPR 2023arXiv

citations

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

ICCV 2025arXiv

citations

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

CVPR 2025arXiv

citations

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

CVPR 2025arXiv

citations

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

CVPR 2024arXiv

citations

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

ICML 2025arXiv

citations

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

ICLR 2025arXiv

citations

MM-IFEngine: Towards Multimodal Instruction Following

ICCV 2025arXiv

citations

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

ICCV 2025arXiv

citations

BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

CVPR 2023arXiv

citations

Real-Time Neural Character Rendering with Pose-Guided Multiplane Images

ECCV 2022arXiv

citations

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

NEURIPS 2025arXiv

citations

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation

CVPR 2021arXiv

citations

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

ICCV 2025arXiv

citations

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

CVPR 2025arXiv

citations

Conical Visual Concentration for Efficient Large Vision-Language Models

CVPR 2025

citations

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

ICCV 2025

citations

Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

ICCV 2025

citations

Pan Zhang

papers (24)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Cross-Domain Correspondence Learning for Exemplar-Based Image Translation

Bringing Old Photos Back to Life

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

V3Det: Vast Vocabulary Visual Detection Dataset

MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

MM-IFEngine: Towards Multimodal Instruction Following

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

Real-Time Neural Character Rendering with Pose-Guided Multiplane Images

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Conical Visual Concentration for Efficient Large Vision-Language Models

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

papers (24)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Cross-Domain Correspondence Learning for Exemplar-Based Image Translation

Bringing Old Photos Back to Life

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

V3Det: Vast Vocabulary Visual Detection Dataset

MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

MM-IFEngine: Towards Multimodal Instruction Following

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

Real-Time Neural Character Rendering with Pose-Guided Multiplane Images

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Conical Visual Concentration for Efficient Large Vision-Language Models

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate