Yiwu Zhong

papers

2,753

total citations

papers (13)

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

CVPR 2023arXiv

citations

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

ICCV 2025arXiv

citations

Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods

CVPR 2024

citations

Revisiting Tampered Scene Text Detection in the Era of Generative AI

AAAI 2025arXiv

citations

Fine-grained Spatiotemporal Grounding on Egocentric Videos

ICCV 2025arXiv

citations

PAVE: Patching and Adapting Video Large Language Models

CVPR 2025arXiv

citations

A Simple Baseline for Weakly-Supervised Scene Graph Generation

ICCV 2021

citations

Yiwu Zhong

papers (13)

Grounded Language-Image Pre-Training

RegionCLIP: Region-Based Language-Image Pretraining

Comprehensive Image Captioning via Scene Graph Decomposition

Towards Learning a Generalist Model for Embodied Navigation

Learning Concise and Descriptive Attributes for Visual Recognition

Learning To Generate Scene Graph From Natural Language Supervision

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods

Revisiting Tampered Scene Text Detection in the Era of Generative AI

Fine-grained Spatiotemporal Grounding on Egocentric Videos

PAVE: Patching and Adapting Video Large Language Models

A Simple Baseline for Weakly-Supervised Scene Graph Generation

papers (13)

Grounded Language-Image Pre-Training

RegionCLIP: Region-Based Language-Image Pretraining

Comprehensive Image Captioning via Scene Graph Decomposition

Towards Learning a Generalist Model for Embodied Navigation

Learning Concise and Descriptive Attributes for Visual Recognition

Learning To Generate Scene Graph From Natural Language Supervision

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods

Revisiting Tampered Scene Text Detection in the Era of Generative AI

Fine-grained Spatiotemporal Grounding on Egocentric Videos

PAVE: Patching and Adapting Video Large Language Models

A Simple Baseline for Weakly-Supervised Scene Graph Generation