Yujie Zhong

papers

2,162

total citations

papers (24)

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

ICCV 2025arXiv

citations

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

ICCV 2025arXiv

citations

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

CVPR 2025

citations

CO-MOT: Boosting End-to-end Transformer-based Multi-Object Tracking via Coopetition Label Assignment and Shadow Sets

ICLR 2025

citations

DisTime: Distribution-based Time Representation for Video Large Language Models

ICCV 2025arXiv

citations

HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver

CVPR 2025

citations

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

ICCV 2025arXiv

citations

Advancing Visual Large Language Model for Multi-granular Versatile Perception

ICCV 2025arXiv

citations

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

CVPR 2025arXiv

citations

Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs

ICCV 2025

citations

Yujie Zhong

papers (24)

TOOD: Task-Aligned One-Stage Object Detection

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

TriDet: Temporal Action Detection With Relative Boundary Modeling

Exploring Classification Equilibrium in Long-Tailed Object Detection

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

ReAct: Temporal Action Detection with Relational Queries

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Adaptive Sparse Pairwise Loss for Object Re-Identification

Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Cross-Architecture Self-Supervised Video Representation Learning

AeDet: Azimuth-Invariant Multi-View 3D Object Detection

Representation Sharing for Fast Object Detector Search and Beyond

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

CO-MOT: Boosting End-to-end Transformer-based Multi-Object Tracking via Coopetition Label Assignment and Shadow Sets

DisTime: Distribution-based Time Representation for Video Large Language Models

HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

Advancing Visual Large Language Model for Multi-granular Versatile Perception

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs

papers (24)

TOOD: Task-Aligned One-Stage Object Detection

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

TriDet: Temporal Action Detection With Relative Boundary Modeling

Exploring Classification Equilibrium in Long-Tailed Object Detection

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

ReAct: Temporal Action Detection with Relational Queries

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Adaptive Sparse Pairwise Loss for Object Re-Identification

Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Cross-Architecture Self-Supervised Video Representation Learning

AeDet: Azimuth-Invariant Multi-View 3D Object Detection

Representation Sharing for Fast Object Detector Search and Beyond

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

CO-MOT: Boosting End-to-end Transformer-based Multi-Object Tracking via Coopetition Label Assignment and Shadow Sets

DisTime: Distribution-based Time Representation for Video Large Language Models

HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

Advancing Visual Large Language Model for Multi-granular Versatile Perception

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs