Yi Jiang

papers

5,611

total citations

papers (27)

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

CVPR 2025arXiv

201

citations

Towards Grand Unification of Object Tracking

ECCV 2022arXiv

171

citations

In Defense of Online Models for Video Instance Segmentation

ECCV 2022arXiv

138

citations

SeqFormer: Sequential Transformer for Video Instance Segmentation

ECCV 2022arXiv

135

citations

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

CVPR 2025arXiv

128

citations

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

ECCV 2024arXiv

107

citations

Learning to Segment the Tail

CVPR 2020arXiv

citations

General Object Foundation Model for Images and Videos at Scale

CVPR 2024arXiv

citations

UniTok: a Unified Tokenizer for Visual Generation and Understanding

NEURIPS 2025arXiv

citations

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

NEURIPS 2023arXiv

citations

Goku: Flow Based Video Generative Foundation Models

CVPR 2025arXiv

citations

Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation

ECCV 2022arXiv

citations

Generative Region-Language Pretraining for Open-Ended Object Detection

CVPR 2024arXiv

citations

Segment Every Reference Object in Spatial and Temporal Spaces

ICCV 2023arXiv

citations

Rethinking Resolution in the Context of Efficient Video Recognition

NEURIPS 2022arXiv

citations

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

ICCV 2023arXiv

citations

InstMove: Instance Motion for Object-Centric Video Segmentation

CVPR 2023arXiv

citations

Enhancing Adversarial Transferability with Adversarial Weight Tuning

AAAI 2025arXiv

citations

Exploring Transformers for Open-world Instance Segmentation

ICCV 2023arXiv

citations

InfinityStar: Uniﬁed Spacetime AutoRegressive Modeling for Visual Generation

NEURIPS 2025

citations

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

ICCV 2025arXiv

citations

A Unified Environmental Network for Pedestrian Trajectory Prediction

AAAI 2024

citations

Yi Jiang

papers (27)

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Sparse R-CNN: End-to-End Object Detection With Learnable Proposals

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Universal Instance Perception As Object Discovery and Retrieval

Language As Queries for Referring Video Object Segmentation

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Towards Grand Unification of Object Tracking

In Defense of Online Models for Video Instance Segmentation

SeqFormer: Sequential Transformer for Video Instance Segmentation

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Learning to Segment the Tail

General Object Foundation Model for Images and Videos at Scale

UniTok: a Unified Tokenizer for Visual Generation and Understanding

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Goku: Flow Based Video Generative Foundation Models

Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation

Generative Region-Language Pretraining for Open-Ended Object Detection

Segment Every Reference Object in Spatial and Temporal Spaces

Rethinking Resolution in the Context of Efficient Video Recognition

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

InstMove: Instance Motion for Object-Centric Video Segmentation

Enhancing Adversarial Transferability with Adversarial Weight Tuning

Exploring Transformers for Open-world Instance Segmentation

InfinityStar: Uniﬁed Spacetime AutoRegressive Modeling for Visual Generation

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

A Unified Environmental Network for Pedestrian Trajectory Prediction

papers (27)

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Sparse R-CNN: End-to-End Object Detection With Learnable Proposals

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Universal Instance Perception As Object Discovery and Retrieval

Language As Queries for Referring Video Object Segmentation

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Towards Grand Unification of Object Tracking

In Defense of Online Models for Video Instance Segmentation

SeqFormer: Sequential Transformer for Video Instance Segmentation

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Learning to Segment the Tail

General Object Foundation Model for Images and Videos at Scale

UniTok: a Unified Tokenizer for Visual Generation and Understanding

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Goku: Flow Based Video Generative Foundation Models

Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation

Generative Region-Language Pretraining for Open-Ended Object Detection

Segment Every Reference Object in Spatial and Temporal Spaces

Rethinking Resolution in the Context of Efficient Video Recognition

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

InstMove: Instance Motion for Object-Centric Video Segmentation

Enhancing Adversarial Transferability with Adversarial Weight Tuning

Exploring Transformers for Open-world Instance Segmentation

InfinityStar: Uniﬁed Spacetime AutoRegressive Modeling for Visual Generation

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

A Unified Environmental Network for Pedestrian Trajectory Prediction