Jianfeng Wang

papers

5,065

total citations

papers (28)

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

NEURIPS 2022arXiv

citations

Boosting Weakly Supervised Object Detection with Progressive Knowledge Transfer

ECCV 2020arXiv

citations

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

CVPR 2024arXiv

citations

Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation

CVPR 2022arXiv

citations

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

ICLR 2025arXiv

citations

Segment and Caption Anything

CVPR 2024arXiv

citations

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

CVPR 2023arXiv

citations

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

ICLR 2025arXiv

citations

DAP: Detection-Aware Pre-Training With Weak Supervision

CVPR 2021arXiv

citations

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

CVPR 2024arXiv

citations

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

ECCV 2024arXiv

citations

LiVOS: Light Video Object Segmentation with Gated Linear Matching

CVPR 2025arXiv

citations

"A Simple Approach and Benchmark for 21,000-Category Object Detection"

ECCV 2022

citations

Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition

CVPR 2020

citations

Jianfeng Wang

papers (28)

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Segment Everything Everywhere All at Once

End-to-End Semi-Supervised Object Detection With Soft Teacher

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Generalized Decoding for Pixel, Image, and Language

Scaling Up Vision-Language Pre-Training for Image Captioning

End-to-End Object Detection With Fully Convolutional Network

ReCo: Region-Controlled Text-to-Image Generation

TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

Injecting Semantic Concepts Into End-to-End Image Captioning

Compressing Visual-Linguistic Model via Knowledge Distillation

RSG: A Simple but Effective Module for Learning Imbalanced Datasets

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Boosting Weakly Supervised Object Detection with Progressive Knowledge Transfer

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Segment and Caption Anything

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

DAP: Detection-Aware Pre-Training With Weak Supervision

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

LiVOS: Light Video Object Segmentation with Gated Linear Matching

"A Simple Approach and Benchmark for 21,000-Category Object Detection"

Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition

papers (28)

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Segment Everything Everywhere All at Once

End-to-End Semi-Supervised Object Detection With Soft Teacher

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Generalized Decoding for Pixel, Image, and Language

Scaling Up Vision-Language Pre-Training for Image Captioning

End-to-End Object Detection With Fully Convolutional Network

ReCo: Region-Controlled Text-to-Image Generation

TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

Injecting Semantic Concepts Into End-to-End Image Captioning

Compressing Visual-Linguistic Model via Knowledge Distillation

RSG: A Simple but Effective Module for Learning Imbalanced Datasets

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Boosting Weakly Supervised Object Detection with Progressive Knowledge Transfer

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Segment and Caption Anything

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

DAP: Detection-Aware Pre-Training With Weak Supervision

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

LiVOS: Light Video Object Segmentation with Gated Linear Matching

"A Simple Approach and Benchmark for 21,000-Category Object Detection"

Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition