α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Shih-Fu Chang
Shih-Fu Chang
28
papers
3,187
total citations
papers (28)
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
NEURIPS 2021
arXiv
689
citations
Open-Vocabulary Object Detection Using Captions
CVPR 2021
arXiv
546
citations
Learning Visual Commonsense for Robust Scene Graph Generation
ECCV 2020
arXiv
312
citations
Bridging Knowledge Graphs to Generate Scene Graphs
ECCV 2020
arXiv
233
citations
Few-Shot Object Detection With Fully Cross-Transformer
CVPR 2022
arXiv
176
citations
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
NEURIPS 2022
arXiv
164
citations
CLIP-Event: Connecting Text and Images With Event Structures
CVPR 2022
arXiv
145
citations
Query Adaptive Few-Shot Object Detection With Heterogeneous Graph Convolutional Networks
ICCV 2021
arXiv
131
citations
Learning To Recognize Procedural Activities With Distant Supervision
CVPR 2022
arXiv
98
citations
Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos
ICCV 2021
arXiv
97
citations
Partner-Assisted Learning for Few-Shot Image Classification
ICCV 2021
arXiv
77
citations
Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
CVPR 2021
arXiv
74
citations
Context-Gated Convolution
ECCV 2020
arXiv
68
citations
DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
CVPR 2023
arXiv
64
citations
Weakly Supervised Visual Semantic Parsing
CVPR 2020
arXiv
59
citations
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
ECCV 2022
arXiv
57
citations
Supervised Masked Knowledge Distillation for Few-Shot Transformers
CVPR 2023
arXiv
51
citations
Task-Adaptive Negative Envision for Few-Shot Open-Set Recognition
CVPR 2022
arXiv
30
citations
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
CVPR 2023
arXiv
27
citations
MoDE: CLIP Data Experts via Clustering
CVPR 2024
arXiv
25
citations
SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos
ICLR 2024
arXiv
21
citations
Co-Grounding Networks With Semantic Attention for Referring Expression Comprehension in Videos
CVPR 2021
arXiv
18
citations
What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
CVPR 2024
arXiv
9
citations
Fine-Grained Visual Entailment
ECCV 2022
arXiv
7
citations
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
ECCV 2024
arXiv
4
citations
Learning to Learn Words from Visual Scenes
ECCV 2020
arXiv
4
citations
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
AAAI 2024
arXiv
1
citations
Few-Shot End-to-End Object Detection via Constantly Concentrated Encoding across Heads
ECCV 2022
0
citations