"vision transformer" Papers
52 papers found • Page 1 of 2
Conference
A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention
Qiyu Xu, Zhanxuan Hu, Yu Duan et al.
Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT
Guy Bar-Shalom, Fabrizio Frasca, Yaniv Galron et al.
Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method
Han Wang, Shengyang Li, Jian Yang et al.
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain et al.
Efficient Concertormer for Image Deblurring and Beyond
Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien et al.
EMPLACE: Self-Supervised Urban Scene Change Detection
Tim Alpherts, Sennay Ghebreab, Nanne van Noord
Enhancing Vision-Language Model with Unmasked Token Alignment
Hongsheng Li, Jihao Liu, Boxiao Liu et al.
EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation
Hongwei Niu, Jie Hu, Jianghang Lin et al.
FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning
Gaojian Wang, Feng Lin, Tong Wu et al.
GSPN-2: Efficient Parallel Sequence Modeling
Hongjun Wang, yitong jiang, Collin McCarthy et al.
Hybrid Spiking Vision Transformer for Object Detection with Event Cameras
Qi Xu, Jie Deng, Jiangrong Shen et al.
On the Role of Hidden States of Modern Hopfield Network in Transformer
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
Qihang Fan, Huaibo Huang, Mingrui Chen et al.
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
Haoyi Zhu, Honghui Yang, Yating Wang et al.
Sparse autoencoders reveal selective remapping of visual concepts during adaptation
Hyesu Lim, Jinho Choi, Jaegul Choo et al.
TADFormer: Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning
Seungmin Baek, Soyul Lee, Hayeon Jo et al.
VITRIX-UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao, Yiyang Gan, Bairui Wang et al.
Agglomerative Token Clustering
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor et al.
Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention
Saebom Leem, Hyunseok Seo
CLIP-KD: An Empirical Study of CLIP Model Distillation
Chuanguang Yang, Zhulin An, Libo Huang et al.
ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu
Data-free Neural Representation Compression with Riemannian Neural Dynamics
Zhengqi Pei, Anran Zhang, Shuhui Wang et al.
Enhancing Vision Transformer: Amplifying Non-Linearity in Feedforward Network Module
Yixing Xu, Chao Li, Dong Li et al.
FairViT: Fair Vision Transformer via Adaptive Masking
Bowei Tian, Ruijie Du, Yanning Shen
FiT: Flexible Vision Transformer for Diffusion Model
Zeyu Lu, ZiDong Wang, Di Huang et al.
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Ding Jia, Jianyuan Guo, Kai Han et al.
HEAL-SWIN: A Vision Transformer On The Sphere
Oscar Carlsson, Jan E. Gerken, Hampus Linander et al.
IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
Shaohong Wang, Lu Bin, Xinyu Xiao et al.
Information Flow in Self-Supervised Learning
Zhiquan Tan, Jingqin Yang, Weiran Huang et al.
Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking
Yongxin Li, Mengyuan Liu, You Wu et al.
Learning with Unmasked Tokens Drives Stronger Vision Learners
Taekyung Kim, Sanghyuk Chun, Byeongho Heo et al.
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
Pingping Zhang, Yuhao Wang, Yang Liu et al.
MedSegDiff-V2: Diffusion-based Medical Image Segmentation with Transformer
Junde Wu, Wei Ji, Huazhu Fu et al.
Memory Consolidation Enables Long-Context Video Understanding
Ivana Balazevic, Yuge Shi, Pinelopi Papalampidi et al.
MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks
Sanbao Su, Xin Li, Thang Doan et al.
One-stage Prompt-based Continual Learning
Youngeun Kim, YUHANG LI, Priyadarshini Panda
Outlier-aware Slicing for Post-Training Quantization in Vision Transformer
Yuexiao Ma, Huixia Li, Xiawu Zheng et al.
Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
Shiming Chen, Wenjin Hou, Salman Khan et al.
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz, Yair Kittenplon, Aviad Aberdam et al.
Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren, Zeyu Wang, Hongru Zhu et al.
S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention
Chiyu Zhang, Xiaogang Xu, Lei Wang et al.
Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning
Kaiyou Song, Shan Zhang, Tong Wang
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
Seokju Yun, Youngmin Ro
SiT: Symmetry-invariant Transformers for Generalisation in Reinforcement Learning
Matthias Weissenbacher, Rishabh Agarwal, Yoshinobu Kawahara
Statistical Test for Attention Maps in Vision Transformers
Tomohiro Shiraishi, Daiki Miwa, Teruyuki Katsuoka et al.
Stochastic positional embeddings improve masked image modeling
Amir Bar, Florian Bordes, Assaf Shocher et al.
Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention
Xingyu Zhou, Leheng Zhang, Xiaorui Zhao et al.
ViP: A Differentially Private Foundation Model for Computer Vision
Yaodong Yu, Maziar Sanjabi, Yi Ma et al.
Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting
Zhicheng Wang, Liwen Xiao, Zhiguo Cao et al.
ViT-Calibrator: Decision Stream Calibration for Vision Transformer
Lin Chen, Zhijie Jia, Lechao Cheng et al.