"cross-attention mechanisms" Papers

18 papers found

$\text{I}^2\text{AM}$: Interpreting Image-to-Image Latent Diffusion Models via Bi-Attribution Maps

Junseo Park, Hyeryung Jang

ICLR 2025

DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models

Hyogon Ryu, NaHyeon Park, Hyunjung Shim

ICLR 2025arXiv:2501.04304
7
citations

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Yu Zhang, Jialei Zhou, Xinchen Li et al.

NEURIPS 2025arXiv:2505.19261
7
citations

Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention

Jeonghoon Park, Juyoung Lee, Chaeyeon Chung et al.

ICCV 2025arXiv:2506.13298
3
citations

Grounding Continuous Representations in Geometry: Equivariant Neural Fields

David Wessels, David Knigge, Riccardo Valperga et al.

ICLR 2025arXiv:2406.05753
13
citations

Improving Editability in Image Generation with Layer-wise Memory

Daneul Kim, Jaeah Lee, Jaesik Park

CVPR 2025arXiv:2505.01079
1
citations

Prediction-Feedback DETR for Temporal Action Detection

Jihwan Kim, Miso Lee, Cheol-Ho Cho et al.

AAAI 2025paperarXiv:2408.16729
6
citations

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Seokhyeon Hong, Chaelin Kim, Serin Yoon et al.

CVPR 2025arXiv:2503.13836
14
citations

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez et al.

ICCV 2025arXiv:2507.07620
2
citations

AugDETR: Improving Multi-scale Learning for Detection Transformer

Jinpeng Dong, Yutong Lin, Chen Li et al.

ECCV 2024
4
citations

Commonsense for Zero-Shot Natural Language Video Localization

Meghana Holla, Ismini Lourentzou

AAAI 2024paperarXiv:2312.17429
5
citations

Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Ruichen Wang, Zekang Chen, Chen Chen et al.

AAAI 2024paperarXiv:2305.13921
93
citations

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji et al.

ECCV 2024arXiv:2407.05352
9
citations

Revealing Vision-Language Integration in the Brain with Multimodal Networks

Vighnesh Subramaniam, Colin Conwell, Christopher Wang et al.

ICML 2024arXiv:2406.14481
18
citations

Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar, Yongqin Xian, Alessio Tonioni et al.

ECCV 2024arXiv:2312.11897
24
citations

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Junyan Wang, Zhenhong Sun, Stewart Tan et al.

CVPR 2024arXiv:2403.05239
18
citations

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Jie Ren, Yaxin Li, Shenglai Zeng et al.

ECCV 2024arXiv:2403.11052
52
citations

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Marianna Ohanyan, Hayk Manukyan, Zhangyang Wang et al.

CVPR 2024arXiv:2406.04032
10
citations