"safety mechanisms" Papers
9 papers found
Conference
AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
Yufan Liu, Wanqian Zhang, Huashan Chen et al.
ICCV 2025arXiv:2510.24034
1
citations
Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
Mingjie Li, Wai Man Si, Michael Backes et al.
NEURIPS 2025
1
citations
Is Your Multimodal Language Model Oversensitive to Safe Queries?
Xirui Li, Hengguang Zhou, Ruochen Wang et al.
ICLR 2025arXiv:2406.17806
23
citations
LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao, Jing Huang, Zhengxuan Wu et al.
NEURIPS 2025arXiv:2507.11878
10
citations
Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling
Yichuan Cao, Yibo Miao, Xiao-Shan Gao et al.
NEURIPS 2025arXiv:2505.21074
2
citations
Safety Depth in Large Language Models: A Markov Chain Perspective
Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu et al.
NEURIPS 2025
1
citations
Transstratal Adversarial Attack: Compromising Multi-Layered Defenses in Text-to-Image Models
Chunlong Xie, Kangjie Chen, Shangwei Guo et al.
NEURIPS 2025spotlight
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
Vitali Petsiuk, Kate Saenko
ECCV 2024arXiv:2404.13706
8
citations
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
Zhi-Yi Chin, Chieh Ming Jiang, Ching-Chun Huang et al.
ICML 2024arXiv:2309.06135
132
citations