Poster "harmful content generation" Papers
12 papers found
Conference
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.
ICML 2025arXiv:2404.16873
132
citations
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran, Hieu Minh Nguyen, Akash Kundu et al.
ICLR 2025arXiv:2503.10728
18
citations
Durable Quantization Conditioned Misalignment Attack on Large Language Models
Peiran Dong, Haowei Li, Song Guo
ICLR 2025
2
citations
Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them
Anh Bui, Thuy-Trang Vu, Long Vuong et al.
ICLR 2025arXiv:2501.18950
20
citations
Information Retrieval Induced Safety Degradation in AI Agents
Cheng Yu, Benedikt Stroebl, Diyi Yang et al.
NEURIPS 2025arXiv:2505.14215
One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head
Junhao Xia, Haotian Zhu, Shuchao Pang et al.
NEURIPS 2025
PALMBENCH: A COMPREHENSIVE BENCHMARK OF COMPRESSED LARGE LANGUAGE MODELS ON MOBILE PLATFORMS
Yilong Li, Jingyu Liu, Hao Zhang et al.
ICLR 2025arXiv:2410.05315
7
citations
ProAdvPrompter: A Two-Stage Journey to Effective Adversarial Prompting for LLMs
Hao Di, Tong He, Haishan Ye et al.
ICLR 2025
2
citations
Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models
Byeonghu Na, Mina Kang, Jiseok Kwak et al.
NEURIPS 2025arXiv:2510.24012
VLMs can Aggregate Scattered Training Patches
Zhanhui Zhou, Lingjie Chen, Chao Yang et al.
NEURIPS 2025arXiv:2506.03614
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
Sanghyun Kim, Seohyeon Jung, Balhae Kim et al.
ECCV 2024arXiv:2407.21032
10
citations
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.
ICML 2024arXiv:2402.02207
123
citations