Poster "harmful content generation" Papers

12 papers found

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.

ICML 2025arXiv:2404.16873
132
citations

DarkBench: Benchmarking Dark Patterns in Large Language Models

Esben Kran, Hieu Minh Nguyen, Akash Kundu et al.

ICLR 2025arXiv:2503.10728
18
citations

Durable Quantization Conditioned Misalignment Attack on Large Language Models

Peiran Dong, Haowei Li, Song Guo

ICLR 2025
2
citations

Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them

Anh Bui, Thuy-Trang Vu, Long Vuong et al.

ICLR 2025arXiv:2501.18950
20
citations

Information Retrieval Induced Safety Degradation in AI Agents

Cheng Yu, Benedikt Stroebl, Diyi Yang et al.

NEURIPS 2025arXiv:2505.14215

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

Junhao Xia, Haotian Zhu, Shuchao Pang et al.

NEURIPS 2025

PALMBENCH: A COMPREHENSIVE BENCHMARK OF COMPRESSED LARGE LANGUAGE MODELS ON MOBILE PLATFORMS

Yilong Li, Jingyu Liu, Hao Zhang et al.

ICLR 2025arXiv:2410.05315
7
citations

ProAdvPrompter: A Two-Stage Journey to Effective Adversarial Prompting for LLMs

Hao Di, Tong He, Haishan Ye et al.

ICLR 2025
2
citations

Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models

Byeonghu Na, Mina Kang, Jiseok Kwak et al.

NEURIPS 2025arXiv:2510.24012

VLMs can Aggregate Scattered Training Patches

Zhanhui Zhou, Lingjie Chen, Chao Yang et al.

NEURIPS 2025arXiv:2506.03614

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Sanghyun Kim, Seohyeon Jung, Balhae Kim et al.

ECCV 2024arXiv:2407.21032
10
citations

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.

ICML 2024arXiv:2402.02207
123
citations