"model safety" Papers
7 papers found
Conference
On Large Language Model Continual Unlearning
Chongyang Gao, Lixu Wang, Kaize Ding et al.
ICLR 2025arXiv:2407.10223
30
citations
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
ICLR 2025arXiv:2410.13708
43
citations
Robust LLM safeguarding via refusal feature adversarial training
Lei Yu, Virginie Do, Karen Hambardzumyan et al.
ICLR 2025arXiv:2409.20089
45
citations
VLMs can Aggregate Scattered Training Patches
Zhanhui Zhou, Lingjie Chen, Chao Yang et al.
NEURIPS 2025arXiv:2506.03614
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi, Alexander Wei, Eric Wallace et al.
ICML 2024arXiv:2406.20053
65
citations
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
ECCV 2024arXiv:2403.09792
101
citations
Position: Building Guardrails for Large Language Models Requires Systematic Design
Yi DONG, Ronghui Mu, Gaojie Jin et al.
ICML 2024