"safety guardrails" Papers
3 papers found
Conference
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Basani, Xiao Zhang
NEURIPS 2025arXiv:2411.14133
12
citations
GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning
Zhen Xiang, Linzhi Zheng, Yanjie Li et al.
ICML 2025
69
citations
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
ICLR 2025arXiv:2410.13708
43
citations