"jailbreaking attacks" Papers

15 papers found

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Mintong Kang, Bo Li

ICLR 2025arXiv:2407.05557
34
citations

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.

ICML 2025arXiv:2404.16873
132
citations

Attention! Your Vision Language Model Could Be Maliciously Manipulated

Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.

NEURIPS 2025arXiv:2505.19911
3
citations

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

Hangtao Zhang, Chenyu Zhu, Xianlong Wang et al.

ICLR 2025
6
citations

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Seanie Lee, Haebin Seong, Dong Bok Lee et al.

ICLR 2025arXiv:2410.01524
15
citations

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025arXiv:2404.02151
401
citations

Lifelong Safety Alignment for Language Models

Haoyu Wang, Yifei Zhao, Zeyu Qin et al.

NEURIPS 2025arXiv:2505.20259
7
citations

Persistent Pre-training Poisoning of LLMs

Yiming Zhang, Javier Rando, Ivan Evtimov et al.

ICLR 2025arXiv:2410.13722
38
citations

ProAdvPrompter: A Two-Stage Journey to Effective Adversarial Prompting for LLMs

Hao Di, Tong He, Haishan Ye et al.

ICLR 2025
2
citations

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Runqi Lin, Bo Han, Fengwang Li et al.

ICLR 2025arXiv:2502.03052
18
citations

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang et al.

ICML 2024arXiv:2402.05162
184
citations

Fast Adversarial Attacks on Language Models In One GPU Minute

Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan et al.

ICML 2024arXiv:2402.15570
72
citations

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Zhuowen Yuan, Zidi Xiong, Yi Zeng et al.

ICML 2024arXiv:2403.13031
67
citations

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.

ICML 2024arXiv:2402.02207
123
citations

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

Qinyu Zhao, Ming Xu, Kartik Gupta et al.

ECCV 2024arXiv:2403.09037
15
citations