"jailbreak attacks" Papers

25 papers found

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio–Language Models

Weifei Jin, Yuxin Cao, Junjie Su et al.

NEURIPS 2025arXiv:2510.26096
1
citations

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo

NEURIPS 2025arXiv:2510.02833
2
citations

Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs

Masahiro Kaneko, Timothy Baldwin

NEURIPS 2025spotlightarXiv:2510.17000

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao, Xiang Zheng, Lin Luo et al.

ICLR 2025arXiv:2410.20971
20
citations

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Keltin Grimes, Marco Christiani, David Shriver et al.

ICLR 2025arXiv:2412.13341
6
citations

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

NEURIPS 2025arXiv:2506.00781
5
citations

Durable Quantization Conditioned Misalignment Attack on Large Language Models

Peiran Dong, Haowei Li, Song Guo

ICLR 2025
2
citations

EFFICIENT JAILBREAK ATTACK SEQUENCES ON LARGE LANGUAGE MODELS VIA MULTI-ARMED BANDIT-BASED CONTEXT SWITCHING

Aditya Ramesh, Shivam Bhardwaj, Aditya Saibewar et al.

ICLR 2025
3
citations

Endless Jailbreaks with Bijection Learning

Brian R.Y. Huang, Max Li, Leonard Tang

ICLR 2025arXiv:2410.01294
16
citations

FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts

Yichen Gong, Delong Ran, Jinyuan Liu et al.

AAAI 2025paperarXiv:2311.05608
302
citations

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Advik Basani, Xiao Zhang

NEURIPS 2025arXiv:2411.14133
12
citations

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Ma Teng, Xiaojun Jia, Ranjie Duan et al.

ICCV 2025arXiv:2412.05934
21
citations

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Ruofan Wang, Juncheng Li, Yixu Wang et al.

ICCV 2025arXiv:2411.00827
9
citations

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen, Dongcheng Zhao, Yiting Dong et al.

ICLR 2025arXiv:2410.02298
13
citations

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng, Erjia Xiao, Jing Shao et al.

NEURIPS 2025arXiv:2501.13772
8
citations

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025arXiv:2501.04931
30
citations

Perception-Guided Jailbreak Against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li et al.

AAAI 2025paperarXiv:2408.10848
27
citations

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen et al.

ICLR 2025arXiv:2506.16447
23
citations

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan et al.

NEURIPS 2025arXiv:2507.00971
11
citations

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Jiawei Wang, Yushen Zuo, Yuanjun Chai et al.

ICCV 2025arXiv:2504.01308

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

ICLR 2025arXiv:2406.05946
303
citations

Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

Shaopeng Fu, Liang Ding, Jingfeng ZHANG et al.

NEURIPS 2025arXiv:2502.04204
6
citations

T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks

Jiayang Liu, Siyuan Liang, Shiqian Zhao et al.

NEURIPS 2025arXiv:2505.06679
6
citations

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang et al.

ICML 2024arXiv:2402.08567
103
citations

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, hangyu guo, Kun Zhou et al.

ECCV 2024arXiv:2403.09792
101
citations