AdvPrefix: An Objective for Nuanced LLM Jailbreaks

12citations
arXiv:2412.10321
12
citations
#541
in NEURIPS 2025
of 5858 papers
5
Top Authors
4
Data Points

Abstract

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14\% to 80\%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released.

Citation History

Jan 25, 2026
10
Feb 13, 2026
12+2
Feb 13, 2026
12
Feb 13, 2026
12