AdvPrefix: An Objective for Nuanced LLM Jailbreaks

12citations

arXiv:2412.10321

citations

#541

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Sicheng Zhu Brandon Amos Yuandong Tian Chuan Guo Ivan Evtimov

Abstract

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14\% to 80\%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released.

Citation History

Jan 25, 2026

Feb 13, 2026

12+2

Feb 13, 2026