RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction

22citations
arXiv:2410.19937
22
citations
#800
in ICLR 2025
of 3827 papers
6
Top Authors
7
Data Points

Abstract

Jailbreak attacks circumvent LLMs' built-in safeguards by concealing harmful queries within adversarial prompts. While most existing defenses attempt to mitigate the effects of adversarial prompts, they often prove inadequate as adversarial prompts can take arbitrary, adaptive forms. This paper introduces RobustKV, a novel jailbreak defense that takes a fundamentally different approach by selectively removing critical tokens of harmful queries from key-value (KV) caches. Intuitively, for an adversarial prompt to be effective, its tokens must achieve sufficient `importance' (measured by attention scores), which consequently lowers the importance of tokens in the concealed harmful query. Therefore, by carefully evicting the KVs of low-ranked tokens, RobustKV minimizes the harmful query's presence in the KV cache, thus preventing the LLM from generating informative responses. Extensive evaluation using benchmark datasets and models demonstrates that RobustKV effectively counters state-of-the-art jailbreak attacks while maintaining the LLM's performance on benign queries. Notably, RobustKV creates an interesting effectiveness-evasiveness dilemma for the adversary, leading to its robustness against adaptive attacks.{(Warning: This paper contains potentially harmful content generated by LLMs.)}

Citation History

Jan 25, 2026
0
Jan 26, 2026
0
Jan 26, 2026
0
Jan 28, 2026
0
Feb 13, 2026
22+22
Feb 13, 2026
22
Feb 13, 2026
22