Poster "harmful behavior mitigation" Papers
2 papers found
Conference
Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
Mingjie Li, Wai Man Si, Michael Backes et al.
NEURIPS 2025
1
citations
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin et al.
ICML 2024arXiv:2402.04249
802
citations