"refusal behavior" Papers
2 papers found
Conference
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
Zouying Cao, Yifei Yang, Hai Zhao
AAAI 2025paperarXiv:2408.11491
23
citations
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde, Alasdair Paren, Preetham Arvind et al.
ICLR 2025arXiv:2502.19320
4
citations