by Xinpeng Wang Papers
2 papers found
Conference
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu et al.
NEURIPS 2025arXiv:2505.17306
5
citations
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang, Chengzhi (Martin) Hu, Paul Röttger et al.
ICLR 2025arXiv:2410.03415
26
citations