"toxicity mitigation" Papers
4 papers found
Conference
Controlling Language and Diffusion Models by Transporting Activations
Pau Rodriguez, Arno Blaas, Michal Klein et al.
ICLR 2025arXiv:2410.23054
22
citations
Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
Yisong Xiao, Aishan Liu, Siyuan Liang et al.
NEURIPS 2025arXiv:2510.01243
2
citations
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao, Zhun Deng, David Madras et al.
ICML 2024oralarXiv:2312.12736
25
citations
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavi Suau, Pieter Delobelle, Katherine Metcalf et al.
ICML 2024arXiv:2407.12824
28
citations