"toxicity reduction" Papers
3 papers found
Conference
Controlling Large Language Models Through Concept Activation Vectors
Hanyu Zhang, Xiting Wang, Chengao Li et al.
AAAI 2025paperarXiv:2501.05764
20
citations
Large Language Models can Become Strong Self-Detoxifiers
Ching-Yun Ko, Pin-Yu Chen, Payel Das et al.
ICLR 2025
3
citations
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee, Xiaoyan Bai, Itamar Pres et al.
ICML 2024arXiv:2401.01967
165
citations