"safe deployment" Papers
2 papers found
Conference
Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
Yisong Xiao, Aishan Liu, Siyuan Liang et al.
NEURIPS 2025arXiv:2510.01243
2
citations
Constrained Reinforcement Learning Under Model Mismatch
Zhongchang Sun, Sihong He, Fei Miao et al.
ICML 2024arXiv:2405.01327
11
citations