"toxicity detection" Papers
2 papers found
Conference
An Auditing Test to Detect Behavioral Shift in Language Models
Leo Richter, Xuanli He, Pasquale Minervini et al.
ICLR 2025oralarXiv:2410.19406
2
citations
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
ICML 2024arXiv:2312.01648
6
citations