Poster "large language model safety" Papers
3 papers found
Conference
FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model
Jinwei Hu, Zhenglin Huang, Xiangyu Yin et al.
NEURIPS 2025arXiv:2502.01472
1
citations
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li, Qiang Sheng, Yehan Yang et al.
NEURIPS 2025arXiv:2506.09996
8
citations
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal et al.
ICML 2024arXiv:2403.03218
333
citations