"large language model safety" Papers
4 papers found
Conference
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Zixuan Hu, Li Shen, Zhenyi Wang et al.
NEURIPS 2025spotlightarXiv:2510.27172
3
citations
FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model
Jinwei Hu, Zhenglin Huang, Xiangyu Yin et al.
NEURIPS 2025arXiv:2502.01472
1
citations
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li, Qiang Sheng, Yehan Yang et al.
NEURIPS 2025arXiv:2506.09996
8
citations
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal et al.
ICML 2024arXiv:2403.03218
333
citations