Poster "safety alignment" Papers

26 papers found

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu et al.

ICLR 2025arXiv:2410.13828
5
citations

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo

NEURIPS 2025arXiv:2510.02833
2
citations

Can a Large Language Model be a Gaslighter?

Wei Li, Luyao Zhu, Yang Song et al.

ICLR 2025arXiv:2410.09181
2
citations

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda et al.

ICLR 2025arXiv:2410.08968
24
citations

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

NEURIPS 2025arXiv:2506.00781
5
citations

Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Rui Ye, Jingyi Chai, Xiangrui Liu et al.

ICLR 2025arXiv:2406.10630
18
citations

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Xiyue Peng, Hengquan Guo, Jiawei Zhang et al.

NEURIPS 2025arXiv:2410.19933
5
citations

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Shuyang Hao, Bryan Hooi, Jun Liu et al.

CVPR 2025arXiv:2411.18000
6
citations

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer, Dan Valentine, Luke Bailey et al.

ICLR 2025arXiv:2407.15211
24
citations

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li, Qiang Sheng, Yehan Yang et al.

NEURIPS 2025arXiv:2506.09996
8
citations

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Xiaojun Jia, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2405.21018
85
citations

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng, Erjia Xiao, Jing Shao et al.

NEURIPS 2025arXiv:2501.13772
8
citations

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025arXiv:2404.02151
401
citations

Lifelong Safety Alignment for Language Models

Haoyu Wang, Yifei Zhao, Zeyu Qin et al.

NEURIPS 2025arXiv:2505.20259
7
citations

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni, Mohammed Haddou, Jackie CK Cheung et al.

NEURIPS 2025arXiv:2508.18076
11
citations

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen et al.

ICLR 2025arXiv:2506.16447
23
citations

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Jiaming Ji, Xinyu Chen, Rui Pan et al.

NEURIPS 2025arXiv:2503.17682
9
citations

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

ICLR 2025arXiv:2406.05946
303
citations

Safety Depth in Large Language Models: A Markov Chain Perspective

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu et al.

NEURIPS 2025
1
citations

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Yixu Wang, Jiaxin Song, Yifeng Gao et al.

NEURIPS 2025arXiv:2505.11926
4
citations

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NEURIPS 2025arXiv:2406.14144
24
citations

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Yiran Zhao, Wenxuan Zhang, Yuxi Xie et al.

ICLR 2025
29
citations

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang et al.

ICML 2024arXiv:2402.05162
184
citations

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Junyuan Hong, Jinhao Duan, Chenhui Zhang et al.

ICML 2024arXiv:2403.15447
49
citations

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang, Qizhen Zhang, Jakob Foerster

ICML 2024arXiv:2405.07932
34
citations

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.

ICML 2024arXiv:2402.02207
123
citations