Poster "safety alignment" Papers

26 papers found

Filters:poster safety alignment Clear all

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu et al.

ICLR 2025arXiv:2410.13828

citations

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo

NEURIPS 2025arXiv:2510.02833

citations

Can a Large Language Model be a Gaslighter?

Wei Li, Luyao Zhu, Yang Song et al.

ICLR 2025arXiv:2410.09181

citations

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda et al.

ICLR 2025arXiv:2410.08968

citations

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

NEURIPS 2025arXiv:2506.00781

citations

Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Rui Ye, Jingyi Chai, Xiangrui Liu et al.

ICLR 2025arXiv:2406.10630

citations

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Xiyue Peng, Hengquan Guo, Jiawei Zhang et al.

NEURIPS 2025arXiv:2410.19933

citations

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Shuyang Hao, Bryan Hooi, Jun Liu et al.

CVPR 2025arXiv:2411.18000

citations

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer, Dan Valentine, Luke Bailey et al.

ICLR 2025arXiv:2407.15211

citations

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li, Qiang Sheng, Yehan Yang et al.

NEURIPS 2025arXiv:2506.09996

citations

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Xiaojun Jia, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2405.21018

citations

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng, Erjia Xiao, Jing Shao et al.

NEURIPS 2025arXiv:2501.13772

citations

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025arXiv:2404.02151

401

citations

Lifelong Safety Alignment for Language Models

Haoyu Wang, Yifei Zhao, Zeyu Qin et al.

NEURIPS 2025arXiv:2505.20259

citations

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni, Mohammed Haddou, Jackie CK Cheung et al.

NEURIPS 2025arXiv:2508.18076

citations

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen et al.

ICLR 2025arXiv:2506.16447

citations

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Jiaming Ji, Xinyu Chen, Rui Pan et al.

NEURIPS 2025arXiv:2503.17682

citations

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

ICLR 2025arXiv:2406.05946

303

citations

Safety Depth in Large Language Models: A Markov Chain Perspective

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu et al.

NEURIPS 2025

citations

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Yixu Wang, Jiaxin Song, Yifeng Gao et al.

NEURIPS 2025arXiv:2505.11926

citations

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NEURIPS 2025arXiv:2406.14144

citations

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Yiran Zhao, Wenxuan Zhang, Yuxi Xie et al.

ICLR 2025

citations

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang et al.

ICML 2024arXiv:2402.05162

184

citations

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Junyuan Hong, Jinhao Duan, Chenhui Zhang et al.

ICML 2024arXiv:2403.15447

citations

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang, Qizhen Zhang, Jakob Foerster

ICML 2024arXiv:2405.07932

citations

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.

ICML 2024arXiv:2402.02207

123

citations