"ai safety" Papers

16 papers found

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Hyunin Lee, Chanwoo Park, David Abel et al.

ICLR 2025arXiv:2407.18422

citations

Combining Cost Constrained Runtime Monitors for AI Safety

Tim Hua, James Baskerville, Henri Lemoine et al.

NEURIPS 2025arXiv:2507.15886

citations

Neural Interactive Proofs

Lewis Hammond, Sam Adam-Day

ICLR 2025arXiv:2412.08897

citations

PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

Priyanshu Kumar, Devansh Jain, Akhila Yerukola et al.

COLM 2025paperarXiv:2504.04377

citations

Position: Require Frontier AI Labs To Release Small "Analog" Models

Shriyash Upadhyay, Philip Quirke, Narmeen Oozeer et al.

NEURIPS 2025

Tell me about yourself: LLMs are aware of their learned behaviors

Jan Betley, Xuchan Bao, Martín Soto et al.

ICLR 2025oralarXiv:2501.11120

citations

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Xin Zhao, Xiaojun Chen, Bingshan Liu et al.

NEURIPS 2025arXiv:2510.13462

AI Control: Improving Safety Despite Intentional Subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan et al.

ICML 2024arXiv:2312.06942

110

citations

Circumventing Concept Erasure Methods For Text-To-Image Generative Models

Minh Pham, Kelly Marshall, Niv Cohen et al.

ICLR 2024arXiv:2308.01508

citations

Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan, Erik Jones, Meena Jagadeesan et al.

ICML 2024arXiv:2402.06627

citations

Fundamental Limitations of Alignment in Large Language Models

Yotam Wolf, Noam Wies, Oshri Avnery et al.

ICML 2024arXiv:2304.11082

178

citations

Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities

Golnoosh Farnadi, Mohammad Havaei, Negar Rostamzadeh

ICML 2024arXiv:2406.01757

citations

Position: Explain to Question not to Justify

Przemyslaw Biecek, Wojciech Samek

ICML 2024arXiv:2402.13914

citations

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI

Francisco Eiras, Aleksandar Petrov, Bertie Vidgen et al.

ICML 2024

Position: Open-Endedness is Essential for Artificial Superhuman Intelligence

Edward Hughes, Michael Dennis, Jack Parker-Holder et al.

ICML 2024

Scalable AI Safety via Doubly-Efficient Debate

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

ICML 2024arXiv:2311.14125

citations