Poster "ai safety" Papers

14 papers found

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Hyunin Lee, Chanwoo Park, David Abel et al.

ICLR 2025arXiv:2407.18422
4
citations

Combining Cost Constrained Runtime Monitors for AI Safety

Tim Hua, James Baskerville, Henri Lemoine et al.

NEURIPS 2025arXiv:2507.15886
9
citations

Neural Interactive Proofs

Lewis Hammond, Sam Adam-Day

ICLR 2025arXiv:2412.08897
5
citations

Position: Require Frontier AI Labs To Release Small "Analog" Models

Shriyash Upadhyay, Philip Quirke, Narmeen Oozeer et al.

NEURIPS 2025

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Xin Zhao, Xiaojun Chen, Bingshan Liu et al.

NEURIPS 2025arXiv:2510.13462

AI Control: Improving Safety Despite Intentional Subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan et al.

ICML 2024arXiv:2312.06942
110
citations

Circumventing Concept Erasure Methods For Text-To-Image Generative Models

Minh Pham, Kelly Marshall, Niv Cohen et al.

ICLR 2024arXiv:2308.01508
70
citations

Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan, Erik Jones, Meena Jagadeesan et al.

ICML 2024arXiv:2402.06627
60
citations

Fundamental Limitations of Alignment in Large Language Models

Yotam Wolf, Noam Wies, Oshri Avnery et al.

ICML 2024arXiv:2304.11082
178
citations

Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities

Golnoosh Farnadi, Mohammad Havaei, Negar Rostamzadeh

ICML 2024arXiv:2406.01757
3
citations

Position: Explain to Question not to Justify

Przemyslaw Biecek, Wojciech Samek

ICML 2024arXiv:2402.13914
27
citations

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI

Francisco Eiras, Aleksandar Petrov, Bertie Vidgen et al.

ICML 2024

Position: Open-Endedness is Essential for Artificial Superhuman Intelligence

Edward Hughes, Michael Dennis, Jack Parker-Holder et al.

ICML 2024

Scalable AI Safety via Doubly-Efficient Debate

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

ICML 2024arXiv:2311.14125
39
citations