Poster "ai safety" Papers
14 papers found
Conference
A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety
Hyunin Lee, Chanwoo Park, David Abel et al.
ICLR 2025arXiv:2407.18422
4
citations
Combining Cost Constrained Runtime Monitors for AI Safety
Tim Hua, James Baskerville, Henri Lemoine et al.
NEURIPS 2025arXiv:2507.15886
9
citations
Neural Interactive Proofs
Lewis Hammond, Sam Adam-Day
ICLR 2025arXiv:2412.08897
5
citations
Position: Require Frontier AI Labs To Release Small "Analog" Models
Shriyash Upadhyay, Philip Quirke, Narmeen Oozeer et al.
NEURIPS 2025
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
Xin Zhao, Xiaojun Chen, Bingshan Liu et al.
NEURIPS 2025arXiv:2510.13462
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan et al.
ICML 2024arXiv:2312.06942
110
citations
Circumventing Concept Erasure Methods For Text-To-Image Generative Models
Minh Pham, Kelly Marshall, Niv Cohen et al.
ICLR 2024arXiv:2308.01508
70
citations
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan et al.
ICML 2024arXiv:2402.06627
60
citations
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf, Noam Wies, Oshri Avnery et al.
ICML 2024arXiv:2304.11082
178
citations
Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities
Golnoosh Farnadi, Mohammad Havaei, Negar Rostamzadeh
ICML 2024arXiv:2406.01757
3
citations
Position: Explain to Question not to Justify
Przemyslaw Biecek, Wojciech Samek
ICML 2024arXiv:2402.13914
27
citations
Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Francisco Eiras, Aleksandar Petrov, Bertie Vidgen et al.
ICML 2024
Position: Open-Endedness is Essential for Artificial Superhuman Intelligence
Edward Hughes, Michael Dennis, Jack Parker-Holder et al.
ICML 2024
Scalable AI Safety via Doubly-Efficient Debate
Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras
ICML 2024arXiv:2311.14125
39
citations