"ai safety" Papers
16 papers found
Conference
A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety
Hyunin Lee, Chanwoo Park, David Abel et al.
Combining Cost Constrained Runtime Monitors for AI Safety
Tim Hua, James Baskerville, Henri Lemoine et al.
Neural Interactive Proofs
Lewis Hammond, Sam Adam-Day
PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages
Priyanshu Kumar, Devansh Jain, Akhila Yerukola et al.
Position: Require Frontier AI Labs To Release Small "Analog" Models
Shriyash Upadhyay, Philip Quirke, Narmeen Oozeer et al.
Tell me about yourself: LLMs are aware of their learned behaviors
Jan Betley, Xuchan Bao, Martín Soto et al.
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
Xin Zhao, Xiaojun Chen, Bingshan Liu et al.
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan et al.
Circumventing Concept Erasure Methods For Text-To-Image Generative Models
Minh Pham, Kelly Marshall, Niv Cohen et al.
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan et al.
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf, Noam Wies, Oshri Avnery et al.
Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities
Golnoosh Farnadi, Mohammad Havaei, Negar Rostamzadeh
Position: Explain to Question not to Justify
Przemyslaw Biecek, Wojciech Samek
Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Francisco Eiras, Aleksandar Petrov, Bertie Vidgen et al.
Position: Open-Endedness is Essential for Artificial Superhuman Intelligence
Edward Hughes, Michael Dennis, Jack Parker-Holder et al.
Scalable AI Safety via Doubly-Efficient Debate
Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras