by Martín Soto Papers
2 papers found
Conference
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Tan, Niels Warncke et al.
ICML 2025oralarXiv:2502.17424
108
citations
Tell me about yourself: LLMs are aware of their learned behaviors
Jan Betley, Xuchan Bao, Martín Soto et al.
ICLR 2025oralarXiv:2501.11120
59
citations