by Tomek Korbak Papers
7 papers found
Conference
Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Xander Davies, Eric Winsor, Alexandra Souly et al.
NEURIPS 2025arXiv:2502.14828
10
citations
Inverse Scaling: When Bigger Isn't Better
Joe Cavanagh, Andrew Gritsevskiy, Najoung Kim et al.
ICLR 2025arXiv:2306.09479
186
citations
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix Jedidja Binder, James Chua, Tomek Korbak et al.
ICLR 2025oralarXiv:2410.13787
44
citations
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Javier Rando, Tony Wang, Stewart Slocum et al.
ICLR 2025arXiv:2307.15217
750
citations
Compositional Preference Models for Aligning LMs
DONGYOUNG GO, Tomek Korbak, Germàn Kruszewski et al.
ICLR 2024arXiv:2310.13011
25
citations
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Lukas Berglund, Meg Tong, Maximilian Kaufmann et al.
ICLR 2024
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomek Korbak et al.
ICLR 2024arXiv:2310.13548
526
citations