"learning rate warmup" Papers
3 papers found
Conference
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec, Felix Dangel, Sidak Pal Singh
ICLR 2025arXiv:2410.10986
10
citations
Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Atli Kosson, Bettina Messmer, Martin Jaggi
ICML 2024arXiv:2305.17212
33
citations
When Will Gradient Regularization Be Harmful?
Yang Zhao, Hao Zhang, Xiuyuan Hu
ICML 2024arXiv:2406.09723
2
citations