The Optimization Landscape of SGD Across the Feature Learning Strength

12citations
arXiv:2410.04642
12
citations
#1335
in ICLR 2025
of 3827 papers
4
Top Authors
7
Data Points

Abstract

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $γ$. Recent work has identified $γ$ as controlling the strength of feature learning. As $γ$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $γ$ across a variety of models and datasets in the online training setting. We first examine the interaction of $γ$ with the learning rate $η$, identifying several scaling regimes in the $γ$-$η$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $η^*$ scales non-trivially with $γ$. In particular, $η^* \propto γ^2$ when $γ\ll 1$ and $η^* \propto γ^{2/L}$ when $γ\gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" $γ\gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $γ$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $γ$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$γ$ limit may yield useful insights into the dynamics of representation learning in performant models.

Citation History

Jan 26, 2026
0
Jan 26, 2026
10+10
Jan 27, 2026
10
Feb 3, 2026
11+1
Feb 13, 2026
12+1
Feb 13, 2026
12
Feb 13, 2026
12