"training dynamics" Papers
28 papers found
Conference
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Hui Yuan, Yifan Zeng, Yue Wu et al.
Attention layers provably solve single-location regression
Pierre Marion, Raphaël Berthier, Gérard Biau et al.
Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation
Muquan Li, Hang Gou, Dongyang Zhang et al.
Bridging Critical Gaps in Convergent Learning: How Representational Alignment Evolves Across Layers, Training, and Distribution Shifts
Chaitanya Kapoor, Sudhanshu Srivastava, Meenakshi Khosla
Contrastive Learning with Data Misalignment: Feature Purity, Training Dynamics and Theoretical Generalization Guarantees
Jiawei Sun, Shuai Zhang, Hongkang Li et al.
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining
Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu et al.
Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks
Binghui Li, Zhixuan Pan, Kaifeng Lyu et al.
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Ting Han, Linara Adilova, Henning Petzka et al.
From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models
Shubhra Mishra, Gabriel Poesia, Noah Goodman
Global Convergence in Neural ODEs: Impact of Activation Functions
Tianxiang Gao, Siyuan Sun, Hailiang Liu et al.
Memorization in Graph Neural Networks
Adarsh Jamadandi, Jing Xu, Adam Dziedzic et al.
On the Feature Learning in Diffusion Models
Andi Han, Wei Huang, Yuan Cao et al.
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Xianliang Li, Jun Luo, Zhiwei Zheng et al.
PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
Oskar van der Wal, Pietro Lesci, Max Müller-Eberstein et al.
Position: Algebra Unveils Deep Learning - An Invitation to Neuroalgebraic Geometry
Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta et al.
Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization
Daniel Palenicek, Florian Vogt, Joe Watson et al.
The emergence of sparse attention: impact of data distribution and benefits of repetition
Nicolas Zucchet, Francesco D'Angelo, Andrew Lampinen et al.
Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression
Jiarui Jiang, Wei Huang, Miao Zhang et al.
Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought
Jianhao Huang, Zixuan Wang, Jason Lee
Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
Tony Bonnaire, Raphaël Urfin, Giulio Biroli et al.
Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement
Aditay Tripathi, Pradeep Shenoy, Anirban Chakraborty
Evolving Subnetwork Training for Large Language Models
hanqi li, Lu Chen, Da Ma et al.
How Graph Neural Networks Learn: Lessons from Training Dynamics
Chenxiao Yang, Qitian Wu, David Wipf et al.
Learning Associative Memories with Gradient Descent
Vivien Cabannnes, Berfin Simsek, Alberto Bietti
Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting
Masoumeh Zareapoor, Pourya Shamsolmoali
Stability-Informed Initialization of Neural Ordinary Differential Equations
Theodor Westny, Arman Mohammadi, Daniel Jung et al.
United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit
Uri Stern, Daniel Shwartz, Daphna Weinshall
What is Dataset Distillation Learning?
William Yang, Ye Zhu, Zhiwei Deng et al.