"gradient accumulation" Papers
2 papers found
Conference
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful
Martin Marek, Sanae Lotfi, Aditya Somasundaram et al.
NEURIPS 2025arXiv:2507.07101
22
citations
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini, Pierre Ablin, David Grangier
ICLR 2025arXiv:2409.03137
27
citations