"language model scaling" Papers
13 papers found
Conference
CoTFormer: A Chain of Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference
Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi
ICLR 2025arXiv:2310.10845
15
citations
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung, Jeonghoon Kim
NEURIPS 2025arXiv:2508.15390
2
citations
Language models scale reliably with over-training and on downstream tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar et al.
ICLR 2025arXiv:2403.08540
79
citations
Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination’s Impact on Machine Translation
Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch et al.
ICML 2025oralarXiv:2501.18771
7
citations
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
Xiuying Wei, Anunay Yadav, Razvan Pascanu et al.
NEURIPS 2025arXiv:2507.04416
Reasoning with Latent Thoughts: On the Power of Looped Transformers
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li et al.
ICLR 2025arXiv:2502.17416
79
citations
Scaling Laws for Precision
Tanishq Kumar, Zachary Ankner, Benjamin Spector et al.
ICLR 2025arXiv:2411.04330
68
citations
SkyLadder: Better and Faster Pretraining via Context Window Scheduling
Tongyao Zhu, Qian Liu, Haonan Wang et al.
NEURIPS 2025arXiv:2503.15450
3
citations
Tensor Product Attention Is All You Need
Yifan Zhang, Yifeng Liu, Huizhuo Yuan et al.
NEURIPS 2025spotlightarXiv:2501.06425
34
citations
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Nikhil Sardana, Jacob Portes, Alexandre (Sasha) Doubov et al.
ICML 2024arXiv:2401.00448
123
citations
Data Engineering for Scaling Language Models to 128K Context
Yao Fu, Rameswar Panda, Xinyao Niu et al.
ICML 2024arXiv:2402.10171
186
citations
Mechanistic Design and Scaling of Hybrid Architectures
Michael Poli, Armin Thomas, Eric Nguyen et al.
ICML 2024arXiv:2403.17844
53
citations
Why Larger Language Models Do In-context Learning Differently?
Zhenmei Shi, Junyi Wei, Zhuoyan Xu et al.
ICML 2024arXiv:2405.19592
49
citations