"language modeling" Papers

48 papers found

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking

Heli Ben-Hamu, Itai Gat, Daniel Severo et al.

NEURIPS 2025arXiv:2505.24857

citations

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien GOMES, Yanlei Zhang, Eugene Belilovsky et al.

ICLR 2025arXiv:2405.16397

citations

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

Artem Zholus, Maksim Kuznetsov, Roman Schutski et al.

AAAI 2025paperarXiv:2406.03686

citations

Chunk-Distilled Language Modeling

Yanhong Li, Karen Livescu, Jiawei Zhou

ICLR 2025arXiv:2501.00343

citations

Continuous Diffusion Model for Language Modeling

Jaehyeong Jo, Sung Ju Hwang

NEURIPS 2025arXiv:2502.11564

citations

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Julien Siems, Timur Carstensen, Arber Zela et al.

NEURIPS 2025arXiv:2502.10297

citations

Differential Transformer

Tianzhu Ye, Li Dong, Yuqing Xia et al.

ICLR 2025arXiv:2410.05258

Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu et al.

NEURIPS 2025arXiv:2507.00425

citations

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Mathurin VIDEAU, Badr Youbi Idrissi, Alessandro Leite et al.

NEURIPS 2025arXiv:2506.14761

citations

Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness

Thomas Pethick, Wanyun Xie, Mete Erdogan et al.

NEURIPS 2025oralarXiv:2506.01913

citations

Glauber Generative Model: Discrete Diffusion Models via Binary Classification

Harshit Varma, Dheeraj Nagaraj, Karthikeyan Shanmugam

ICLR 2025arXiv:2405.17035

citations

Improving Bilinear RNN with Closed-loop Control

Jiaxi Hu, Yongqi Pan, Jusen Du et al.

NEURIPS 2025spotlightarXiv:2506.02475

citations

Language Models Are Implicitly Continuous

Samuele Marro, Davide Evangelista, X. Huang et al.

ICLR 2025arXiv:2504.03933

citations

Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Ryo Bertolissi, Jonas Hübotter, Ido Hakimi et al.

COLM 2025paperarXiv:2505.14136

citations

Longhorn: State Space Models are Amortized Online Learners

Bo Liu, Rui Wang, Lemeng Wu et al.

ICLR 2025arXiv:2407.14207

citations

MIND over Body: Adaptive Thinking using Dynamic Computation

Mrinal Mathur, Barak Pearlmutter, Sergey Plis

ICLR 2025

citations

Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong et al.

NEURIPS 2025arXiv:2512.24695

citations

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Cai Zhou, Chenyu Wang, Dinghuai Zhang et al.

NEURIPS 2025arXiv:2510.08632

citations

Scaling up Masked Diffusion Models on Text

Shen Nie, Fengqi Zhu, Chao Du et al.

ICLR 2025oralarXiv:2410.18514

124

citations

Selective Attention Improves Transformer

Yaniv Leviathan, Matan Kalman, Yossi Matias

ICLR 2025arXiv:2410.02703

citations

ShortListing Model: A Streamlined Simplex Diffusion for Discrete Variable Generation

Yuxuan Song, Zhe Zhang, Yu Pei et al.

NEURIPS 2025

citations

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Pierre Ablin, Angelos Katharopoulos, Skyler Seto et al.

ICML 2025arXiv:2502.01804

citations

SuperBPE: Space Travel for Language Models

Alisa Liu, Jonathan Hayase, Valentin Hofmann et al.

COLM 2025paperarXiv:2503.13423

citations

The AdEMAMix Optimizer: Better, Faster, Older

Matteo Pagliardini, Pierre Ablin, David Grangier

ICLR 2025arXiv:2409.03137

citations

Tight Clusters Make Specialized Experts

Stefan Nielsen, Rachel Teo, Laziz Abdullaev et al.

ICLR 2025arXiv:2502.15315

citations

AMPA: Adaptive Mixed Precision Allocation for Low-Bit Integer Training

Li Ding, Wen Fei, Yuyang Huang et al.

ICML 2024

An Independence-promoting Loss for Music Generation with Language Models

Jean-Marie Lemercier, Simon Rouard, Jade Copet et al.

ICML 2024arXiv:2406.02315

citations

Cached Transformers: Improving Transformers with Differentiable Memory Cached

Zhaoyang Zhang, Wenqi Shao, Yixiao Ge et al.

AAAI 2024paperarXiv:2312.12742

citations

Can Mamba Learn How To Learn? A Comparative Study on In-Context Learning Tasks

Jong Ho Park, Jaden Park, Zheyang Xiong et al.

ICML 2024arXiv:2402.04248

107

citations

Differentiable Model Scaling using Differentiable Topk

Kai Liu, Ruohui Wang, Jianfei Gao et al.

ICML 2024arXiv:2405.07194

citations

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, Stefano Ermon

ICML 2024arXiv:2310.16834

354

citations

Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

David T. Hoffmann, Simon Schrodi, Jelena Bratulić et al.

ICML 2024arXiv:2310.12956

citations

Exploring Transformer Extrapolation

Zhen Qin, Yiran Zhong, Hui Deng

AAAI 2024paperarXiv:2307.10156

citations

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen et al.

ICML 2024arXiv:2312.06635

329

citations

Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao, Qingye Meng, Shengping Li et al.

ICML 2024arXiv:2405.08553

citations

In-Context Language Learning: Architectures and Algorithms

Ekin Akyürek, Bailin Wang, Yoon Kim et al.

ICML 2024arXiv:2401.12973

citations

Matrix Information Theory for Self-Supervised Learning

Yifan Zhang, Zhiquan Tan, Jingqin Yang et al.

ICML 2024arXiv:2305.17326

citations

Modeling Language Tokens as Functionals of Semantic Fields

Zhengqi Pei, Anran Zhang, Shuhui Wang et al.

ICML 2024

MultiMax: Sparse and Multi-Modal Attention Learning

Yuxuan Zhou, Mario Fritz, Margret Keuper

ICML 2024arXiv:2406.01189

citations

PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels

Praneeth Kacham, Vahab Mirrokni, Peilin Zhong

ICML 2024arXiv:2310.01655

citations

PolyVoice: Language Models for Speech to Speech Translation

Qianqian Dong, Zhiying Huang, Qiao Tian et al.

ICLR 2024arXiv:2306.02982

citations

Positive Concave Deep Equilibrium Models

Mateusz Gabor, Tomasz Piotrowski, Renato L. G. Cavalcante

ICML 2024arXiv:2402.04029

citations

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

Jialong Guo, Xinghao Chen, Yehui Tang et al.

ICML 2024arXiv:2405.11582

citations

SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

Xingrun Xing, Zheng Zhang, Ziyi Ni et al.

ICML 2024arXiv:2406.03287

citations

StableMask: Refining Causal Masking in Decoder-only Transformer

Qingyu Yin, Xuzheng He, Xiang Zhuang et al.

ICML 2024arXiv:2402.04779

citations

Stay on Topic with Classifier-Free Guidance

Guillaume Sanchez, Alexander Spangher, Honglu Fan et al.

ICML 2024spotlightarXiv:2306.17806

citations

Trainable Transformer in Transformer

Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia et al.

ICML 2024arXiv:2307.01189

citations

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao, Albert Gu

ICML 2024arXiv:2405.21060

1146

citations