"language modeling" Papers
48 papers found
Conference
Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking
Heli Ben-Hamu, Itai Gat, Daniel Severo et al.
AdaFisher: Adaptive Second Order Optimization via Fisher Information
Damien GOMES, Yanlei Zhang, Eugene Belilovsky et al.
BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning
Artem Zholus, Maksim Kuznetsov, Roman Schutski et al.
Chunk-Distilled Language Modeling
Yanhong Li, Karen Livescu, Jiawei Zhou
Continuous Diffusion Model for Language Modeling
Jaehyeong Jo, Sung Ju Hwang
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
Julien Siems, Timur Carstensen, Arber Zela et al.
Differential Transformer
Tianzhu Ye, Li Dong, Yuqing Xia et al.
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu et al.
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Mathurin VIDEAU, Badr Youbi Idrissi, Alessandro Leite et al.
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness
Thomas Pethick, Wanyun Xie, Mete Erdogan et al.
Glauber Generative Model: Discrete Diffusion Models via Binary Classification
Harshit Varma, Dheeraj Nagaraj, Karthikeyan Shanmugam
Improving Bilinear RNN with Closed-loop Control
Jiaxi Hu, Yongqi Pan, Jusen Du et al.
Language Models Are Implicitly Continuous
Samuele Marro, Davide Evangelista, X. Huang et al.
Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging
Ryo Bertolissi, Jonas Hübotter, Ido Hakimi et al.
Longhorn: State Space Models are Amortized Online Learners
Bo Liu, Rui Wang, Lemeng Wu et al.
MIND over Body: Adaptive Thinking using Dynamic Computation
Mrinal Mathur, Barak Pearlmutter, Sergey Plis
Nested Learning: The Illusion of Deep Learning Architectures
Ali Behrouz, Meisam Razaviyayn, Peilin Zhong et al.
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
Cai Zhou, Chenyu Wang, Dinghuai Zhang et al.
Scaling up Masked Diffusion Models on Text
Shen Nie, Fengqi Zhu, Chao Du et al.
Selective Attention Improves Transformer
Yaniv Leviathan, Matan Kalman, Yossi Matias
ShortListing Model: A Streamlined Simplex Diffusion for Discrete Variable Generation
Yuxuan Song, Zhe Zhang, Yu Pei et al.
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
Pierre Ablin, Angelos Katharopoulos, Skyler Seto et al.
SuperBPE: Space Travel for Language Models
Alisa Liu, Jonathan Hayase, Valentin Hofmann et al.
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini, Pierre Ablin, David Grangier
Tight Clusters Make Specialized Experts
Stefan Nielsen, Rachel Teo, Laziz Abdullaev et al.
AMPA: Adaptive Mixed Precision Allocation for Low-Bit Integer Training
Li Ding, Wen Fei, Yuyang Huang et al.
An Independence-promoting Loss for Music Generation with Language Models
Jean-Marie Lemercier, Simon Rouard, Jade Copet et al.
Cached Transformers: Improving Transformers with Differentiable Memory Cached
Zhaoyang Zhang, Wenqi Shao, Yixiao Ge et al.
Can Mamba Learn How To Learn? A Comparative Study on In-Context Learning Tasks
Jong Ho Park, Jaden Park, Zheyang Xiong et al.
Differentiable Model Scaling using Differentiable Topk
Kai Liu, Ruohui Wang, Jianfei Gao et al.
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, Stefano Ermon
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
David T. Hoffmann, Simon Schrodi, Jelena Bratulić et al.
Exploring Transformer Extrapolation
Zhen Qin, Yiran Zhong, Hui Deng
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen et al.
Improving Transformers with Dynamically Composable Multi-Head Attention
Da Xiao, Qingye Meng, Shengping Li et al.
In-Context Language Learning: Architectures and Algorithms
Ekin Akyürek, Bailin Wang, Yoon Kim et al.
Matrix Information Theory for Self-Supervised Learning
Yifan Zhang, Zhiquan Tan, Jingqin Yang et al.
Modeling Language Tokens as Functionals of Semantic Fields
Zhengqi Pei, Anran Zhang, Shuhui Wang et al.
MultiMax: Sparse and Multi-Modal Attention Learning
Yuxuan Zhou, Mario Fritz, Margret Keuper
PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels
Praneeth Kacham, Vahab Mirrokni, Peilin Zhong
PolyVoice: Language Models for Speech to Speech Translation
Qianqian Dong, Zhiying Huang, Qiao Tian et al.
Positive Concave Deep Equilibrium Models
Mateusz Gabor, Tomasz Piotrowski, Renato L. G. Cavalcante
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization
Jialong Guo, Xinghao Chen, Yehui Tang et al.
SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms
Xingrun Xing, Zheng Zhang, Ziyi Ni et al.
StableMask: Refining Causal Masking in Decoder-only Transformer
Qingyu Yin, Xuzheng He, Xiang Zhuang et al.
Stay on Topic with Classifier-Free Guidance
Guillaume Sanchez, Alexander Spangher, Honglu Fan et al.
Trainable Transformer in Transformer
Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia et al.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao, Albert Gu