"length generalization" Papers
19 papers found
Conference
A Formal Framework for Understanding Length Generalization in Transformers
Xinting Huang, Andy Yang, Satwik Bhattamishra et al.
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
Yi Hu, Shijia Kang, Haotong Yang et al.
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
Mayank Jobanputra, Yana Veitsman, Yash Sarrof et al.
Extrapolation by Association: Length Generalization Transfer In Transformers
Ziyang Cai, Nayoung Lee, Avi Schwarzschild et al.
Generalizing Reasoning Problems to Longer Lengths
Changnan Xiao, Bing Liu
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
Xiang Hu, Jiaqi Leng, Jun Zhao et al.
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Haoran Li, Yingjie Qin, Baoyuan Ou et al.
Language Models Need Inductive Biases to Count Inductively
Yingshan Chang, Yonatan Bisk
Length Generalization via Auxiliary Tasks
Pranjal Awasthi, Anupam Gupta, Ravi Kumar
Looped Transformers for Length Generalization
Ying Fan, Yilun Du, Kannan Ramchandran et al.
Mamba Modulation: On the Length Generalization of Mamba Models
Peng Lu, Jerry Huang, QIUHAO Zeng et al.
Provable Length Generalization in Sequence Prediction via Spectral Filtering
Annie Marsden, Evan Dogariu, Naman Agarwal et al.
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Liliang Ren, Yang Liu, Yadong Lu et al.
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models
Benjamin Walker, Lingyi Yang, Nicola Muca Cirone et al.
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
Yu Huang, Zixin Wen, Aarti Singh et al.
Case-Based or Rule-Based: How Do Transformers Do the Math?
Yi Hu, Xiaojuan Tang, Haotong Yang et al.
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen et al.
Mastering Symbolic Operations: Augmenting Language Models with Compiled Neural Networks
Yixuan Weng, Minjun Zhu, Fei Xia et al.
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Zixuan Wang, Stanley Wei, Daniel Hsu et al.