Kaifeng Lyu

Affiliations

Tsinghua University

papers

786

total citations

papers (13)

Safety Alignment Should be Made More Than Just a Few Tokens Deep

ICLR 2025arXiv

303

citations

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

NEURIPS 2022arXiv

citations

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

NEURIPS 2021arXiv

citations

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

NEURIPS 2022arXiv

citations

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

NEURIPS 2020arXiv

citations

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

NEURIPS 2022arXiv

citations

A Quadratic Synchronization Rule for Distributed Deep Learning

ICLR 2024arXiv

citations

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

NEURIPS 2025arXiv

citations

Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

NEURIPS 2025arXiv

citations

Kaifeng Lyu

Affiliations

papers (13)

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Efficient stagewise pretraining via progressive subnetworks

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

A Quadratic Synchronization Rule for Distributed Deep Learning

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

papers (13)

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Efficient stagewise pretraining via progressive subnetworks

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

A Quadratic Synchronization Rule for Distributed Deep Learning

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold