Audio-driven Talking Face Generation with Stabilized Synchronization Loss

13
citations
#878
in ECCV 2024
of 2387 papers
5
Top Authors
7
Data Points

Abstract

Talking face generation aims to create a realistic video with accurate lip synchronization and high visual quality, using given audio and reference video, while preserving identity and visual characteristics. In this paper, we start by identifying several issues of existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss and SyncNet. We further tackle lip leaking problem from the identity reference and propose a silent-lip generator, aiming to prevent lip leaking by changing the lips of the identity reference. We then introduce stabilized synchronization loss and AVSyncNet to alleviate the problems caused by lip-sync loss and SyncNet. Finally, we present adaptive triplet loss to enhance visual quality and apply a post-processing technique to obtain high-quality videos. According to the experiments, our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions as well as their complementary effects.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 28, 2026
0
Feb 13, 2026
13+13
Feb 13, 2026
13
Feb 13, 2026
13