Audio-driven Talking Face Generation with Stabilized Synchronization Loss

13citations

arXiv:2307.09368 PDF Project

citations

#878

in ECCV 2024

of 2387 papers

Top Authors

Data Points

Top Authors

Dogucan Yaman Fevziye Irem Eyiokur Yaman Leonard Bärmann HAZIM KEMAL EKENEL Alexander Waibel

Abstract

Talking face generation aims to create a realistic video with accurate lip synchronization and high visual quality, using given audio and reference video, while preserving identity and visual characteristics. In this paper, we start by identifying several issues of existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss and SyncNet. We further tackle lip leaking problem from the identity reference and propose a silent-lip generator, aiming to prevent lip leaking by changing the lips of the identity reference. We then introduce stabilized synchronization loss and AVSyncNet to alleviate the problems caused by lip-sync loss and SyncNet. Finally, we present adaptive triplet loss to enhance visual quality and apply a post-processing technique to obtain high-quality videos. According to the experiments, our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions as well as their complementary effects.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026

Feb 13, 2026

13+13

Feb 13, 2026