LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.
Abstract
Current research on generating 3D hand-object interaction motion primarily focuses on in-domain objects. Generalization to unseen objects is essential for practical applications, yet it remains both challenging and largely unexplored. In this paper, we propose LatentHOI, a novel approach designed to tackle the challenges of generalizing hand-object interaction synthesis to unseen objects. Our main insight lies in decoupling high-level temporal motion from fine-grained spatial hand-object interactions via a latent diffusion model coupled with a Grasping Variational Autoencoder (Grasp-VAE). This configuration introduces regularization by enforcing a conditional dependency between spatial grasping and temporal motion, as well as through the regularized latent space for better generalization ability. We conducted extensive experiments in an unseen-object setting on both single-hand grasping and bi-manual motion datasets, including GRAB, DexYCB†, and OakInk. Quantitative and qualitative evaluations demonstrate that our method significantly enhances the realism and physical plausibility of generated motions for unseen objects, both in single and bimanual manipulations, compared to the state-of-the-art.