Abstract
Numerous researches on real-time motion generation primarily focus on kinematic aspects, often resulting in physically implausible outcomes. In this paper, we present POMP ("Physics-cOnstrainable Motion Generative Model through Phase Manifolds"), a kinematics-based framework that synthesizes physically realistic motions by leveraging phase manifolds to align motion priors with physics constraints. POMP operates as a frame-by-frame autoregressive model with three core components: a diffusion-based kinematic module, a simulation-based dynamic module, and a phase encoding module. At each timestep, the kinematic module first generates an initial pose, which is subsequently revised by the dynamic module through a simulation step to incorporate physical constraints. While individual simulation steps induce negligible kinematic distortion, accumulated discrepancies can drive the result beyond the motion prior learned by the kinematic module, leading to failure in subsequent motion generation. To address this, the phase encoding module applies semantic alignment in the phase manifold, projecting the simulated result back to the motion prior. Moreover, we present a pipeline in Unity for generating terrain maps and capturing full-body motion impulses from existing motion capture dataset. The collected terrain topology and motion impulse data facilitate the training of POMP, enabling it to robustly respond to underlying contact forces and applied dynamics. Extensive evaluations demonstrate the efficacy of POMP across various tasks.