PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation

6citations
PDFProject
6
citations
#733
in AAAI 2025
of 3028 papers
6
Top Authors
2
Data Points

Abstract

Manipulating human poses based on natural language is an emerging research field that has traditionally focused on coarse commands such as “walking” or “dancing.” However, fine-grained pose manipulation, like instructing “put both hands in front of the stomach,” remains underexplored. In this paper, we introduce PoseLLaVA, a pioneering model that integrates SMPL-based pose representations into the multimodal LLaVA framework. Through a novel pose encoder decoder mechanism, PoseLLaVA achieves precise alignment between pose, textual, and visual modalities, enabling detailed control over pose manipulation tasks. PoseLLaVA excels in three key tasks: pose estimation, generation, and adjustment, all driven by detailed language instructions. We further introduce a fine-grained pose adjustment dataset PosePart, where each sample contains an initial pose and a target pose, along with specific instructions for adjustments, mimicking the guidance a human instructor might provide. Extensive evaluations across these tasks demonstrate significant improvements over existing methods, including metrics such as MPJPE and PA-MPJPE, which measure SMPL reconstruction errors, and Recall rates, which assess feature alignment across modalities. Specifically, PoseLLaVA reduces MPJPE errors by more than 20% compared to state-of-the-art methods in pose adjustment and generation tasks. Additionally, we demonstrate the feasibility of combining PoseLLaVA with generative models, such as diffusion, for pose image editing, highlighting its potential applications in language-controlled pose manipulation.

Citation History

Jan 27, 2026
5
Feb 13, 2026
6+1