Diffusion-Based Imaginative Coordination for Bimanual Manipulation

2citations

arXiv:2507.11296

citations

#1094

in ICCV 2025

of 2701 papers

Top Authors

Data Points

Top Authors

Huilin Xu Jian Ding Jiakun Xu Ruixiang Wang Jun Chen Jinjie Mai Yanwei Fu Bernard Ghanem Feng Xu Mohamed Elhoseiny

Topics

bimanual manipulation diffusion models video prediction action prediction latent space encoding attention mechanism robotic coordination unified framework

Abstract

Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a \textbf{24.9\%} increase on ALOHA, an \textbf{11.1\%} increase on RoboTwin, and a \textbf{32.5\%} increase in real-world experiments. Our models and code are publicly available at https://github.com/return-sleep/Diffusion_based_imaginative_Coordination.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 2, 2026

Feb 6, 2026

2+2

Feb 13, 2026