From Notation to Gesture: Virtual Conductor Gesture Generation in VR via Structured Score Semantics
Top Authors
Abstract
Conductor avatar plays a dual role in immersive Virtual Reality (VR) interactive systems by interpreting musical scores and guiding orchestral performance. Rule-based score-driven methods ensure precise synchronization with predefined conducting templates or videos, but are constrained by pre-authored data. Audio-driven frameworks offer greater adaptability through real-time gesture generation but often fail to capture the symbolic semantics of musical scores. To overcome these limitations, we propose a novel score-driven gesture generation framework that translates symbolic musical representations into plausible conducting gestures. Our approach adopts a two-stage architecture, combining a comparative learning stage for pre-training a score encoder with a generative learning stage for gesture synthesis. The score encoder explicitly models musical features such as tempo, chord, intensity, and cycle semantics, directly informing gesture generation. To support this research, we introduce Multimodal Symphonic Conducting Dataset (MSCD), the first synchronized dataset comprising conducting gestures, performance audio, and editable symbolic scores, effectively bridging the gap between musical semantics and gesture synthesis. Qualitative and quantitative analyses are provided to demonstrate the effectiveness of our approach, while a user study is designed to identify the strengths and limitations of the current work.