MACE-Dance:
Motion-Appearance
Cascaded
Experts
for
Music-Driven Dance Video Generation
Anonymous Project Page
Abstract
With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation task has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, they cannot be directly adapted to this task. Moreover, the limited studies in this task still struggle to jointly achieve high-quality visual appearance and human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion enforcing kinematic plausibility and artistic expressiveness, while the Appearance Expert carries out motion-and-reference conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts Diffusion Model with BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation task; the Appearance Expert adopts a decoupled Kinematic–Aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in the pose-driven image animation task. To better benchmark this task, we curate a large-scale and diverse dataset, and design a motion–appearance evaluation protocol. Based on them, MACE-Dance also achieves the state-of-the-art (SOTA) performance.