MACE-Dance:

Motion-Appearance Cascaded Experts for
Music-Driven Dance Video Generation

Anonymous Project Page

Leveraging the synergistic collaboration among the cascaded experts, MACE-Dance can generate diverse dance videos that not only exhibit kinematically plausible and artistically expressive motion, but also maintain spatiotemporal coherent appearance.

Abstract

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation task has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, they cannot be directly adapted to this task. Moreover, the limited studies in this task still struggle to jointly achieve high-quality visual appearance and human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion enforcing kinematic plausibility and artistic expressiveness, while the Appearance Expert carries out motion-and-reference conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts Diffusion Model with BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation task; the Appearance Expert adopts a decoupled Kinematic–Aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in the pose-driven image animation task. To better benchmark this task, we curate a large-scale and diverse dataset, and design a motion–appearance evaluation protocol. Based on them, MACE-Dance also achieves the state-of-the-art (SOTA) performance.