Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training and computational complexity, especially when learning complex distributions of large motion datasets using modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and semantically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.
Comparison 1 - 3
MDM ❌ (wrong start pose)
sMDM (Ours) 🟢
MotionGPT 🟢
sMDM-stella (Ours) 🟢
MDM ❌ (not a monster style)
sMDM (Ours) 🟢
MotionGPT 🟢
sMDM-stella (Ours) 🟢
MDM 🤔 (static movement)
sMDM (Ours) 🟢
MotionGPT ❌ (not a dancing)
sMDM-stella (Ours) 🟢
Comparison 4 - 6
MDM ❌ (not a catwalk)
sMDM (Ours) 🟢
MotionGPT ❌ (not a forward, catwalk)
sMDM-stella (Ours) 🟢
MDM ❌ (skips forward step)
sMDM (Ours) 🟢
MotionGPT ❌ (skips forward step)
sMDM-stella (Ours) 🟢
MDM 🤔 (no finish turn)
sMDM (Ours) 🤔 (no finish turn)
MotionGPT ❌ (not a dancing)
sMDM-stella (Ours) 🟢
Failure Case
MDM 🟢
sMDM (Ours) ❌ (not wiggling)
sMDM-stlla (Ours) 🤔 (wiggling?)
Comparison 1 - 2
Text 1 : "A person skates on the ice."
Text 2 : "A person is pushed and falls to the ground."
Text 3 : "A person climbs up the stairs and then goes down."
Text 4 : "A person is doing a salsa dance."
PriorMDM 🤔 (less expressive)
sPriorMDM (Ours) 🟢
Text 1 : "A person crawls and then stands up."
Text 2 : "A person picks up a ball from the ground and then throws it away."
Text 3 : "A person punches and kicks in the air."
Text 4 : "A person waves his body."
PriorMDM ❌ (fails in 1 and 4)
sPriorMDM (Ours) 🟢
Additional Samples 1 - 3
Text 1 : "A person leaps upward with explosive energy."
Text 2 : "A person spins around in a rapid twirl several times."
Text 3 : "A person drinks a cup of water."
Text 4 : "A person starts running on a treadmill."
sPriorMDM
Text 1 : "A person sits down on a chair."
Text 2 : "A person stands and stretches slowly, raising both arms overhead."
Text 3 : "A person walks toward the window to draw back the curtains."
Text 4 : "A person moves into the kitchen to prepare a cup of coffee."
sPriorMDM
Text 1 : "A person begins with a light walk along a park pathway."
Text 2 : "A person sprints along a track."
Text 3 : "A person stops and performs a series of arm swings."
Text 4 : "A person finishes by pausing to stretch on a grassy area."
sPriorMDM
Comparison
Text 1 : "A person is warming up."
Text 2 : "A person walks forward, bends down to pick something up off the ground."
Text 3 : "A person is getting up and perform jumping jacks."
Text 4 : "A person is running back and forth."
Text 5 : "A person punches in a manner consistent with martial arts."
Text 6 : "A person dances like a pop star."
DiP 🟢
sDiP (Ours) 🟢