Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Jinseok Bae ¹, Inwoo Hwang ¹, Young Yoon Lee ², Ziyu Guo ³, Joseph Liu ² Yizhak Ben-Shabat ², Young Min Kim ¹, Mubbasir Kapadia ²

¹Seoul National University

²Roblox

³The Chinese University of Hong Kong

arXiv

Code

Abstract

Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training and computational complexity, especially when learning complex distributions of large motion datasets using modern neural architectures. This severely limits the performance of generative motion models for downstream tasks.

Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and semantically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.

Method Overview

We present a Sparse Motion Diffusion Model (sMDM) that improves motion diffusion models by focusing on sparse keyframes. sMDM uses a binary keyframe mask to exclude non-keyframes from the self-attention layers. During training, a keyframe mask is derived from the clean input frames via keyframe selection. At inference, the model starts with a uniform keyframe mask at earlier timesteps (t > T'), then updates the mask by selecting keyframes from noised input frames for later timesteps. Finally, sMDM reconstructs the dense motion by linearly interpolating features of the selected keyframes. To ensure smooth interpolation, we replace the input and output linear layers with Lipschitz MLPs. Red boxes indicate the changes from the baseline.

Text-conditioned Motion Generation

We showcase the performance of our model on text-to-motion synthesis tasks using the HumanML3D dataset. Our models consistently generate high-quality motions that closely align with the input text prompts. We have two variants of our method, denoted sMDM and sMDM-stella. Like many motion diffusion models, sMDM uses CLIP ViT-B/32 as its text encoder. In contrast, sMDM-stella employs the larger Stella-1.5B to explore the compatibility of our approach with more advanced text encoders. Additionally, we visualize results from the baseline MDM and MotionGPT models, which serve as representatives of approaches using standard and advanced text encoders, respectively.

Comparison 1 - 3

Baselines sometimes fail to generate motions that faithfully reflect the text conditions, while ours consistently generate text-aligned motions.

Text Condition : "a person with their knees on the ground and their hands, gets up."

MDM ❌ (wrong start pose)

sMDM (Ours) 🟢

MotionGPT 🟢

sMDM-stella (Ours) 🟢

Text Condition : "the person is walking forward and turn around like a monster."

MDM ❌ (not a monster style)

sMDM (Ours) 🟢

MotionGPT 🟢

sMDM-stella (Ours) 🟢

Text Condition : "the person is doing a dance move."

MDM 🤔 (static movement)

sMDM (Ours) 🟢

MotionGPT ❌ (not a dancing)

sMDM-stella (Ours) 🟢

Comparison 4 - 6

Our models excel at capturing nuanced styles and contextual details from complex textual descriptions.

Text Condition : "A woman walks with a forward direction and a crossed, catwalk motion style to reach an idle position."

MDM ❌ (not a catwalk)

sMDM (Ours) 🟢

MotionGPT ❌ (not a forward, catwalk)

sMDM-stella (Ours) 🟢

Text Condition : "a person ducks down suddenly, steps forward, and stands back up."

MDM ❌ (skips forward step)

sMDM (Ours) 🟢

MotionGPT ❌ (skips forward step)

sMDM-stella (Ours) 🟢

Text Condition : "a person dancing, starting in a t pose, the. turns, continues to dance and finishes with another turn."

MDM 🤔 (no finish turn)

sMDM (Ours) 🤔 (no finish turn)

MotionGPT ❌ (not a dancing)

sMDM-stella (Ours) 🟢

Failure Case

However, our models occasionally fail. As illustrated in this example, our model struggles to capture wiggling motions, which involve high-frequency body movements. Although increasing the number of keyframes could address this issue, it might also diminish the advantages of using a sparse keyframe approach. Adaptive or learnable keyframe selection could be potential solutions. We leave a more in-depth investigation of this limitation to future work.

Text Condition : "a person wiggles their hips back and forth while waving their arms."

MDM 🟢

sMDM (Ours) ❌ (not wiggling)

sMDM-stlla (Ours) 🤔 (wiggling?)

Long-Sequence Generation

We further demonstrate that sMDM can serve as a generative prior in a downstream task. Here, we adopt DoubleTake strategy, which uses a pre-trained MDM or sMDM to generate multiple short motion segments, then revises their overlaps to ensure coherent transitions. To distinguish between the two pre-trained priors (both with 50 diffusion steps and a transformer encoder), we denote them as PriorMDM and sPriorMDM. We set the transition length to 1 second and use a 0.25 second margin for blending each segment. In the videos, we indicate transitions with yellow meshes. Please note four input texts are reflected in the generated motions in a sequential manner.

Comparison 1 - 2

Compared to PriorMDM, sPriorMDM generates much expressive motions while preserving naturalness and smoothness in the transitions. We attribute this difference to the keyframe-centric training of sMDM, which enables the model to learn realistic motion dynamics from a sparse set of frames.

Text 1 : "A person skates on the ice."

Text 2 : "A person is pushed and falls to the ground."

Text 3 : "A person climbs up the stairs and then goes down."

Text 4 : "A person is doing a salsa dance."

PriorMDM 🤔 (less expressive)

sPriorMDM (Ours) 🟢

Text 1 : "A person crawls and then stands up."

Text 2 : "A person picks up a ball from the ground and then throws it away."

Text 3 : "A person punches and kicks in the air."

Text 4 : "A person waves his body."

PriorMDM ❌ (fails in 1 and 4)

sPriorMDM (Ours) 🟢

Additional Samples 1 - 3

Text 1 : "A person leaps upward with explosive energy."

Text 2 : "A person spins around in a rapid twirl several times."

Text 3 : "A person drinks a cup of water."

Text 4 : "A person starts running on a treadmill."

sPriorMDM

Text 1 : "A person sits down on a chair."

Text 2 : "A person stands and stretches slowly, raising both arms overhead."

Text 3 : "A person walks toward the window to draw back the curtains."

Text 4 : "A person moves into the kitchen to prepare a cup of coffee."

sPriorMDM

Text 1 : "A person begins with a light walk along a park pathway."

Text 2 : "A person sprints along a track."

Text 3 : "A person stops and performs a series of arm swings."

Text 4 : "A person finishes by pausing to stretch on a grassy area."

sPriorMDM

Real-Time Character Control

We further validate our approach by applying it to Diffusion Planner (DiP), an autoregressive motion diffusion model designed for character control. Unlike standard motion generators, DiP predicts 2 seconds of future motion from 1 second of past trajectories. We name our model sDiP, which is trained with the same keyframe-centric approach as sMDM.

Comparison

We evaluate our model on extended scenarios in which the character performs a series of actions. In these scenarios, the character is dynamically controlled using text prompts that change every 8 seconds. We visualize the progression of time using the following color sequence: red, orange, yellow, green, blue, and purple.

Text 1 : "A person is warming up."

Text 2 : "A person walks forward, bends down to pick something up off the ground."

Text 3 : "A person is getting up and perform jumping jacks."

Text 4 : "A person is running back and forth."

Text 5 : "A person punches in a manner consistent with martial arts."

Text 6 : "A person dances like a pop star."

DiP 🟢

sDiP (Ours) 🟢

Although DiP often generates plausible motions as sDiP, we find that sDiP often produces much expressive and contextually relevant motions. This conclusion is follwed by the quantitative evaluation, where our models, including target-conditioning variants (sDiP-T), clearly outperform the baseline for all the metrics.