publications | Jinseok Bae

2025

ICCV
Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Jinseok Bae, Inwoo Hwang, Young Yoon Lee, Ziyu Guo, Joseph Liu, Yizhak Ben-Shabat, Young Min Kim, and Mubbasir Kapadia

2025

Abs Bib Paper

Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.
@article{bae2025less, title = {Less is More: Improving Motion Diffusion Models with Sparse Keyframes}, author = {Bae, Jinseok and Hwang, Inwoo and Lee, Young Yoon and Guo, Ziyu and Liu, Joseph and Ben-Shabat, Yizhak and Kim, Young Min and Kapadia, Mubbasir}, year = {2025}, }
ICCV
Motion Synthesis with Sparse and Flexible Keyjoint Control

Inwoo Hwang, Jinseok Bae, Donggeun Lim, and Young Min Kim

2025

Abs Bib Paper

Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals. Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints. Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements. We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios.
@article{hwang2025flexible, title = {Motion Synthesis with Sparse and Flexible Keyjoint Control}, author = {Hwang, Inwoo and Bae, Jinseok and Lim, Donggeun and Kim, Young Min}, year = {2025}, }
ICCV
Event-Driven Storytelling with Multiple Lifelike Humans in a 3D scene

Donggeun Lim, Jinseok Bae, Inwoo Hwang, Seungmin Lee, Hwanhee Lee, and Young Min Kim

2025

Abs Bib

In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability.
@article{lim2025event, title = {Event-Driven Storytelling with Multiple Lifelike Humans in a 3D scene}, author = {Lim, Donggeun and Bae, Jinseok and Hwang, Inwoo and Lee, Seungmin and Lee, Hwanhee and Kim, Young Min}, year = {2025}, }
SIGGRAPH
PLT: Part-Wise Latent Tokens as Adaptable Motion Priors for Physically Simulated Characters

Jinseok Bae, Younghwan Lee, Donggeun Lim, and Young Min Kim

2025

Abs Bib

Physically simulated characters can learn highly natural full-body motion guided by motion capture datasets. However, the range of motion is limited to the existing high-quality datasets, and cannot effectively adapt to challenging scenarios. We propose a novel policy architecture that learns part-wise motion skills, where individual parts can be separately extended and combined for unobserved settings. Our method employs a set of part-specific codebooks, which robustly capture motion dynamics without catastrophic collapse or forgetting. This structured decomposition allows intuitive control over the character’s behavior and dynamic exploration for a novel combination of part-wise motion. We further incorporate a refinement network compensating for subtle discrepancies in the disjoint discrete tokens, thus improving motion quality and stability. Our extensive evaluations show that our part-wise latent token achieves superior performance in imitating motions, even those from unseen distribution. We also validate our method in challenging tasks, including body tracking, navigation on complex terrains, and point-goal navigation with damaged body parts. Finally, we introduce a part-wise expansion of motion priors, where the physically simulated character incrementally adapts partial motion and produces unique combinations of whole-body motion, significantly diversifying motions.
@article{bae2025plt, title = {PLT: Part-Wise Latent Tokens as Adaptable Motion Priors for Physically Simulated Characters}, author = {Bae, Jinseok and Lee, Younghwan and Lim, Donggeun and Kim, Young Min}, year = {2025}, }
CVPR Workshop (HuMoGen)
Goal-Driven Human Motion Generation in Diverse Tasks

Inwoo Hwang, Jinseok Bae, Donggeun Lim, and Young Min Kim

2025

Abs Bib

We propose a framework for goal-driven human motion generation, which can synthesize interaction-rich scenarios. Given the goal positions for key joints, our pipeline automatically generates natural full-body motion that approaches the target in cluttered environments. Our pipeline solves the complex constraints in a tractable formulation by disentangling the process of motion generation into two stages. The first stage computes the trajectory of the key joints like hands and feet to encourage the character to naturally approach the target position while avoiding possible physical violation. We demonstrate that diffusion-based guidance sampling can flexibly adapt to the local scene context while satisfying goal conditions. Then the subsequent second stage can easily generate plausible full-body motion that traverses the key joint trajectories. The proposed pipeline applies to various scenarios that have to concurrently account for 3D scene geometry and body joint configurations.
@article{hwang2025goal, title = {Goal-Driven Human Motion Generation in Diverse Tasks}, author = {Hwang, Inwoo and Bae, Jinseok and Lim, Donggeun and Kim, Young Min}, year = {2025}, }
Eurographics Short Paper
Audio-aided Character Control for Inertial Measurement Tracking

Hojun Jang, Jinseok Bae, and Young Min Kim

2025

Abs Bib

Physics-based character control generates realistic motion dynamics by leveraging kinematic priors from large-scale data within a simulation engine. The simulated motion respects physical plausibility, while dynamic cues like contacts and forces guide compelling human-scene interaction. However, leveraging audio cues, which can capture physical contacts in a cost-effective way, has been less explored in animating human motions. In this work, we demonstrate that audio inputs can enhance accuracy in predicting footsteps and capturing human locomotion dynamics. Experiments validate that audio-aided control from sparse observations (e.g., an IMU sensor on a VR headset) enhances the prediction accuracy of contact dynamics and motion tracking, offering a practical auxiliary signal for robotics, gaming, and virtual environments.
@article{jang2025audio, title = {Audio-aided Character Control for Inertial Measurement Tracking}, author = {Jang, Hojun and Bae, Jinseok and Kim, Young Min}, year = {2025}, }
Eurographics
Versatile Physics-based Character Control with Hybrid Latent Representation

Jinseok Bae, Jungdam Won, Donggeun Lim, Inwoo Hwang, and Young Min Kim

2025

Abs Bib Paper

We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to build a versatile motion prior that can be adapted to a wide range of challenging control tasks. Specifically, we build a discrete latent model to capture distinctive posterior distribution without collapse, and simultaneously augment the sampled vector with the continuous residuals to generate high-quality, smooth motion without jittering. We further incorporate Residual Vector Quantization, which not only maximizes the capacity of the discrete motion prior, but also efficiently abstracts the action space during the task learning phase. We demonstrate that our agent can produce diverse yet smooth motions simply by traversing the learned motion prior through unconditional motion generation. Furthermore, our model robustly satisfies sparse goal conditions with highly expressive natural motions, including head-mounted device tracking and motion in-betweening at irregular intervals, which could not be achieved with existing latent representations.
@article{bae2025hybrid, title = {Versatile Physics-based Character Control with Hybrid Latent Representation}, author = {Bae, Jinseok and Won, Jungdam and Lim, Donggeun and Hwang, Inwoo and Kim, Young Min}, year = {2025}, }

2023

ICCV
Dynamic Mesh Recovery from Partial Point Cloud Sequence

Hojun Jang, Minkwan Kim, Jinseok Bae, and Young Min Kim

2023

Abs Bib Paper Video Code

The exact 3D dynamics of the human body provides crucial evidence to analyze consequences of the physical interaction between the body and the environment, which can eventually assist everyday activities in a wide range of applications. However, optimizing for 3D configurations from image observation requires a significant amount of computation, whereas real-world 3D measurements often suffer from noisy observation or complex occlusion. We resolve the challenge by learning a latent distribution representing strong temporal priors. We use a conditional variational autoencoder (CVAE) architecture with transformer to train the motion priors with a large-scale motion datasets. Then our feature follower effectively aligns the feature spaces of noisy, partial observation with the necessary input for pre-trained motion priors, and quickly recovers a complete mesh sequence of motion. We demonstrate that the transformer-based autoencoder can collect necessary spatio-temporal correlations robust to various adversaries, such as missing temporal frames, or noisy observation under severe occlusion. Our framework is general and can be applied to recover the full 3D dynamics of other subjects with parametric representations.
@article{jang2023dynamic, title = {Dynamic Mesh Recovery from Partial Point Cloud Sequence}, author = {Jang, Hojun and Kim, Minkwan and Bae, Jinseok and Kim, Young Min}, year = {2023}, video = {https://www.youtube.com/watch?v=OgineYrkgRE}, page = {https://hojunjang17.github.io/DynamicMeshRecovery/}, }
SIGGRAPH
PMP: Learning to Physically Interact with Environments using Part-wise Motion Priors

Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, and Young Min Kim

2023

Abs Bib Paper Video Code

We present a method to animate a character incorporating multiple part-wise motion priors (PMP). While previous works allow creating realistic articulated motions from reference data, the range of motion is largely limited by the available samples. Especially for the interaction-rich scenarios, it is impractical to attempt acquiring every possible interacting motion, as the combination of physical parameters increases exponentially. The proposed PMP allows us to assemble multiple part skills to animate a character, creating a diverse set of motions with different combinations of existing data. In our pipeline, we can train an agent with a wide range of part-wise priors. Therefore, each body part can obtain a kinematic insight of the style from the motion captures, or at the same time extract dynamics-related information from the additional part-specific simulation. For example, we can first train a general interaction skill, e.g. grasping, only for the dexterous part, and then combine the expert trajectories from the pre-trained agent with the kinematic priors of other limbs. Eventually, our whole-body agent learns a novel physical interaction skill even with the absence of the object trajectories in the reference motion sequence.
@article{bae2023pmp, title = {PMP: Learning to Physically Interact with Environments using Part-wise Motion Priors}, author = {Bae, Jinseok and Won, Jungdam and Lim, Donggeun and Min, Cheol-Hui and Kim, Young Min}, year = {2023}, video = {https://www.youtube.com/watch?v=WdLGvKdNG-0}, page = {https://jinseokbae.github.io/pmp}, }

2022

AAAI
Neural marionette: Unsupervised learning of motion skeleton and latent dynamics from volumetric video

Jinseok Bae, Hojun Jang, Cheol-Hui Min, Hyungun Choi, and Young Min Kim

2022

Abs Bib Paper Code

We present Neural Marionette, an unsupervised approach that discovers the skeletal structure from a dynamic sequence and learns to generate diverse motions that are consistent with the observed motion dynamics. Given a video stream of point cloud observation of an articulated body under arbitrary motion, our approach discovers the unknown low-dimensional skeletal relationship that can effectively represent the movement. Then the discovered structure is utilized to encode the motion priors of dynamic sequences in a latent structure, which can be decoded to the relative joint rotations to represent the full skeletal motion. Our approach works without any prior knowledge of the underlying motion or skeletal structure, and we demonstrate that the discovered structure is even comparable to the hand-labeled ground truth skeleton in representing a 4D sequence of motion. The skeletal structure embeds the general semantics of possible motion space that can generate motions for diverse scenarios. We verify that the learned motion prior is generalizable to the multi-modal sequence generation, interpolation of two poses, and motion retargeting to a different skeletal structure.
@article{bae2022neural, title = {Neural marionette: Unsupervised learning of motion skeleton and latent dynamics from volumetric video}, author = {Bae, Jinseok and Jang, Hojun and Min, Cheol-Hui and Choi, Hyungun and Kim, Young Min}, year = {2022}, }

2021

Eurographics Short Paper
Auto-rigging 3D Bipedal Characters in Arbitrary Poses

Jeonghwan Kim, Hyeontae Son, Jinseok Bae, and Young Min Kim

2021

Abs Bib Paper Video Code

We present an end-to-end algorithm that can automatically rig a given 3D character such that it is ready for 3D animation. The animation of a virtual character requires the skeletal motion defined with bones and joints, and the corresponding deformation of the mesh represented with skin weights. While the conventional animation pipeline requires the initial 3D character to be in the predefined default pose, our pipeline can rig a 3D character in arbitrary pose. We handle the increased ambiguity by fixing the skeletal topology and solving for the full deformation space. After the skeletal positions and orientations are fully discovered, we can deform the provided 3D character into the default pose, from which we can animate the character with the help of recent motion-retargeting techniques. Our results show that we can successfully animate initially deformed characters, which was not possible with previous works.
@article{kim2021autorigging, title = {Auto-rigging 3D Bipedal Characters in Arbitrary Poses}, author = {Kim, Jeonghwan and Son, Hyeontae and Bae, Jinseok and Kim, Young Min}, year = {2021}, paper = {https://diglib.eg.org/handle/10.2312/egs20211023}, video = {https://www.youtube.com/watch?v=1UVNbxYLkE8}, }
CVPR
Gatsbi: Generative agent-centric spatio-temporal object interaction

Cheol-Hui Min, Jinseok Bae, Junho Lee, and Young Min Kim

2021

Abs Bib Paper Video

We present GATSBI, a generative model that can transform a sequence of raw observations into a structured latent representation that fully captures the spatio-temporal context of the agent’s actions. In vision-based decision-making scenarios, an agent faces complex high-dimensional observations where multiple entities interact with each other. The agent requires a good scene representation of the visual observation that discerns essential components and consistently propagates along the time horizon. Our method, GATSBI, utilizes unsupervised object-centric scene representation learning to separate an active agent, static background, and passive objects. GATSBI then models the interactions reflecting the causal relationships among decomposed entities and predicts physically plausible future states. Our model generalizes to a variety of environments where different types of robots and objects dynamically interact with each other. We show GATSBI achieves superior performance on scene decomposition and video prediction compared to its state-of-the-art counterparts.
@article{min2021gatsbi, title = {Gatsbi: Generative agent-centric spatio-temporal object interaction}, author = {Min, Cheol-Hui and Bae, Jinseok and Lee, Junho and Kim, Young Min}, year = {2021}, video = {https://www.youtube.com/watch?v=nAf87_0T5CE} }