HuMoR: 3D Human Motion Model for Robust Pose Estimation

Davis Rempe¹ Tolga Birdal¹ Aaron Hertzmann² Jimei Yang² Srinath Sridhar³ Leonidas J. Guibas¹

¹ Stanford University ² Adobe Research ³ Brown University

International Conference on Computer Vision (ICCV) 2021 (Oral Presentation)

We introduce HuMoR: a 3D Human Motion Model for Robust Estimation of temporal pose and shape. Though substantial progress has been made in estimating 3D human motion and shape from dynamic observations, recovering plausible pose sequences in the presence of noise and occlusions remains a challenge. For this purpose, we propose an expressive generative model in the form of a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. Furthermore, we introduce a flexible optimization-based approach that leverages HuMoR as a motion prior to robustly estimate plausible pose and shape from ambiguous observations. Through extensive evaluations, we demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset, and enables motion reconstruction from multiple input modalities including 3D keypoints and RGB(-D) videos.

Method Overview

Rather than describing likely poses, HuMoR models a probability distribution of possible pose transitions, formulated as a conditional variational autoencoder (CVAE). Though not explicitly physics-based, its components correspond to a physical model: the latent space can be interpreted as generalized forces, which are inputs to a dynamics model with numerical integration (the decoder). Moreover, ground contacts are explicitly predicted and used to constrain pose estimation at test time.

HuMoR conditional variational autoencoder architecture.

After training on the large AMASS motion capture dataset, we use HuMoR as a motion prior at test time for 3D human perception from noisy and partial observations across different input modalities such as RGB(-D) video and 2D or 3D joint sequences. We introduce a robust test-time optimization (TestOpt) which estimates the parameters of 3D motion, body shape, the ground plane, and contact points. The optimization gives robust results by (i) parameterizing the motion in the latent space of the CVAE, and (ii) using HuMoR priors to regularize the optimization towards the space of plausible motions.

Our optimization procedure leverages HuMoR to recover plausible motions from many modalities even under noise and occlusions.

Next, we show a sampling of video results for our approach. For a more detailed explanation of these results and more supplementary videos corresponding to each section of the paper, see the full supplementary material webpage.

Results on 3D Data

Our proposed optimization using HuMoR can fit to 3D observations such as partial keypoints or noisy joints, recovering motion, shape, and ground contacts. See additional keypoints and joints examples.

Results on RGB Video

In these examples, our optimization fits to 2D joints detected in RGB videos while maintaining robustness to occlusions. Here we show examples of motion, shape, and contact predictions. See additional examples and comparisons here and here.

HuMoR also generalizes to fast and dynamic motions like dancing shown below. See additional dancing results here.

Results on RGB-D Video

Similarly, our method works on RGB-D data by fitting to 2D joints and the 3D point cloud. Here we also show the ground plane output within the true scene mesh for reference. See additional results.

Generation Results

Finally, HuMoR as a standalone generative model can produce plausible random motions. Here we show motion samples on test-set body shapes (left) and on a single body starting from the same initial state (right). See additional results.

Additional Results

A more detailed explanation of the above results and more videos are available on the full supplementary material webpage.

Acknowledgments

This work was supported by the Toyota Research Institute ("TRI") under the University 2.0 program, grants from the Samsung GRO program and the Ford-Stanford Alliance, a Vannevar Bush faculty fellowship, NSF grant IIS-1763268, and NSF grant CNS-2038897. TRI provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

This project page template is based on this page.

Citation

@inproceedings{rempe2021humor,
    author={Rempe, Davis and Birdal, Tolga and Hertzmann, Aaron and Yang, Jimei and Sridhar, Srinath and Guibas, Leonidas J.},
    title={HuMoR: 3D Human Motion Model for Robust Pose Estimation},
    booktitle={International Conference on Computer Vision (ICCV)},
    year={2021}
}

Contact

For any questions, please contact Davis Rempe drempe@stanford.edu.