We introduce HuMoR: a 3D Human Motion Model for Robust Estimation of temporal pose and shape.
Though substantial progress has been made in estimating 3D human motion and shape from dynamic observations,
recovering plausible pose sequences in the presence of noise and occlusions remains a challenge. For this purpose,
we propose an expressive generative model in the form of a conditional variational autoencoder, which learns a distribution
of the change in pose at each step of a motion sequence. Furthermore, we introduce a flexible optimization-based approach
that leverages HuMoR as a motion prior to robustly estimate plausible pose and shape from ambiguous observations.
Through extensive evaluations, we demonstrate that our model generalizes to diverse motions and body shapes after training
on a large motion capture dataset, and enables motion reconstruction from multiple input modalities including 3D keypoints
and RGB(-D) videos.
Method Overview
Rather than describing likely poses, HuMoR models a probability distribution of possible pose transitions,
formulated as a conditional variational autoencoder (CVAE). Though not explicitly physics-based, its components
correspond to a physical model: the latent space can be interpreted as generalized forces, which are inputs to a
dynamics model with numerical integration (the decoder). Moreover, ground contacts are explicitly predicted and used
to constrain pose estimation at test time.
After training on the large AMASS motion capture dataset, we use HuMoR as a motion prior at test time for 3D human perception
from noisy and partial observations across different input modalities such as RGB(-D) video and 2D or 3D joint sequences.
We introduce a robust test-time optimization (TestOpt) which estimates the parameters of 3D motion, body shape,
the ground plane, and contact points. The optimization gives robust results by (i) parameterizing the motion in the
latent space of the CVAE, and (ii) using HuMoR priors to regularize the optimization towards the space of plausible motions.
Our optimization procedure leverages HuMoR to recover plausible motions from many
modalities even under noise and occlusions.
Next, we show a sampling of video results for our approach. For a more detailed explanation of these results and
more supplementary videos corresponding to each section of the paper, see
the full supplementary material webpage.
Results on 3D Data
Our proposed optimization using HuMoR can fit to 3D observations such as partial keypoints or noisy joints,
recovering motion, shape, and ground contacts. See additional keypoints and
joints examples.
Results on RGB Video
In these examples, our optimization fits to 2D joints detected in RGB videos while maintaining robustness
to occlusions. Here we show examples of motion, shape, and contact predictions.
See additional examples and comparisons here
and here.
HuMoR also generalizes to fast and dynamic motions like dancing shown below. See additional dancing results here.
Results on RGB-D Video
Similarly, our method works on RGB-D data by fitting to 2D joints and the 3D point cloud. Here we also
show the ground plane output within the true scene mesh for reference.
See additional results.
Generation Results
Finally, HuMoR as a standalone generative model can produce plausible random motions. Here we show
motion samples on test-set body shapes (left) and on a single body starting from the same initial state (right).
See additional results.
This work was supported by the Toyota Research Institute ("TRI") under the University 2.0 program, grants from the
Samsung GRO program and the Ford-Stanford Alliance, a Vannevar Bush faculty fellowship, NSF grant IIS-1763268, and NSF grant CNS-2038897.
TRI provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors
and not TRI or any other Toyota entity.
@inproceedings{rempe2021humor,
author={Rempe, Davis and Birdal, Tolga and Hertzmann, Aaron and Yang, Jimei and Sridhar, Srinath and Guibas, Leonidas J.},
title={HuMoR: 3D Human Motion Model for Robust Pose Estimation},
booktitle={International Conference on Computer Vision (ICCV)},
year={2021}
}