Supplementary Results for HuMoR: 3D Human Motion Model for Robust Pose Estimation

This page contains extensive qualitative results that supplement experiments in Sec. 5 of the main paper and the appendix. To see a quick sampling of results instead, please visit the main project page. Each set of qualitative results can be accessed with the links below, or simply scroll down to watch the videos in order. Specific sequences are linked with buttons for easy reference.

Generative Model Evaluation (Section 5.3)
Estimation from 3D: Occluded Keypoints (Section 5.4)
Estimation from 3D: Noisy Joints (Section 5.4)
Estimation from RGB: i3DB Data Baseline Comparisons (Section 5.5)
Estimation from RGB: i3DB Data Ablation Comparisons (Section 5.5)
Estimation from RGB: PROX Data Baseline Comparisons (Section 5.5)
Estimation from RGB-D: PROX Data Baseline Comparisons (Section 5.5)
Estimation of Fast & Dynamic Motions (Appendix F.1)
Failure Cases (Appendix A.1)

These results show the capabilities of HuMoR as a standalone generative model. In each example, a random initial state is sampled from the AMASS test set, and then sequences are sampled through autoregressive rollout.

(0:03) Randomly sampled sequences using the qualitative variation of HuMoR on 4 body shapes not seen during training.
(0:34) Same random sampling but with person-ground contacts (output from the CVAE decoder) visualized in red.
(0:43) Demonstrates the diversity of random rollout by showing 5 sampled sequences all beginning at the same iniital state.
(1:03) Demonstrates the stability of random sampling by generating a minute-long sequence.
(1:27) Randomly samples sequences using HuMoR (Qual), HuMoR, and MVAE all starting from the same initial state. Note that MVAE quickly diverges to unrealistic poses and HuMoR tends to suffer from plausibility artifacts such as feet floating above or penetrating the floor.

Next Return to top

These results demonstrate using test-time optimization (TestOpt) with HuMoR as a motion prior to fit to partially observable 3D keypoint data (generated from the AMASS test set). In each example, Observations+Ground Truth shows the observed keypoints in blue on the ground truth body. Output motion is on the opaque body mesh along with observed keypoints again in blue for reference.

(0:03) Results on 5 different sequences fitting to partial keypoints that are occluded below the hips. Despite only observing the upper body, our method produces plausible walking, side stepping, jogging, running, and kicking.
(0:43) Comparing results on 6 different sequences to baselines. VPoser-t fails to produce any plausible lower-body motion since it uses only a pose prior, while using MVAE as the motion prior often gives unnatural and implausible motions. This is espeically noticeable when watching the feet as indicated by the red and green boxes in the first few results.

Next Return to top

These results demonstrate fitting to noisy 3D joint data (generated from the AMASS test set). In each example, Observations+Ground Truth shows the observed joints in green with the ground truth body and motion.

(0:03) Results on 4 different sequences using our proposed approach. The output on the right shows the output body mesh and resulting SMPL joints with ground contacts colored in red. Despite severe noise, TestOpt with HuMoR recovers smooth motions with highly accurate contacts.
(0:27) Comparing results on 4 different sequences. Output results for each method show the recovered body mesh along with noisy joint observations in green for reference. The VPoser-t baseline gives overly smoothed motions (as indicated by colored boxes around feet) while still being affected by noise (e.g. see head jerkiness indicated by boxes). HuMoR allows for large acclerations to fit dynamic, yet smooth, motions.

Next Return to top

These results demonstrate fitting to 2D joints detected from RGB videos in the i3DB dataset, which contains heavy occlusions. For each example sequence, the input video and output of TestOpt with HuMoR (both motion and contacts) are shown first. Next, HuMoR results are compared to the VIBE and VPoser-t baselines - first from the camera view and then from an alternate view with the predicted ground plane shown as reference.

(0:03) Results for a sidestepping motion with occluded feet. Note that the VIBE output is very noisy over time and both VIBE and VPoser-t produce an implausible neutral pose when feet are occluded, causing inconsistent foot heights that are particularly noticeable in the alternate view (as indicated by colored boxes).
(0:21) Results for a sitting motion with heavy occlusions. Due to the occluded lower body, VIBE tends to fail while VPoser-t produces a standing pose that severely penetrates the floor (as indicated by colored boxes). In order to maintain consistency with the start of the sequence (a standing pose), HuMoR produces a plausible sitting motion.
(0:39, 0:57, 1:15) Additional examples. The advantage of our approach is especially apparent when observing the feet (e.g. consistent foot height and improved contacts).

Next Return to top

Similarly, we compare results to ablations of the full HuMoR CVAE: No Delta (does not predict change in state) and Standard Prior (does not learn a conditional prior). These two are of particular interest since they are common in previous variational motion models.

(0:03) Results for a sitting motion with heavy occlusions. Both ablations produce standing poses with severe ground penetrations (as indicated by colored boxes) due to their inability to learn a generalizable motion model.
(0:21) Results for a walking motion with mild occlusions. Both ablations produce foot skating artifacts with floor penetrations and Standard Prior estimates an incorrect stepping sequence.
(0:39, 0:57) Additional examples. Observing the feet shows the subtle, but key, improvements when using HuMoR.

Next Return to top

Similar to i3DB, we compare results for fitting to RGB observations in the PROX dataset. For each example sequence, just the output of our method is shown first, followed by the comparison to PROX-RGB and VPoser-t baselines. In all examples, PROX-RGB produces temporally incoherent results since it operates on single frames. However, it also uses the scene mesh as input which allows for plausible poses when the person is fully visible. This does not greatly improve results under occlusions, though, often reverting to a mean leg pose similar to VPoser-t and VIBE.

(0:03, 0:30, 0:57) Example results.

Next Return to top

Next, we show results for fitting to RGB-D observations from the PROX dataset. For each example sequence, just the output of our method is shown first. In these examples we show the predicted motion as well as the ground plane estimation (instead of contacts). The ground plane is rendered within the true scene mesh for reference only, our method does not use the scene mesh as input or output. Next, our results are compared to the PROX-D and VPoser-t baselines, first overlaid on the input video then within the ground truth scene geometry.

(0:03) Walking sequence with lower body occlusions. As seen in the overlaid RGB, PROX-D produces unrealistic sliding/floating when legs are occluded. However, when feet are visible PROX-D produces realistic contacts with the scene floor mesh since it uses the geometry as input. HuMoR produces natural motion under occlusions, but the ground plane estimation is not perfect so feet slightly float above the true scene geometry.
(0:30) Sitting to standing motion with mild feet occlusions. As seen when visualized in the true scene mesh, HuMoR notably produces fewer penetrations with the couch than PROX-D, despite not using any geometric constraints. This indicates using motion as a prior can by itself improve the plausibility of environment interactions.
(0:57, 1:24, 1:51) Additional examples.

Next Return to top

Most results so far have shown common motions (e.g. walking, sitting) in occluded settings. However, fitting with HuMoR can also capture fast and dynamic motions from full-body observations. In the following results, we show that despite not training on many dance motions, HuMoR effectively generalizes to complex dynamic movements and allows for large accelerations to accurately fit 3D keypoints and 2D joints captured from dancing motions. 3D keypoint data is from the DanceDB subset of AMASS (not used for training HuMoR) - the ground truth motion and shape along with observed keypoints are shown on the left, on the right is our fitting results alongside the ground truth keypoints. RBB videos are from the AIST dataset.

(0:04, 0:10, 0:16, 0:22, 0:28) DanceDB 3D keypoint results. HuMoR fits to fast motions and difficult poses that are not frequent in training thanks to generalizability from operating on pairs of frames.
(0:34, 0:40) AIST 2D joint from RGB video results.
(0:46) Fitting to a long AIST video by splitting the clip into overlapping small windows and optimizing separately with consistency constraints between adjacent windows. Note the incorrect motion during the cartwheels - this is caused by poor 2D joint detections from OpenPose. Despite this, the optimization recovers robustly and produces reasonable results for the remainder of the sequence.

Next Return to top

Finally, we look at specific failure cases of TestOpt using HuMoR.

(0:04) Extreme occlusions (e.g. only a few visible joints), especially at the first frame makes for a difficult optimization that often lands in local minima with implausible motions.
(0:12) Our method is dependent on motion in order to resolve ambiguity. For the case of a nearly static person, as shown in this example, it may produce implausible motion if occlusions cause ambiguity: it predicts standing when the person is clearly sitting since standing is more likely under the prior.
(0:18) Similarly, since the ground plane estimation depnds on motion, if the person is static this estimation can be incorrect.
(0:27) When observed motions are far from CVAE training data, e.g. laying down in this example, the ground plane estimation may have large errors in an attempt to make the motion likely under the prior.

Return to top

HuMoR: 3D Human Motion Model for Robust Pose Estimation Supplementary Video Results

HuMoR: 3D Human Motion Model for Robust Pose Estimation
Supplementary Video Results