This page contains extensive qualitative results that supplement experiments in Sec. 5 of the main paper and the appendix.
To see a quick sampling of results instead, please visit the main project page.
Each set of qualitative results can be accessed with the links below, or simply scroll down to watch
the videos in order. Specific sequences are linked with buttons for easy reference.
These results show the capabilities of HuMoR as a standalone generative model. In each example, a random initial
state is sampled from the AMASS test set, and then sequences are sampled through autoregressive rollout.
(0:03) Randomly sampled sequences using the qualitative variation of HuMoR on 4 body shapes not seen during training.
(0:34) Same random sampling but with person-ground contacts (output from the CVAE decoder) visualized in red.
(0:43) Demonstrates the diversity of random rollout by showing 5 sampled sequences all beginning at the same iniital state.
(1:03) Demonstrates the stability of random sampling by generating a minute-long sequence.
(1:27) Randomly samples sequences using HuMoR (Qual), HuMoR, and MVAE all starting from the same initial state.
Note that MVAE quickly diverges to unrealistic poses and HuMoR tends to suffer from plausibility artifacts such
as feet floating above or penetrating the floor.
2. Estimation from 3D: Occluded Keypoints (Section 5.4)
These results demonstrate using test-time optimization (TestOpt) with HuMoR as a motion prior to fit to partially
observable 3D keypoint data (generated from the AMASS test set). In each example, Observations+Ground Truth
shows the observed keypoints in blue
on the ground truth body. Output motion is on the opaque body mesh along with observed keypoints
again in blue for reference.
(0:03) Results on 5 different sequences fitting to partial keypoints that are occluded below the hips. Despite
only observing the upper body, our method produces plausible walking, side stepping, jogging, running, and kicking.
(0:43) Comparing results on 6 different sequences to baselines. VPoser-t fails to produce any plausible lower-body
motion since it uses only a pose prior, while using MVAE as the motion prior often gives unnatural and implausible
motions. This is espeically noticeable when watching the feet as indicated by the red and green boxes in the first few results.
These results demonstrate fitting to noisy 3D joint data (generated from the AMASS test set). In each example,
Observations+Ground Truth shows the observed joints in green with the ground truth body and motion.
(0:03) Results on 4 different sequences using our proposed approach. The output on the right shows the output body
mesh and resulting SMPL joints with ground contacts colored in red. Despite severe noise, TestOpt with HuMoR recovers
smooth motions with highly accurate contacts.
(0:27) Comparing results on 4 different sequences. Output results for each method show the recovered body mesh along with noisy
joint observations in green for reference. The VPoser-t baseline gives overly smoothed motions (as indicated by colored boxes around feet) while still
being affected by noise (e.g. see head jerkiness indicated by boxes). HuMoR allows for large acclerations to fit dynamic, yet smooth, motions.
4. Estimation from RGB: i3DB Data Baseline Comparisons (Section 5.5)
These results demonstrate fitting to 2D joints detected from RGB videos in the i3DB dataset, which contains heavy occlusions.
For each example sequence, the input video and output of TestOpt with HuMoR (both motion and contacts) are shown first. Next,
HuMoR results are compared to the VIBE and VPoser-t baselines - first from the camera view and then from an
alternate view with the predicted ground plane shown as reference.
(0:03) Results for a sidestepping motion with occluded feet. Note that the VIBE output is very noisy over time and
both VIBE and VPoser-t produce an implausible neutral pose when feet are occluded, causing inconsistent foot heights
that are particularly noticeable in the alternate view (as indicated by colored boxes).
(0:21) Results for a sitting motion with heavy occlusions. Due to the occluded lower body, VIBE tends to fail while
VPoser-t produces a standing pose that severely penetrates the floor (as indicated by colored boxes). In order to maintain consistency with the start
of the sequence (a standing pose), HuMoR produces a plausible sitting motion.
(0:39, 0:57, 1:15) Additional examples. The advantage of our approach
is especially apparent when observing the feet (e.g. consistent foot height and improved contacts).
5. Estimation from RGB: i3DB Data Ablation Comparisons (Section 5.5)
Similarly, we compare results to ablations of the full HuMoR CVAE: No Delta (does not predict change in state) and
Standard Prior (does not learn a conditional prior). These two are of particular interest since they are common in
previous variational motion models.
(0:03) Results for a sitting motion with heavy occlusions. Both ablations produce standing poses with severe ground
penetrations (as indicated by colored boxes) due to their inability to learn a generalizable motion model.
(0:21) Results for a walking motion with mild occlusions. Both ablations produce foot skating artifacts with floor penetrations
and Standard Prior estimates an incorrect stepping sequence.
(0:39, 0:57) Additional examples. Observing the feet
shows the subtle, but key, improvements when using HuMoR.
6. Estimation from RGB: PROX Data Baseline Comparisons (Section 5.5)
Similar to i3DB, we compare results for fitting to RGB observations in the PROX dataset. For each example sequence, just the output of our
method is shown first, followed by the comparison to PROX-RGB and VPoser-t baselines.
In all examples, PROX-RGB produces temporally incoherent results since it operates on single frames. However, it also uses the scene
mesh as input which allows for plausible poses when the person is fully visible. This does not greatly improve results under occlusions, though,
often reverting to a mean leg pose similar to VPoser-t and VIBE.
7. Estimation from RGB-D: PROX Data Baseline Comparisons (Section 5.5)
Next, we show results for fitting to RGB-D observations from the PROX dataset. For each example sequence, just the output of our
method is shown first. In these examples we show the predicted motion as well as the ground plane estimation (instead of contacts).
The ground plane is rendered within the true scene mesh for reference only, our method does not use the scene mesh as input or output.
Next, our results are compared to the PROX-D and VPoser-t baselines, first overlaid on the input video then within
the ground truth scene geometry.
(0:03) Walking sequence with lower body occlusions. As seen in the overlaid RGB, PROX-D produces unrealistic sliding/floating when legs are occluded.
However, when feet are visible PROX-D produces realistic contacts with the scene floor mesh since it uses the geometry as input. HuMoR produces natural
motion under occlusions, but the ground plane estimation is not perfect so feet slightly float above the true scene geometry.
(0:30) Sitting to standing motion with mild feet occlusions. As seen when visualized in the true scene mesh, HuMoR notably produces fewer penetrations with the
couch than PROX-D, despite not using any geometric constraints. This indicates using motion as a prior can by itself improve the plausibility of
environment interactions.
8. Estimation of Fast & Dynamic Motions (Appendix F.1)
Most results so far have shown common motions (e.g. walking, sitting) in occluded settings. However, fitting with HuMoR can also capture fast and dynamic motions
from full-body observations. In the following results, we show that despite not training on many dance motions, HuMoR effectively generalizes to complex
dynamic movements and allows for large accelerations to accurately fit 3D keypoints and 2D joints captured from dancing motions. 3D keypoint data is from the
DanceDB subset of AMASS (not used for training HuMoR) - the ground truth motion and shape along with observed keypoints are shown on the left, on the right
is our fitting results alongside the ground truth keypoints. RBB videos are from the AIST dataset.
(0:04, 0:10, 0:16, 0:22, 0:28) DanceDB 3D keypoint results. HuMoR fits to fast motions and difficult poses that are not frequent in training thanks to generalizability from operating
on pairs of frames.
(0:34, 0:40) AIST 2D joint from RGB video results.
(0:46) Fitting to a long AIST video by splitting the clip into overlapping small windows and optimizing separately with consistency constraints between adjacent windows. Note
the incorrect motion during the cartwheels - this is caused by poor 2D joint detections from OpenPose. Despite this, the optimization recovers robustly and
produces reasonable results for the remainder of the sequence.
Finally, we look at specific failure cases of TestOpt using HuMoR.
(0:04) Extreme occlusions (e.g. only a few visible joints), especially at the first frame makes for a
difficult optimization that often lands in local minima with implausible motions.
(0:12) Our method is dependent on motion in order to resolve ambiguity. For the case of a nearly static person,
as shown in this example, it may produce implausible motion if
occlusions cause ambiguity: it predicts standing when the person is clearly sitting since standing is more
likely under the prior.
(0:18) Similarly, since the ground plane estimation depnds on motion, if the person is static this
estimation can be incorrect.
(0:27) When observed motions are far from CVAE training data, e.g. laying down in this example,
the ground plane estimation may have large errors
in an attempt to make the motion likely under the prior.