Novel View Synthesis
|
Dense 3D Point Tracking
|
Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian "marbles", reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.
We augment the popular Nvidia Dynamics Scene Dataset and remove the standard multi-view training signals. This results in training using a single stationary camera's video stream, characterizing a setting with very little multi-view information. Even without the multiview cues, our method can track geometry and perform high quality novel view synthesis.
Balloon1
|
Umbrella
|
Playground
|
The best part about DGMarbles is that it works out-of-the-box on real world videos. Here are a few videos from Davis, YouTube-VOS, and other online sources. To save ourselves time and headache, we did NOT estimate camera poses on any of the below videos! Instead, we use a dynamic background to account for arbitrary motion, including the camera moving.
What if we pretend the background is moving? In this work, we can treat both the foreground and the background as dynamic content. Unlike previous work, this allows us to reason about the scene's 3D structure in the camera's frame of reference, i.e. without any provided camera poses. No need to sweat over your workstation and hope that COLMAP (or your other favorite SLAM method) converges on a highly dynamic video!
Input Video
|
Ground Truth (Novel View)
|
DGMarbles (Ours)
|
TNeRF
|
Nerfies
|
HyperNerf
|
4D Gaussians
|
Dynamic Gaussians
|
Input Video
|
Ground Truth (Novel View)
|
DGMarbles (Ours)
|
TNeRF
|
Nerfies
|
HyperNerf
|
4D Gaussians
|
Dynamic Gaussians
|
Still, open-world dynamic and monocular novel view synthesis is an extremely challenging problem, and DGMarbles certainly has limitations. One key limitation is the quality of off-the-shelf 2D priors. For instance, when off-the-shelf tracking fails and a total occlusion occurs, DGMarbles may also fail to track the occluded object. This usually results in "duplicating" the object and "teleporting" from it the first to the second copy. Here's an example:
CoTracker Tracking Prediction
|
Gaussian Marbles (Training View)
|
Gaussian Marbles (Novel Views)
|