- We are hosting the poster session on GatherTown (check out the GatherTown instructions on how to join).
Archival Workshop Publications
1 |
|
Object Detection in Aerial Images with Uncertainty-Aware Graph Network Abstract: In this work, we propose a novel uncertainty-aware object detection framework with a structured-graph, where nodes and edges are denoted by objects and their spatial-semantic similarities, respectively. Specifically, we aim to consider relationships among objects for effectively contextualizing them. To achieve this, we first detect objects and then measure their semantic and spatial distances to construct an object graph, which is then represented by a graph neural network (GNN) for refining visual CNN features for objects. However, refining CNN features and detection results of every object are inefficient and may not be necessary, as that include correct predictions with low uncertainties. Therefore, we propose to handle uncertain objects by not only transferring the representation from certain objects (sources) to uncertain objects (targets) over the directed graph, but also improving CNN features only on objects regarded as uncertain with their representational outputs from the GNN. Furthermore, we calculate a training loss by giving larger weights on uncertain objects, to concentrate on improving uncertain object predictions while maintaining high performances on certain objects. We refer to our model as Uncertainty-Aware Graph network for object DETection (UAGDet). We then experimentally validate ours on the challenging large-scale aerial image dataset, namely DOTA, that consists of lots of objects with small to large sizes in an image, on which ours improves the performance of the existing object detection network. |
Non-archival Paper Presentations
2 |
|
Learning Task-Agnostic 3D Representations of Objects by De-Rendering Abstract: State-of-the-art unsupervised representation learning methods typically do not exploit the physical properties of objects, such as geometry, albedo, lighting, and camera view, and, when they do, multi-view images are often needed for training. We show that de-rendering, a way to reverse the rendering process to recover these properties from single images without supervision, can also be used to learn task-agnostic representations, which we dub physically disentangled representations (PDRs). While de-renderers predict distinct physical properties, the features learned in the process may not be disentangled. To ensure meaningful features are encoded by de-rendering and thus prevent overreliance on decoders, we propose a novel Leave-One-Out, Cycle Contrastive loss (LOOCC) to improve feature disentanglement w.r.t. physical properties, which leads to higher downstream accuracy. We evaluate PDRs on downstream clustering tasks, including car classification and face identification. We perform a comparison of our method with other generative representation learning methods for these tasks and find PDRs consistently yield higher accuracy, outperforming evaluated baselines by as much as 18%. |
3 |
|
Inductive Biases for Object-Centric Representations in the Presence of Complex Textures Abstract:
Understanding which inductive biases could be helpful for the unsupervised learning of object-centric representations of natural scenes is challenging. |
4 |
|
PartAfford: Part-level Affordance Discovery from Cross-category 3D Objects Abstract: Understanding what objects could furnish for humans--viz., learning object affordance--is the crux to bridge perception and action. In the vision community, prior work primarily focuses on learning object affordance with dense (\eg, at a per-pixel level) supervision. In stark contrast, we humans learn the object affordance without dense labels. As such, the fundamental question to devise a computational model is: What is the natural way to learn the object affordance from geometry with humanlike sparse supervision? In this work, we present a new task of part-level affordance discovery (PartAfford): Given only the affordance labels per object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category. We propose a novel learning framework for PartAfford, which discovers part-level representations by leveraging only the affordance set supervision and geometric primitive regularization, without dense supervision. To learn and evaluate PartAfford, we construct a part-level, cross-category 3D object affordance dataset, annotated with 24 affordance categories shared among >25,000 objects. We demonstrate that our method enables both the abstraction of 3D objects and part-level affordance discovery, with generalizability to difficult and cross-category examples. Further ablations reveal the contribution of each component. |
5 |
|
OPD: Single-view 3D Openable Part Detection Abstract: We address the task of predicting what parts of an object can open and how they move when they do so. The input is a single image of an object, and as output we detect what parts of the object can open, and the motion parameters describing the articulation of each openable part. To tackle this task, we create two datasets of 3D objects: OPDSynth based on existing synthetic objects, and OPDReal based on RGBD reconstructions of real objects. We then design OPDRCNN, a neural architecture that detects openable parts and predicts their motion parameters. Our experiments show that this is a challenging task especially when considering generalization across object categories, and the limited amount of information in a single image. Our architecture outperforms baselines and prior work especially for RGB image inputs. |
6 |
|
PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? Abstract: Most (3D) multi-object tracking methods rely on appearance-based cues for data association. By contrast, we investigate how far we can get by only encoding geometric relationships between objects in 3D space as cues for data-driven data association. We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges. This representation makes our geometric relations invariant to global transformations and smooth trajectory changes, especially under non-holonomic motion. This allows our graph neural network to learn to effectively encode temporal and spatial interactions and fully leverage contextual and motion cues to obtain final scene interpretation by posing data association as edge classification. We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations (Boston, Singapore, Karlsruhe) and datasets (nuScenes and KITTI) |
7 |
|
Self-Supervised Representation Learning from Videos of Audible Interactions Abstract: We propose a self-supervised algorithm to learn representations from egocentric video data. Given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. To achieve this, we leverage audio signals to identify moments of likely interactions and also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions on two large-scale egocentric datasets, EPIC-Kitchens-100 and Ego4D, and show improvements on downstream task of action recognition. |
Contact Info
E-mail: kaichun@cs.stanford.edu