The first Workshop on
Visual Object-oriented Learning meets Interaction (VOLI):
Discovery, Representations, and Applications

October 24 @ ECCV 2022 Virtual

We are hosting the poster session on GatherTown (check out the GatherTown instructions on how to join).

Archival Workshop Publications

1

Jongha Kim, Jinheon Baek, Sung Ju Hwang Korea University, KAIST, and AITRICS

[PDF] [Video]

Object Detection in Aerial Images with Uncertainty-Aware Graph Network

Non-archival Paper Presentations

2	Tzofi M Klinghoffer, Kushagra Tiwary, Arkadiusz Balata, Vivek Sharma, Ramesh Raskar MIT [PDF] [Video]	Learning Task-Agnostic 3D Representations of Objects by De-Rendering Abstract: State-of-the-art unsupervised representation learning methods typically do not exploit the physical properties of objects, such as geometry, albedo, lighting, and camera view, and, when they do, multi-view images are often needed for training. We show that de-rendering, a way to reverse the rendering process to recover these properties from single images without supervision, can also be used to learn task-agnostic representations, which we dub physically disentangled representations (PDRs). While de-renderers predict distinct physical properties, the features learned in the process may not be disentangled. To ensure meaningful features are encoded by de-rendering and thus prevent overreliance on decoders, we propose a novel Leave-One-Out, Cycle Contrastive loss (LOOCC) to improve feature disentanglement w.r.t. physical properties, which leads to higher downstream accuracy. We evaluate PDRs on downstream clustering tasks, including car classification and face identification. We perform a comparison of our method with other generative representation learning methods for these tasks and find PDRs consistently yield higher accuracy, outperforming evaluated baselines by as much as 18%.
3	Samuele Papa, Ole Winther, Andrea Dittadi University of Amsterdam, DTU and KU, Technical University of Denmark [PDF] [Video]	Inductive Biases for Object-Centric Representations in the Presence of Complex Textures Abstract: Understanding which inductive biases could be helpful for the unsupervised learning of object-centric representations of natural scenes is challenging. In this paper, we systematically investigate the performance of two models on datasets where neural style transfer was used to obtain objects with complex textures while still retaining ground-truth annotations. We find that by using a single module to reconstruct both the shape and visual appearance of each object, the model learns more useful representations and achieves better object separation. In addition, we observe that adjusting the latent space size is insufficient to improve segmentation performance. Finally, the downstream usefulness of the representations is significantly more strongly correlated with segmentation quality than with reconstruction accuracy.
4	Chao Xu, Yixin Chen, He Wang, Song-Chun Zhu, Yixin Zhu, Siyuan Huang UC Los Angeles, Peking University, Beijing Institute of General Artificial Intelligence [PDF] [Supp] [Video] [Long Video]	PartAfford: Part-level Affordance Discovery from Cross-category 3D Objects Abstract: Understanding what objects could furnish for humans--viz., learning object affordance--is the crux to bridge perception and action. In the vision community, prior work primarily focuses on learning object affordance with dense (\eg, at a per-pixel level) supervision. In stark contrast, we humans learn the object affordance without dense labels. As such, the fundamental question to devise a computational model is: What is the natural way to learn the object affordance from geometry with humanlike sparse supervision? In this work, we present a new task of part-level affordance discovery (PartAfford): Given only the affordance labels per object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category. We propose a novel learning framework for PartAfford, which discovers part-level representations by leveraging only the affordance set supervision and geometric primitive regularization, without dense supervision. To learn and evaluate PartAfford, we construct a part-level, cross-category 3D object affordance dataset, annotated with 24 affordance categories shared among >25,000 objects. We demonstrate that our method enables both the abstraction of 3D objects and part-level affordance discovery, with generalizability to difficult and cross-category examples. Further ablations reveal the contribution of each component.
5	Hanxiao Jiang, Yongsen Mao, Manolis Savva, Angel X Chang Simon Fraser University [PDF] [Video] [Long Video]	OPD: Single-view 3D Openable Part Detection Abstract: We address the task of predicting what parts of an object can open and how they move when they do so. The input is a single image of an object, and as output we detect what parts of the object can open, and the motion parameters describing the articulation of each openable part. To tackle this task, we create two datasets of 3D objects: OPDSynth based on existing synthetic objects, and OPDReal based on RGBD reconstructions of real objects. We then design OPDRCNN, a neural architecture that detects openable parts and predicts their motion parameters. Our experiments show that this is a challenging task especially when considering generalization across object categories, and the limited amount of information in a single image. Our architecture outperforms baselines and prior work especially for RGB image inputs.
6	Aleksandr Kim, Guillem Brasó, Aljosa Osep, Laura Leal-Taixé Technical University of Munich [PDF] [Video]	PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? Abstract: Most (3D) multi-object tracking methods rely on appearance-based cues for data association. By contrast, we investigate how far we can get by only encoding geometric relationships between objects in 3D space as cues for data-driven data association. We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges. This representation makes our geometric relations invariant to global transformations and smooth trajectory changes, especially under non-holonomic motion. This allows our graph neural network to learn to effectively encode temporal and spatial interactions and fully leverage contextual and motion cues to obtain final scene interpretation by posing data association as edge classification. We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations (Boston, Singapore, Karlsruhe) and datasets (nuScenes and KITTI)
7	Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta CMU, UWD, UIUC [PDF] [Video]	Self-Supervised Representation Learning from Videos of Audible Interactions Abstract: We propose a self-supervised algorithm to learn representations from egocentric video data. Given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. To achieve this, we leverage audio signals to identify moments of likely interactions and also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions on two large-scale egocentric datasets, EPIC-Kitchens-100 and Ego4D, and show improvements on downstream task of action recognition.

Contact Info

E-mail: kaichun@cs.stanford.edu