TRI at CVPR 2022

Toyota Research Institute
Toyota Research Institute
7 min readJun 17, 2022

--

The conference on Computer Vision and Pattern Recognition (CVPR) is one of the top international venues in Computer Science, with a focus on computer vision and machine learning. CVPR 2022 will be a hybrid conference, with both in-person and virtual attendance options, and it will take place on June 19–24 in New Orleans, Louisiana.

This year, Toyota Research Institute (TRI) is once again a top sponsor and will be presenting new research findings and participating in a number of workshops. Check out the main conference and workshops below to learn where TRI researchers will be present. We look forward to seeing you online and talking to you at this year’s CVPR — you can find us at booth #715!

Main Conference

Multi-Frame Self-Supervised Depth with Transformers

Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon

Poster Session: Tuesday, June 21, 2022 10:00AM — 12:30PM CST

Even though self-supervised monocular depth estimation leverages multi-view consistency at training time, most methods are still single-frame during inference. This limits the expressivity of learned features and leads to suboptimal results compared to multi-frame methods. In this paper, we revisit feature matching for self-supervised monocular depth estimation and propose a novel transformer architecture for cost volume generation. Unlike single-frame approaches that rely solely on appearance-based features, multi-frame depth estimation methods also leverage geometric relationships between images learned via feature matching, leading to superior performance. We use depth-discretized epipolar sampling to select matching candidates and refine predictions through a series of self- and cross-attention layers.

These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI dataset for Automated Driving and the DDAD, Dense Depth for Autonomous Driving, dataset show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.

Read more about our work on deploying monodepth in the real world as well as our earlier depth estimation work on TRI’s Medium blog (part 1, part 2).

[paper] [website] [code]

Revealing Occlusions with 4D Neural Fields

Basile Van Hoorick, Purva Tendulkar, Dídac Surís, Dennis Park, Simon Stent, Carl Vondrick

Oral Session: Tuesday, June 21, 2022 1:30PM — 3:00PM CST

For computer vision systems to operate in dynamic situations, they need to be able to represent and reason about object permanence. We introduce a framework for learning to estimate 4D visual representations from monocular RGB-D video, which is able to persist objects, even once they become obstructed by occlusions. Unlike traditional video representations, we encode point clouds into a continuous representation, which permits the model to attend across the spatiotemporal context to resolve occlusions. On two large video datasets that we release along with this paper, our experiments show that the representation is able to successfully reveal occlusions for several tasks, without any architectural changes. Visualizations show that the attention mechanism automatically learns to follow occluded objects. Since our approach can be trained end-to-end and is easily adaptable, we believe it will be useful for handling occlusions in many video understanding tasks.

[paper] [website]

Discovering Objects that Can Move

Zhipeng Bao, Pavel Tokmakov, Allan Jabri, Yu-Xiong Wang, Adrien Gaidon, Martial Hebert

Poster Session: Thursday, June 23, 2022 10:00AM — 12:30PM CST

This paper studies the problem of object discovery — separating objects from the background without manual labels. Existing approaches rely on appearance cues, such as color, texture and location, to group pixels into object-like regions. However, by relying on appearance alone, these methods fail to reliably separate objects from the background in cluttered scenes. This is a fundamental limitation, since the definition of an object is inherently ambiguous and context-dependent. To resolve this ambiguity, in this work we choose to focus on dynamic objects — entities that are capable of moving independently in the world. We then scale the recent auto-encoder based frameworks for unsupervised object discovery from toy, synthetic images to complex, real world scenes by simplifying their architecture and augmenting the resulting model with a weak learning signal from a motion segmentation algorithm. We demonstrate that, despite only capturing a small subset of the objects, this signal is enough to bias the model, which then learns to segment both moving and static instances of dynamic objects. We show that this model scales to our newly collected, photo-realistic synthetic dataset with street driving scenarios. Additionally, we leverage ground truth segmentation and flow annotations in this dataset for thorough ablation and evaluation. Finally, our experiments on the real-world KITTI Dataset for Automated Driving dataset demonstrate that the proposed approach outperforms both heuristic- and learning-based methods by capitalizing on motion cues.

[paper] [website]

Revisiting the “Video” in Video-Language Understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, Juan Carlos Niebles

Oral Session: Tuesday, June 21, 2022 1:30PM — 3:00PM CST

Videos offer the promise of understanding not only what can be discerned from a single image (e.g., scenes, people, and objects), but also multi-frame event temporality, causality, and dynamics. Correspondingly, there lies a central question at the heart of video research: “What makes a video task uniquely suited for videos, beyond what can be understood from a single image?” Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding.

We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.

[paper] [blog]

Workshops

Women In Computer Vision (WiCV) — June 19

Website: https://sites.google.com/view/wicv/home

TRI is once again sponsoring this half-day workshop, one of the leading DEI scientific events in computer vision and machine learning, where Dr. Jie Li and Dr. Adrien Gaidon will be participating as mentors. We look forward to seeing you there, and don’t hesitate to reach out!

AVA: Accessibility, Vision, and Autonomy Workshop — June 20

[website]

The overarching goal of this workshop is to gather researchers, students, and advocates who work at the intersection of accessibility, computer vision, and autonomous systems. We plan to use the workshop to identify challenges and pursue solutions for the current lack of shared and principled development tools for data-driven vision-based accessibility systems. For instance, there is a general lack of vision-based benchmarks and methods relevant to accessibility (e.g., people with disabilities and mobility aids are currently mostly absent from large-scale datasets in pedestrian detection). Our workshop will provide a unique opportunity for fostering a mutual discussion between accessibility, computer vision, and robotics researchers and practitioners.

Dr. Adrien Gaidon will be a keynote speaker talking about Principle-centric AI.

How Far Can Synthetic Data Take us? 7th Workshop on Benchmarking Multi-Target Tracking — June 20

[website]

Synthetic data has the potential to enable the next generation of deep learning algorithms to thrive on unprecedented amounts of free labeled data while avoiding privacy and dataset bias concerns. As recently shown in our MOTSynth work, models trained on synthetic data can already achieve competitive performance when tested on real datasets. At the 7th BMTT workshop, we aim to bring the tracking community together to further explore the potential of synthetic data. We have an exciting line-up of speakers and are organizing two challenges aiming to advance the state-of-the-art in synthetic-to-real tracking.

Dr. Adrien Gaidon and Dr. Pavel Tokmakov will be sharing recent progress on our use of synthetic data for improved video understanding.

Omnidirectional Computer Vision — June 20

[website]

Our objective is to provide a venue for novel research in omnidirectional computer vision with an eye toward actualizing these ideas for commercial or societal benefit. As omnidirectional cameras become more widespread, we want to bridge the gap between the research and application of omnidirectional vision technologies. Omnidirectional cameras are already widespread in a number of application areas such as automotive, surveillance, photography, simulation and other use-cases that benefit from large fields of view. More recently, they have garnered interest for use in virtual and augmented reality. We want to encourage the development of new models that natively operate on omnidirectional imagery as well as close the performance gap between perspective-image and omnidirectional algorithms. This full day workshop has twelve invited speakers from both academia and industry.

Dr. Vitor Guizilini will be talking about our recent research in self-supervised learning in multi-camera systems.

--

--

Toyota Research Institute
Toyota Research Institute

Applied and forward-looking research to create a new world of mobility that's safe, reliable, accessible and pervasive.