Abstract

A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, textures, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.

3D Scene Graph

The input to our method consists of 3D mesh models, registered RGB panoramas and the corresponding camera parameters. The output is the 3D Scene Graph of the scanned space, which we formulate as a four-layered graph. Each layer has a set of nodes, each node has a set of attributes, and there are edges between nodes which represent their relationships.

Automatic Labeling Pipeline

To construct the 3D Scene Graph we need to identify its elements, their attributes, and relationships. Given the number of elements and the scale, annotating the input RGB and 3D mesh data with object labels and their segmentation masks is the major labor bottleneck. We present an automatic method that uses existing semantic detectors to bootrstap the annotation pipeline and minimize human labor.

I.Framing

Having RGB panoramas as input gives the opportunity to formulate a framing approach that samples rectilinear images from them with the objective to maximize detection accuracy. It utilizes two heuristics: (a) placing the object at the center of the image and (b) having the image properly zoomed-in around it to provide enough context. We begin by densely sampling rectilinear images on the panorama, with the goal of having at least one image that satisfies the heuristics for each object in the scene. After we get the detection results for each frame, we aggregate them back on the panorama using a weighted voting scheme.

II.Multi-View Consistency

With the RGB panoramas registered on the 3D mesh, we can annotate it by projecting the 2D pixel labels on the 3D surfaces. A mere projection of a single panorama does not yield accurate segmentation, because of imperfect panorama results and common poor reconstruction of certain objects or camera registration errors. This leads to labels ”leaking” on neighboring objects. However, the objects in the scene are visible from multiple panoramas, which enables using multi-view consistency to fix such issues.

Results

To interact with the 2D and 3D results per model visit here.

Gibson Integration

We integrate 3D Scene Graph as a modality in Gibson [1] simulator, allowing on-the-fly computation of 2D scene graphs per frame.

[1] Xia, F., Zamir, A. R., He, Z., Sax, A., Malik, J., & Savarese, S. (2018). Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9068-9079).

Paper

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Armeni I., He Z., Gwak J., Zamir A.. R., Fischer M., Malik J., Savarese S.

ICCV 2019

Team


Iro Armeni	Zhi_Yang He	JunYoung Gwak	Amir R. Zamir	Martin Fischer	Jitendra Malik	Silvio Savarese
Stanford University - CIFE, SVL	Stanford University - SVL	Stanford University - SVL	Stanford University - SVL UC Berkeley - BAIR	Stanford University - CIFE	UC Berkeley - BAIR	Stanford University - SVL

News

Next:	Annotations for Gibson's large split to follow.
Feb 18th 2020:	Released code to generate automatic annotations given a 2D detector, as well as code to compute attributes and relationships.
Jan 22nd 2020:	Released annotations for Gibson's medium split.
Oct 8th 2019:	Released annotations for Gibson's tiny split.

3D Scene Graph

Abstract

3D Scene Graph

Automatic Labeling Pipeline

I.Framing

II.Multi-View Consistency

Results

To interact with the 2D and 3D results per model visit here.

Gibson Integration

We integrate 3D Scene Graph as a modality in Gibson [1] simulator, allowing on-the-fly computation of 2D scene graphs per frame.

Paper

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Armeni I., He Z., Gwak J., Zamir A.. R., Fischer M., Malik J., Savarese S.

ICCV 2019

Team

News

Stanford Vision and Learning Lab

Center for Integrated Facility Engineering