A Structure for Unified Semantics, 3D Geometry and Camera.
A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, textures, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.
3D Scene Graph
The input to our method consists of 3D mesh models, registered RGB panoramas and the corresponding camera parameters. The output is the 3D Scene Graph of the scanned space, which we formulate as a four-layered graph. Each layer has a set of nodes, each node has a set of attributes, and there are edges between nodes which represent their relationships.
Automatic Labeling Pipeline
To construct the 3D Scene Graph we need to identify its elements, their attributes, and relationships. Given the number of elements and the scale, annotating the input RGB and 3D mesh data with object labels and their segmentation masks is the major labor bottleneck. We present an automatic method that uses existing semantic detectors to bootrstap the annotation pipeline and minimize human labor.
Having RGB panoramas as input gives the opportunity to formulate a framing approach that samples rectilinear images from them with the objective to maximize detection accuracy. It utilizes two heuristics: (a) placing the object at the center of the image and (b) having the image properly zoomed-in around it to provide enough context. We begin by densely sampling rectilinear images on the panorama, with the goal of having at least one image that satisfies the heuristics for each object in the scene. After we get the detection results for each frame, we aggregate them back on the panorama using a weighted voting scheme.
With the RGB panoramas registered on the 3D mesh, we can annotate it by projecting the 2D pixel labels on the 3D surfaces. A mere projection of a single panorama does not yield accurate segmentation, because of imperfect panorama results and common poor reconstruction of certain objects or camera registration errors. This leads to labels ”leaking” on neighboring objects. However, the objects in the scene are visible from multiple panoramas, which enables using multi-view consistency to fix such issues.
To interact with the 2D and 3D results per model visit here.
We integrate 3D Scene Graph as a modality in Gibson  simulator, allowing on-the-fly computation of 2D scene graphs per frame.
 Xia, F., Zamir, A. R., He, Z., Sax, A., Malik, J., & Savarese, S. (2018). Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9068-9079).