Sensor Depth To Mesh Point Cloud Feature Transfer: Implementation Details
Sensor Depth To Mesh Point Cloud Feature Transfer: Implementation Details
3D benchmarks like ScanNet [6] and ScanNet200 [41], the pre-trained weights from Mask2Former [4] trained on
objective is to label a point cloud derived from a mesh rather COCO [31]. Subsequently, we train all parameters end-to-
than the depth map from the sensor. Hence, on those bench- end, including both pre-trained and new parameters from
marks, instead of upsampling the 1/8 resolution feature 3D fusion layers. During training in 3D scenes, our model
map to 1/4, we trilinearly interpolate features from the 1/8 processes a sequence of N consecutive frames, usually
resolution feature map to the provided point cloud sampled comprising 25 frames. At test time, we input all images
from the mesh. This means: for each vertex in the mesh, in the scene to our model, with an average of 90 images per
we trilinearly interpolate from our computed 3D features to scene in ScanNet. We use vanilla closed-vocabulary decod-
obtain interpolated features. We additionally similarly in- ing head for all experiments except when training jointly on
terpolate from the unprojected 1/4 resolution feature map 2D-3D datasets. There we use our open vocabulary class
in the backbone, for an additive skip connection. decoder that lets us handle different label spaces in these
Shared 2D-3D segmentation mask decoder: Our segmen- datasets. During training, we employ open vocabulary mask
tation decoder is a Transformer, similar to Mask2Former’s decoding for joint 2D and 3D datasets and vanilla closed-
decoder head, which takes as input upsampled 2D or 3D vocabulary decoding otherwise. Training continues until
feature maps and outputs corresponding 2D or 3D segmen- convergence on 2 NVIDIA A100s with 40 GB VRAM, with
tation masks and their semantic classes. Specifically, we an effective batch size of 6 in 3D and 16 in 2D. For joint
instantiate a set of N learnable object queries responsible training on 2D and 3D datasets, we alternate sampling 2D
for decoding individual instances. These queries are itera- and 3D batches with batch sizes of 3 and 8 per GPU, respec-
tively refined by a Query Refinement block, which consists tively. We adopt Mask2Former’s strategy, using Hungar-
of cross-attention to the upsampled features, followed by a ian matching for matching queries to ground truth instances
self-attention between the queries. Except for the positional and supervision losses. While our model is only trained for
embeddings, all attention and query weights are shared be- instance segmentation, it can perform semantic segmenta-
tween 2D and 3D. We use Fourier positional encodings in tion for free at test time like Mask2Former. We refer to
2D, while in 3D we encode the XYZ coordinates of the 3D Mask2Former [4] for more details.
tokens with an MLP. The refined queries are used to pre-
dict instance masks and semantic classes. For mask pre- 4. Experiments
diction, the queries do a token-wise dot product with the
highest-resolution upsampled features. For semantic class
4.1. Evaluation on 3D benchmarks
prediction, we use an MLP over the queries, mapping them Datasets: First, we test our model on 3D instance and
to class logits. We refer readers to Mask2Former [4] for semantic segmentation in the ScanNet [6] and Scan-
further details. Net200 [41] benchmarks. The goal of these benchmarks
Open vocabulary class decoder: Drawing inspiration is to label the point cloud extracted from the 3D mesh of
from prior open-vocabulary detection methods [19, 29, 61], a scene reconstructed from raw sensor data. ScanNet eval-
we introduce an alternative classification head capable of uates on 20 common semantic classes, while ScanNet200
handling an arbitrary number of semantic classes. This uses 200 classes, which is more representative of the long-
modification is essential for joint training on multiple tailed object distribution encountered in the real world. We
datasets. Similar to BUTD-DETR [19] and GLIP [29], we report results on the official validation split of these datasets
supply the model with a detection prompt formed by con- here and on the official test split in the supplementary.
catenating object categories into a sentence (e.g., “Chair. Evaluation metrics: We follow the standard evaluation
Table. Sofa.”) and encode it using RoBERTa [32]. In metrics, namely mean Average Precision (mAP) for in-
the query-refinement block, queries additionally attend to stance segmentation and mean Intersection over Union
these text tokens before attending to the upsampled fea- (mIoU) for semantic segmentation.
ture maps. For semantic class prediction, we first perform a Baselines: In instance segmentation, our main baseline is
dot-product operation between queries and language tokens, the SOTA 3D method Mask3D [44]. For a thorough com-
generating one logit per token in the detection prompt. The parison, we train both Mask3D and our model with sen-
logits corresponding to prompt tokens for a specific object sor RGB-D point cloud input and evaluate them on the
class are then averaged to derive per-class logits. This can benchmark-provided mesh-sampled point clouds. We also
handle multi-word noun phrases such as “shower curtain”, compare with the following recent and concurrent works:
where we average the logits corresponding to “shower” PBNet [58], QueryFormer [34] and MAFT [28]. Query-
and “curtain”. The segmentation masks are predicted by Former and MAFT explore query initialization and refine-
a pixel-/point-wise dot-product, in the same fashion as de- ment in a Mask3D-like architecture and thus have comple-
scribed earlier. mentary advantages to ours. Unlike ODIN, these methods