
R
𝚹
separate
kernels
polar grid
points
Interpolated
cartesian grid
points
Range Stratified Convolution&Normalization
Feature Undistortion
Figure 2: Simultaneous LiDAR object detection and segmentation network with polar pillars. We
adopt the same backbone as in PointPillars, and add a semantic segmentation head in parallel with
the detection heads. The input wedge-shape pillars are unfolded into a rectangular feature map for
convolution. The object (green box) is distorted because one end near the sensor looks bigger and the
other end far from the sensor looks smaller. Feature Undistortion is applied to classification head to
mimic bilinear sampling and interpolate cartesian pillar features from polar pillar features. Range
Stratified Convolution& Normalization is applied to center offset regression head.
only one pillar along the height dimension. Following MVF [
37
], we adopt dynamic voxelization to
sample all points within each pillar, instead of randomly sampling a fixed number of points per pillar.
3.3 Simultaneous Detection and Segmentation
We design PolarStream: a simultaneous object detection and segmentation network by extending
PointPillars [
15
], one of the most widely used 3D object detectors balancing accuracy and speed. As
shown in Fig.2, PolarStream consists of a Pillar Feature Encoder, followed by a 2D CNN backbone
and a U-Net[24] like structure. On top are the detection and segmentation heads.
Detection Heads
We adopt CenterPoint [
34
] heads with modifications to make it compatible with
polar pillars. To assign targets to the 10-class heatmap to indicate the objects, the gaussian radius of
the object center is computed using the span of range and azimuth of the object bounding box, instead
of using length and width of the box. Following CenterPoint, we also regress the center offset as
d
x
, d
y
, the bounding box size
l, w, h
as
log l, log w, log h
, and predict the bounding box height
z
. We
regress the relative bounding box orientation
φ
as
cos φ, sin φ
and relative velocity as
v
x
, v
y
similar
to [
22
]. Unlike most methods, which use multi-group detection heads that partition object classes to
several groups according to their size, we use single-group detection heads to balance accuracy and
speed. A comparison against multi-group detection heads is shown in Supplementary. For streaming
data with n > 1, we apply stateful-NMS proposed in Han et al.[13].
Segmentation Head
To extend PointPillars for segmentation, we add a semantic segmentation
head in parallel with the detection heads. The segmentation head is made of a single 1x1 convolution
layer. The input for the segmentation head is concatenation of the outputs from pillar feature encoder
and bilinearly upsampled features from the 2D backbone.
Panoptic Fusion
Similar to Panoptic-PolarNet [
39
], for each point belonging to things, we predict
the instance id as the box id whose category is the same and center is the nearest. For streaming
data with
n > 1
, the panoptic segmentation task is not well defined. For example, the points in the
i
th
sector may belong to the box in the
(i + 1)
th
sector if the majority of the box is in the
(i + 1)
th
sector. However, when we are doing panoptic fusion for
i
th
sector, we do not have information from
the
(i + 1)
th
sector. Therefore we choose global panoptic fusion for streaming point clouds, i.e., we
assign instance ids according to the boxes from all sectors of the same sweep.
Multi-Task Learning
We adopt Focal Loss [
16
] for classification and L1 loss for bounding box
regression, orientation and velocity estimation. For segmentation, we use the weighted cross-entropy
loss and lovasz-softmax loss [2]. The total loss is the weighted sum of losses for each component.
Feature Undistortion
As mentioned in Sec.1, objects have distorted appearances with polar pillars,
we propose Feature Undistortion to undistort the features. As shown on the top right of Fig.2, the
idea of undistortion is to interpolate features at cartesian pillar locations from the original polar pillar
locations so that the translation-invariant property of convolution applies. We find the connection
4