0% found this document useful (0 votes)
19 views12 pages

1 s2.0 S009784932400236X Main

This document presents a virtual reality tool designed to enhance the manual labeling of large 3D point clouds, which are often used in various applications but require extensive manual classification. The tool utilizes a mixed CPU/GPU-based data structure to enable efficient rendering, selection, and labeling with immediate visual feedback, significantly reducing the time required for labeling tasks. User evaluations demonstrate that this immersive labeling method improves efficiency and accuracy compared to existing 2D-based tools.

Uploaded by

alex.muravev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

1 s2.0 S009784932400236X Main

This document presents a virtual reality tool designed to enhance the manual labeling of large 3D point clouds, which are often used in various applications but require extensive manual classification. The tool utilizes a mixed CPU/GPU-based data structure to enable efficient rendering, selection, and labeling with immediate visual feedback, significantly reducing the time required for labeling tasks. User evaluations demonstrate that this immersive labeling method improves efficiency and accuracy compared to existing 2D-based tools.

Uploaded by

alex.muravev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Computers & Graphics 124 (2024) 104101

Contents lists available at ScienceDirect

Computers & Graphics


journal homepage: www.elsevier.com/locate/cag

Special Section on RAGI 2024

An immersive labeling method for large point clouds


Tianfang Lin a ,∗, Zhongyuan Yu b , Matthew McGinity b , Stefan Gumhold a
a
The Chair of Computer Graphics and Visualization, TU Dresden, Germany
b
Immersive Experience Lab, TU Dresden, Germany

ARTICLE INFO ABSTRACT

Keywords: 3D point clouds, such as those produced by 3D scanners, often require labeling – the accurate classification
Virtual reality of each point into structural or semantic categories – before they can be used in their intended application.
Point cloud However, in the absence of fully automated methods, such labeling must be performed manually, which can
Immersive labeling
prove extremely time and labor intensive. To address this we present a virtual reality tool for accelerating
and improving the manual labeling of very large 3D point clouds. The labeling tool provides a variety of 3D
interactions for efficient viewing, selection and labeling of points using the controllers of consumer VR-kits.
The main contribution of our work is a mixed CPU/GPU-based data structure that supports rendering, selection
and labeling with immediate visual feedback at high frame rates necessary for a convenient VR experience.
Our mixed CPU/GPU data structure supports fluid interaction with very large point clouds in VR, what is not
possible with existing continuous level-of-detail rendering algorithms. We evaluate our method with 25 users
on tasks involving point clouds of up to 50 million points and find convincing results that support the case
for VR-based point cloud labeling.

1. Introduction methods for segmentation and classification [15]. However, for many
applications, fully automated methods may not be sufficient in terms of
3D scans of real objects, structures and scenes are increasingly accuracy, and so manual methods are required. Furthermore, machine
being employed in fields ranging from building virtual environments learning methods often require correctly labeled ‘‘ground truth’’ data
to train robots or autonomous cars [1] to building infrastructure man- sets for training, synthetic data derived from converting 3D models into
agement [2], archaeology [3], museums [4], surveying [5,6] and foren- point clouds, though automatically labeled, lacks the detail and com-
sics [7], remote sensing [8,9] and the creation of 3D assets for computer
plexity of real-world labeling, leading to gaps in model comprehension
games [10] and art [11]. Current methods for capturing real world
of real-world contexts, which must be produced manually [12]. The
structures, such as LIDAR or photogrammetry, tend to produce dense,
essential challenge facing the manual labeling of large point clouds is
unstructured 3D point clouds which, before they can be used in their
intended application, must often be classified into different subsets, be the immensely time-consuming, labor intensive effort of the task. For
it by semantic meaning or spatial structure. For example, erroneous example, labeling of the Dublin City point cloud with an appropriate
points might be classified as noise, or sets of points identified as tutorial, supervision, and carefully crossed-checked multiple times re-
individual objects (e.g., chair, door) or types (e.g., pedestrians, trees, quired over 2500 h of manual labor [13]. The required work, combined
cars) or material (e.g., concrete, metal). We refer to this annotation as with the increasing need for large point cloud annotation has seen
labeling. The accurate labeling of points is often an essential stage in the emergence of commercial services solely dedicated to the manual
the workflow of any project involving point clouds (e.g., see [3]). labeling of point clouds (e.g., [16–19]).
Nowadays, labeling techniques have been developed and widely In this work we explore methods for reducing the time and labor
used for diverse 2D/3D data-sets and as preliminaries in a variety of required to manually label large points clouds. The majority of existing
fields [12–14]. tools, such as Pointly.AI [20], and point_labeler [12] only support 2D
Successful labeling depends on both the accuracy of segmentation of displays, and so must present 3D data on a 2D screen. With large, dense
individual points into particular subsets and correctly assigning the ap-
point clouds, 2D displays suffer from heavy occlusion, and making fast,
propriate semantic classification to each subset. Recently, progress has
accurate 3D selections of points can be both slow and error-prone.
been made on the automation of both these tasks using deep learning

∗ Corresponding author.
E-mail addresses: [email protected] (T. Lin), [email protected] (Z. Yu), [email protected] (M. McGinity),
[email protected] (S. Gumhold).

https://doi.org/10.1016/j.cag.2024.104101
Received 30 May 2024; Received in revised form 28 August 2024; Accepted 1 October 2024
Available online 5 October 2024
0097-8493/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

We investigated whether an immersive VR interface can accelerate size. The number of chunks is typically many orders of magnitude
this process. Depth perception can aid selection in 3D, and the abil- lower than the number of points, and so sorting and prioritizing chunks
ity to change viewpoint rapidly helps to deal with heavy occlusion. becomes a tractable task. Such schemes, being very efficient on GPUs
Furthermore, VR interaction with controllers is 3D and each controller and relatively easy to implement, have become a common approach
is tracked with 6-DOF and provides numerous buttons, supporting the to accelerate rendering of large point clouds. Alternatively, Futterlieb
mapping of a greater palette of interaction possibilities in comparison et al. [26] proposed a point-based rendering approach that preserves
to a 2D mouse device. details from the previous frame if the camera does not move and oth-
The development of a VR based labeling system, however, faces erwise adds additional detail. ‘‘Popping’’ artifacts are reduced through
a number of challenges. The first is the design of a user interface an image-based LOD-blending method using multiple render targets.
that permits fluid labeling operations. The second is the question of Discher et al. [27] presented a point-based and image-based multi-pass
performance. VR demands for high resolution rendering at high frame rendering technique that allows for visualizing massive 3D point clouds
rates. It also requires accurate, low latency interaction where users on VR devices accelerated with a kd-tree used for view-frustum and
experience immediate cause-and-effect relationships. To support this, occlusion culling. As chunk-based methods suffer from visible ‘‘pop-
methods for efficiently rendering, selecting, labeling and modifying ping’’ of chunks on level-of-detail changes, Schütz et al. [21] proposed
large point clouds are required for a successful VR labeling tool. a continuous level-of-detail method that per frame creates a reduced
To conquer the rendering challenge, Schütz et al. [21] proposed vertex buffer from the full list of points in a way to gradually decrease
a continuous level-of-detail (CLOD) rendering strategy for large point point density with increasing distance, thereby eliminating chunk-
clouds in VR, that view-dependently filters large point clouds with the based ‘‘popping’’. This method supports large point cloud rendering in
help of a compute shader in order to meet the performance require- VR, and is deemed best suited for our purposes.
ments. Fast selection, which requires spatial querying of the points
clouds, and fast labeling, which requires modification of point labels, 2.2. 2D/3D point cloud selection & labeling
remains a challenge.
In this paper, we propose a mixed CPU/GPU-based data structure In the 2D/3D labeling approach proposed in [28], the user creates
that permits continuous level-of-detail rendering and rapid selection, labeled control points through mouse clicks, from which the labels
translation and removal of points with instantaneous visual feedback. are propagated to other points based on the shortest path in the
Instant and high-fidelity feedback is necessary for effective labeling of neighborhood graph. For complex 3D shapes, one part usually contains
large point clouds, which is hardly seen in existing methods. With our points from the background or the foreground and is discarded. This
algorithm, labeling of large point clouds of 50 million points or more in procedure is repeated a few times until all remaining points belong to
VR at around 90 Hz is feasible. Our main contributions are as follows: the same class. A typical application is the labeling for 3D intracranial
aneurysm [14], where users describe a boundary by clicking several
1. a chunk-based compute shader selection algorithm for fast spa-
points and the connection between two points is determined by the
tial queries and modification of subsets of large point clouds,
shortest path. After creating a closed boundary line, users annotate
the aneurysm part by selecting a point inside of it. The enclosed
2. acceleration of the continuous level-of-detail algorithm through
area is calculated automatically by propagation from the point to the
the addition of CPU-based visibility culling,
boundary line along the surface. This method took around 8 h for one
3. instantaneous visual feedback of user interactions, including
dataset. (By using our method, labeling takes 2 min per aneurysm, and
undo and restore operations,
the entire dataset can be completed in less than 4 h). Similarly, the
4. the efficiency of the labeling process is greatly improved in terms
domain-specific application point_labeler has been applied to label the
of positioning and detail observation.
SemanticKITTI dataset [12]. This tool primarily features a mouse-based
To validate our contribution, we tested our method on our own polygon selector and painting functionality, which can inadvertently
dataset and on public datasets [22–24]. Additionally, we conducted a lead to over-labeling of background points when labeling points on the
user study to evaluate how well our application supports labeling of front of an object in a 2D display. Additionally, the lack of an undo
large 3D point clouds, the results of the user study show that with our feature constrains its usability and flexibility in detailed labeling tasks.
immersive labeling tool the efficiency of labeling large point clouds is Pointly.AI [20] is an advanced labeling tool for large point clouds,
significantly improved. To the best of our knowledge, our method is which is based on Potree [29]. However, Pointly.AI is still a mouse- and
the first VR labeling application that supports large point clouds with 2D-display-based labeling tool, which are not ideal for scenes with a lot
minimal preprocessing while preserving original color and geometry of occlusion, such as an indoor scene with walls and cluttered objects.
features of point cloud. As a mouse usually operates with 2 degrees of freedom, the view point
must be adjusted many times before reaching a suitable observation
2. Related work point.

2.1. Point cloud rendering in VR 2.3. VR-based point cloud selection & labeling

Real-time rendering of large point clouds in VR is the basis of Lubos et al. [30] presented ‘‘Touching the Cloud’’, a bi-manual
any VR labeling tool. The essential problem is that render times are user interface for immersive selection and annotation of point clouds.
proportional to the number of points, thereby limiting the number By tracking the user’s hands using an OpenNI sensor and displaying
of points that can be rendered at interactive frame rates. As such, them in the virtual environment, the user can touch the virtual 3D
methods for accelerating the rendering of point clouds to permit ever point cloud in midair and transform it with pinch gestures inspired by
larger numbers at interactive rates are desired. A common approach smartphone-based interaction. However tracking noise and jitter of the
is to find strategies for rendering only a subset of the points without hands was observed, impacting the selection accuracy and speed of the
degrading rendering quality. Gobbetti and Marton [25] presented a interface. Burgess et al. [31] designed an interactive spatial selection
simple point-based multi-resolution structure. They were the first to method for a large-scale 3D point cloud visualizer. Although it used the
apply a level-of-detail (LOD) approach with points grouped into con- Leap Motion 3D hand tracker, the display was 2D. Valentin et al. [32]
stant sized ‘‘chunks’’ and organized in a level-of-detail representation. presented a novel interactive and online approach to 3D scene under-
Chunks are culled according to visibility, and the appropriate level standing, which allows users to simultaneously scan their environment,
of detail selected according to a function of distance or screen-space whilst interactively segmenting the scene simply by reaching out and

2
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

touching any desired object or surface. This work, however, does not
accommodate very large point clouds or accurate labeling.
Gaugne et al. [33] proposed a display infrastructure associated with
a set of tools that allows archaeologists to interact instantly with point
clouds. The resulting framework employs a level-of-detail strategy for
rendering large point clouds, supports the positioning and visualiza-
tion of photographic views for point clouds by using a tracked wand
equipped with a joystick or a tactile tablet connected to the application.
However, their method does not support point selection or labeling.
Point cloud labeling in virtual reality was proposed for the first time
in [34], where they consider the challenge of labeling animated point
clouds in VR. By labeling a person or an object, segmenting the point
clouds in each animation frame, and establishing how the segments in
each frame correspond with those in surrounding frames, their method
makes it possible to segment time-varying point clouds efficiently, Fig. 1. Button mapping of Vive Controller [43].

albeit of a rather small size. Large point clouds demand a more efficient
rendering strategy. Another labeling approach is given in [35], which
is optimized for the annotation of cuboid-like objects in point clouds.
Individual points and points in more challenging constellations cannot
be processed efficiently. In 2021, Zingsheim et al. [36] proposed a
VR based approach to label live-capture 3D scenes collaboratively, it
offers efficient and intuitive 3D labeling and annotation, like in the case
of physical touching or painting on surfaces, which allows remotely
connected users simultaneously to enter and interact with a respective Fig. 2. Text plates are rendered next to the controllers to show current rendering
environment based on VR devices. Schmitz et al. [37] presented a semi- parameters and to explain current button mappings.
auto method for the interactive segmentation of textured point clouds,
but the boundary of interactivity of their method is around 6 million
points. When the number of points increases, their tool would be no 5. intuitive, accurate and fast methods for selecting arbitrarily
longer able to stay within their prescribed bounds regarding compu- shaped sets of points anywhere within the point cloud,
tation time. Franzluebbers et al. [38] presented a hybrid immersive 6. saving and loading of labels.
headset- and desktop-based virtual reality annotation system for point
clouds of specific agriculture datasets by deploying continuous level-of- To fulfill the above requirements, we developed an immersive la-
detail rendering, but since they only annotate cotton data, their results beling method by designing an immersive user interface and a mixed
were still based on relatively small and sparse point clouds. CPU/GPU-based data structure. We first present the labeling interface
None of the aforementioned methods support large-scale dense and basic functions offered by our system in Section 3.2, and describe
point clouds as addressed by our work. the technical components necessary to achieve the labeling work in
Section 4.
Shooting Labels [39], supports larger 3D scenes by first converting
point clouds into voxels or meshes. Users have to split the point cloud
3.2. Interface design
into several chunks at 3 different resolutions beforehand, and then
import the split data in their project. Users then interactively shoot
Inspired by Cassie [42], we endeavoured to keep the interface
pre-defined labels at the 3D scene. However, preprocessing is quite
minimal and user interactions simple, to keep the user focused on
complex and the voxelization decreases the resolution of the original
the labeling task at hand. Our system targets the HTC vive Pro head-
point cloud, and so detailed features and color information are lost.
mounted display, with a resolution of 1440 × 1600 per eye and 90 Hz
Immersive-Labeler [40,41] proposed a concept based on the full
frame rate. Interaction is achieved through two hand-held Vive con-
immersion of the user in a VR-based environment that represents the
trollers as shown in Fig. 1. The button mappings are shown on text
3D point cloud scene. However, their work did not explain the technical
plates close to the controllers. The text plates are automatically hidden
details, the visible examples are mainly sparse outdoor scenes.
if the HMD does not look in the direction of a controller Fig. 2.
Compared to previous methods, our VR labeling tool processes large
point clouds without preprocessing, offering instant visual feedback, 3.2.1. Navigation
preserving their original color and geometry. Large point clouds typically represent room-scale environments or
extensive terrains, necessitating rapid movement through these spaces
3. Design for users. To allow the use of our labeling tool in large as well as small
spaces, navigation in virtual environments is supported in three ways.
3.1. Requirements The user can teleport to any position in the point cloud using the left
controller to shoot a ray to the desired location as shown in Fig. 3.
As we mentioned in Section 1, labeling of large point clouds of 50 Alternatively, the user can at any time simply ‘‘grab’’ the point cloud
million points or more in VR at around 90 Hz is the essential task. with the controllers and scale, rotate or translate it. Finally, a rapid
Therefore, our system is designed to support: method of moving in small increments forward or backward in the
direction of the right hand controller is offered through pressing UP
1. large point clouds (50 million points or more), and DOWN on the right controller’s touch-pad.
2. high frame rate, high resolution rendering,
3. fluid navigation of the scene and fast scaling, rotation and 3.2.2. Labeling
translation of the entire point cloud, The labeling interface is a crucial component of our design, inspired
4. instantaneous selecting, labeling, deleting and undo with imme- by immersive drawing and sketching tools such as AdaptiBrush [44]
diate visual feedback, and Cassie [42], the user interacts with the tool using two controllers,

3
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

Fig. 4. The 3 selection primitives for labeling points. Top-left: the point cloud of an
outdoor scene; Top-right: labeling the tree using the Sphere primitive; Middle-left: the
point cloud of a church; Middle-right: labeling the church using Cube primitive; Bottom-
left: the point cloud of a room; Bottom-right: labeling points out of the wall using the
Clipping Plane.
Fig. 3. Top figure: Overview of teleportation shown from a top-down perspective in
a room-scale virtual environment. The user teleported from the blue sphere to the
red sphere, with the violet arrow indicating the direction of teleportation. Bottom
figure: Sequential views from the user’s perspective of the initial position and the
target position (Red sphere) respectively.

one in the dominant hand and one in the non-dominant hand. However,
unlike these tools, our labeling tool for large point clouds does not
Fig. 5. Left: the original room-scale point cloud; Middle: corrupted or unwanted points
require complex selection tools that burden the GPU. Instead, it is (shown in red) are labeled as Delete; Right: cleaning corrupted or unwanted points out
designed to ensure intuitive and rapid selecting of the point cloud. of the wall through one or multiple delete operations.
Subsequently, we designed the labeling function for the dominant
hand (default: right) and labels for the non-dominant hand (default:
left). On the dominant hand, we designed the 3 basic geometric prim- 3.2.3. Config mode
itives for labeling points as shown in Fig. 4: Sphere, Cube, and Clipping To support various scene sizes and densities of point clouds, we
Plane. The sphere and cube primitives are mainly designed for painting
provide a Config mode to adjust of key parameters for the CLOD
points close to the user. The clipping plane primitive is used for
rendering. The CLOD-factor, which controls the spacing between points
cutting points out of a wall or boundary and provides a method for
based on camera-to-point distance, and point size can both be adjusted
selecting large numbers of points rapidly. Above the non-dominant
using LEFT and RIGHT clicks on the touch-pads of the right and left
hand controller a colored sphere array from which the user can choose
controllers, respectively. These adjustments allow for precise control
a selection primitive with the tip of the dominant hand controller.
over the level of point overlap, with a larger CLOD-factor rendering
On the non-dominant hand, labels are presented in a grid attached
fewer points.
to the controller below the selection primitives. The first label is
dedicated to point deletion, the second is for removing labels on already
labeled points. The remaining labels are for user defined label classes 4. CPU/GPU-based processing
and can be customized with a text file that the is read on startup. Two
labels are always present: Delete and Restore. To remove noise points
and outliers during labeling, the user can select the Delete label, as Based on the design of interaction in VR, when labeling points
shown in Fig. 5, to delete corrupted or unwanted points. To relabel i.e. changing the color of points, the user must get instant visual
uncertain points, the user can select the Restore label to erase the labels feedback. Otherwise the user may select wrong points due to rendering
and return the labeled points to their original color. latency. To achieve this, our system needs to put both rendering and
Undo mechanics are one of the standard features of many computer interaction on the GPU as much as possible, but the current view
systems and undo in VR tasks can effectively increases users’ success, and the controller pose of the VR device still require uploading from
helps them recover from errors and unwanted situations [45]. Labeling the CPU to the GPU. Aiming at minimizing the latency, we proposed
large point clouds in VR is particularly error-prone. Therefore, we a hybrid CPU/GPU data structure for reducing the data amount of
designed an undo function, which uses the MENU button of left hand transmission between CPU and GPU. As shown in Fig. 6, a chunk-based
as Undo button. Undo is done on a per stroke basis, where each stroke strategy was introduced to speed up labeling. Chunks are selected on
corresponds to a labeling action started by pressing the trigger and the CPU based on the current view and the interaction primitive. In this
ending when the trigger is released again. In this way the user can also way, GPU based labeling only has to operate on the chunks of interest.
Undo labeling operation to go back to previous states after he/she has In this section, we will provide details of this mixed CPU/GPU-based
wrongly added points to the current selection. data structure.

4
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

Fig. 7. Left: Visualization of chunks of one of our room-scale point clouds, the
blue wire-frame boxes represent chunks’ 3D bounding boxes. Right: Visualization of
Fig. 6. Overview of mixed CPU/GPU-based data structure for immersive point cloud remaining chunks in the view frustum after culling.
cleaning and labeling. Data containers are shown in light green/light blue color, CPU
functions in gray and GPU programs in lilac.

4.2. GPU-based processing


4.1. CPU-based processing
On the GPU side, PointBuffer, LabelBuffer, HistoryEn-
tryBuffer, and RenderPointBuffer are created in VRAM.
On the CPU side, the main tasks are data IO, chunk generation and PointBuffer follows the same format with 16 bytes per point
selection of chunks based on frustum culling and the currently chosen as in [21]. PointBuffer and LabelBuffer both have as many
interaction primitive. elements as the original point cloud. RenderPointBuffer is built
for each frame and stores the points that need to be rendered. For better
packing it is organized into two buffer objects, one for the 16 byte
4.1.1. Point cloud I/O
points (with color and LOD) and one buffer object for the point indices
Point clouds are input and output in OBJ, PLY, or other binary that allow to access the corresponding label in the LabelBuffer. The
formats, which mainly store the coordinates, optionally normal vectors, HistoryEntryBuffer stores undo information.
and RGB values of the point cloud. After loading the point cloud, In the following we discuss the four GPU compute programs re-
we use the Potree Converter [29] in the adapted version of [21] to duce, render, label, and undo, used to render and manipulate the
organize points into an octree and assign a level-of-detail (LOD) to each point cloud in a chunk-based manner. Fig. 6 illustrates the data flow
point, which is stored as single unsigned byte value. In addition, after with arrows pointing from the data source to the data receiver. Darker
generating the LOD, the points are split into chunks. arrows illustrate write access. Mixed colored arrows illustrate data flow
For large point clouds, it is important to support multi-session from the CPU to the GPU.
labeling. To simplify this we defined our own binary file format with
the extension .lpc (Large Point Cloud/LOD Point Cloud) to store header 4.2.1. Reduce
information on rendering style, chunk array, and point cloud transfor- The reduce program is a pure compute shader program that is
mation as well as the point cloud itself with per point position, optional implemented in the same way as the reduce step in [21] and fills the
normal, RGB color, and LOD. With this information, point clouds can RenderPointBuffer with the points to be rendered based on the
be edited multiple times and the user can continue with the previously current view. The reduce step in [21] creates a new vertex buffer by
adjusted render style and point cloud transformation. iterating through all points in the fill point cloud and copying those
that match the desired level of detail in the target buffer. We extended
Fig. 6 illustrates on the left side the organization of the chunks and
the original implementation in two ways. Firstly, we gave the reduce
points into the ChunkArray and the PointArray. If labels are not
function access to the interaction primitive such that all points inside
part of the loaded file format, they are initialized to unlabeled.
the interaction primitive are passed on to the RenderPointBuffer.
This ensures that the user sees all the points that he or she currently
4.1.2. Chunk-based selection strategy interacts with. Only for the clipping plane primitive we relaxed this
strategy and restricted the pass through of points to a half sphere
When labeling points, it is essential to calculate how many points
around the controller of 0.5 m.
intersect with the selection primitive. This calculation is crucial when
The second modification is to exploit the chunks. For this we did a
dealing with two extreme cases: labeling a single point or the entire
view frustum culling on the CPU and only considered chunks that are
point cloud. Traversing all points or constructing a neighboring tree
at least partly visible in the current view. In Fig. 6 this filtering step
can be quite time-consuming in a VR-based application. Therefore, to
is denoted as cull function and the result is the ChunkIndexArray
mitigate these issues, our method exploits the chunks as an acceleration
containing the relevant chunks. Unlike [21], which runs the reduce
data structure for point labeling. The chunks are generated from the program once across the whole point cloud, we performed the simple
3D bounding-box of the whole point cloud as shown in Fig. 7, the 3D approach of just dispatching the reduce program once per visible
bounding box of the whole point cloud is defined by the extents of chunk, which transmits less data from CPU to GPU.
the point cloud, which is uniformly partitioned into chunks of N×N×N To support labeling of points reduce also stores the point indices
dimensions (where N = 10 by default, can be adjusted based on the size in the RenderPointBuffer.
of the point cloud). Each ChunkArray contains the index of the first
point, the number of points, and itself bounding-box. The bounding-box 4.3. Mixed CPU/GPU-based processing
of each chunk is calculated for two main uses: on the one hand, culling
techniques can reduce the number of chunks that need to be considered 4.3.1. Point selecting/labeling
for rendering, and on the other hand the bounding-box will be used to Assigning labels is the core operation of our method, in order to
judge if the chunk intersects with the current selection primitive. reduce the latency for rendering, as we mentioned in Section 3, we

5
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

make efforts to minimize the data transmission between CPU and


GPU. On one hand, during selection, deletion and labeling of points
we simply assign new labels, without actually removing points from
PointBuffer. Each point is assigned a label in the LabelBuffer,
which can be UNLABELED, SELECTED, DELETED, LABEL1, LABEL2,
….
On the other hand, similar to our reduce program the label
program is called only on relevant chunks, which are intersected with
the selection primitive. This time the cull function on the CPU culls
away any chunks outside the current interaction primitive. This results
in far less chunks for sphere and cube but can also result in more chunks
for the clipping plane primitive, where also currently unseen points can
be selected/labeled.
The label program overwrites the labels of all points inside of
the interaction primitive in the LabelBuffer. The label program is
Fig. 8. The dataset consisting of four point clouds: Lab_𝐴(50.7M points, top-left),
called every time that the user performs a selection or labeling action.
Lab_𝐵(50.9M points, top-right), Lab_𝐶(34.7M points, bottom-left), Office_𝐴(21.2M
points, bottom-right).
4.3.2. Point deleting
Deleting unwanted and corrupt points is one of the main tasks in
our work. Deleting points on the CPU can increase latency due to the
need to modify the octree and regenerate level-of-detail information. 5.2. Dataset preparation
To give users instant visual feedback, labels of DELETED are assigned
to the selected points in the labeling buffer LabelBuffer on the For practical labeling, we tested our method on heterogeneous point
GPU. If points are labeled as DELETED, the points are simply skipped clouds including indoor, outdoor, and medical datasets. We discov-
during rendering. This operation can preserve the original structure ered that the open-source indoor datasets [46,47], obtained via fusion
of the point cloud, and so does not have to generate new level-of- methods or low-quality Lidar and comprising up to 20 million points,
detail information whenever points are deleted. Instead of deleting contain sparse points, which place less stress on the GPU. Outdoor point
points instantly, our approach deletes points while exporting them to clouds [12,22] typically have no points in the sky – more than half of
an external file. the view frustum area is free of points – resulting in less rendering stress
on the GPU after culling. To ensure a view frustum densely filled with
4.3.3. Undo operation points during measurements, we acquired our own in-house datasets,
As we mentioned in Section 3, an undo operation is provided. For which contain complex scenes with densely packed and highly occluded
this, a circular buffer HistoryEntryBuffer is created in VRAM. objects. To acquire large point clouds, we scanned several rooms in our
Every time the labeling operation is triggered, the operation and the building using the Leica BLK360 laser scanner. It can capture 360,000
index of the operation are uploaded to HistoryEntryBuffer. For points per second and has additionally panoramic HDR imagery and
this we exploit an atomic counter. On the CPU side we keep track of the thermal imaging capability. Our dataset includes four room-scale point
actions in the ActionStack. When an undo operation is executed, the clouds. The number of points of each point cloud ranges from 21.2M
last labeling operation will be restored by fetching the range of history to 50.9M. The scanned point clouds are shown in Fig. 8.
entries in the top action on the ActionStack. The undo program is
used to roll back all labels in the LabelBuffer to their state before 5.3. Performance
the rolled back action. The history capacity is set to accommodate 2𝑛 − 1
points (𝑛 = 24 by default), where 𝑛 can be adjusted based on the size The performance of our system is determined by two aspects: the
of VRAM. For operations that exceed the history capacity, users can use efficiency of users’ judgment and positioning in the model and their
the RESTORE to erase the labels and relabel points. familiarity with the labeling tool; and the rendering frame rate of large
point clouds while actively labeling. We evaluate our method with
4.3.4. Point cloud rendering extensive experiments: three experiments were conducted to measure
We employ the Continuous Level-of-Detail Point Cloud rendering
the execution time of draw performance, the labeling performance of
algorithm from Schütz, which can render large point clouds in real
the chunk-based strategy, and how long it takes for a skilled user to
time [21]. The render program draws points from RenderPoint-
label a large point cloud. In all experiments, trained users participated.
Buffer, which contains the currently relevant points generated with
Training was done in multiple labeling sessions over 10 types of scenes
the reduce program. Our implementation supports two rendering
in order to familiarize the users with our tool’s button map and in-
styles: a circle or a square. Furthermore, we incorporate label infor-
teraction features. In 5.3.3, the labeling work on large samples was
mation with the help of the additionally stored point index from the
completed by a single skilled user.
LabelBuffer as well as information on the interaction primitive. The
label can discard the point or change its color. If the point is inside of
5.3.1. Evaluation of rendering
the interaction primitive we also change its color to the highlight color.
The execution time is measured using high precision timers pro-
This allows the user to preview selection action as long as the trigger
is not pressed without changing the labels in the LabelBuffer. vided in C++ standard library 〈chrono〉. The draw performance was
measured during real labeling sessions in which the user was actively
5. Result labeling the point cloud during measurement. In addition, the user
was instructed to keep the point cloud in their view frustum during
5.1. Implementation measurement, so that measurements reflect performance during dense
point cloud rendering. Besides the CLOD-factor and the splat size of
Our method is implemented in C++ with OpenGL and OpenVR. The the points, different point densities in the different point clouds also
test system was equipped with an Intel(R) Core(TM) i9-9900K CPU @ affect rendering time. For our acquired point clouds, we measured
3.60 GHz, 64 GB RAM and an Nvidia GeForce RTX 2080 Ti with 11 GB three combinations of the rendering parameters CLOD-factor and size of
VRAM. HTC Vive Pro with lighthouse tracking 2.0 is utilized for the point 𝑃 𝑠𝑖𝑧𝑒: (1.0, 1.0), (1.0, 0.2), and (2.3, 0.2). The first setting (1.0, 1.0)
evaluation of our point cloud labeling tool. is the default, where some overlap exists between rendered points.

6
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

Table 1
Performance of the regular rendering pass.
Model CLOD 𝑃𝑠𝑖𝑧𝑒 Cull&Reduce Draw(×2) Sum
(points) (ms) (ms) (ms)
office_𝐴 1.0 1.0 0.77 8.22 17.21
(21.2M) 1.0 0.2 0.80 3.01 6.82
2.3 0.2 0.71 2.07 4.85
lab_𝐴 1.0 1.0 0.67 9.00 18.67
(50.7M) 1.0 0.2 0.57 4.42 9.41
2.3 0.2 0.56 3.78 8.12
lab_𝐵 1.0 1.0 0.68 9.43 19.54
(50.9M) 1.0 0.2 0.57 3.73 8.03
2.3 0.2 0.55 2.98 6.51
lab_𝐶 1.0 1.0 1.09 11.56 24.21
(34.7M) 1.0 0.2 1.14 4.16 9.46 Fig. 10. Frame rate over 7200 frames of a real labeling session, which remains
2.3 0.2 0.92 1.69 4.30 relatively steady at 90 Hz.

Fig. 11. Labeling results on outdoor scene and indoor scene.

Fig. 9. Comparison between the rendering frame rates of labeling Office_𝐴 with and
without chunk-based selection over 180 frames. The orange plot shows the decrease of
frame rate during labeling and relabeling without chunk-based selection.

Modifying the 𝑃 𝑠𝑖𝑧𝑒 and CLOD-factor can help reduce the overlap
without significantly compromising the overall point cloud quality.
The main steps that affect rendering time are CPU side frustum
culling as well as GPU side reduction and drawing. The labeling is
only executed when users trigger that event. So the duration of frustum
culling, reduce step, and draw step were measured respectively and a
total duration of the these steps is calculated. Fig. 12. The ground truth of buildstein1 in Semantic3D was rendered in VR, several
evident errors of labeling are highlighted in the zoomed in red rectangles.
Table 1 contains the model name(the number of points), durations
of culling&reduction and drawing. The total duration is obtained by
sum of culling and reduction, with drawing time multiplied by 2,
due to stereo rendering. As the CLOD method is flexible, by manually 5.3.3. Evaluation on heterogeneous datasets
adjusting CLOD-factor and 𝑃 𝑠𝑖𝑧𝑒, the total duration can be limited to For large point clouds, we tested the immersive labeling method
11.1 ms. Subsequently, rendering frame rate can be held steady at on several public datasets Semantic3D [22], Baidu Apollo [23] (point
90 Hz. clouds in the context of autonomous driving) and our own dataset. Data
are classified into 3 types: Indoor, Rural and Urban. The trained users
were asked to label our own datasets with 7 labels, namely 1. chair, 2.
5.3.2. Evaluation of chunk-based selection strategy
table, 3. table leg, 4. screen, 5. robot arm, 6. small objects on the table,
To validate the chunk-based selection, we compared its performance 7. unlabeled points, and label part of Semantic3D and Baidu Apollo
with non-chunk-based selection on our own dataset. The experiment datasets into 8 labels, namely 1. man-made terrain, 2. natural terrain,
involved a consistent labeling path (from one fixed 3D position to 3. high vegetation, 4. low vegetation, 5. buildings, 6. hard scape, 7.
another 3D position in the room-scale point cloud) and speed (0.7 m/s) scanning artifacts, 8. cars. The labeling time is shown in Table 2. Part
to ensure realistic and comparable measurements. Point clouds are of labeling results of the Semantic3D dataset are shown in Figs. 11 and
uniformly partitioned into chunks of 10 × 10 × 10 dimensions. CLOD- 13. As Semantic3D offers public ground truth of their data, we compare
factor = 2.3 and 𝑃 𝑠𝑖𝑧𝑒 = 0.2 are set for rendering these point clouds. their ground truth as shown in Fig. 12 and our labeling result as shown
One of the results, shown in Fig. 9, reveal that chunk-based labeling in Fig. 13. The ground truth of semantic3D data was rendered in VR,
maintains a steady frame rate of 90 Hz, and shows that searching for as their labeling is not on the original dense point cloud, some labels
points only in intersecting chunks, rather than traversing the whole are null, but the details of ground truth can still be checked visually in
point cloud, successfully allows for point-cloud interactions without VR. The main issue in the Semantic3d dataset is that some details were
significantly impacting frame rate. ignored. However, by using immersive labeling with our tool, visual
To illustrate the rendering frame rate in actual labeling sessions, feedback is provided instantly to the user, so the user can label details
we recorded sessions where a user labeled the point cloud lab_𝐵 using accurately and check the labeling result in an immersive environment,
various selection primitives. As shown in Fig. 10, the frame rates subsequently, the errors in Fig. 12 can be avoided with our tool.
remained around 90 Hz, though there were fluctuations due to the Typically, our immersive labeling method effectively avoids the
user’s sudden movements. occlusion issue by the users changing their perspective or re-positioning

7
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

Fig. 15. The labeling tasks for user study. Left: labeling the table of a room-scale point
cloud. Right: labeling the building of an outdoor scene.

labeling large point clouds with other tools can be extremely time
consuming — many days for a single cloud, especially when users
Fig. 13. The labeling result of bildstein1 from Semantic3D dataset by our proposed
require very detailed results. For instance, our initial trials labeling
method.
lab_𝐵 with Pointly.AI [20] consumed more than 25 h, and labeling
bildstein1 consumed more than 40 h.

5.4. User study

Paramount to the success of any VR application is the user expe-


rience. As such, we conducted a user study to assess the subjective
experience of performing point cloud labeling with our system. Our
study goals were twofold. First, as detailed in Section 5.4.2, we evalu-
Fig. 14. Left: using a cuboid selection primitive from the top view; Right: user’s
view positioned close to the road level. With immersive experience, occlusion of tree
ated whether our system achieved fluid navigation of the scene and
branches is avoided. provided smooth rendering without VR motion sickness for labeling
tasks. Second, as outlined in Section 5.4.3, we aimed to verify whether
Table 2 our system, with instant visual feedback in VR, enhanced the labeling
Performance of the labeling time on large samples. efficiency for large point clouds.
Model(points) Type Time (min∕s)
office_𝐴(21.2M) Indoor 44 min 50 s 5.4.1. Participants
lab_𝐴(50.7M) Indoor 40 min 17 s The target users would be professional people who have experience
lab_𝐵(50.9M) Indoor 48 min 10 s with VR and processing point clouds. To achieve a reasonable level of
lab_𝐶(34.7M) Indoor 57 min 14 s
bildstein1(29.3M) Rural 11 min 15 s
statistical power [48], we recruited 25 unpaid participants (6 females
bildstein3(23.8M) Rural 14 min 38 s and 19 males) including staff and students from our university. All of
bildstein5(24.7M) Rural 14 min 14 s the participants were between 20 to 35 years old, and had experience
domfountain1(35.4M) Urban 7 min 00 s with computer science, 23 of 25 had experience with VR, 5 of 25 had
domfountain2(35.2M) Urban 8 min 17 s
experience with processing point clouds in VR.
domfountain3(35.0M) Urban 10 min 17 s
untermaederbrunnen1(16.7M) Rural 27 min 39 s
untermaederbrunnen3(19.8M) Rural 22 min 01 s 5.4.2. Evaluation of rendering performance and interaction design
neugasse(50.1M) Urban 8 min 37 s In this part, we evaluate the rendering performance and interaction
apollo_data_0(1.81M) Urban 10 min 02 s
design for our immersive labeling tool.
apollo_data_1(3.36M) Urban 15 min 09 s
apollo_data_2(6.12M) Urban 18 min 41 s Procedures: Participants need to warm up in advance to ensure that
apollo_data_3(9.41M) Urban 24 min 09 s they are fully familiar with the use of the system. We first provided each
participant with a 10-minute tutorial and let he/she do two labeling
tasks as Fig. 15 shows: label the table of lab_𝐵, and label the buildings
of the bildstein1 model. The total duration for each participant was
the point cloud. In Fig. 14, if users observe the road from the top
around 30 min.
view by using the traditional 2D/3D labeling tool, part of tree branches
To get first insights into the evaluation of immersive labeling for
stretch into the road, and the redundant branches occlude the road large point clouds, we split the questions into two parts: on rendering
when selecting points. In our tool, users can change their perspective and on interaction design.
and label the ground directly by using the positioning and scaling The rendering part focused on four questions: (1) whether there
function. are popping artifacts when moving around in the large point cloud;
According to Table 2, we found that the number of points does not (2) whether there is simulation sickness during movement; (3) whether
significantly affect labeling time. The type of scenes and the number there is simulation sickness during labeling; (4) whether there is simu-
of labels is a factor. The more labels, the more time it takes to label. lation sickness on deletion.
Indoor scenes contain numerous and complicated objects, so it costs The main questions on interaction design are (1) if the current
more time than others. Urban scenes take less time than rural scenes, teleport function is helpful for moving in the point cloud; (2) if the
because the buildings and streets in urban scenes are more regular, positioning function of the model is helpful for adjusting the pose of
man-made facilities are easier to recognize, and the boundary points the point cloud; (3) if the scaling function is helpful for adjusting the
between objects are clear and continuous. On the contrary, in the rural size of the model; (4) if the UI design is easy to remember and use.
scene, trees, bushes, courtyards and streets are intertwined, the bound- Data Collection and Analysis: In this user study, participants were
ary points are ambiguous, which takes longer to observe and judge. asked to rate the labeling tool on 5-point Likert scales (5 is highest),
Particularly, Apollo automatic driving point clouds are without color the higher the value users choose, the better perception of rendering
information further slowing down the labeling task, but the objects are there is. So a 5 indicates the best experience without any simulation
still able to be recognized in the immersive experience. Furthermore, sickness and popping artifacts while 1 implies the worst experience.

8
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

Table 4
Wilcoxon signed-rank test of our hypotheses.
𝑝-𝑣𝑎𝑙𝑢𝑒 Pointly.AI-SL Pointly.AI-Ours SL-Ours
Scene A 5.96𝑒−08 5.96𝑒−08 5.96𝑒−08
Scene B 1.82𝑒−05 0.141 5.96𝑒−08
Scene C 5.96𝑒−08 5.96𝑒−08 5.96𝑒−08

Fig. 16. Left: The results on rendering performance and simulation sickness. Right:
The results on interaction design.
and (3) objects with heavy occlusion (multiple chairs and a tabletop)
Table 3 in scene C. Time for labeling completion were measured in seconds
Questions and users’ input in a scale of 1 to 5. and labeling accuracy was determined by visual inspection, for two
Questions Mean SD reasons: free versions of Pointly.AI and SL do not support export of
Whether there are popping 4.96 0.20 labeled datasets; and the labels provided in datasets such as Semantic3d
artifacts when moving around? and Stanford 2D-3D-Semantics do not provide error-free labeling. After
Whether there is simulation 4.92 0.28 users completed all labeling tasks, their feedback and remarks were
sickness during movement? collected directly by having them fill a questionnaire.
Whether there is simulation 4.92 0.28 With the user study we wanted to validate two hypotheses:
sickness during labeling?
Whether there is simulation 5.0 0.00 • H1: With our tool point cloud labeling tasks can be performed
sickness on deletion? faster than with Pointly.AI and ShootingLabels.
If the scaling function is helpful 4.92 0.28 • H2: The VR-based tools allow faster labeling of point clouds than
for adjusting the size of the the 2D display based tool Pointly.AI.
model?
If the positioning function of the 4.96 0.20 Results: Fig. 18 plots all task completion times and corresponding
model is helpful for adjusting the boxplots are shown on the right side of Fig. 17.
pose of the point cloud? For none of the nine scene-tool combinations the labeling com-
If the current teleport function is 4.92 0.28 pletion time followed a normal distribution. Therefore, we used the
helpful? non-parametric Wilcoxon signed-rank test to compare pairs of tools on
If the UI design is easy to 4.92 0.28 the three labeling tasks. We tested the null-hypothesis whether the
remember and use?
completion times for two tools is the same based on a low significance
level of 𝛼 = 0.05. From the resulting p-values shown in Table 4,
eight out of nine pairwise comparisons demonstrated extraordinarily
The rating result for popping artifacts and simulation sickness is low values (𝑝 <= 1.82𝑒−05 ) such that the null-hypothesis needs to be
shown in Fig. 16. The number of people rating for the questions is rejected validating our hypothesis H1 that our method is with very
represented by light blue, orange, gray and yellow respectively. As the high significance faster than the other two methods. The exceptional
result shows, there is slightly simulation sickness during rendering. And case with high 𝑝-value of 𝑝 = 0.141 is for the comparison of Pointly.AI
there is no simulation sickness during labeling and cleaning points. and our method for scene B, where the labeling task could be solved
The users’ ratings for Interaction Design are shown in Fig. 16. The efficiently with a box primitive available in both tools. On the other
number of people rating the three questions are represented by light hand H2 could not be validated as the 2D display based Pointly.AI was
blue, orange, gray and yellow respectively. According to the rating re- faster than the VR-based SL.
sults, participants are satisfied with our interaction design. All functions In addition to the comparison of task completion time we analyzed
can be learned and used easily by following the instructions on the text the preference of the participants for the different tools. While our
plates. Table 3 reveals that the average values are close to 5.0, and the method and Pointly.AI preserve the detail of the point cloud, Shoot-
standard deviations (SD) are comparatively minor, indicating that the ingLabels is based on a voxelization and furthermore removes the color
performance of the our method is both stable and reliable. information. In scene A, labeling the chair requires users to navigate
themselves under the chair, so they can label the bottom of the chair.
5.4.3. Comparison to existing labeling tools Among three methods, only our method allowed this successfully. In
For further evaluation of the efficiency of our immersive label- scene B, the book is on the table with no occlusion, so Pointly.AI can
ing system, we made a comparison with Pointly.AI [20] and Shoot- also achieve this by their box selector. However, in scene C, heavy
ingLabels [39]. Pointly.AI is a web-based labeling tool using a 2D occlusions exist between table and chairs, here Pointly.AI has to select
display and a mouse as input device. ShootingLabels (abbreviated as some parts and then unselect overlapped the part, and SL has to restore
SL in this section) and our tool are VR-based. For comparison we con- over-labeling parts. Here Pointly.AI and SL are not easy to change their
ducted another user study according to the repeated measure approach perspectives, so they took more time in the user study than our method.
where each participant used all three tools and performed with each From this point, we concluded that with our immersive labeling tool
tool the same three labeling tasks on the three room-scale scenes A, occlusion can be better dealt with.
B, and C from the Stanford 2D-3D-Semantics Database [24] shown in Based on these considerations we asked the participants in the ques-
Fig. 17. tionnaire, which tool provides the best visual presentation of the point
Procedure: The participants did not know which tool is ours. To cloud, and which tool provides the easiest navigation. Furthermore,
avoid dependence on the order in which the tools are used, we ran- we asked the users to rank the tools based according to their overall
domized the tool order over the participants. In the beginning the preference. From the 25 participants 23 rated our tool to provide
participants were trained on all three tools in the same way: 10 min the best visual presentation as well as the easiest navigation. Only
for familiarizing with the features of the tool and 20 min performing 2 participants selected Pointly.AI as best in visual presentation and
simple training tasks on the test scenes A, B, and C. After training the easiest for navigation. In the general ranking of the tools we found the
participants were ask to perform the following three labeling tasks with four different orders in decreasing preference: Ours → Pointly.AI → SL
all tools in the randomized order: (1) an independent object (chair) in (20 times), Ours → SL → Pointly.AI (2 times), Pointly.AI → Ours → SL
scene A, (2) two easy to access objects (book and tabletop) in scene B, (2 times), Pointly.AI → SL → Ours (1 time).

9
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

Fig. 17. The results of the comparison among scenes A, B and C with three labeling methods. The first column is Pointly.AI (2D screen based labeling tool), the second column
represents ShootingLabels (VR based labeling tool) and the third column shows our method (VR based labeling tool). The right-most column shows the statistics of task completion
time in seconds.

For statistical significance testing of null-hypotheses that there is


no difference between the tools we used again 𝛼 = 0.05. For both
visual presentation and navigation the chi-squared test provided a 𝑝-
value of 7.19𝑒−05 such that the hypothesis can be rejected with very
high significance. As post-hoc test we compared each pair of tools and
found that only p-values for the test between our tool and Pointly.AI
(𝑝 = 0.0063) and SL (𝑝 = 0.00041) showed a statistical significance of dif-
ferent results whereas the comparison between SL and Pointly.AI with
𝑝 = 0.53 does not allow to reject the null-hypothesis. The general tool
ranking was analyzed with the Friedman test, suitable for ordinal data.
Again we get a very low 𝑝-value of 9.05𝑒−09 for the null-hypothesis that
all tools are equivalently preferred. Post-hoc tests between Pointly.AI
and SL, Pointly.AI and ours, and SL and ours resulted in p-values of
0.00014, 0.00346, and 8.94𝑒−07 , respectively. Thus together with the
effect size all three statistical evaluations showed a very high statistical
significance for the preference of the participants for our tool over other
tools.
Based on the results of statistical tests for task completion times and
preference rankings, it is evident that our tool provides superior visual
presentation of point clouds and easier navigation compared to two
other existing tools. Despite being a VR-based tool, the user experience Fig. 18. The results of the task completion times of individual users. The horizontal
of SL was even below the 2D display based tool Pointly.AI. We speculate axis represents participant 1 to 25.
that this is due to the lack of color information and the reduced visual
presentation quality.
way too many variables needs to be controlled. Even if the variable of
6. Limitations and discussion the labeled target area is controlled, there still exists great uncertainty
in the interaction process. We admit that the user study can be extended
In the previous Sections 5.4.2 and 5.4.3, we illustrated how the but we also would like to point out that we are the first to perform a
users label points in an immersive environment and compared our user study based comparison of labeling tools for large point clouds.
method with two state-of-the-art methods. As a result, our approach On the other hand, our acceleration data structure successfully
shows great advantages in terms of positioning and detail observa- balances the rendering and interaction in Virtual Reality. The chunk
tion. With the immersive environment, users dive into the 3D point based data structure costs very little on the CPU side, but it significantly
clouds, and can locate and select target points intuitively, rapidly and reduces the workload of traversing point clouds. While ensuring render-
accurately. The interaction design proves fast and fluid. Instant visual ing performance in an immersive environment, the ability to interact
feedback, undo and clearing label operations help users label point with large point clouds enhances greatly. It is not only able to change
cloud accurately. Exploiting the scaling functionality, users can observe the color of points in an immersive environment but also offers the
the point cloud at arbitrarily close scales. The Move Around function potential to change the geometry (3D position) of points, making it a
(moving in small increments forward or backward) and Positioning very promising data structure for 3D scene rearrangement, 3D modeling
function (‘‘grab’’ the point cloud with the controllers) were the most and the metaverse.
convenient and flexible combination for navigation inside the point However, we did observe unexpected issues with rendering. For
cloud and most used to adjust the viewpoints during labeling. instance, one user felt a little bit popping artifacts, two users suf-
Admittedly, labeling a large 3D point cloud is complex. In the user fered slight simulation sickness, which may have been caused by the
study, manual labeling of large point clouds takes a long time, there are chunk-wise data structure. To solve these issues, we may allocate a

10
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

labeling buffer ChunkLabelBuffer for chunks, similar to the La- Acknowledgments


belBuffer, pass it to the label program so that users can instantly
label on chunks without touching the underlying points, in this way, the This work was funded by the German Research Foundation (DFG,
execution time is supposed to be reduced for the label GPU program. Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence
Furthermore, our current work focuses on bridging the gap between Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excel-
lence ‘‘Centre for Tactile Internet with Human-in-the-Loop" (CeTI) of
rendering large point clouds and fluid interactions in VR by imple-
Technische Universität Dresden. This project was also co-funded by
menting semantic annotation. Besides the functionality of semantic
the European Union and co-financed from tax revenues on the basis
labeling, however, our system has the potential to extend to instance
of the budget adopted by the Saxon State Parliament (Project No.
segmentation by adding an interaction model for assigning instance
100690214). It was also supported by the Federal Ministry of Education
identifiers. In VR, users can assign instance identifiers to points labeled and Research of Germany and by Sächsische Staatsministerium für Wis-
with semantic information through the VR controller, visually checking senschaft, Kultur und Tourismus in the programme Center of Excellence
whether each point has been labeled with semantic information and an for AI-research ‘‘Center for Scalable Data Analytics and Artificial Intel-
instance identifier, thus enabling panoramic segmentation. ligence Dresden/Leipzig’’ (ScaDS.AI.), and was also partially funded by
the Federal Ministry of Education and Research of Germany in the joint
project 6G-life (16KISK002).
7. Conclusions and future work
Appendix A. Supplementary data
In this paper, we presented an immersive method for labeling large
point clouds in VR with an intuitive interaction paradigm. To obtain Supplementary material related to this article can be found online
high frame rates and instant visual feedback with large point clouds, at https://doi.org/10.1016/j.cag.2024.104101.
we designed a novel mixed CPU/GPU-based data handling scheme that
allows people to do an instant selection of points. This mixed CPU/GPU- References
based data structure greatly improved the efficiency of labeling while [1] Kühner T, Kümmerle J. Large-scale volumetric scene reconstruction using lidar.
preserving the real-time rendering of large point clouds. Furthermore, In: 2020 IEEE international conference on robotics and automation. IEEE; 2020,
the experiments and the user study show basic functions and interac- p. 6261–7.
[2] Wang J, Sun W, Shou W, Wang X, Wu C, Chong H-Y, et al. Integrating
tions are robust. Our method can be applied to arbitrary point cloud
BIM and LiDAR for real-time construction quality control. J Intell Robot Syst
datasets. 2015;79:417–32.
In future work, we will leverage the chunk-based labeling discussed [3] Lozić E, Štular B. Documentation of archaeology-specific workflow for airborne
LiDAR data processing. Geosciences 2021;11(1):26.
in the previous section to improve performance further. To ensure a
[4] Esmaeili H, Thwaites H, Woods PC. Workflows and challenges involved in cre-
high frame rate during immersive point cloud labeling, we plan to ation of realistic immersive virtual museum, heritage, and tourism experiences:
investigate automated solutions for adapting rendering parameters and a comprehensive reference for 3D asset capturing. In: 2017 13th international
kd-tree- or octree-based acceleration data structures for spatial queries conference on signal-image technology & internet-based systems. IEEE; 2017, p.
465–72.
and modification. In addition, we plan more interaction features such as [5] Gonzalez-Aguilera D, Crespo-Matellan E, Hernandez-Lopez D, Rodriguez-
interactively changing the position of points, and copying and pasting Gonzalvez P. Automated urban analysis based on lidar-derived building models.
the points by using our mixed CPU/GPU data structure. We also plan IEEE Trans Geosci Remote Sens 2012;51(3):1844–51.
[6] Guan H, Li J, Cao S, Yu Y. Use of mobile LiDAR in road information inventory:
to investigate the use of multiple simultaneous perspectives inside VR,
A review. Int J Image Data Fusion 2016;7(3):219–42.
thereby providing the best of both a windowed 2D interface and VR. [7] Maiese A, Manetti AC, Ciallella C, Fineschi V. The introduction of a new diag-
The source code of this work will be made available at https://github. nostic tool in forensic pathology: LiDAR sensor for 3D autopsy documentation.
com/lintianfang/Immersive_labeling_pc with MIT license. We expect Biosensors 2022;12(2):132.
[8] Dong P, Chen Q. LiDAR remote sensing and applications. CRC Press; 2017.
that our work can be beneficial to a diversity of related fields that use [9] Wang K, Franklin SE, Guo X, Cattet M. Remote sensing of ecology, biodiversity
point cloud labeling and segmentation. and conservation: a review from the perspective of remote sensing specialists.
Sensors 2010;10(11):9647–67.
[10] Statham N. Use of photogrammetry in video games: a historical overview. Games
CRediT authorship contribution statement Culture 2020;15(3):289–307.
[11] Ivsic L, Rajcic N, McCormack J, Dziekan V. The art of point clouds: 3D LiDAR
scanning and photogrammetry in science & art. In: 10th international conference
on digital and interactive arts. 2021, p. 1–8.
Tianfang Lin: Writing – review & editing, Writing – original draft,
[12] Behley J, Garbade M, Milioto A, Quenzel J, Behnke S, Stachniss C, Gall J.
Methodology. Zhongyuan Yu: Writing – review & editing, Validation, Semantickitti: A dataset for semantic scene understanding of lidar sequences.
Data curation. Matthew McGinity: Writing – review & editing, Su- In: Proceedings of the IEEE/CVF international conference on computer vision.
pervision. Stefan Gumhold: Writing – review & editing, Supervision, 2019, p. 9297–307.
[13] Zolanvari S, Ruano S, Rana A, Cummins A, da Silva RE, Rahbar M, et al.
Methodology. DublinCity: Annotated LiDAR point cloud and its applications. 2019, arXiv
preprint arXiv:1909.03613.
[14] Yang X, Xia D, Kin T, Igarashi T. Intra: 3d intracranial aneurysm dataset for
Declaration of competing interest deep learning. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition. 2020, p. 2656–66.
[15] Huang J, You S. Point cloud labeling using 3d convolutional neural network.
The authors declare that they have no known competing finan- In: 2016 23rd international conference on pattern recognition. IEEE; 2016, p.
2670–5.
cial interests or personal relationships that could have appeared to
[16] Anolytics. 3D point cloud annotation services | 3D point cloud dataset. 2022,
influence the work reported in this paper. https://www.anolytics.ai/3d-point-cloud-annotation/.
[17] Lidar 3D point cloud semantic segmentation annotation services. 2022, https:
//www.cogitotech.com/lidar-3d-point-cloud-annotation.
Data availability [18] Mindy. 3D point cloud annotation company. 2022, URL https://mindy-support.
com/services-post/3d-point-cloud/.
[19] Sama. 3D point cloud annotation for LiDAR & radar. 2022, https://www.sama.
The authors do not have permission to share data. com/3d-lidar-radar-point-cloud-annotation/.

11
T. Lin, Z. Yu, M. McGinity et al. Computers & Graphics 124 (2024) 104101

[20] Pointly GmbH. 2024, https://pointly.ai/. [35] Wirth F, Quehl J, Ota J, Stiller C. Pointatme: efficient 3d point cloud labeling
[21] Schütz M, Krösl K, Wimmer M. Real-time continuous level of detail rendering of in virtual reality. In: 2019 IEEE intelligent vehicles symposium. IEEE; 2019, p.
point clouds. In: 2019 IEEE conference on virtual reality and 3D user interfaces. 1693–8.
IEEE; 2019, p. 103–10. [36] Zingsheim D, Stotko P, Krumpen S, Weinmann M, Klein R. Collaborative VR-
[22] Hackel T, Savinov N, Ladicky L, Wegner JD, Schindler K, Pollefeys M. Seman- based 3D labeling of live-captured scenes by remote users. IEEE Comput Graph
tic3d. net: A new large-scale point cloud classification benchmark. 2017, arXiv Appl 2021;41(4):90–8.
preprint arXiv:1704.03847. [37] Schmitz P, Suder S, Schuster K, Kobbelt L. Interactive segmentation of textured
[23] Liao M, Lu F, Zhou D, Zhang S, Li W, Yang R. Dvi: Depth guided video inpainting point clouds. In: VMV. 2022, p. 25–32.
for autonomous driving. In: European conference on computer vision. Springer; [38] Franzluebbers A, Li C, Paterson A, Johnsen K. Virtual reality point cloud anno-
2020, p. 1–17. tation. In: Proceedings of the 2022 ACM symposium on spatial user interaction.
[24] Armeni I, Sax S, Zamir AR, Savarese S. Joint 2d-3d-semantic data for indoor 2022, p. 1–11.
scene understanding. 2017, arXiv preprint arXiv:1702.01105. [39] Ramirez PZ, Paternesi C, De Luigi L, Lella L, De Gregorio D, Di Stefano L. Shoot-
[25] Gobbetti E, Marton F. Layered point clouds: a simple and efficient multiresolution ing labels: 3d semantic labeling by virtual reality. In: 2020 IEEE international
structure for distributing and rendering gigantic point-sampled models. Comput conference on artificial intelligence and virtual reality. IEEE; 2020, p. 99–106.
Graph 2004. [40] Doula A, Güdelhöfer T, Matviienko A, Mühlhäuser M, Sanchez Guinea A.
[26] Futterlieb J, Teutsch C, Berndt D. Smooth visualization of large point clouds. Immersive-labeler: Immersive annotation of large-scale 3d point clouds in virtual
IADIS Int J Comput Sci Inf Syst 2016;11(2):146–58. reality. In: ACM SIGGRAPH 2022 posters. 2022, p. 1–2.
[27] Discher S, Masopust L, Schulz S. A point-based and image-based multi-pass [41] Doula A, Güdelhöfer T, Matviienko A, Mühlhäuser M, Guinea AS. PointCloudLab:
rendering technique for visualizing massive 3D point clouds in VR environments. An environment for 3D point cloud annotation with adapted visual aids and
J WSCG 2018;26:p. 76–84. levels of immersion. In: 2023 IEEE international conference on robotics and
[28] Monica R, Aleotti J, Zillich M, Vincze M. Multi-label point cloud annotation by automation. IEEE; 2023, p. 11640–6.
selection of sparse control points. In: 2017 international conference on 3D vision [42] Yu E, Arora R, Stanko T, Bærentzen JA, Singh K, Bousseau A. Cassie: Curve and
(3DV). IEEE; 2017, p. 301–8. surface sketching in immersive environments. In: Proceedings of the 2021 CHI
[29] Schütz M. Potree: rendering large point clouds in web browsers [Ph.D. thesis], conference on human factors in computing systems. 2021, p. 1–14.
Wien; 2015. [43] HTC Vive User Guide. 2020, https://developer.vive.com/resources/hardware-
[30] Lubos P, Beimler R, Lammers M, Steinicke F. Touching the cloud: Bimanual guides/vive-pro-specs-user-guide/.
annotation of immersive point clouds. In: 2014 IEEE symposium on 3D user [44] Rosales E, Araújo C, Rodriguez J, Vining N, Yoon D, Sheffer A. AdaptiBrush:
interfaces. IEEE; 2014, p. 191–2. adaptive general and predictable VR ribbon brush. ACM Trans Graph 2021;40(6).
[31] Burgess R, Falcão AJ, Fernandes T, Ribeiro RA, Gomes M, Krone-Martins A, et 247–1.
al. Selection of large-scale 3d point cloud data using gesture recognition. In: [45] Rasch J, Perzl F, Weiss Y, Müller F. Just undo it: Exploring undo mechanics
Doctoral conference on computing, electrical and industrial systems. Springer; in multi-user virtual reality. In: Proceedings of the CHI conference on human
2015, p. 188–95. factors in computing systems. 2024, p. 1–14.
[32] Valentin J, Vineet V, Cheng M-M, Kim D, Shotton J, Kohli P, et al. Semanticpaint: [46] Yeshwanth C, Liu Y-C, Nießner M, Dai A. ScanNet++: A high-fidelity dataset of
Interactive 3d labeling and learning at your fingertips. ACM Trans Graph 3D indoor scenes. In: Proceedings of the international conference on computer
2015;34(5):1–17. vision. ICCV, 2023.
[33] Gaugne R, Petit Q, Barreau J-B, Gouranton V. Interactive and immersive tools for [47] Guo Y, Li Y, Ren D, Zhang X, Li J, Pu L, et al. Lidar-net: A real-scanned 3D point
point clouds in archaeology. In: ICAT-eGVE 2019-international conference on ar- cloud dataset for indoor scenes. In: Proceedings of the IEEE/CVF conference on
tificial reality and telexistence-eurographics symposium on virtual environments. computer vision and pattern recognition. 2024, p. 21989–99.
2019, p. 1–8. [48] Nielsen J, Landauer TK. A mathematical model of the finding of usability
[34] Stets JD, Sun Y, Corning W, Greenwald SW. Visualization and labeling of point problems. In: Proceedings of the iNTERACT’93 and cHI’93 conference on human
clouds in virtual reality. In: SIGGRAPH Asia 2017 posters. 2017, p. 1–2. factors in computing systems. 1993, p. 206–13.

12

You might also like