0% found this document useful (0 votes)
13 views11 pages

Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper

This document presents a novel approach for reconstructing radiance fields from images with known poses, achieving rapid convergence in under 15 minutes using a single GPU, compared to hours or days required by existing methods like NeRF. The method utilizes a density voxel grid for scene geometry and a feature voxel grid for view-dependent appearance, incorporating techniques such as post-activation interpolation and direct voxel density optimization to enhance quality and speed. Evaluation shows that the proposed method matches or surpasses NeRF's quality while significantly reducing training time and improving rendering speed.

Uploaded by

Ahmet reisoğlu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper

This document presents a novel approach for reconstructing radiance fields from images with known poses, achieving rapid convergence in under 15 minutes using a single GPU, compared to hours or days required by existing methods like NeRF. The method utilizes a density voxel grid for scene geometry and a feature voxel grid for view-dependent appearance, incorporating techniques such as post-activation interpolation and direct voxel density optimization to enhance quality and speed. Evaluation shows that the proposed method matches or surpasses NeRF's quality while significantly reducing training time and improving rendering speed.

Uploaded by

Ahmet reisoğlu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Direct Voxel Grid Optimization:

Super-fast Convergence for Radiance Fields Reconstruction

Cheng Sun1,2 Min Sun1,3 Hwann-Tzong Chen1,4


[email protected] [email protected] [email protected]

Abstract

We present a super-fast convergence approach to recon-


structing the per-scene radiance field from a set of images
that capture the scene with known poses. This task, which is
23.64 PSNR. 32.72 PSNR. 34.22 PSNR.
often applied to novel view synthesis, is recently revolution- Ours at 2.33 mins. Ours at 5.07 mins. Ours at 13.72 mins.
ized by Neural Radiance Field (NeRF) for its state-of-the-art (a) The synthesized novel view by our method at three training checkpoints.
quality and flexibility. However, NeRF and its variants re-
quire a lengthy training time ranging from hours to days for
a single scene. In contrast, our approach achieves NeRF-
comparable quality and converges rapidly from scratch in
less than 15 minutes with a single GPU. We adopt a represen-
tation consisting of a density voxel grid for scene geometry
(b) The training curves of different methods on Lego scene. The training time of each
and a feature voxel grid with a shallow network for com- method is measured on our machine with a single NVIDIA RTX 2080 Ti GPU.
plex view-dependent appearance. Modeling with explicit
and discretized volume representations is not new, but we Figure 1. Super-fast convergence by our method. The key to
propose two simple yet non-trivial techniques that contribute our speedup is to optimize the volume density modeled in a dense
to fast convergence speed and high-quality output. First, voxel grid directly. Note that our method needs neither a conversion
we introduce the post-activation interpolation on voxel den- step from any trained implicit model (e.g., NeRF) nor a cross-
sity, which is capable of producing sharp surfaces in lower scene pretraining, i.e., our voxel grid representation is directly and
efficiently trained from scratch for each scene.
grid resolution. Second, direct voxel density optimization
is prone to suboptimal geometry solutions, so we robustify
the optimization process by imposing several priors. Finally, Despite its effectiveness in representing scenes, NeRF is
evaluation on five inward-facing benchmarks shows that our known to be hampered by the need of lengthy training time
method matches, if not surpasses, NeRF’s quality, yet it only and the inefficiency in rendering new views. This makes
takes about 15 minutes to train from scratch for a new scene. NeRF infeasible for many application scenarios. Several
Code: https:// github.com/ sunset1995/ DirectVoxGO. follow-up methods [15, 18, 29, 30, 42, 43, 66] have shown sig-
nificant speedup of FPS in testing phase, some of which even
achieve real-time rendering. However, only few methods
1. Introduction show training times speedup, and the improvements are not
Achieving free-viewpoint navigation of 3D objects or comparable to ours [1, 10, 31] or lead to worse quality [6, 59].
scenes from only a set of calibrated images as input is a de- On a single GPU machine, several hours of per scene opti-
manding task. For instance, it enables online product show- mization or a day of pretraining is typically required.
case to provide an immersive user experience comparing To reconstruct a volumetric scene representation from a
to static image demonstration. Recently, Neural Radiance set of images, NeRF uses multilayer perceptron (MLP) to
Fields (NeRFs) [37] have emerged as powerful representa- implicitly learn the mapping from a queried 3D point (with
tions yielding state-of-the-art quality on this task. a viewing direction) to its colors and densities. The queried
1 National
properties along a camera ray can then be accumulated into a
Tsing Hua University
2 ASUS AICS Department
pixel color by volume rendering techniques. Our work takes
3 Joint Research Center for AI Technology and All Vista Healthcare inspiration from the recent success [15, 18, 66] that uses
4 Aeolus Robotics classic voxel grid to explicitly store the scene properties,

5459
which enables real-time rendering and shows good quality. to 13003 to achieve NeRF-comparable quality.
However, their methods can not train from scratch and need
a conversion step from the trained implicit model, which 2. Related work
causes a bottleneck to the training time.
The key to our speedup is to use a dense voxel grid to Representations for novel view synthesis. Images syn-
directly model the 3D geometry (volume density). Develop- thesis from novel viewpoints given a set of images cap-
ing an elaborate strategy for view-dependent colors is not turing the scene is a long-standing task with rich stud-
in the main scope of this paper, and we simply use a hybrid ies. Previous work has presented several scene represen-
representation (feature grid with shallow MLP) for colors. tations reconstructed from the input images to synthesize
Directly optimizing the density voxel grid leads to super- the unobserved viewpoints. Lumigraph [4, 16] and light
fast converges but is prone to suboptimal solutions, where field representation [7, 23, 24, 46] directly synthesize novel
our method allocates “cloud” at free space and tries to fit views by interpolating the input images but require very
the photometric loss with the cloud instead of searching a dense scene capture. Layered depth images [11, 45, 47, 57]
geometry with better multi-view consistency. Our solution work for sparse input views but rely on depth maps or es-
to this problem is simple and effective. First, we initialize timated depth with sacrificed quality. Mesh-based repre-
the density voxel grid to yield opacities very close to zero sentations [8, 54, 58, 63] can run in real-time but have a
everywhere to avoid the geometry solutions being biased hard time with gradient-based optimization without template
toward the cameras’ near planes. Second, we give a lower meshes provided. Recent approaches employ 2D/3D Con-
learning rate to voxels visible to fewer views, which can volutional Neural Network (CNNs) to estimate multiplane
avoid redundant voxels that are allocated just for explaining images (MPIs) [12, 26, 36, 51, 56, 71] for forward-facing
the observations from a small number of views. We show that captures; estimate voxel grid [17, 32, 48] for inward-facing
the proposed solutions can successfully avoid the suboptimal captures. Our method uses gradient-descent to optimize
geometry and work well on the five datasets. voxel grids directly and does not rely on neural networks to
Using the voxel grid to model volume density still faces predict the grid values, and we still outperform the previous
a challenge in scalability. For parsimony, our approach au- works [17, 32, 48] with CNNs by a large margin.
tomatically finds a BBox tightly encloses the volume of
Neural radiance fields. Recently, NeRF [37] stands out
interest to allocate the voxel grids. Besides, we propose
to be a prevalent method for novel view synthesis with rapid
post-activation—applying all the activation functions after
progress, which takes a moderate number of input images
trilinearly interpolating the density voxel grid. Previous
with known camera poses. Unlike traditional explicit and
work either interpolates the voxel grid for the activated opac-
discretized volumetric representations (e.g., voxel grids and
ity or uses nearest-neighbor interpolation, which results in a
MPIs), NeRF uses coordinate-based multilayer perceptrons
smooth surface in each grid cell. Conversely, we prove math-
(MLP) as an implicit and continuous volumetric represen-
ematically and empirically that the proposed post-activation
tation. NeRF achieves appealing quality and has good flex-
can model (beyond) a sharp linear surface within a single
ibility with many follow-up extensions to various setups,
grid cell. As a result, we can use fewer voxels to achieve
e.g., relighting [2, 3, 50, 70], deformation [13, 38–40, 55],
better qualities—our method with 1603 dense voxels already
self-calibration [19, 27, 28, 35, 61], meta-learning [52], dy-
outperforms NeRF in most cases.
namic scene modeling [14, 25, 33, 41, 64], and generative
In summary, we have two main technical contributions.
modeling [5, 22, 44]. Nevertheless, NeRF has unfavorable
First, we implement two priors to avoid suboptimal geometry
limitations of lengthy training progress and slow rendering
in direct voxel density optimization. Second, we propose the
speed. In this work, we mainly follow NeRF’s original setup,
post-activated voxel-grid interpolation, which enables sharp
while our method can optimize the volume density explicitly
boundary modeling in lower grid resolution. The resulting
encoded in a voxel grid to speed up both training and testing
key merits of this work are highlighted as follows:
by a large margin with comparable quality.
• Our convergence speed is about two orders of magni- Hybrid volumetric representations. To combine NeRF’s
tude faster than NeRF—reducing training time from implicit representation and traditional grid representations,
10−20 hours to 15 minutes on our machine with a sin- the coordinate-based MLP is extended to also condition-
gle NVIDIA RTX 2080 Ti GPU, as shown in Fig. 1. ing on the local feature in the grid. Recently, hybrid vox-
• We achieve visual quality comparable to NeRF at a els [18, 30] and MPIs [62] representations have shown suc-
rendering speed that is about 45× faster. cess in fast rendering speed and result quality. We use hybrid
• Our method does not need cross-scene pretraining. representation to model view-dependent color as well.
• Our grid resolution is about 1603 , while the grid reso- Fast NeRF rendering. NSVF [30] uses octree in its hy-
lution in previous work [15, 18, 66] ranges from 5123 brid representation to avoid redundant MLP queries in free

5460
observed pixel photometric loss supervise
(Eq. (3)) …
rendered pixel priors

accumulation training views


(Eq. (2)) post-activation
initialize supervise find tight bbox +
(Sec. 4)
skip free space
render

query queried
feature
shallow
MLP
3D position +
viewing-direction
novel views
(a) Volume rendering (b) Our scene representation (c) Coarse geometry searching (d) Fine detail reconstruction
(Sec. 3) (Sec. 5.2) (Sec. 5.1) (Sec. 5.2)

Figure 2. Approach overview. We first review NeRF in Sec. 3. In Sec. 4, we present a novel post-activated density voxel grid to support
sharp surface modeling in lower grid resolutions. In Sec. 5, we show our approach to the reconstruction of radiance field with super-fast
convergence, where we first find a coarse geometry in Sec. 5.1 and then reconstruct the fine details and view-dependent effects in Sec. 5.2.

space. However, NSVF still needs many training hours due MLP(rgb) to learn c (see NeRF++ [68] for more discussions
to the deep MLP in its representation. Recent methods fur- on the architecture design). In practice, positional encod-
ther use thousands of tiny MLPs [43] or explicit volumetric ing is applied to x and d, which enables the MLPs to learn
representations [15,18,62,66] to achieve real-time rendering. the high-frequency details from low-dimensional input [53].
Unfortunately, gradient-based optimization is not directly For output activation, Sigmoid is applied on c; ReLU or
applicable to their methods due to their topological data Softplus is applied on σ (see Mip-NeRF [1] for more dis-
structures or the lack of priors. As a result, these meth- cussion on output activation).
ods [15, 18, 43, 62, 66] still need a conversion step from a To render the color of a pixel Ĉ(r), we cast the ray r
trained implicit model (e.g., NeRF) to their final representa- from the camera center through the pixel; K points are then
tion that supports real-time rendering. Their training time is sampled on r between the pre-defined near and far planes;
still burdened by the lengthy implicit model optimization. the K ordered sampled points are then used to query for
Fast NeRF convergence. Recent works that focus on their densities and colors {(σi , ci )}K
i=1 (MLPs are queried

fewer input views setup also bring faster convergence in NeRF). Finally, the K queried results are accumulated
as a side benefit. These methods rely on generalizable into a single color with the volume rendering quadrature in
pre-training [6, 59, 67] or external MVS depth informa- accordance with the optical model given by Max [34]:
tion [10, 31], while ours does not. Further, they still require K
!
several per-scene fine-tuning hours [10] or fail to achieve X
Ĉ(r) = Ti αi ci + TK+1 cbg , (2a)
NeRF quality in the full input-view setup [6,59,67]. Most re- i=1
cently, NeuRay [31] shows NeRF’s quality with 40 minutes
αi = alpha(σi , δi ) = 1 − exp(−σi δi ) , (2b)
per-scene training time in the lower-resolution setup. Under
i−1
the same GPU spec, our method achieves NeRF’s quality in Y
15 minutes per scene on the high-resolution setup and does Ti = (1 − αj ) , (2c)
j=1
not require depth guidance and cross-scene pre-training.

3. Preliminaries where αi is the probability of termination at the point i; Ti is


the accumulated transmittance from the near plane to point
To represent a 3D scene for novel view synthesis, Neural i; δi is the distance to the adjacent sampled point, and cbg is
Radiance Fields (NeRFs) [37] employ multilayer perceptron a pre-defined background color.
(MLP) networks to map a 3D position x and a viewing direc- Given the training images with known poses, NeRF model
tion d to the corresponding density σ and view-dependent is trained by minimizing the photometric MSE between the
color emission c: observed pixel color C(r) and the rendered color Ĉ(r):
(σ, e) = MLP(pos) (x) , (1a) 2
1 X
c = MLP (rgb)
(e, d) , (1b) Lphoto = Ĉ(r) − C(r) , (3)
|R| 2
r∈R
where the learnable MLP parameters are omitted, and e
is an intermediate embedding to help the much shallower where R is the set of rays in a sampled mini-batch.

5461
(a) Visual comparison of image fitting results under grid resolution (H/5)×(W/5).
The first row is the results of pre-, in-, and post-activation. The second row is their
per-pixel absolute difference to the target image.

Figure 3. A single grid cell with post-activation is capable of


modeling sharp linear surfaces. Left: We depict the toy task
for a 2D grid cell, where a grid cell is optimized for the linear
surface (decision boundary) across it. Right: Each column shows
an example task for three different methods. The results show that a (b) PSNRs achieved by pre-, in- and post-activation under different grid strides. A
single grid cell with post-activation (Eq. (6c)) is adequate to recover grid stride s means that the grid resolution is (H/s) × (W/s). The black dashed
line highlights that post-activation with stride ≈ 8.5 can achieve the same PSNR as
faithfully the linear surface. Conversely, pre-activation (Eq. (6a)) pre-activation with stride 2 in this example.
and in-activation (Eq. (6b)) fail to accomplish the tasks as they can
only fit into smooth results, and thus would require more grid cells Figure 4. Toy example on image fitting. The target 2D image is
to recover the surface detail. See supplementary material for the binary to imitate the scenario that most of the 3D space is either
mathematical proof. occupied or free. The objective is to reconstruct the target image
by a low-resolution 2D grid. In each optimization step, the tunable
2D grid is queried by interpolation with pre-activation (Eq. (6a)),
4. Post-activated density voxel grid in-activation (Eq. (6b)), or post-activation (Eq. (6c)) to minimize
Voxel-grid representation. A voxel-grid representation the mean squared error to the target image. The result reveals that
the post-activation can produce sharp boundaries even with low
models the modalities of interest (e.g., density, color, or
grid resolution (Fig. 4a) and is much better than the other two under
feature) explicitly in its grid cells. Such an explicit scene various grid resolutions (Fig. 4b). This motivates us to model the
representation is efficient to query for any 3D positions via 3D geometry directly via voxel grids with post-activation.
interpolation:
interp(x, V ) : R3 , RC×Nx ×Ny ×Nz → RC , (4)

queried 3D point x:
where x is the queried 3D point, V is the voxel grid, C is
α(pre) = interp x, alpha softplus V (density)

the dimension of the modality, and Nx · Ny · Nz is the total , (6a)
number of voxels. Trilinear interpolation is applied if not (in) (density)

α = alpha interp x, softplus V , (6b)
specified otherwise. (post) (density)

α = alpha softplus interp x, V . (6c)
Density voxel grid for volume rendering. Density voxel
grid, V (density) , is a special case with C = 1, which stores the The input δ to the function alpha (Eq. (2b)) is omitted for
density values for volume rendering (Eq. (2)). We use σ̈ ∈ R simplicity. We show that the post-activation, i.e., applying
to denote the raw voxel density before applying the density all the non-linear activation after the trilinear interpolation,
activation (i.e., a mapping of R → R≥0 ). In this work, we is capable of producing sharp surfaces (decision boundaries)
use the shifted softplus mentioned in Mip-NeRF [1] as the with much fewer grid cells. In Fig. 3, we use a 2D grid cell
density activation: as an example to show that a grid cell with post-activation
can produce a sharp linear boundary, while pre- and in-
σ = softplus(σ̈) = log(1 + exp(σ̈ + b)) , (5) activation can only produce smooth results and thus require
where the shift b is a hyperparameter. Using softplus instead more cells for the surface detail. In Fig. 4, we further use
of ReLU is crucial to optimize voxel density directly, as it binary image regression as a toy example to compare their
is irreparable when a voxel is falsely set to a negative value capability, which also shows that post-activation can achieve
with ReLU as the density activation. Conversely, softplus a much better efficiency in grid cell usage.
allows us to explore density very close to 0.
5. Fast and direct voxel grid optimization
Sharp decision boundary via post-activation. The inter-
polated voxel density is processed by softplus (Eq. (5)) and We depict an overview of our approach in Fig. 2. In
alpha (Eq. (2b)) functions sequentially for volume render- Sec. 5.1, we first search the coarse geometry of a scene. In
ing. We consider three different orderings—pre-activation, Sec. 5.2, we then reconstruct the fine detail including view-
in-activation, and post-activation—of plugging in the tri- dependent effects. Hereinafter we use superscripts (c) and (f)
linear interpolation and performing the activation, given a to denote variables in the coarse and fine stages.

5462
5.1. Coarse geometry searching In practice, we initialize all grid values in V (density)(c) to 0
and set the bias term in Eq. (5) to
Typically, a scene is dominated by free space (i.e., unoc-  − 1 
cupied space). Motivated by this fact, we aim to efficiently b = log 1 − α(init)(c) s(c) − 1 , (9)
find the coarse 3D areas of interest before reconstructing the
fine detail and view-dependent effect that require more com- where α(init)(c) is a hyperparameter. Thereby, the accumu-
putation resources. We can thus greatly reduce the number lated transmittance Ti is decayed by 1 − α(init)(c) ≈ 1 for a
of queried points on each ray in the later fine stage. ray that traces forward a distance of a voxel size s(c) . See
supplementary material for the derivation and proof.
Coarse scene representation. We use a coarse den-
(c) (c) (c)
sity voxel grid V (density)(c) ∈ R1×Nx ×Ny ×Nz with post- Prior 2: view-count-based learning rate. There could
activation (Eq. (6c)) to model scene geometry. We be some voxels visible to too few training views in real-
only model view-invariant color emissions by V (rgb)(c) ∈ world capturing, while we prefer a surface with consistency
(c) (c) (c)
R3×Nx ×Ny ×Nz in the coarse stage. A query of any 3D in many views instead of a surface that can only explain few
point x is efficient with interpolation: views. In practice, we set different learning rates for different
grid points in V (density)(c) . For each grid point indexed by j,
σ̈ (c) = interp x, V (density)(c) ,

(7a) we count the number of training views nj to which point j
c(c) = interp x, V (rgb)(c) ,

(7b) is visible, and then scale its base learning rate by nj /nmax ,
where nmax is the maximum view count over all grid points.
where c(c) ∈ R3 is the view-invariant color and σ̈ (c) ∈ R is
Training objective for coarse representation. The scene
the raw volume density.
representation is reconstructed by minimizing the mean
Coarse voxels allocation. We first find a bounding box square error between the rendered and observed colors. To
(BBox) tightly enclosing the camera frustums of training regularize the reconstruction, we mainly use background
views (See the red BBox in Fig. 2c for an example). Our entropy loss to encourage the accumulated alpha values to
voxel grids are aligned with the BBox. Let L(c) (c) (c)
x , Ly , Lz concentrate on background or foreground. Please refer to
be the lengths of the BBox and M (c) be the hyperparameter supplementary material for more detail.
for the expected total number
q of voxels in the coarse stage. 5.2. Fine detail reconstruction
The voxel size is s(c) = 3 L(c) (c) (c) (c)
x · Ly · Lz /M , so there are
(c) (c) (c) (c) (c) (c) (c) (c) (c)
Given the optimized coarse geometry V (density)(c) in
Nx , Ny , Nz = bLx /s c, bLy /s c, bLz /s c voxels Sec. 5.1, we now can focus on a smaller subspace to recon-
on each side of the BBox. struct the surface details and view-dependent effects. The
Coarse-stage points sampling. On a pixel-rendering ray, optimized V (density)(c) is frozen in this stage.
we sample query points as Fine scene representation. In the fine stage, we use
x0 = o + t(near) d , (8a) a higher-resolution density voxel grid V (density)(f) ∈
(f) (f) (f)

d R1×Nx ×Ny ×Nz with post-activated interpolation (Eq. (6c)).


xi = x0 + i · δ (c) · , (8b) Note that, alternatively, it is also possible to use a more ad-
kdk2
vanced data structure [18, 30, 66] to refine the voxel grid
where o is the camera center, d is the ray-casting direction, based on the current V (density)(c) but we leave that for future
t(near) is the camera near bound, and δ (c) is a hyperparameter work. To model view-dependent color emission, we opt to
for the step size that can be adaptively chosen according use an explicit-implicit hybrid representation as we find in
to the voxel size s(c) . The query index i ranges from 1 to our prior experiments that an explicit representation tends to
dt(far) · kdk2 /δ (c) e, where t(far) is the camera far bound, so produce worse results, and an implicit representation entails
the last sampled point stops nearby the far plane. a slower training speed. Our hybrid representation comprises
(f) (f) (f)

Prior 1: low-density initialization. At the start of train- i) a feature voxel grid V (feat)(f) ∈ RD×Nx ×Ny ×Nz , where
ing, the importance of points far from a camera is down- D is a hyperparameter for feature-space dimension, and ii) a
weighted due to the accumulated transmittance term in shallow MLP parameteriszed by Θ. Finally, queries of 3D
Eq. (2c). As a result, the coarse density voxel grid V (density)(c) points x and viewing-direction d are performed by
σ̈ (f) = interp x, V (density)(f) ,

could be accidentally trapped into a suboptimal “cloudy” ge- (10a)
ometry with higher densities at camera near planes. We thus
c(f) = MLP(rgb) interp(x, V (feat)(f) ), x, d ,

(10b)
have to initialize V (density)(c) more carefully to ensure that Θ

all sampled points on rays are visible to the cameras at the where c(f) ∈ R3 is the view-dependent color emission and
beginning, i.e., the accumulated transmittance rates Ti s in σ̈ (f) ∈ R is the raw volume density in the fine stage. Posi-
Eq. (2c) are close to 1. tional embedding [37] is applied on x, d for the MLP(rgb)
Θ .

5463
Known free space and unknown space. A query point edMVS [65] is a synthetic MVS dataset that has realistic
is in the known free space if the post-activated alpha value ambient lighting from real image blending. We use a subset
from the optimized V (density)(c) is less than the threshold τ (c) . of four objects provided by NSVF. The image resolution
Otherwise, we say the query point is in the unknown space. is 768 × 576 pixels, and one-eighth of the images are for
testing. Tanks&Temples [21] is a real-world dataset. We
Fine voxels allocation. We densely query V (density)(c) to
use a subset of five scenes provided by NSVF, each con-
find a BBox tightly enclosing the unknown space, where
taining views captured by an inward-facing camera circling
L(f) (f) (f)
x , Ly , Lz are the lengths of the BBox. The only hyper- the scene. The image resolution is 1920 × 1080 pixels, and
parameter is the expected total number of voxels M (f) . The one-eighth of the images are for testing. DeepVoxels [48]
voxel size s(f) and the grid dimensions Nx(f) , Ny(f) , Nz(f) can dataset contains four simple Lambertian objects. The image
then be derived automatically from M (f) as per Sec. 5.1. resolutions are 512 × 512, and each scene has 479 views for
Progressive scaling. Inspired by NSVF [30], we progres- training and 1000 views for testing.
sively scale our voxel grid V (density)(f) and V (feat)(f) . Let
pg ckpt be the set of checkpoint steps. The initial number of
6.2. Implementation details
voxels is set to bM (f) /2|pg ckpt| c. When reaching the train- We choose the same hyperparameters generally for all
ing step in pg ckpt, we double the number of voxels such scenes. The expected numbers of voxels are set to M (c) =
that the number of voxels after the last checkpoint is M (f) ; 1003 and M (f) = 1603 in coarse and fine stages if not stated
the voxel size s(f) and the grid dimensions Nx(f) , Ny(f) , Nz(f) otherwise. The activated alpha values are initialized to
are updated accordingly. Scaling our scene representation be α(init)(c) = 10−6 in the coarse stage. We use a higher
is much simpler. At each checkpoint, we resize our voxel α(init)(f) = 10−2 as the query points are concentrated on the
grids, V (density)(f) and V (feat)(f) , by trilinear interpolation. optimized coarse geometry in the fine stage. The points
sampling step sizes are set to half of the voxel sizes, i.e.,
Fine-stage points sampling. The points sampling strategy
δ (c) = 0.5 · s(c) and δ (f) = 0.5 · s(f) . The shallow MLP layer
is similar to Eq. (8) with some modifications. We first filter
comprises two hidden layers with 128 channels. We use
out rays that do not intersect with the known free space. For
the Adam optimizer [20] with a batch size of 8,192 rays to
each ray, we adjust the near- and far-bound, t(near) and t(far) ,
optimize the coarse and fine scene representations for 10k
to the two endpoints of the ray-box intersection. We do not
and 20k iterations. The base learning rates are 0.1 for all
adjust t(near) if x0 is already inside the BBox.
voxel grids and 10−3 for the shallow MLP. The exponential
Free space skipping. Querying V (density)(c) (Eq. (7a)) is learning rate decay is applied. See supplementary material
faster than querying V (density)(f) (Eq. (10a)); querying for for detailed hyperparameter setups.
view-dependent colors (Eq. (10b)) is the slowest. We im-
prove fine-stage efficiency by free space skipping in both 6.3. Comparisons
training and testing. First, we skip sampled points that are in
Quantitative evaluation on the synthesized novel view.
the known free space by checking the optimized V (density)(c)
We first quantitatively compare the novel view synthesis
(Eq. (7a)). Second, we further skip sampled points in un-
results in Tab. 1. PSNR, SSIM [60], and LPIPS [69] are em-
known space with low activated alpha value (threshold at
ployed as evaluation metrics. Our model with M (f) = 1603
τ (f) ) by querying V (density)(f) (Eq. (10a)).
voxels already outperforms the original NeRF [37] and the
Training objective for fine representation. We use the improved JaxNeRF [9] re-implementation. Besides, our re-
same training losses as the coarse stage, but we use a smaller sults are also comparable to most of the recent methods,
weight for the regularization losses as we find it empirically except JaxNeRF+ [9] and Mip-NeRF [1]. Moreover, our
leads to slightly better quality. per-scene optimization only takes about 15 minutes, while
all the methods after NeRF in Tab. 1 need quite a few hours
6. Experiments per scene. We also show our model with M (f) = 2563
voxels, which significantly improves our results under all
6.1. Datasets metrics and achieves more comparable results to JaxNeRF+
and Mip-NeRF. We defer detail comparisons on the much
We evaluate our approach on five inward-facing datasets. simpler DeepVoxels [48] dataset to supplementary material,
Synthetic-NeRF [37] contains eight objects with realistic where we achieve 45.83 averaged PSNR and outperform
images synthesized by NeRF. Synthetic-NSVF [30] con- NeRF’s 40.15 and IBRNet’s 42.93.
tains another eight objects synthesized by NSVF. Strictly
following NeRF’s and NSVF’s setups, we set the image Training time comparisons. The key merit of our work
resolution to 800 × 800 pixels and let each scene have is the significant improvement in convergence speed with
100 views for training and 200 views for testing. Blend- NeRF-comparable quality. In Tab. 2, we show a training

5464
Synthetic-NeRF Synthetic-NSVF BlendedMVS Tanks and Temples
Methods
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
SRN [49] 22.26 0.846 0.170vgg 24.33 0.882 0.141alex 20.51 0.770 0.294alex 24.10 0.847 0.251alex
NV [32] 26.05 0.893 0.160vgg 25.83 0.892 0.124alex 23.03 0.793 0.243alex 23.70 0.834 0.260alex
NeRF [37] 31.01 0.947 0.081vgg 30.81 0.952 0.043alex 24.15 0.828 0.192alex 25.78 0.864 0.198alex
Improved visual quality from NeRF
JaxNeRF [9] 31.69 0.953 0.068vgg - - - - - - 27.94 0.904 0.168vgg
JaxNeRF+ [9] 33.00 0.962 0.038 - - - - - - - - -
Mip-NeRF [1] 33.09 0.961 0.043vgg - - - - - - - - -
Improved test-time rendering speed (and visual quality) from NeRF
AutoInt [29] 25.55 0.911 0.170 - - - - - - - - -
FastNeRF [15] 29.97 0.941 0.053 - - - - - - - - -
SNeRG [18] 30.38 0.950 0.050 - - - - - - - - -
KiloNeRF [43] 31.00 0.95 0.03 33.37 0.97 0.02 27.39 0.92 0.06 28.41 0.91 0.09
PlenOctrees [66] 31.71 0.958 0.053vgg - - - - - - 27.99 0.917 0.131vgg
NSVF [30] 31.75 0.953 0.047alex 35.18 0.979 0.015alex 26.89 0.898 0.114alex 28.48 0.901 0.155alex
Improved convergence speed, test-time rendering speed, and visual quality from NeRF
0.053vgg 0.033vgg 0.101vgg 0.155vgg
ours (M (f) =1603 ) 31.95 0.957 alex 35.08 0.975 28.02 0.922 28.41 0.911
0.035 0.019alex 0.075alex 0.148alex
vgg vgg
0.045 0.024 0.081vgg 0.138vgg
ours (M (f) =2563 ) 32.80 0.961 36.21 0.980 28.64 0.933 28.82 0.920
0.027alex 0.012alex 0.052alex 0.124alex
*
The superscript denotes the pre-trained models used in LPIPS. The gray numbers indicate that the code is unavailable or has a unconventional LPIPS implementation.

Table 1. Quantitative comparisons for novel view synthesis. Our method excels in convergence speed, i.e., 15 minutes per scene compared
to many hours or days per scene using other methods. Besides, our rendering quality is better than the original NeRF [37] and the improved
JaxNeRF [9] on the four datasets under all metrics. We also show comparable results to most of the recent methods.

time comparison. We also show GPU specifications after generalizable per-scene


Methods PSNR↑
each reported time as it is the main factor affecting run-time. pre-training optimization
NeRF [37] 31.01 no need 1–2 days (V100)
NeRF [37] with a more powerful GPU needs 1–2 days per MVSNeRF [6] 27.21 30 hrs (2080Ti) 15 mins (2080Ti)
scene to achieve 31.01 PSNR, while our method achieves a IBRNet [59] 28.14 1 day (8xV100) 6 hrs (V100)
superior 31.95 and 32.80 PSNR in about 15 an 22 minutes NeuRay [31]† 32.42 2 days (2080Ti) 23 hrs (2080Ti)

per scene respectively. MVSNeRF [6], IBRNet [59], and Mip-NeRF [1]‡ 30.85 no need 6 hrs (2080Ti)

NeuRay [31] also show less per-scene training time than ours (M (f) =1603 ) 31.95 no need 15 mins (2080Ti)
ours (M (f) =2563 ) 32.80 no need 22 mins (2080Ti)
NeRF but with the additional cost to run a generalizable
† Use external depth information.
cross-scene pre-training. MVSNeRF [6], after pre-training,
‡ Our reproduction with early stopping on our machine.
optimizes a scene in 15 minutes as well, but the PSNR is
degraded to 28.14. IBRNet [59] shows worse PSNR and Table 2. Training time comparisons. We take the training time
longer training time than ours. NeuRay [31] originally re- and GPU specifications reported in previous works directly. A
ports time in lower-resolution (NeuRay-Lo) setup, and we V100 GPU can run faster and has more storage than a 2080Ti GPU.
receive the training time of the high-resolution (NeuRay- Our method achieves good PSNR in a significantly less per-scene
Hi) setup from the authors. NeuRay-Hi achieves 32.42 optimization time.
PSNR and requires 23 hours to train, while our method
with M (f) = 2563 voxels achieves superior 32.80 in about time. The early-stopped Mip-NeRF achieves 30.85 PSNR
22 minutes. For the early-stopped NeuRay-Hi, unfortunately, after 6 hours of training, while we can achieve 31.95 PSNR
only its training time is retained (early-stopped NeuRay-Lo in just 15 minutes.
achieves NeRF-similar PSNR). NeuRay-Hi still needs 70 Rendering speed comparisons. Improving test-time ren-
minutes to train with early stopping, while we only need 15 dering speed is not the main focus of this work, but we still
minutes to achieve NeRF-comparable quality and do not rely achieve ∼ 45× speedups from NeRF—0.64 seconds versus
on generalizable pre-training or external depth information. 29 seconds per 800 × 800 image on our machine.
Mip-NeRF [1] has similar run-time to NeRF but with much
better PSNRs, which also signifies using less training time to Qualitative comparison. Fig. 5 shows our rendering re-
achieve NeRF’s PSNR. We train early-stopped Mip-NeRFs sults on the challenging parts and compare them with the
on our machine and show the averaged PSNR and training results (better than NeRF’s) provided by PlenOctrees [66].

5465
GT Ours PlenOctree GT Ours PlenOctree Syn.-NeRF Syn.-NSVF BlendedMVS T&T
Interp.
PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆
Nearest 28.61 -2.77 28.86 -6.22 25.49 -2.48 26.39 -1.27
pre- 30.84 -0.55 32.66 -2.41 27.39 -0.58 27.44 -0.21
Tri. in- 29.91 -1.48 32.42 -2.66 27.29 -0.68 27.52 -0.13
post- 31.39 - 35.08 - 27.97 - 27.66 -

Table 3. Effectiveness of the post-activation. Geometry modeling


with density voxel grid can achieve better PSNRs by using the
proposed post-activated trilinear interpolation.

View. Syn.-NeRF Syn.-NSVF BlendedMVS T&T


α(init)(c)
lr. PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆
- X 28.88 -2.51 25.12 -9.96 22.17 -5.79 25.33 -2.33
Figure 5. Qualitative comparisons on the challenging parts. 10−3 X 30.96 -0.42 27.24 -7.84 23.17 -4.79 26.04 -1.61
Top: On ficus scene, we do not show blocking artifacts as PlenOc- 10−4 X 31.29 -0.09 31.05 -4.03 26.09 -1.88 27.60 -0.05
tree and recover the pot better. Middle: We produce blurrier results 10−5 X 31.41 +0.02 35.04 -0.04 27.36 -0.61 27.63 -0.02
10−6 31.40 +0.01 35.03 -0.04 27.37 -0.60 27.59 -0.07
on ship’s body and rigging, but we do not have the background
10−7 X 31.36 -0.02 35.03 -0.05 27.73 -0.23 27.59 -0.06
artifacts. Bottom: On real-world captured Ignatius, we show better 10−6 X 31.39 - 35.08 - 27.97 - 27.66 -
quality without blocking artifacts (left) and recover the color tone
better (right). See supplementary material for more visualizations.

6.4. Ablation studies


- / X 10−3 / X 10−6 / - 10−6 / X
We mainly validate the effectiveness of the two proposed
techniques—post-activation and the imposed priors—that Table 4. Effectiveness of the imposed priors. We compare our
enable voxel grids to model scene geometry with NeRF- different settings in the coarse geometry search. Top: We show
comparable quality. We subsample two scenes for each their impacts on the final PSNRs after the fine stage reconstruction.
dataset. See supplementary material for more detail and Bottom: We visualize the allocated voxels by coarse geometry
additional ablation studies on the number of voxels, point- search on the Truck scene. Overall, low-density initialization is
sampling step size, progressive scaling, free space skipping, essential; using α(init)(c) = 10−6 and view-count-based learning
view-dependent colors modeling, and the losses. rate generally achieves cleaner voxels allocation in the coarse stage
and better PSNR after the fine stage.
Effectiveness of the post-activation. We show in Sec. 4
that the proposed post-activated trilinear interpolation en- initialization, the quality drops severely for all the scenes.
ables the discretized grid to model sharper surfaces. In Tab. 3, When α(init)(c) = 10−7 , we have to train the coarse stage
we compare the effectiveness of post-activation in scene re- of some scenes for more iterations. The effective range of
construction for novel view synthesis. Our grid in the fine α(init)(c) is scene-dependent. We find α(init)(c) = 10−6 gener-
stage consists of only 1603 voxels, where nearest-neighbor ally works well on all the scenes in this work. Finally, using
interpolation results in worse quality than trilinear interpola- a view-count-based learning rate can further improve the
tion. The proposed post-activation can improve the results results and allocate noiseless voxels in the coarse stage.
further compared to pre- and in-activation. We find that we
gain less in the real-world captured BlendedMVS and Tanks 7. Conclusion
and Temples datasets. The intuitive reason is that real-world
data introduces more uncertainty (e.g., inconsistent light- Our method directly optimizes the voxel grid and achieves
ning, SfM error), which results in multi-view inconsistent super-fast convergence in per-scene optimization with NeRF-
and blurrier surfaces. Thus, the advantage is lessened for comparable quality—reducing training time from many
scene representations that can model sharper surfaces. We hours to 15 minutes. However, we do not deal with the
speculate that resolving the uncertainty in future work can unbounded or forward-facing scenes, while we believe our
increase the gain of the proposed post-activation. method can be a stepping stone toward fast convergence in
such scenarios. We hope our method can boost the progress
Effectiveness of the imposed priors. As discussed in of NeRF-based scene reconstruction and its applications.
Sec. 5.1, it is crucial to initialize the voxel grid with low Acknowledgements: This work was supported in part by
density to avoid suboptimal geometry. The hyperparameter the MOST grants 110-2634-F-001-009 and 110-2622-8-007-
α(init)(c) controls the initial activated alpha values via Eq. (9). 010-TE2 of Taiwan. We are grateful to National Center for
In Tab. 4, we compare the quality with different α(init)(c) and High-performance Computing for providing computational
the view-count-based learning rate. Without the low-density resources and facilities.

5466
References [17] Tong He, John P. Collomosse, Hailin Jin, and Stefano Soatto.
Deepvoxels++: Enhancing the fidelity of novel view synthesis
[1] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter from 3d voxel embeddings. In Hiroshi Ishikawa, Cheng-Lin
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Liu, Tomás Pajdla, and Jianbo Shi, editors, ACCV, 2020. 2
Mip-nerf: A multiscale representation for anti-aliasing neural [18] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall,
radiance fields. In ICCV, 2021. 1, 3, 4, 6, 7 Jonathan T. Barron, and Paul E. Debevec. Baking neural
[2] Sai Bi, Zexiang Xu, Pratul P. Srinivasan, Ben Mildenhall, radiance fields for real-time view synthesis. In ICCV, 2021.
Kalyan Sunkavalli, Milos Hasan, Yannick Hold-Geoffroy, 1, 2, 3, 5, 7
David J. Kriegman, and Ravi Ramamoorthi. Neural re- [19] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Ani-
flectance fields for appearance acquisition. arxiv CS.CV mashree Anandkumar, Minsu Cho, and Jaesik Park. Self-
2106.01970, 2020. 2 calibrating neural radiance fields. In ICCV, 2021. 2
[3] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Bar- [20] Diederik P. Kingma and Jimmy Ba. Adam: A method for
ron, Ce Liu, and Hendrik P. A. Lensch. Nerd: Neural re- stochastic optimization. In ICLR, 2015. 6
flectance decomposition from image collections. In ICCV, [21] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
2021. 2 Koltun. Tanks and temples: benchmarking large-scale scene
[4] Chris Buehler, Michael Bosse, Leonard McMillan, Steven J. reconstruction. ACM Trans. Graph., 2017. 6
Gortler, and Michael F. Cohen. Unstructured lumigraph ren- [22] Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol
dering. In SIGGRAPH, 2001. 2 Moreno, Rosalia Schneider, Sona Mokrá, and Danilo Jimenez
[5] Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Rezende. Nerf-vae: A geometry aware 3d scene generative
and Gordon Wetzstein. Pi-gan: Periodic implicit generative model. In ICML, 2021. 2
adversarial networks for 3d-aware image synthesis. In CVPR, [23] Anat Levin and Frédo Durand. Linear view synthesis using a
2021. 2 dimensionality gap light field prior. In CVPR, 2010. 2
[6] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, [24] Marc Levoy and Pat Hanrahan. Light field rendering. In
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- SIGGRAPH, 1996. 2
izable radiance field reconstruction from multi-view stereo. [25] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang.
In ICCV, 2021. 1, 3, 7 Neural scene flow fields for space-time view synthesis of
[7] Abe Davis, Marc Levoy, and Frédo Durand. Unstructured dynamic scenes. In CVPR, 2021. 2
light fields. Comput. Graph. Forum, 2012. 2 [26] Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely.
[8] Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. Mod- Crowdsampling the plenoptic function. In ECCV, 2020. 2
eling and rendering architecture from photographs: A hybrid [27] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon
geometry- and image-based approach. In SIGGRAPH, 1996. Lucey. BARF: bundle-adjusting neural radiance fields. In
2 ICCV, 2021. 2
[9] Boyang Deng, Jonathan T. Barron, and Pratul P. Srinivasan. [28] Yen-Chen Lin, Pete Florence, Jonathan T. Barron, Alberto
JaxNeRF: an efficient JAX implementation of NeRF, 2020. Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting
6, 7 neural radiance fields for pose estimation. In IROS, 2021. 2
[29] David B. Lindell, Julien N. P. Martel, and Gordon Wetzstein.
[10] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan.
Autoint: Automatic integration for fast neural volume render-
Depth-supervised nerf: Fewer views and faster training for
ing. In CVPR, 2021. 1, 7
free. arxiv CS.CV 2107.02791, 2021. 1, 3
[30] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[11] Helisa Dhamo, Keisuke Tateno, Iro Laina, Nassir Navab, and
Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
Federico Tombari. Peeking behind objects: Layered depth
2020. 1, 2, 5, 6, 7
prediction from a single image. Pattern Recognit. Lett., 2019.
[31] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng
2
Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang.
[12] John Flynn, Michael Broxton, Paul E. Debevec, Matthew Neural rays for occlusion-aware image-based rendering. arxiv
DuVall, Graham Fyffe, Ryan S. Overbeck, Noah Snavely, CS.CV 2107.13421, 2021. 1, 3, 7
and Richard Tucker. Deepview: View synthesis with learned [32] Stephen Lombardi, Tomas Simon, Jason M. Saragih, Gabriel
gradient descent. In CVPR, 2019. 2 Schwartz, Andreas M. Lehrmann, and Yaser Sheikh. Neural
[13] Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias volumes: learning dynamic renderable volumes from images.
Nießner. Dynamic neural radiance fields for monocular 4d ACM Trans. Graph., 2019. 2, 7
facial avatar reconstruction. In CVPR, 2021. 2 [33] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Saj-
[14] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. jadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel
Dynamic view synthesis from dynamic monocular video. In Duckworth. Nerf in the wild: Neural radiance fields for
ICCV, 2021. 2 unconstrained photo collections. In CVPR, 2021. 2
[15] Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie [34] Nelson L. Max. Optical models for direct volume rendering.
Shotton, and Julien P. C. Valentin. Fastnerf: High-fidelity IEEE Trans. Vis. Comput. Graph., 1995. 3
neural rendering at 200fps. In ICCV, 2021. 1, 2, 3, 7 [35] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su,
[16] Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based neural
Michael F. Cohen. The lumigraph. In SIGGRAPH, 1996. 2 radiance field without posed camera. In ICCV, 2021. 2

5467
[36] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz Cayon, Ng. Learned initializations for optimizing coordinate-based
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and neural representations. In CVPR, 2021. 2
Abhishek Kar. Local light field fusion: practical view syn- [53] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
thesis with prescriptive sampling guidelines. ACM Trans. Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
Graph., 2019. 2 mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features
[37] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, let networks learn high frequency functions in low dimen-
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: sional domains. In NeurIPS, 2020. 3
Representing scenes as neural radiance fields for view synthe- [54] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
sis. In ECCV, 2020. 1, 2, 3, 5, 6, 7 ferred neural rendering: image synthesis using neural textures.
[38] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya ACM Trans. Graph., 2019. 2
Harada. Neural articulated radiance field. In ICCV, 2021. 2 [55] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
[39] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo rigid neural radiance fields: Reconstruction and novel view
Martin-Brualla. Deformable neural radiance fields. In ICCV, synthesis of a deforming scene from monocular video. In
2021. 2 ICCV, 2021. 2
[40] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. [56] Richard Tucker and Noah Snavely. Single-view view synthe-
Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin- sis with multiplane images. In CVPR, 2020. 2
Brualla, and Steven M. Seitz. Hypernerf: A higher- [57] Shubham Tulsiani, Richard Tucker, and Noah Snavely. Layer-
dimensional representation for topologically varying neural structured 3d scene inference via view synthesis. In ECCV,
radiance fields. arxiv CS.CV 2106.13228, 2021. 2 2018. 2
[41] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and [58] Michael Waechter, Nils Moehrle, and Michael Goesele. Let
Francesc Moreno-Noguer. D-nerf: Neural radiance fields there be color! large-scale texturing of 3d reconstructions. In
for dynamic scenes. In CVPR, 2021. 2 ECCV, 2014. 2
[42] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, [59] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P. Srini-
Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decom- vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
posed radiance fields. In CVPR, 2021. 1 Brualla, Noah Snavely, and Thomas A. Funkhouser. Ibrnet:
Learning multi-view image-based rendering. In CVPR, 2021.
[43] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
1, 3, 7
Geiger. Kilonerf: Speeding up neural radiance fields with
[60] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
thousands of tiny mlps. In ICCV, 2021. 1, 3, 7
Simoncelli. Image quality assessment: from error visibility
[44] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
to structural similarity. IEEE TIP, 2004. 6
Geiger. GRAF: generative radiance fields for 3d-aware image
[61] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
synthesis. In NeurIPS, 2020. 2
tor Adrian Prisacariu. Nerf-: Neural radiance fields without
[45] Jonathan Shade, Steven J. Gortler, Li-wei He, and Richard
known camera parameters. arxiv CS.CV 2102.07064, 2021. 2
Szeliski. Layered depth images. In SIGGRAPH, 1998. 2
[62] Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon
[46] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Yenphraphai, and Supasorn Suwajanakorn. Nex: Real-time
Frédo Durand. Light field reconstruction using sparsity in the view synthesis with neural basis expansion. In CVPR, 2021.
continuous fourier domain. ACM Trans. Graph., 2014. 2 2, 3
[47] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin [63] Daniel N. Wood, Daniel I. Azuma, Ken Aldinger, Brian Cur-
Huang. 3d photography using context-aware layered depth less, Tom Duchamp, David Salesin, and Werner Stuetzle.
inpainting. In CVPR, 2020. 2 Surface light fields for 3d photography. In SIGGRAPH, 2000.
[48] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias 2
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- [64] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
voxels: Learning persistent 3d feature embeddings. In CVPR, Kim. Space-time neural irradiance fields for free-viewpoint
2019. 2, 6 video. In CVPR, 2021. 2
[49] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. [65] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Scene representation networks: Continuous 3d-structure- Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
aware neural scene representations. In NeurIPS, 2019. 7 scale dataset for generalized multi-view stereo networks. In
[50] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew CVPR, 2020. 6
Tancik, Ben Mildenhall, and Jonathan T. Barron. Nerv: Neu- [66] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
ral reflectance and visibility fields for relighting and view Angjoo Kanazawa. Plenoctrees for real-time rendering of
synthesis. In CVPR, 2021. 2 neural radiance fields. In ICCV, 2021. 1, 2, 3, 5, 7
[51] Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, [67] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the pixelnerf: Neural radiance fields from one or few images. In
boundaries of view extrapolation with multiplane images. In CVPR, 2021. 3
CVPR, 2019. 2 [68] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[52] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Koltun. Nerf++: Analyzing and improving neural radiance
Schmidt, Pratul P. Srinivasan, Jonathan T. Barron, and Ren fields. arxiv CS.CV 2010.07492, 2020. 3

5468
[69] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 6
[70] Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul E.
Debevec, William T. Freeman, and Jonathan T. Barron. Ner-
factor: Neural factorization of shape and reflectance under an
unknown illumination. arxiv CS.CV 2106.01970, 2021. 2
[71] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
and Noah Snavely. Stereo magnification: learning view syn-
thesis using multiplane images. ACM Trans. Graph., 2018.
2

5469

You might also like