Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
Abstract
5459
which enables real-time rendering and shows good quality. to 13003 to achieve NeRF-comparable quality.
However, their methods can not train from scratch and need
a conversion step from the trained implicit model, which 2. Related work
causes a bottleneck to the training time.
The key to our speedup is to use a dense voxel grid to Representations for novel view synthesis. Images syn-
directly model the 3D geometry (volume density). Develop- thesis from novel viewpoints given a set of images cap-
ing an elaborate strategy for view-dependent colors is not turing the scene is a long-standing task with rich stud-
in the main scope of this paper, and we simply use a hybrid ies. Previous work has presented several scene represen-
representation (feature grid with shallow MLP) for colors. tations reconstructed from the input images to synthesize
Directly optimizing the density voxel grid leads to super- the unobserved viewpoints. Lumigraph [4, 16] and light
fast converges but is prone to suboptimal solutions, where field representation [7, 23, 24, 46] directly synthesize novel
our method allocates “cloud” at free space and tries to fit views by interpolating the input images but require very
the photometric loss with the cloud instead of searching a dense scene capture. Layered depth images [11, 45, 47, 57]
geometry with better multi-view consistency. Our solution work for sparse input views but rely on depth maps or es-
to this problem is simple and effective. First, we initialize timated depth with sacrificed quality. Mesh-based repre-
the density voxel grid to yield opacities very close to zero sentations [8, 54, 58, 63] can run in real-time but have a
everywhere to avoid the geometry solutions being biased hard time with gradient-based optimization without template
toward the cameras’ near planes. Second, we give a lower meshes provided. Recent approaches employ 2D/3D Con-
learning rate to voxels visible to fewer views, which can volutional Neural Network (CNNs) to estimate multiplane
avoid redundant voxels that are allocated just for explaining images (MPIs) [12, 26, 36, 51, 56, 71] for forward-facing
the observations from a small number of views. We show that captures; estimate voxel grid [17, 32, 48] for inward-facing
the proposed solutions can successfully avoid the suboptimal captures. Our method uses gradient-descent to optimize
geometry and work well on the five datasets. voxel grids directly and does not rely on neural networks to
Using the voxel grid to model volume density still faces predict the grid values, and we still outperform the previous
a challenge in scalability. For parsimony, our approach au- works [17, 32, 48] with CNNs by a large margin.
tomatically finds a BBox tightly encloses the volume of
Neural radiance fields. Recently, NeRF [37] stands out
interest to allocate the voxel grids. Besides, we propose
to be a prevalent method for novel view synthesis with rapid
post-activation—applying all the activation functions after
progress, which takes a moderate number of input images
trilinearly interpolating the density voxel grid. Previous
with known camera poses. Unlike traditional explicit and
work either interpolates the voxel grid for the activated opac-
discretized volumetric representations (e.g., voxel grids and
ity or uses nearest-neighbor interpolation, which results in a
MPIs), NeRF uses coordinate-based multilayer perceptrons
smooth surface in each grid cell. Conversely, we prove math-
(MLP) as an implicit and continuous volumetric represen-
ematically and empirically that the proposed post-activation
tation. NeRF achieves appealing quality and has good flex-
can model (beyond) a sharp linear surface within a single
ibility with many follow-up extensions to various setups,
grid cell. As a result, we can use fewer voxels to achieve
e.g., relighting [2, 3, 50, 70], deformation [13, 38–40, 55],
better qualities—our method with 1603 dense voxels already
self-calibration [19, 27, 28, 35, 61], meta-learning [52], dy-
outperforms NeRF in most cases.
namic scene modeling [14, 25, 33, 41, 64], and generative
In summary, we have two main technical contributions.
modeling [5, 22, 44]. Nevertheless, NeRF has unfavorable
First, we implement two priors to avoid suboptimal geometry
limitations of lengthy training progress and slow rendering
in direct voxel density optimization. Second, we propose the
speed. In this work, we mainly follow NeRF’s original setup,
post-activated voxel-grid interpolation, which enables sharp
while our method can optimize the volume density explicitly
boundary modeling in lower grid resolution. The resulting
encoded in a voxel grid to speed up both training and testing
key merits of this work are highlighted as follows:
by a large margin with comparable quality.
• Our convergence speed is about two orders of magni- Hybrid volumetric representations. To combine NeRF’s
tude faster than NeRF—reducing training time from implicit representation and traditional grid representations,
10−20 hours to 15 minutes on our machine with a sin- the coordinate-based MLP is extended to also condition-
gle NVIDIA RTX 2080 Ti GPU, as shown in Fig. 1. ing on the local feature in the grid. Recently, hybrid vox-
• We achieve visual quality comparable to NeRF at a els [18, 30] and MPIs [62] representations have shown suc-
rendering speed that is about 45× faster. cess in fast rendering speed and result quality. We use hybrid
• Our method does not need cross-scene pretraining. representation to model view-dependent color as well.
• Our grid resolution is about 1603 , while the grid reso- Fast NeRF rendering. NSVF [30] uses octree in its hy-
lution in previous work [15, 18, 66] ranges from 5123 brid representation to avoid redundant MLP queries in free
5460
observed pixel photometric loss supervise
(Eq. (3)) …
rendered pixel priors
query queried
feature
shallow
MLP
3D position +
viewing-direction
novel views
(a) Volume rendering (b) Our scene representation (c) Coarse geometry searching (d) Fine detail reconstruction
(Sec. 3) (Sec. 5.2) (Sec. 5.1) (Sec. 5.2)
Figure 2. Approach overview. We first review NeRF in Sec. 3. In Sec. 4, we present a novel post-activated density voxel grid to support
sharp surface modeling in lower grid resolutions. In Sec. 5, we show our approach to the reconstruction of radiance field with super-fast
convergence, where we first find a coarse geometry in Sec. 5.1 and then reconstruct the fine details and view-dependent effects in Sec. 5.2.
space. However, NSVF still needs many training hours due MLP(rgb) to learn c (see NeRF++ [68] for more discussions
to the deep MLP in its representation. Recent methods fur- on the architecture design). In practice, positional encod-
ther use thousands of tiny MLPs [43] or explicit volumetric ing is applied to x and d, which enables the MLPs to learn
representations [15,18,62,66] to achieve real-time rendering. the high-frequency details from low-dimensional input [53].
Unfortunately, gradient-based optimization is not directly For output activation, Sigmoid is applied on c; ReLU or
applicable to their methods due to their topological data Softplus is applied on σ (see Mip-NeRF [1] for more dis-
structures or the lack of priors. As a result, these meth- cussion on output activation).
ods [15, 18, 43, 62, 66] still need a conversion step from a To render the color of a pixel Ĉ(r), we cast the ray r
trained implicit model (e.g., NeRF) to their final representa- from the camera center through the pixel; K points are then
tion that supports real-time rendering. Their training time is sampled on r between the pre-defined near and far planes;
still burdened by the lengthy implicit model optimization. the K ordered sampled points are then used to query for
Fast NeRF convergence. Recent works that focus on their densities and colors {(σi , ci )}K
i=1 (MLPs are queried
fewer input views setup also bring faster convergence in NeRF). Finally, the K queried results are accumulated
as a side benefit. These methods rely on generalizable into a single color with the volume rendering quadrature in
pre-training [6, 59, 67] or external MVS depth informa- accordance with the optical model given by Max [34]:
tion [10, 31], while ours does not. Further, they still require K
!
several per-scene fine-tuning hours [10] or fail to achieve X
Ĉ(r) = Ti αi ci + TK+1 cbg , (2a)
NeRF quality in the full input-view setup [6,59,67]. Most re- i=1
cently, NeuRay [31] shows NeRF’s quality with 40 minutes
αi = alpha(σi , δi ) = 1 − exp(−σi δi ) , (2b)
per-scene training time in the lower-resolution setup. Under
i−1
the same GPU spec, our method achieves NeRF’s quality in Y
15 minutes per scene on the high-resolution setup and does Ti = (1 − αj ) , (2c)
j=1
not require depth guidance and cross-scene pre-training.
5461
(a) Visual comparison of image fitting results under grid resolution (H/5)×(W/5).
The first row is the results of pre-, in-, and post-activation. The second row is their
per-pixel absolute difference to the target image.
5462
5.1. Coarse geometry searching In practice, we initialize all grid values in V (density)(c) to 0
and set the bias term in Eq. (5) to
Typically, a scene is dominated by free space (i.e., unoc- − 1
cupied space). Motivated by this fact, we aim to efficiently b = log 1 − α(init)(c) s(c) − 1 , (9)
find the coarse 3D areas of interest before reconstructing the
fine detail and view-dependent effect that require more com- where α(init)(c) is a hyperparameter. Thereby, the accumu-
putation resources. We can thus greatly reduce the number lated transmittance Ti is decayed by 1 − α(init)(c) ≈ 1 for a
of queried points on each ray in the later fine stage. ray that traces forward a distance of a voxel size s(c) . See
supplementary material for the derivation and proof.
Coarse scene representation. We use a coarse den-
(c) (c) (c)
sity voxel grid V (density)(c) ∈ R1×Nx ×Ny ×Nz with post- Prior 2: view-count-based learning rate. There could
activation (Eq. (6c)) to model scene geometry. We be some voxels visible to too few training views in real-
only model view-invariant color emissions by V (rgb)(c) ∈ world capturing, while we prefer a surface with consistency
(c) (c) (c)
R3×Nx ×Ny ×Nz in the coarse stage. A query of any 3D in many views instead of a surface that can only explain few
point x is efficient with interpolation: views. In practice, we set different learning rates for different
grid points in V (density)(c) . For each grid point indexed by j,
σ̈ (c) = interp x, V (density)(c) ,
(7a) we count the number of training views nj to which point j
c(c) = interp x, V (rgb)(c) ,
(7b) is visible, and then scale its base learning rate by nj /nmax ,
where nmax is the maximum view count over all grid points.
where c(c) ∈ R3 is the view-invariant color and σ̈ (c) ∈ R is
Training objective for coarse representation. The scene
the raw volume density.
representation is reconstructed by minimizing the mean
Coarse voxels allocation. We first find a bounding box square error between the rendered and observed colors. To
(BBox) tightly enclosing the camera frustums of training regularize the reconstruction, we mainly use background
views (See the red BBox in Fig. 2c for an example). Our entropy loss to encourage the accumulated alpha values to
voxel grids are aligned with the BBox. Let L(c) (c) (c)
x , Ly , Lz concentrate on background or foreground. Please refer to
be the lengths of the BBox and M (c) be the hyperparameter supplementary material for more detail.
for the expected total number
q of voxels in the coarse stage. 5.2. Fine detail reconstruction
The voxel size is s(c) = 3 L(c) (c) (c) (c)
x · Ly · Lz /M , so there are
(c) (c) (c) (c) (c) (c) (c) (c) (c)
Given the optimized coarse geometry V (density)(c) in
Nx , Ny , Nz = bLx /s c, bLy /s c, bLz /s c voxels Sec. 5.1, we now can focus on a smaller subspace to recon-
on each side of the BBox. struct the surface details and view-dependent effects. The
Coarse-stage points sampling. On a pixel-rendering ray, optimized V (density)(c) is frozen in this stage.
we sample query points as Fine scene representation. In the fine stage, we use
x0 = o + t(near) d , (8a) a higher-resolution density voxel grid V (density)(f) ∈
(f) (f) (f)
Prior 1: low-density initialization. At the start of train- i) a feature voxel grid V (feat)(f) ∈ RD×Nx ×Ny ×Nz , where
ing, the importance of points far from a camera is down- D is a hyperparameter for feature-space dimension, and ii) a
weighted due to the accumulated transmittance term in shallow MLP parameteriszed by Θ. Finally, queries of 3D
Eq. (2c). As a result, the coarse density voxel grid V (density)(c) points x and viewing-direction d are performed by
σ̈ (f) = interp x, V (density)(f) ,
could be accidentally trapped into a suboptimal “cloudy” ge- (10a)
ometry with higher densities at camera near planes. We thus
c(f) = MLP(rgb) interp(x, V (feat)(f) ), x, d ,
(10b)
have to initialize V (density)(c) more carefully to ensure that Θ
all sampled points on rays are visible to the cameras at the where c(f) ∈ R3 is the view-dependent color emission and
beginning, i.e., the accumulated transmittance rates Ti s in σ̈ (f) ∈ R is the raw volume density in the fine stage. Posi-
Eq. (2c) are close to 1. tional embedding [37] is applied on x, d for the MLP(rgb)
Θ .
5463
Known free space and unknown space. A query point edMVS [65] is a synthetic MVS dataset that has realistic
is in the known free space if the post-activated alpha value ambient lighting from real image blending. We use a subset
from the optimized V (density)(c) is less than the threshold τ (c) . of four objects provided by NSVF. The image resolution
Otherwise, we say the query point is in the unknown space. is 768 × 576 pixels, and one-eighth of the images are for
testing. Tanks&Temples [21] is a real-world dataset. We
Fine voxels allocation. We densely query V (density)(c) to
use a subset of five scenes provided by NSVF, each con-
find a BBox tightly enclosing the unknown space, where
taining views captured by an inward-facing camera circling
L(f) (f) (f)
x , Ly , Lz are the lengths of the BBox. The only hyper- the scene. The image resolution is 1920 × 1080 pixels, and
parameter is the expected total number of voxels M (f) . The one-eighth of the images are for testing. DeepVoxels [48]
voxel size s(f) and the grid dimensions Nx(f) , Ny(f) , Nz(f) can dataset contains four simple Lambertian objects. The image
then be derived automatically from M (f) as per Sec. 5.1. resolutions are 512 × 512, and each scene has 479 views for
Progressive scaling. Inspired by NSVF [30], we progres- training and 1000 views for testing.
sively scale our voxel grid V (density)(f) and V (feat)(f) . Let
pg ckpt be the set of checkpoint steps. The initial number of
6.2. Implementation details
voxels is set to bM (f) /2|pg ckpt| c. When reaching the train- We choose the same hyperparameters generally for all
ing step in pg ckpt, we double the number of voxels such scenes. The expected numbers of voxels are set to M (c) =
that the number of voxels after the last checkpoint is M (f) ; 1003 and M (f) = 1603 in coarse and fine stages if not stated
the voxel size s(f) and the grid dimensions Nx(f) , Ny(f) , Nz(f) otherwise. The activated alpha values are initialized to
are updated accordingly. Scaling our scene representation be α(init)(c) = 10−6 in the coarse stage. We use a higher
is much simpler. At each checkpoint, we resize our voxel α(init)(f) = 10−2 as the query points are concentrated on the
grids, V (density)(f) and V (feat)(f) , by trilinear interpolation. optimized coarse geometry in the fine stage. The points
sampling step sizes are set to half of the voxel sizes, i.e.,
Fine-stage points sampling. The points sampling strategy
δ (c) = 0.5 · s(c) and δ (f) = 0.5 · s(f) . The shallow MLP layer
is similar to Eq. (8) with some modifications. We first filter
comprises two hidden layers with 128 channels. We use
out rays that do not intersect with the known free space. For
the Adam optimizer [20] with a batch size of 8,192 rays to
each ray, we adjust the near- and far-bound, t(near) and t(far) ,
optimize the coarse and fine scene representations for 10k
to the two endpoints of the ray-box intersection. We do not
and 20k iterations. The base learning rates are 0.1 for all
adjust t(near) if x0 is already inside the BBox.
voxel grids and 10−3 for the shallow MLP. The exponential
Free space skipping. Querying V (density)(c) (Eq. (7a)) is learning rate decay is applied. See supplementary material
faster than querying V (density)(f) (Eq. (10a)); querying for for detailed hyperparameter setups.
view-dependent colors (Eq. (10b)) is the slowest. We im-
prove fine-stage efficiency by free space skipping in both 6.3. Comparisons
training and testing. First, we skip sampled points that are in
Quantitative evaluation on the synthesized novel view.
the known free space by checking the optimized V (density)(c)
We first quantitatively compare the novel view synthesis
(Eq. (7a)). Second, we further skip sampled points in un-
results in Tab. 1. PSNR, SSIM [60], and LPIPS [69] are em-
known space with low activated alpha value (threshold at
ployed as evaluation metrics. Our model with M (f) = 1603
τ (f) ) by querying V (density)(f) (Eq. (10a)).
voxels already outperforms the original NeRF [37] and the
Training objective for fine representation. We use the improved JaxNeRF [9] re-implementation. Besides, our re-
same training losses as the coarse stage, but we use a smaller sults are also comparable to most of the recent methods,
weight for the regularization losses as we find it empirically except JaxNeRF+ [9] and Mip-NeRF [1]. Moreover, our
leads to slightly better quality. per-scene optimization only takes about 15 minutes, while
all the methods after NeRF in Tab. 1 need quite a few hours
6. Experiments per scene. We also show our model with M (f) = 2563
voxels, which significantly improves our results under all
6.1. Datasets metrics and achieves more comparable results to JaxNeRF+
and Mip-NeRF. We defer detail comparisons on the much
We evaluate our approach on five inward-facing datasets. simpler DeepVoxels [48] dataset to supplementary material,
Synthetic-NeRF [37] contains eight objects with realistic where we achieve 45.83 averaged PSNR and outperform
images synthesized by NeRF. Synthetic-NSVF [30] con- NeRF’s 40.15 and IBRNet’s 42.93.
tains another eight objects synthesized by NSVF. Strictly
following NeRF’s and NSVF’s setups, we set the image Training time comparisons. The key merit of our work
resolution to 800 × 800 pixels and let each scene have is the significant improvement in convergence speed with
100 views for training and 200 views for testing. Blend- NeRF-comparable quality. In Tab. 2, we show a training
5464
Synthetic-NeRF Synthetic-NSVF BlendedMVS Tanks and Temples
Methods
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
SRN [49] 22.26 0.846 0.170vgg 24.33 0.882 0.141alex 20.51 0.770 0.294alex 24.10 0.847 0.251alex
NV [32] 26.05 0.893 0.160vgg 25.83 0.892 0.124alex 23.03 0.793 0.243alex 23.70 0.834 0.260alex
NeRF [37] 31.01 0.947 0.081vgg 30.81 0.952 0.043alex 24.15 0.828 0.192alex 25.78 0.864 0.198alex
Improved visual quality from NeRF
JaxNeRF [9] 31.69 0.953 0.068vgg - - - - - - 27.94 0.904 0.168vgg
JaxNeRF+ [9] 33.00 0.962 0.038 - - - - - - - - -
Mip-NeRF [1] 33.09 0.961 0.043vgg - - - - - - - - -
Improved test-time rendering speed (and visual quality) from NeRF
AutoInt [29] 25.55 0.911 0.170 - - - - - - - - -
FastNeRF [15] 29.97 0.941 0.053 - - - - - - - - -
SNeRG [18] 30.38 0.950 0.050 - - - - - - - - -
KiloNeRF [43] 31.00 0.95 0.03 33.37 0.97 0.02 27.39 0.92 0.06 28.41 0.91 0.09
PlenOctrees [66] 31.71 0.958 0.053vgg - - - - - - 27.99 0.917 0.131vgg
NSVF [30] 31.75 0.953 0.047alex 35.18 0.979 0.015alex 26.89 0.898 0.114alex 28.48 0.901 0.155alex
Improved convergence speed, test-time rendering speed, and visual quality from NeRF
0.053vgg 0.033vgg 0.101vgg 0.155vgg
ours (M (f) =1603 ) 31.95 0.957 alex 35.08 0.975 28.02 0.922 28.41 0.911
0.035 0.019alex 0.075alex 0.148alex
vgg vgg
0.045 0.024 0.081vgg 0.138vgg
ours (M (f) =2563 ) 32.80 0.961 36.21 0.980 28.64 0.933 28.82 0.920
0.027alex 0.012alex 0.052alex 0.124alex
*
The superscript denotes the pre-trained models used in LPIPS. The gray numbers indicate that the code is unavailable or has a unconventional LPIPS implementation.
Table 1. Quantitative comparisons for novel view synthesis. Our method excels in convergence speed, i.e., 15 minutes per scene compared
to many hours or days per scene using other methods. Besides, our rendering quality is better than the original NeRF [37] and the improved
JaxNeRF [9] on the four datasets under all metrics. We also show comparable results to most of the recent methods.
per scene respectively. MVSNeRF [6], IBRNet [59], and Mip-NeRF [1]‡ 30.85 no need 6 hrs (2080Ti)
NeuRay [31] also show less per-scene training time than ours (M (f) =1603 ) 31.95 no need 15 mins (2080Ti)
ours (M (f) =2563 ) 32.80 no need 22 mins (2080Ti)
NeRF but with the additional cost to run a generalizable
† Use external depth information.
cross-scene pre-training. MVSNeRF [6], after pre-training,
‡ Our reproduction with early stopping on our machine.
optimizes a scene in 15 minutes as well, but the PSNR is
degraded to 28.14. IBRNet [59] shows worse PSNR and Table 2. Training time comparisons. We take the training time
longer training time than ours. NeuRay [31] originally re- and GPU specifications reported in previous works directly. A
ports time in lower-resolution (NeuRay-Lo) setup, and we V100 GPU can run faster and has more storage than a 2080Ti GPU.
receive the training time of the high-resolution (NeuRay- Our method achieves good PSNR in a significantly less per-scene
Hi) setup from the authors. NeuRay-Hi achieves 32.42 optimization time.
PSNR and requires 23 hours to train, while our method
with M (f) = 2563 voxels achieves superior 32.80 in about time. The early-stopped Mip-NeRF achieves 30.85 PSNR
22 minutes. For the early-stopped NeuRay-Hi, unfortunately, after 6 hours of training, while we can achieve 31.95 PSNR
only its training time is retained (early-stopped NeuRay-Lo in just 15 minutes.
achieves NeRF-similar PSNR). NeuRay-Hi still needs 70 Rendering speed comparisons. Improving test-time ren-
minutes to train with early stopping, while we only need 15 dering speed is not the main focus of this work, but we still
minutes to achieve NeRF-comparable quality and do not rely achieve ∼ 45× speedups from NeRF—0.64 seconds versus
on generalizable pre-training or external depth information. 29 seconds per 800 × 800 image on our machine.
Mip-NeRF [1] has similar run-time to NeRF but with much
better PSNRs, which also signifies using less training time to Qualitative comparison. Fig. 5 shows our rendering re-
achieve NeRF’s PSNR. We train early-stopped Mip-NeRFs sults on the challenging parts and compare them with the
on our machine and show the averaged PSNR and training results (better than NeRF’s) provided by PlenOctrees [66].
5465
GT Ours PlenOctree GT Ours PlenOctree Syn.-NeRF Syn.-NSVF BlendedMVS T&T
Interp.
PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆
Nearest 28.61 -2.77 28.86 -6.22 25.49 -2.48 26.39 -1.27
pre- 30.84 -0.55 32.66 -2.41 27.39 -0.58 27.44 -0.21
Tri. in- 29.91 -1.48 32.42 -2.66 27.29 -0.68 27.52 -0.13
post- 31.39 - 35.08 - 27.97 - 27.66 -
5466
References [17] Tong He, John P. Collomosse, Hailin Jin, and Stefano Soatto.
Deepvoxels++: Enhancing the fidelity of novel view synthesis
[1] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter from 3d voxel embeddings. In Hiroshi Ishikawa, Cheng-Lin
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Liu, Tomás Pajdla, and Jianbo Shi, editors, ACCV, 2020. 2
Mip-nerf: A multiscale representation for anti-aliasing neural [18] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall,
radiance fields. In ICCV, 2021. 1, 3, 4, 6, 7 Jonathan T. Barron, and Paul E. Debevec. Baking neural
[2] Sai Bi, Zexiang Xu, Pratul P. Srinivasan, Ben Mildenhall, radiance fields for real-time view synthesis. In ICCV, 2021.
Kalyan Sunkavalli, Milos Hasan, Yannick Hold-Geoffroy, 1, 2, 3, 5, 7
David J. Kriegman, and Ravi Ramamoorthi. Neural re- [19] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Ani-
flectance fields for appearance acquisition. arxiv CS.CV mashree Anandkumar, Minsu Cho, and Jaesik Park. Self-
2106.01970, 2020. 2 calibrating neural radiance fields. In ICCV, 2021. 2
[3] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Bar- [20] Diederik P. Kingma and Jimmy Ba. Adam: A method for
ron, Ce Liu, and Hendrik P. A. Lensch. Nerd: Neural re- stochastic optimization. In ICLR, 2015. 6
flectance decomposition from image collections. In ICCV, [21] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
2021. 2 Koltun. Tanks and temples: benchmarking large-scale scene
[4] Chris Buehler, Michael Bosse, Leonard McMillan, Steven J. reconstruction. ACM Trans. Graph., 2017. 6
Gortler, and Michael F. Cohen. Unstructured lumigraph ren- [22] Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol
dering. In SIGGRAPH, 2001. 2 Moreno, Rosalia Schneider, Sona Mokrá, and Danilo Jimenez
[5] Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Rezende. Nerf-vae: A geometry aware 3d scene generative
and Gordon Wetzstein. Pi-gan: Periodic implicit generative model. In ICML, 2021. 2
adversarial networks for 3d-aware image synthesis. In CVPR, [23] Anat Levin and Frédo Durand. Linear view synthesis using a
2021. 2 dimensionality gap light field prior. In CVPR, 2010. 2
[6] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, [24] Marc Levoy and Pat Hanrahan. Light field rendering. In
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- SIGGRAPH, 1996. 2
izable radiance field reconstruction from multi-view stereo. [25] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang.
In ICCV, 2021. 1, 3, 7 Neural scene flow fields for space-time view synthesis of
[7] Abe Davis, Marc Levoy, and Frédo Durand. Unstructured dynamic scenes. In CVPR, 2021. 2
light fields. Comput. Graph. Forum, 2012. 2 [26] Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely.
[8] Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. Mod- Crowdsampling the plenoptic function. In ECCV, 2020. 2
eling and rendering architecture from photographs: A hybrid [27] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon
geometry- and image-based approach. In SIGGRAPH, 1996. Lucey. BARF: bundle-adjusting neural radiance fields. In
2 ICCV, 2021. 2
[9] Boyang Deng, Jonathan T. Barron, and Pratul P. Srinivasan. [28] Yen-Chen Lin, Pete Florence, Jonathan T. Barron, Alberto
JaxNeRF: an efficient JAX implementation of NeRF, 2020. Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting
6, 7 neural radiance fields for pose estimation. In IROS, 2021. 2
[29] David B. Lindell, Julien N. P. Martel, and Gordon Wetzstein.
[10] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan.
Autoint: Automatic integration for fast neural volume render-
Depth-supervised nerf: Fewer views and faster training for
ing. In CVPR, 2021. 1, 7
free. arxiv CS.CV 2107.02791, 2021. 1, 3
[30] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[11] Helisa Dhamo, Keisuke Tateno, Iro Laina, Nassir Navab, and
Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
Federico Tombari. Peeking behind objects: Layered depth
2020. 1, 2, 5, 6, 7
prediction from a single image. Pattern Recognit. Lett., 2019.
[31] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng
2
Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang.
[12] John Flynn, Michael Broxton, Paul E. Debevec, Matthew Neural rays for occlusion-aware image-based rendering. arxiv
DuVall, Graham Fyffe, Ryan S. Overbeck, Noah Snavely, CS.CV 2107.13421, 2021. 1, 3, 7
and Richard Tucker. Deepview: View synthesis with learned [32] Stephen Lombardi, Tomas Simon, Jason M. Saragih, Gabriel
gradient descent. In CVPR, 2019. 2 Schwartz, Andreas M. Lehrmann, and Yaser Sheikh. Neural
[13] Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias volumes: learning dynamic renderable volumes from images.
Nießner. Dynamic neural radiance fields for monocular 4d ACM Trans. Graph., 2019. 2, 7
facial avatar reconstruction. In CVPR, 2021. 2 [33] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Saj-
[14] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. jadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel
Dynamic view synthesis from dynamic monocular video. In Duckworth. Nerf in the wild: Neural radiance fields for
ICCV, 2021. 2 unconstrained photo collections. In CVPR, 2021. 2
[15] Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie [34] Nelson L. Max. Optical models for direct volume rendering.
Shotton, and Julien P. C. Valentin. Fastnerf: High-fidelity IEEE Trans. Vis. Comput. Graph., 1995. 3
neural rendering at 200fps. In ICCV, 2021. 1, 2, 3, 7 [35] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su,
[16] Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based neural
Michael F. Cohen. The lumigraph. In SIGGRAPH, 1996. 2 radiance field without posed camera. In ICCV, 2021. 2
5467
[36] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz Cayon, Ng. Learned initializations for optimizing coordinate-based
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and neural representations. In CVPR, 2021. 2
Abhishek Kar. Local light field fusion: practical view syn- [53] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
thesis with prescriptive sampling guidelines. ACM Trans. Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
Graph., 2019. 2 mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features
[37] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, let networks learn high frequency functions in low dimen-
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: sional domains. In NeurIPS, 2020. 3
Representing scenes as neural radiance fields for view synthe- [54] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
sis. In ECCV, 2020. 1, 2, 3, 5, 6, 7 ferred neural rendering: image synthesis using neural textures.
[38] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya ACM Trans. Graph., 2019. 2
Harada. Neural articulated radiance field. In ICCV, 2021. 2 [55] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
[39] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo rigid neural radiance fields: Reconstruction and novel view
Martin-Brualla. Deformable neural radiance fields. In ICCV, synthesis of a deforming scene from monocular video. In
2021. 2 ICCV, 2021. 2
[40] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. [56] Richard Tucker and Noah Snavely. Single-view view synthe-
Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin- sis with multiplane images. In CVPR, 2020. 2
Brualla, and Steven M. Seitz. Hypernerf: A higher- [57] Shubham Tulsiani, Richard Tucker, and Noah Snavely. Layer-
dimensional representation for topologically varying neural structured 3d scene inference via view synthesis. In ECCV,
radiance fields. arxiv CS.CV 2106.13228, 2021. 2 2018. 2
[41] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and [58] Michael Waechter, Nils Moehrle, and Michael Goesele. Let
Francesc Moreno-Noguer. D-nerf: Neural radiance fields there be color! large-scale texturing of 3d reconstructions. In
for dynamic scenes. In CVPR, 2021. 2 ECCV, 2014. 2
[42] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, [59] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P. Srini-
Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decom- vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
posed radiance fields. In CVPR, 2021. 1 Brualla, Noah Snavely, and Thomas A. Funkhouser. Ibrnet:
Learning multi-view image-based rendering. In CVPR, 2021.
[43] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
1, 3, 7
Geiger. Kilonerf: Speeding up neural radiance fields with
[60] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
thousands of tiny mlps. In ICCV, 2021. 1, 3, 7
Simoncelli. Image quality assessment: from error visibility
[44] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
to structural similarity. IEEE TIP, 2004. 6
Geiger. GRAF: generative radiance fields for 3d-aware image
[61] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
synthesis. In NeurIPS, 2020. 2
tor Adrian Prisacariu. Nerf-: Neural radiance fields without
[45] Jonathan Shade, Steven J. Gortler, Li-wei He, and Richard
known camera parameters. arxiv CS.CV 2102.07064, 2021. 2
Szeliski. Layered depth images. In SIGGRAPH, 1998. 2
[62] Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon
[46] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Yenphraphai, and Supasorn Suwajanakorn. Nex: Real-time
Frédo Durand. Light field reconstruction using sparsity in the view synthesis with neural basis expansion. In CVPR, 2021.
continuous fourier domain. ACM Trans. Graph., 2014. 2 2, 3
[47] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin [63] Daniel N. Wood, Daniel I. Azuma, Ken Aldinger, Brian Cur-
Huang. 3d photography using context-aware layered depth less, Tom Duchamp, David Salesin, and Werner Stuetzle.
inpainting. In CVPR, 2020. 2 Surface light fields for 3d photography. In SIGGRAPH, 2000.
[48] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias 2
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- [64] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
voxels: Learning persistent 3d feature embeddings. In CVPR, Kim. Space-time neural irradiance fields for free-viewpoint
2019. 2, 6 video. In CVPR, 2021. 2
[49] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. [65] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Scene representation networks: Continuous 3d-structure- Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
aware neural scene representations. In NeurIPS, 2019. 7 scale dataset for generalized multi-view stereo networks. In
[50] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew CVPR, 2020. 6
Tancik, Ben Mildenhall, and Jonathan T. Barron. Nerv: Neu- [66] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
ral reflectance and visibility fields for relighting and view Angjoo Kanazawa. Plenoctrees for real-time rendering of
synthesis. In CVPR, 2021. 2 neural radiance fields. In ICCV, 2021. 1, 2, 3, 5, 7
[51] Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, [67] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the pixelnerf: Neural radiance fields from one or few images. In
boundaries of view extrapolation with multiplane images. In CVPR, 2021. 3
CVPR, 2019. 2 [68] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[52] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Koltun. Nerf++: Analyzing and improving neural radiance
Schmidt, Pratul P. Srinivasan, Jonathan T. Barron, and Ren fields. arxiv CS.CV 2010.07492, 2020. 3
5468
[69] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 6
[70] Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul E.
Debevec, William T. Freeman, and Jonathan T. Barron. Ner-
factor: Neural factorization of shape and reflectance under an
unknown illumination. arxiv CS.CV 2106.01970, 2021. 2
[71] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
and Noah Snavely. Stereo magnification: learning view syn-
thesis using multiplane images. ACM Trans. Graph., 2018.
2
5469