0% found this document useful (0 votes)

13 views11 pages

Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper

This document presents a novel approach for reconstructing radiance fields from images with known poses, achieving rapid convergence in under 15 minutes using a single GPU, compared to hours or days required by existing methods like NeRF. The method utilizes a density voxel grid for scene geometry and a feature voxel grid for view-dependent appearance, incorporating techniques such as post-activation interpolation and direct voxel density optimization to enhance quality and speed. Evaluation shows that the proposed method matches or surpasses NeRF's quality while significantly reducing training time and improving rendering speed.

Uploaded by

Ahmet reisoğlu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views11 pages

Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper

Uploaded by

Ahmet reisoğlu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Direct Voxel Grid Optimization:

Super-fast Convergence for Radiance Fields Reconstruction

Cheng Sun1,2 Min Sun1,3 Hwann-Tzong Chen1,4

[email protected] [email protected] [email protected]

Abstract

We present a super-fast convergence approach to recon-

structing the per-scene radiance field from a set of images
that capture the scene with known poses. This task, which is
23.64 PSNR. 32.72 PSNR. 34.22 PSNR.
often applied to novel view synthesis, is recently revolution- Ours at 2.33 mins. Ours at 5.07 mins. Ours at 13.72 mins.
ized by Neural Radiance Field (NeRF) for its state-of-the-art (a) The synthesized novel view by our method at three training checkpoints.
quality and flexibility. However, NeRF and its variants re-
quire a lengthy training time ranging from hours to days for
a single scene. In contrast, our approach achieves NeRF-
comparable quality and converges rapidly from scratch in
less than 15 minutes with a single GPU. We adopt a represen-
tation consisting of a density voxel grid for scene geometry
(b) The training curves of different methods on Lego scene. The training time of each
and a feature voxel grid with a shallow network for com- method is measured on our machine with a single NVIDIA RTX 2080 Ti GPU.
plex view-dependent appearance. Modeling with explicit
and discretized volume representations is not new, but we Figure 1. Super-fast convergence by our method. The key to
propose two simple yet non-trivial techniques that contribute our speedup is to optimize the volume density modeled in a dense
to fast convergence speed and high-quality output. First, voxel grid directly. Note that our method needs neither a conversion
we introduce the post-activation interpolation on voxel den- step from any trained implicit model (e.g., NeRF) nor a cross-
sity, which is capable of producing sharp surfaces in lower scene pretraining, i.e., our voxel grid representation is directly and
efficiently trained from scratch for each scene.
grid resolution. Second, direct voxel density optimization
is prone to suboptimal geometry solutions, so we robustify
the optimization process by imposing several priors. Finally, Despite its effectiveness in representing scenes, NeRF is
evaluation on five inward-facing benchmarks shows that our known to be hampered by the need of lengthy training time
method matches, if not surpasses, NeRF’s quality, yet it only and the inefficiency in rendering new views. This makes
takes about 15 minutes to train from scratch for a new scene. NeRF infeasible for many application scenarios. Several
Code: https:// github.com/ sunset1995/ DirectVoxGO. follow-up methods [15, 18, 29, 30, 42, 43, 66] have shown sig-
nificant speedup of FPS in testing phase, some of which even
achieve real-time rendering. However, only few methods
1. Introduction show training times speedup, and the improvements are not
Achieving free-viewpoint navigation of 3D objects or comparable to ours [1, 10, 31] or lead to worse quality [6, 59].
scenes from only a set of calibrated images as input is a de- On a single GPU machine, several hours of per scene opti-
manding task. For instance, it enables online product show- mization or a day of pretraining is typically required.
case to provide an immersive user experience comparing To reconstruct a volumetric scene representation from a
to static image demonstration. Recently, Neural Radiance set of images, NeRF uses multilayer perceptron (MLP) to
Fields (NeRFs) [37] have emerged as powerful representa- implicitly learn the mapping from a queried 3D point (with
tions yielding state-of-the-art quality on this task. a viewing direction) to its colors and densities. The queried
1 National
properties along a camera ray can then be accumulated into a
Tsing Hua University
2 ASUS AICS Department
pixel color by volume rendering techniques. Our work takes
3 Joint Research Center for AI Technology and All Vista Healthcare inspiration from the recent success [15, 18, 66] that uses
4 Aeolus Robotics classic voxel grid to explicitly store the scene properties,

5459
which enables real-time rendering and shows good quality. to 13003 to achieve NeRF-comparable quality.
However, their methods can not train from scratch and need
a conversion step from the trained implicit model, which 2. Related work
causes a bottleneck to the training time.
The key to our speedup is to use a dense voxel grid to Representations for novel view synthesis. Images syn-
directly model the 3D geometry (volume density). Develop- thesis from novel viewpoints given a set of images cap-
ing an elaborate strategy for view-dependent colors is not turing the scene is a long-standing task with rich stud-
in the main scope of this paper, and we simply use a hybrid ies. Previous work has presented several scene represen-
representation (feature grid with shallow MLP) for colors. tations reconstructed from the input images to synthesize
Directly optimizing the density voxel grid leads to super- the unobserved viewpoints. Lumigraph [4, 16] and light
fast converges but is prone to suboptimal solutions, where field representation [7, 23, 24, 46] directly synthesize novel
our method allocates “cloud” at free space and tries to fit views by interpolating the input images but require very
the photometric loss with the cloud instead of searching a dense scene capture. Layered depth images [11, 45, 47, 57]
geometry with better multi-view consistency. Our solution work for sparse input views but rely on depth maps or es-
to this problem is simple and effective. First, we initialize timated depth with sacrificed quality. Mesh-based repre-
the density voxel grid to yield opacities very close to zero sentations [8, 54, 58, 63] can run in real-time but have a
everywhere to avoid the geometry solutions being biased hard time with gradient-based optimization without template
toward the cameras’ near planes. Second, we give a lower meshes provided. Recent approaches employ 2D/3D Con-
learning rate to voxels visible to fewer views, which can volutional Neural Network (CNNs) to estimate multiplane
avoid redundant voxels that are allocated just for explaining images (MPIs) [12, 26, 36, 51, 56, 71] for forward-facing
the observations from a small number of views. We show that captures; estimate voxel grid [17, 32, 48] for inward-facing
the proposed solutions can successfully avoid the suboptimal captures. Our method uses gradient-descent to optimize
geometry and work well on the five datasets. voxel grids directly and does not rely on neural networks to
Using the voxel grid to model volume density still faces predict the grid values, and we still outperform the previous
a challenge in scalability. For parsimony, our approach au- works [17, 32, 48] with CNNs by a large margin.
tomatically finds a BBox tightly encloses the volume of
Neural radiance fields. Recently, NeRF [37] stands out
interest to allocate the voxel grids. Besides, we propose
to be a prevalent method for novel view synthesis with rapid
post-activation—applying all the activation functions after
progress, which takes a moderate number of input images
trilinearly interpolating the density voxel grid. Previous
with known camera poses. Unlike traditional explicit and
work either interpolates the voxel grid for the activated opac-
discretized volumetric representations (e.g., voxel grids and
ity or uses nearest-neighbor interpolation, which results in a
MPIs), NeRF uses coordinate-based multilayer perceptrons
smooth surface in each grid cell. Conversely, we prove math-
(MLP) as an implicit and continuous volumetric represen-
ematically and empirically that the proposed post-activation
tation. NeRF achieves appealing quality and has good flex-
can model (beyond) a sharp linear surface within a single
ibility with many follow-up extensions to various setups,
grid cell. As a result, we can use fewer voxels to achieve
e.g., relighting [2, 3, 50, 70], deformation [13, 38–40, 55],
better qualities—our method with 1603 dense voxels already
self-calibration [19, 27, 28, 35, 61], meta-learning [52], dy-
outperforms NeRF in most cases.
namic scene modeling [14, 25, 33, 41, 64], and generative
In summary, we have two main technical contributions.
modeling [5, 22, 44]. Nevertheless, NeRF has unfavorable
First, we implement two priors to avoid suboptimal geometry
limitations of lengthy training progress and slow rendering
in direct voxel density optimization. Second, we propose the
speed. In this work, we mainly follow NeRF’s original setup,
post-activated voxel-grid interpolation, which enables sharp
while our method can optimize the volume density explicitly
boundary modeling in lower grid resolution. The resulting
encoded in a voxel grid to speed up both training and testing
key merits of this work are highlighted as follows:
by a large margin with comparable quality.
• Our convergence speed is about two orders of magni- Hybrid volumetric representations. To combine NeRF’s
tude faster than NeRF—reducing training time from implicit representation and traditional grid representations,
10−20 hours to 15 minutes on our machine with a sin- the coordinate-based MLP is extended to also condition-
gle NVIDIA RTX 2080 Ti GPU, as shown in Fig. 1. ing on the local feature in the grid. Recently, hybrid vox-
• We achieve visual quality comparable to NeRF at a els [18, 30] and MPIs [62] representations have shown suc-
rendering speed that is about 45× faster. cess in fast rendering speed and result quality. We use hybrid
• Our method does not need cross-scene pretraining. representation to model view-dependent color as well.
• Our grid resolution is about 1603 , while the grid reso- Fast NeRF rendering. NSVF [30] uses octree in its hy-
lution in previous work [15, 18, 66] ranges from 5123 brid representation to avoid redundant MLP queries in free

5460
observed pixel photometric loss supervise
(Eq. (3)) …
rendered pixel priors

accumulation training views

(Eq. (2)) post-activation
initialize supervise find tight bbox +
(Sec. 4)
skip free space
render

query queried
feature
shallow
MLP
3D position +
viewing-direction
novel views
(a) Volume rendering (b) Our scene representation (c) Coarse geometry searching (d) Fine detail reconstruction
(Sec. 3) (Sec. 5.2) (Sec. 5.1) (Sec. 5.2)

Figure 2. Approach overview. We first review NeRF in Sec. 3. In Sec. 4, we present a novel post-activated density voxel grid to support
sharp surface modeling in lower grid resolutions. In Sec. 5, we show our approach to the reconstruction of radiance field with super-fast
convergence, where we first find a coarse geometry in Sec. 5.1 and then reconstruct the fine details and view-dependent effects in Sec. 5.2.

space. However, NSVF still needs many training hours due MLP(rgb) to learn c (see NeRF++ [68] for more discussions
to the deep MLP in its representation. Recent methods fur- on the architecture design). In practice, positional encod-
ther use thousands of tiny MLPs [43] or explicit volumetric ing is applied to x and d, which enables the MLPs to learn
representations [15,18,62,66] to achieve real-time rendering. the high-frequency details from low-dimensional input [53].
Unfortunately, gradient-based optimization is not directly For output activation, Sigmoid is applied on c; ReLU or
applicable to their methods due to their topological data Softplus is applied on σ (see Mip-NeRF [1] for more dis-
structures or the lack of priors. As a result, these meth- cussion on output activation).
ods [15, 18, 43, 62, 66] still need a conversion step from a To render the color of a pixel Ĉ(r), we cast the ray r
trained implicit model (e.g., NeRF) to their final representa- from the camera center through the pixel; K points are then
tion that supports real-time rendering. Their training time is sampled on r between the pre-defined near and far planes;
still burdened by the lengthy implicit model optimization. the K ordered sampled points are then used to query for
Fast NeRF convergence. Recent works that focus on their densities and colors {(σi , ci )}K
i=1 (MLPs are queried

fewer input views setup also bring faster convergence in NeRF). Finally, the K queried results are accumulated
as a side benefit. These methods rely on generalizable into a single color with the volume rendering quadrature in
pre-training [6, 59, 67] or external MVS depth informa- accordance with the optical model given by Max [34]:
tion [10, 31], while ours does not. Further, they still require K
!
several per-scene fine-tuning hours [10] or fail to achieve X
Ĉ(r) = Ti αi ci + TK+1 cbg , (2a)
NeRF quality in the full input-view setup [6,59,67]. Most re- i=1
cently, NeuRay [31] shows NeRF’s quality with 40 minutes
αi = alpha(σi , δi ) = 1 − exp(−σi δi ) , (2b)
per-scene training time in the lower-resolution setup. Under
i−1
the same GPU spec, our method achieves NeRF’s quality in Y
15 minutes per scene on the high-resolution setup and does Ti = (1 − αj ) , (2c)
j=1
not require depth guidance and cross-scene pre-training.

3. Preliminaries where αi is the probability of termination at the point i; Ti is

the accumulated transmittance from the near plane to point
To represent a 3D scene for novel view synthesis, Neural i; δi is the distance to the adjacent sampled point, and cbg is
Radiance Fields (NeRFs) [37] employ multilayer perceptron a pre-defined background color.
(MLP) networks to map a 3D position x and a viewing direc- Given the training images with known poses, NeRF model
tion d to the corresponding density σ and view-dependent is trained by minimizing the photometric MSE between the
color emission c: observed pixel color C(r) and the rendered color Ĉ(r):
(σ, e) = MLP(pos) (x) , (1a) 2
1 X
c = MLP (rgb)
(e, d) , (1b) Lphoto = Ĉ(r) − C(r) , (3)
|R| 2
r∈R
where the learnable MLP parameters are omitted, and e
is an intermediate embedding to help the much shallower where R is the set of rays in a sampled mini-batch.

5461
(a) Visual comparison of image fitting results under grid resolution (H/5)×(W/5).
The first row is the results of pre-, in-, and post-activation. The second row is their
per-pixel absolute difference to the target image.

Figure 3. A single grid cell with post-activation is capable of

modeling sharp linear surfaces. Left: We depict the toy task
for a 2D grid cell, where a grid cell is optimized for the linear
surface (decision boundary) across it. Right: Each column shows
an example task for three different methods. The results show that a (b) PSNRs achieved by pre-, in- and post-activation under different grid strides. A
single grid cell with post-activation (Eq. (6c)) is adequate to recover grid stride s means that the grid resolution is (H/s) × (W/s). The black dashed
line highlights that post-activation with stride ≈ 8.5 can achieve the same PSNR as
faithfully the linear surface. Conversely, pre-activation (Eq. (6a)) pre-activation with stride 2 in this example.
and in-activation (Eq. (6b)) fail to accomplish the tasks as they can
only fit into smooth results, and thus would require more grid cells Figure 4. Toy example on image fitting. The target 2D image is
to recover the surface detail. See supplementary material for the binary to imitate the scenario that most of the 3D space is either
mathematical proof. occupied or free. The objective is to reconstruct the target image
by a low-resolution 2D grid. In each optimization step, the tunable
2D grid is queried by interpolation with pre-activation (Eq. (6a)),
4. Post-activated density voxel grid in-activation (Eq. (6b)), or post-activation (Eq. (6c)) to minimize
Voxel-grid representation. A voxel-grid representation the mean squared error to the target image. The result reveals that
the post-activation can produce sharp boundaries even with low
models the modalities of interest (e.g., density, color, or
grid resolution (Fig. 4a) and is much better than the other two under
feature) explicitly in its grid cells. Such an explicit scene various grid resolutions (Fig. 4b). This motivates us to model the
representation is efficient to query for any 3D positions via 3D geometry directly via voxel grids with post-activation.
interpolation:
interp(x, V ) : R3 , RC×Nx ×Ny ×Nz → RC , (4)

queried 3D point x:
where x is the queried 3D point, V is the voxel grid, C is
α(pre) = interp x, alpha softplus V (density)

the dimension of the modality, and Nx · Ny · Nz is the total , (6a)
number of voxels. Trilinear interpolation is applied if not (in) (density)

α = alpha interp x, softplus V , (6b)
specified otherwise. (post) (density)

α = alpha softplus interp x, V . (6c)
Density voxel grid for volume rendering. Density voxel
grid, V (density) , is a special case with C = 1, which stores the The input δ to the function alpha (Eq. (2b)) is omitted for
density values for volume rendering (Eq. (2)). We use σ̈ ∈ R simplicity. We show that the post-activation, i.e., applying
to denote the raw voxel density before applying the density all the non-linear activation after the trilinear interpolation,
activation (i.e., a mapping of R → R≥0 ). In this work, we is capable of producing sharp surfaces (decision boundaries)
use the shifted softplus mentioned in Mip-NeRF [1] as the with much fewer grid cells. In Fig. 3, we use a 2D grid cell
density activation: as an example to show that a grid cell with post-activation
can produce a sharp linear boundary, while pre- and in-
σ = softplus(σ̈) = log(1 + exp(σ̈ + b)) , (5) activation can only produce smooth results and thus require
where the shift b is a hyperparameter. Using softplus instead more cells for the surface detail. In Fig. 4, we further use
of ReLU is crucial to optimize voxel density directly, as it binary image regression as a toy example to compare their
is irreparable when a voxel is falsely set to a negative value capability, which also shows that post-activation can achieve
with ReLU as the density activation. Conversely, softplus a much better efficiency in grid cell usage.
allows us to explore density very close to 0.
5. Fast and direct voxel grid optimization
Sharp decision boundary via post-activation. The inter-
polated voxel density is processed by softplus (Eq. (5)) and We depict an overview of our approach in Fig. 2. In
alpha (Eq. (2b)) functions sequentially for volume render- Sec. 5.1, we first search the coarse geometry of a scene. In
ing. We consider three different orderings—pre-activation, Sec. 5.2, we then reconstruct the fine detail including view-
in-activation, and post-activation—of plugging in the tri- dependent effects. Hereinafter we use superscripts (c) and (f)
linear interpolation and performing the activation, given a to denote variables in the coarse and fine stages.

5462
5.1. Coarse geometry searching In practice, we initialize all grid values in V (density)(c) to 0
and set the bias term in Eq. (5) to
Typically, a scene is dominated by free space (i.e., unoc- − 1
cupied space). Motivated by this fact, we aim to efficiently b = log 1 − α(init)(c) s(c) − 1 , (9)
find the coarse 3D areas of interest before reconstructing the
fine detail and view-dependent effect that require more com- where α(init)(c) is a hyperparameter. Thereby, the accumu-
putation resources. We can thus greatly reduce the number lated transmittance Ti is decayed by 1 − α(init)(c) ≈ 1 for a
of queried points on each ray in the later fine stage. ray that traces forward a distance of a voxel size s(c) . See
supplementary material for the derivation and proof.
Coarse scene representation. We use a coarse den-
(c) (c) (c)
sity voxel grid V (density)(c) ∈ R1×Nx ×Ny ×Nz with post- Prior 2: view-count-based learning rate. There could
activation (Eq. (6c)) to model scene geometry. We be some voxels visible to too few training views in real-
only model view-invariant color emissions by V (rgb)(c) ∈ world capturing, while we prefer a surface with consistency
(c) (c) (c)
R3×Nx ×Ny ×Nz in the coarse stage. A query of any 3D in many views instead of a surface that can only explain few
point x is efficient with interpolation: views. In practice, we set different learning rates for different
grid points in V (density)(c) . For each grid point indexed by j,
σ̈ (c) = interp x, V (density)(c) ,

(7a) we count the number of training views nj to which point j
c(c) = interp x, V (rgb)(c) ,

(7b) is visible, and then scale its base learning rate by nj /nmax ,
where nmax is the maximum view count over all grid points.
where c(c) ∈ R3 is the view-invariant color and σ̈ (c) ∈ R is
Training objective for coarse representation. The scene
the raw volume density.
representation is reconstructed by minimizing the mean
Coarse voxels allocation. We first find a bounding box square error between the rendered and observed colors. To
(BBox) tightly enclosing the camera frustums of training regularize the reconstruction, we mainly use background
views (See the red BBox in Fig. 2c for an example). Our entropy loss to encourage the accumulated alpha values to
voxel grids are aligned with the BBox. Let L(c) (c) (c)
x , Ly , Lz concentrate on background or foreground. Please refer to
be the lengths of the BBox and M (c) be the hyperparameter supplementary material for more detail.
for the expected total number
q of voxels in the coarse stage. 5.2. Fine detail reconstruction
The voxel size is s(c) = 3 L(c) (c) (c) (c)
x · Ly · Lz /M , so there are
(c) (c) (c) (c) (c) (c) (c) (c) (c)
Given the optimized coarse geometry V (density)(c) in
Nx , Ny , Nz = bLx /s c, bLy /s c, bLz /s c voxels Sec. 5.1, we now can focus on a smaller subspace to recon-
on each side of the BBox. struct the surface details and view-dependent effects. The
Coarse-stage points sampling. On a pixel-rendering ray, optimized V (density)(c) is frozen in this stage.
we sample query points as Fine scene representation. In the fine stage, we use
x0 = o + t(near) d , (8a) a higher-resolution density voxel grid V (density)(f) ∈
(f) (f) (f)

d R1×Nx ×Ny ×Nz with post-activated interpolation (Eq. (6c)).

xi = x0 + i · δ (c) · , (8b) Note that, alternatively, it is also possible to use a more ad-
kdk2
vanced data structure [18, 30, 66] to refine the voxel grid
where o is the camera center, d is the ray-casting direction, based on the current V (density)(c) but we leave that for future
t(near) is the camera near bound, and δ (c) is a hyperparameter work. To model view-dependent color emission, we opt to
for the step size that can be adaptively chosen according use an explicit-implicit hybrid representation as we find in
to the voxel size s(c) . The query index i ranges from 1 to our prior experiments that an explicit representation tends to
dt(far) · kdk2 /δ (c) e, where t(far) is the camera far bound, so produce worse results, and an implicit representation entails
the last sampled point stops nearby the far plane. a slower training speed. Our hybrid representation comprises
(f) (f) (f)

Prior 1: low-density initialization. At the start of train- i) a feature voxel grid V (feat)(f) ∈ RD×Nx ×Ny ×Nz , where
ing, the importance of points far from a camera is down- D is a hyperparameter for feature-space dimension, and ii) a
weighted due to the accumulated transmittance term in shallow MLP parameteriszed by Θ. Finally, queries of 3D
Eq. (2c). As a result, the coarse density voxel grid V (density)(c) points x and viewing-direction d are performed by
σ̈ (f) = interp x, V (density)(f) ,

could be accidentally trapped into a suboptimal “cloudy” ge- (10a)
ometry with higher densities at camera near planes. We thus
c(f) = MLP(rgb) interp(x, V (feat)(f) ), x, d ,

(10b)
have to initialize V (density)(c) more carefully to ensure that Θ

all sampled points on rays are visible to the cameras at the where c(f) ∈ R3 is the view-dependent color emission and
beginning, i.e., the accumulated transmittance rates Ti s in σ̈ (f) ∈ R is the raw volume density in the fine stage. Posi-
Eq. (2c) are close to 1. tional embedding [37] is applied on x, d for the MLP(rgb)
Θ .

5463
Known free space and unknown space. A query point edMVS [65] is a synthetic MVS dataset that has realistic
is in the known free space if the post-activated alpha value ambient lighting from real image blending. We use a subset
from the optimized V (density)(c) is less than the threshold τ (c) . of four objects provided by NSVF. The image resolution
Otherwise, we say the query point is in the unknown space. is 768 × 576 pixels, and one-eighth of the images are for
testing. Tanks&Temples [21] is a real-world dataset. We
Fine voxels allocation. We densely query V (density)(c) to
use a subset of five scenes provided by NSVF, each con-
find a BBox tightly enclosing the unknown space, where
taining views captured by an inward-facing camera circling
L(f) (f) (f)
x , Ly , Lz are the lengths of the BBox. The only hyper- the scene. The image resolution is 1920 × 1080 pixels, and
parameter is the expected total number of voxels M (f) . The one-eighth of the images are for testing. DeepVoxels [48]
voxel size s(f) and the grid dimensions Nx(f) , Ny(f) , Nz(f) can dataset contains four simple Lambertian objects. The image
then be derived automatically from M (f) as per Sec. 5.1. resolutions are 512 × 512, and each scene has 479 views for
Progressive scaling. Inspired by NSVF [30], we progres- training and 1000 views for testing.
sively scale our voxel grid V (density)(f) and V (feat)(f) . Let
pg ckpt be the set of checkpoint steps. The initial number of
6.2. Implementation details
voxels is set to bM (f) /2|pg ckpt| c. When reaching the train- We choose the same hyperparameters generally for all
ing step in pg ckpt, we double the number of voxels such scenes. The expected numbers of voxels are set to M (c) =
that the number of voxels after the last checkpoint is M (f) ; 1003 and M (f) = 1603 in coarse and fine stages if not stated
the voxel size s(f) and the grid dimensions Nx(f) , Ny(f) , Nz(f) otherwise. The activated alpha values are initialized to
are updated accordingly. Scaling our scene representation be α(init)(c) = 10−6 in the coarse stage. We use a higher
is much simpler. At each checkpoint, we resize our voxel α(init)(f) = 10−2 as the query points are concentrated on the
grids, V (density)(f) and V (feat)(f) , by trilinear interpolation. optimized coarse geometry in the fine stage. The points
sampling step sizes are set to half of the voxel sizes, i.e.,
Fine-stage points sampling. The points sampling strategy
δ (c) = 0.5 · s(c) and δ (f) = 0.5 · s(f) . The shallow MLP layer
is similar to Eq. (8) with some modifications. We first filter
comprises two hidden layers with 128 channels. We use
out rays that do not intersect with the known free space. For
the Adam optimizer [20] with a batch size of 8,192 rays to
each ray, we adjust the near- and far-bound, t(near) and t(far) ,
optimize the coarse and fine scene representations for 10k
to the two endpoints of the ray-box intersection. We do not
and 20k iterations. The base learning rates are 0.1 for all
adjust t(near) if x0 is already inside the BBox.
voxel grids and 10−3 for the shallow MLP. The exponential
Free space skipping. Querying V (density)(c) (Eq. (7a)) is learning rate decay is applied. See supplementary material
faster than querying V (density)(f) (Eq. (10a)); querying for for detailed hyperparameter setups.
view-dependent colors (Eq. (10b)) is the slowest. We im-
prove fine-stage efficiency by free space skipping in both 6.3. Comparisons
training and testing. First, we skip sampled points that are in
Quantitative evaluation on the synthesized novel view.
the known free space by checking the optimized V (density)(c)
We first quantitatively compare the novel view synthesis
(Eq. (7a)). Second, we further skip sampled points in un-
results in Tab. 1. PSNR, SSIM [60], and LPIPS [69] are em-
known space with low activated alpha value (threshold at
ployed as evaluation metrics. Our model with M (f) = 1603
τ (f) ) by querying V (density)(f) (Eq. (10a)).
voxels already outperforms the original NeRF [37] and the
Training objective for fine representation. We use the improved JaxNeRF [9] re-implementation. Besides, our re-
same training losses as the coarse stage, but we use a smaller sults are also comparable to most of the recent methods,
weight for the regularization losses as we find it empirically except JaxNeRF+ [9] and Mip-NeRF [1]. Moreover, our
leads to slightly better quality. per-scene optimization only takes about 15 minutes, while
all the methods after NeRF in Tab. 1 need quite a few hours
6. Experiments per scene. We also show our model with M (f) = 2563
voxels, which significantly improves our results under all
6.1. Datasets metrics and achieves more comparable results to JaxNeRF+
and Mip-NeRF. We defer detail comparisons on the much
We evaluate our approach on five inward-facing datasets. simpler DeepVoxels [48] dataset to supplementary material,
Synthetic-NeRF [37] contains eight objects with realistic where we achieve 45.83 averaged PSNR and outperform
images synthesized by NeRF. Synthetic-NSVF [30] con- NeRF’s 40.15 and IBRNet’s 42.93.
tains another eight objects synthesized by NSVF. Strictly
following NeRF’s and NSVF’s setups, we set the image Training time comparisons. The key merit of our work
resolution to 800 × 800 pixels and let each scene have is the significant improvement in convergence speed with
100 views for training and 200 views for testing. Blend- NeRF-comparable quality. In Tab. 2, we show a training

5464
Synthetic-NeRF Synthetic-NSVF BlendedMVS Tanks and Temples
Methods
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
SRN [49] 22.26 0.846 0.170vgg 24.33 0.882 0.141alex 20.51 0.770 0.294alex 24.10 0.847 0.251alex
NV [32] 26.05 0.893 0.160vgg 25.83 0.892 0.124alex 23.03 0.793 0.243alex 23.70 0.834 0.260alex
NeRF [37] 31.01 0.947 0.081vgg 30.81 0.952 0.043alex 24.15 0.828 0.192alex 25.78 0.864 0.198alex
Improved visual quality from NeRF
JaxNeRF [9] 31.69 0.953 0.068vgg - - - - - - 27.94 0.904 0.168vgg
JaxNeRF+ [9] 33.00 0.962 0.038 - - - - - - - - -
Mip-NeRF [1] 33.09 0.961 0.043vgg - - - - - - - - -
Improved test-time rendering speed (and visual quality) from NeRF
AutoInt [29] 25.55 0.911 0.170 - - - - - - - - -
FastNeRF [15] 29.97 0.941 0.053 - - - - - - - - -
SNeRG [18] 30.38 0.950 0.050 - - - - - - - - -
KiloNeRF [43] 31.00 0.95 0.03 33.37 0.97 0.02 27.39 0.92 0.06 28.41 0.91 0.09
PlenOctrees [66] 31.71 0.958 0.053vgg - - - - - - 27.99 0.917 0.131vgg
NSVF [30] 31.75 0.953 0.047alex 35.18 0.979 0.015alex 26.89 0.898 0.114alex 28.48 0.901 0.155alex
Improved convergence speed, test-time rendering speed, and visual quality from NeRF
0.053vgg 0.033vgg 0.101vgg 0.155vgg
ours (M (f) =1603 ) 31.95 0.957 alex 35.08 0.975 28.02 0.922 28.41 0.911
0.035 0.019alex 0.075alex 0.148alex
vgg vgg
0.045 0.024 0.081vgg 0.138vgg
ours (M (f) =2563 ) 32.80 0.961 36.21 0.980 28.64 0.933 28.82 0.920
0.027alex 0.012alex 0.052alex 0.124alex
*
The superscript denotes the pre-trained models used in LPIPS. The gray numbers indicate that the code is unavailable or has a unconventional LPIPS implementation.

Table 1. Quantitative comparisons for novel view synthesis. Our method excels in convergence speed, i.e., 15 minutes per scene compared
to many hours or days per scene using other methods. Besides, our rendering quality is better than the original NeRF [37] and the improved
JaxNeRF [9] on the four datasets under all metrics. We also show comparable results to most of the recent methods.

time comparison. We also show GPU specifications after generalizable per-scene

Methods PSNR↑
each reported time as it is the main factor affecting run-time. pre-training optimization
NeRF [37] 31.01 no need 1–2 days (V100)
NeRF [37] with a more powerful GPU needs 1–2 days per MVSNeRF [6] 27.21 30 hrs (2080Ti) 15 mins (2080Ti)
scene to achieve 31.01 PSNR, while our method achieves a IBRNet [59] 28.14 1 day (8xV100) 6 hrs (V100)
superior 31.95 and 32.80 PSNR in about 15 an 22 minutes NeuRay [31]† 32.42 2 days (2080Ti) 23 hrs (2080Ti)

per scene respectively. MVSNeRF [6], IBRNet [59], and Mip-NeRF [1]‡ 30.85 no need 6 hrs (2080Ti)

NeuRay [31] also show less per-scene training time than ours (M (f) =1603 ) 31.95 no need 15 mins (2080Ti)
ours (M (f) =2563 ) 32.80 no need 22 mins (2080Ti)
NeRF but with the additional cost to run a generalizable
† Use external depth information.
cross-scene pre-training. MVSNeRF [6], after pre-training,
‡ Our reproduction with early stopping on our machine.
optimizes a scene in 15 minutes as well, but the PSNR is
degraded to 28.14. IBRNet [59] shows worse PSNR and Table 2. Training time comparisons. We take the training time
longer training time than ours. NeuRay [31] originally re- and GPU specifications reported in previous works directly. A
ports time in lower-resolution (NeuRay-Lo) setup, and we V100 GPU can run faster and has more storage than a 2080Ti GPU.
receive the training time of the high-resolution (NeuRay- Our method achieves good PSNR in a significantly less per-scene
Hi) setup from the authors. NeuRay-Hi achieves 32.42 optimization time.
PSNR and requires 23 hours to train, while our method
with M (f) = 2563 voxels achieves superior 32.80 in about time. The early-stopped Mip-NeRF achieves 30.85 PSNR
22 minutes. For the early-stopped NeuRay-Hi, unfortunately, after 6 hours of training, while we can achieve 31.95 PSNR
only its training time is retained (early-stopped NeuRay-Lo in just 15 minutes.
achieves NeRF-similar PSNR). NeuRay-Hi still needs 70 Rendering speed comparisons. Improving test-time ren-
minutes to train with early stopping, while we only need 15 dering speed is not the main focus of this work, but we still
minutes to achieve NeRF-comparable quality and do not rely achieve ∼ 45× speedups from NeRF—0.64 seconds versus
on generalizable pre-training or external depth information. 29 seconds per 800 × 800 image on our machine.
Mip-NeRF [1] has similar run-time to NeRF but with much
better PSNRs, which also signifies using less training time to Qualitative comparison. Fig. 5 shows our rendering re-
achieve NeRF’s PSNR. We train early-stopped Mip-NeRFs sults on the challenging parts and compare them with the
on our machine and show the averaged PSNR and training results (better than NeRF’s) provided by PlenOctrees [66].

5465
GT Ours PlenOctree GT Ours PlenOctree Syn.-NeRF Syn.-NSVF BlendedMVS T&T
Interp.
PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆
Nearest 28.61 -2.77 28.86 -6.22 25.49 -2.48 26.39 -1.27
pre- 30.84 -0.55 32.66 -2.41 27.39 -0.58 27.44 -0.21
Tri. in- 29.91 -1.48 32.42 -2.66 27.29 -0.68 27.52 -0.13
post- 31.39 - 35.08 - 27.97 - 27.66 -

Table 3. Effectiveness of the post-activation. Geometry modeling

with density voxel grid can achieve better PSNRs by using the
proposed post-activated trilinear interpolation.

View. Syn.-NeRF Syn.-NSVF BlendedMVS T&T

α(init)(c)
lr. PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆ PSNR↑ ∆
- X 28.88 -2.51 25.12 -9.96 22.17 -5.79 25.33 -2.33
Figure 5. Qualitative comparisons on the challenging parts. 10−3 X 30.96 -0.42 27.24 -7.84 23.17 -4.79 26.04 -1.61
Top: On ficus scene, we do not show blocking artifacts as PlenOc- 10−4 X 31.29 -0.09 31.05 -4.03 26.09 -1.88 27.60 -0.05
tree and recover the pot better. Middle: We produce blurrier results 10−5 X 31.41 +0.02 35.04 -0.04 27.36 -0.61 27.63 -0.02
10−6 31.40 +0.01 35.03 -0.04 27.37 -0.60 27.59 -0.07
on ship’s body and rigging, but we do not have the background
10−7 X 31.36 -0.02 35.03 -0.05 27.73 -0.23 27.59 -0.06
artifacts. Bottom: On real-world captured Ignatius, we show better 10−6 X 31.39 - 35.08 - 27.97 - 27.66 -
quality without blocking artifacts (left) and recover the color tone
better (right). See supplementary material for more visualizations.

6.4. Ablation studies

- / X 10−3 / X 10−6 / - 10−6 / X
We mainly validate the effectiveness of the two proposed
techniques—post-activation and the imposed priors—that Table 4. Effectiveness of the imposed priors. We compare our
enable voxel grids to model scene geometry with NeRF- different settings in the coarse geometry search. Top: We show
comparable quality. We subsample two scenes for each their impacts on the final PSNRs after the fine stage reconstruction.
dataset. See supplementary material for more detail and Bottom: We visualize the allocated voxels by coarse geometry
additional ablation studies on the number of voxels, point- search on the Truck scene. Overall, low-density initialization is
sampling step size, progressive scaling, free space skipping, essential; using α(init)(c) = 10−6 and view-count-based learning
view-dependent colors modeling, and the losses. rate generally achieves cleaner voxels allocation in the coarse stage
and better PSNR after the fine stage.
Effectiveness of the post-activation. We show in Sec. 4
that the proposed post-activated trilinear interpolation en- initialization, the quality drops severely for all the scenes.
ables the discretized grid to model sharper surfaces. In Tab. 3, When α(init)(c) = 10−7 , we have to train the coarse stage
we compare the effectiveness of post-activation in scene re- of some scenes for more iterations. The effective range of
construction for novel view synthesis. Our grid in the fine α(init)(c) is scene-dependent. We find α(init)(c) = 10−6 gener-
stage consists of only 1603 voxels, where nearest-neighbor ally works well on all the scenes in this work. Finally, using
interpolation results in worse quality than trilinear interpola- a view-count-based learning rate can further improve the
tion. The proposed post-activation can improve the results results and allocate noiseless voxels in the coarse stage.
further compared to pre- and in-activation. We find that we
gain less in the real-world captured BlendedMVS and Tanks 7. Conclusion
and Temples datasets. The intuitive reason is that real-world
data introduces more uncertainty (e.g., inconsistent light- Our method directly optimizes the voxel grid and achieves
ning, SfM error), which results in multi-view inconsistent super-fast convergence in per-scene optimization with NeRF-
and blurrier surfaces. Thus, the advantage is lessened for comparable quality—reducing training time from many
scene representations that can model sharper surfaces. We hours to 15 minutes. However, we do not deal with the
speculate that resolving the uncertainty in future work can unbounded or forward-facing scenes, while we believe our
increase the gain of the proposed post-activation. method can be a stepping stone toward fast convergence in
such scenarios. We hope our method can boost the progress
Effectiveness of the imposed priors. As discussed in of NeRF-based scene reconstruction and its applications.
Sec. 5.1, it is crucial to initialize the voxel grid with low Acknowledgements: This work was supported in part by
density to avoid suboptimal geometry. The hyperparameter the MOST grants 110-2634-F-001-009 and 110-2622-8-007-
α(init)(c) controls the initial activated alpha values via Eq. (9). 010-TE2 of Taiwan. We are grateful to National Center for
In Tab. 4, we compare the quality with different α(init)(c) and High-performance Computing for providing computational
the view-count-based learning rate. Without the low-density resources and facilities.

5466
References [17] Tong He, John P. Collomosse, Hailin Jin, and Stefano Soatto.
Deepvoxels++: Enhancing the fidelity of novel view synthesis
[1] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter from 3d voxel embeddings. In Hiroshi Ishikawa, Cheng-Lin
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Liu, Tomás Pajdla, and Jianbo Shi, editors, ACCV, 2020. 2
Mip-nerf: A multiscale representation for anti-aliasing neural [18] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall,
radiance fields. In ICCV, 2021. 1, 3, 4, 6, 7 Jonathan T. Barron, and Paul E. Debevec. Baking neural
[2] Sai Bi, Zexiang Xu, Pratul P. Srinivasan, Ben Mildenhall, radiance fields for real-time view synthesis. In ICCV, 2021.
Kalyan Sunkavalli, Milos Hasan, Yannick Hold-Geoffroy, 1, 2, 3, 5, 7
David J. Kriegman, and Ravi Ramamoorthi. Neural re- [19] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Ani-
flectance fields for appearance acquisition. arxiv CS.CV mashree Anandkumar, Minsu Cho, and Jaesik Park. Self-
2106.01970, 2020. 2 calibrating neural radiance fields. In ICCV, 2021. 2
[3] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Bar- [20] Diederik P. Kingma and Jimmy Ba. Adam: A method for
ron, Ce Liu, and Hendrik P. A. Lensch. Nerd: Neural re- stochastic optimization. In ICLR, 2015. 6
flectance decomposition from image collections. In ICCV, [21] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
2021. 2 Koltun. Tanks and temples: benchmarking large-scale scene
[4] Chris Buehler, Michael Bosse, Leonard McMillan, Steven J. reconstruction. ACM Trans. Graph., 2017. 6
Gortler, and Michael F. Cohen. Unstructured lumigraph ren- [22] Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol
dering. In SIGGRAPH, 2001. 2 Moreno, Rosalia Schneider, Sona Mokrá, and Danilo Jimenez
[5] Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Rezende. Nerf-vae: A geometry aware 3d scene generative
and Gordon Wetzstein. Pi-gan: Periodic implicit generative model. In ICML, 2021. 2
adversarial networks for 3d-aware image synthesis. In CVPR, [23] Anat Levin and Frédo Durand. Linear view synthesis using a
2021. 2 dimensionality gap light field prior. In CVPR, 2010. 2
[6] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, [24] Marc Levoy and Pat Hanrahan. Light field rendering. In
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- SIGGRAPH, 1996. 2
izable radiance field reconstruction from multi-view stereo. [25] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang.
In ICCV, 2021. 1, 3, 7 Neural scene flow fields for space-time view synthesis of
[7] Abe Davis, Marc Levoy, and Frédo Durand. Unstructured dynamic scenes. In CVPR, 2021. 2
light fields. Comput. Graph. Forum, 2012. 2 [26] Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely.
[8] Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. Mod- Crowdsampling the plenoptic function. In ECCV, 2020. 2
eling and rendering architecture from photographs: A hybrid [27] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon
geometry- and image-based approach. In SIGGRAPH, 1996. Lucey. BARF: bundle-adjusting neural radiance fields. In
2 ICCV, 2021. 2
[9] Boyang Deng, Jonathan T. Barron, and Pratul P. Srinivasan. [28] Yen-Chen Lin, Pete Florence, Jonathan T. Barron, Alberto
JaxNeRF: an efficient JAX implementation of NeRF, 2020. Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting
6, 7 neural radiance fields for pose estimation. In IROS, 2021. 2
[29] David B. Lindell, Julien N. P. Martel, and Gordon Wetzstein.
[10] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan.
Autoint: Automatic integration for fast neural volume render-
Depth-supervised nerf: Fewer views and faster training for
ing. In CVPR, 2021. 1, 7
free. arxiv CS.CV 2107.02791, 2021. 1, 3
[30] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[11] Helisa Dhamo, Keisuke Tateno, Iro Laina, Nassir Navab, and
Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
Federico Tombari. Peeking behind objects: Layered depth
2020. 1, 2, 5, 6, 7
prediction from a single image. Pattern Recognit. Lett., 2019.
[31] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng
2
Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang.
[12] John Flynn, Michael Broxton, Paul E. Debevec, Matthew Neural rays for occlusion-aware image-based rendering. arxiv
DuVall, Graham Fyffe, Ryan S. Overbeck, Noah Snavely, CS.CV 2107.13421, 2021. 1, 3, 7
and Richard Tucker. Deepview: View synthesis with learned [32] Stephen Lombardi, Tomas Simon, Jason M. Saragih, Gabriel
gradient descent. In CVPR, 2019. 2 Schwartz, Andreas M. Lehrmann, and Yaser Sheikh. Neural
[13] Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias volumes: learning dynamic renderable volumes from images.
Nießner. Dynamic neural radiance fields for monocular 4d ACM Trans. Graph., 2019. 2, 7
facial avatar reconstruction. In CVPR, 2021. 2 [33] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Saj-
[14] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. jadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel
Dynamic view synthesis from dynamic monocular video. In Duckworth. Nerf in the wild: Neural radiance fields for
ICCV, 2021. 2 unconstrained photo collections. In CVPR, 2021. 2
[15] Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie [34] Nelson L. Max. Optical models for direct volume rendering.
Shotton, and Julien P. C. Valentin. Fastnerf: High-fidelity IEEE Trans. Vis. Comput. Graph., 1995. 3
neural rendering at 200fps. In ICCV, 2021. 1, 2, 3, 7 [35] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su,
[16] Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based neural
Michael F. Cohen. The lumigraph. In SIGGRAPH, 1996. 2 radiance field without posed camera. In ICCV, 2021. 2

5467
[36] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz Cayon, Ng. Learned initializations for optimizing coordinate-based
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and neural representations. In CVPR, 2021. 2
Abhishek Kar. Local light field fusion: practical view syn- [53] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
thesis with prescriptive sampling guidelines. ACM Trans. Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
Graph., 2019. 2 mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features
[37] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, let networks learn high frequency functions in low dimen-
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: sional domains. In NeurIPS, 2020. 3
Representing scenes as neural radiance fields for view synthe- [54] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
sis. In ECCV, 2020. 1, 2, 3, 5, 6, 7 ferred neural rendering: image synthesis using neural textures.
[38] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya ACM Trans. Graph., 2019. 2
Harada. Neural articulated radiance field. In ICCV, 2021. 2 [55] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
[39] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo rigid neural radiance fields: Reconstruction and novel view
Martin-Brualla. Deformable neural radiance fields. In ICCV, synthesis of a deforming scene from monocular video. In
2021. 2 ICCV, 2021. 2
[40] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. [56] Richard Tucker and Noah Snavely. Single-view view synthe-
Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin- sis with multiplane images. In CVPR, 2020. 2
Brualla, and Steven M. Seitz. Hypernerf: A higher- [57] Shubham Tulsiani, Richard Tucker, and Noah Snavely. Layer-
dimensional representation for topologically varying neural structured 3d scene inference via view synthesis. In ECCV,
radiance fields. arxiv CS.CV 2106.13228, 2021. 2 2018. 2
[41] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and [58] Michael Waechter, Nils Moehrle, and Michael Goesele. Let
Francesc Moreno-Noguer. D-nerf: Neural radiance fields there be color! large-scale texturing of 3d reconstructions. In
for dynamic scenes. In CVPR, 2021. 2 ECCV, 2014. 2
[42] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, [59] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P. Srini-
Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decom- vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
posed radiance fields. In CVPR, 2021. 1 Brualla, Noah Snavely, and Thomas A. Funkhouser. Ibrnet:
Learning multi-view image-based rendering. In CVPR, 2021.
[43] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
1, 3, 7
Geiger. Kilonerf: Speeding up neural radiance fields with
[60] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
thousands of tiny mlps. In ICCV, 2021. 1, 3, 7
Simoncelli. Image quality assessment: from error visibility
[44] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
to structural similarity. IEEE TIP, 2004. 6
Geiger. GRAF: generative radiance fields for 3d-aware image
[61] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
synthesis. In NeurIPS, 2020. 2
tor Adrian Prisacariu. Nerf-: Neural radiance fields without
[45] Jonathan Shade, Steven J. Gortler, Li-wei He, and Richard
known camera parameters. arxiv CS.CV 2102.07064, 2021. 2
Szeliski. Layered depth images. In SIGGRAPH, 1998. 2
[62] Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon
[46] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Yenphraphai, and Supasorn Suwajanakorn. Nex: Real-time
Frédo Durand. Light field reconstruction using sparsity in the view synthesis with neural basis expansion. In CVPR, 2021.
continuous fourier domain. ACM Trans. Graph., 2014. 2 2, 3
[47] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin [63] Daniel N. Wood, Daniel I. Azuma, Ken Aldinger, Brian Cur-
Huang. 3d photography using context-aware layered depth less, Tom Duchamp, David Salesin, and Werner Stuetzle.
inpainting. In CVPR, 2020. 2 Surface light fields for 3d photography. In SIGGRAPH, 2000.
[48] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias 2
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- [64] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
voxels: Learning persistent 3d feature embeddings. In CVPR, Kim. Space-time neural irradiance fields for free-viewpoint
2019. 2, 6 video. In CVPR, 2021. 2
[49] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. [65] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Scene representation networks: Continuous 3d-structure- Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
aware neural scene representations. In NeurIPS, 2019. 7 scale dataset for generalized multi-view stereo networks. In
[50] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew CVPR, 2020. 6
Tancik, Ben Mildenhall, and Jonathan T. Barron. Nerv: Neu- [66] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
ral reflectance and visibility fields for relighting and view Angjoo Kanazawa. Plenoctrees for real-time rendering of
synthesis. In CVPR, 2021. 2 neural radiance fields. In ICCV, 2021. 1, 2, 3, 5, 7
[51] Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, [67] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the pixelnerf: Neural radiance fields from one or few images. In
boundaries of view extrapolation with multiplane images. In CVPR, 2021. 3
CVPR, 2019. 2 [68] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[52] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Koltun. Nerf++: Analyzing and improving neural radiance
Schmidt, Pratul P. Srinivasan, Jonathan T. Barron, and Ren fields. arxiv CS.CV 2010.07492, 2020. 3

5468
[69] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 6
[70] Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul E.
Debevec, William T. Freeman, and Jonathan T. Barron. Ner-
factor: Neural factorization of shape and reflectance under an
unknown illumination. arxiv CS.CV 2106.01970, 2021. 2
[71] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
and Noah Snavely. Stereo magnification: learning view syn-
thesis using multiplane images. ACM Trans. Graph., 2018.
2

5469

Fast Radiance Fields for Experts
No ratings yet
Fast Radiance Fields for Experts
21 pages
Point-NeRF Point-Based Neural Radiance Fields
No ratings yet
Point-NeRF Point-Based Neural Radiance Fields
16 pages
Neural Radiance Fields for View Synthesis
No ratings yet
Neural Radiance Fields for View Synthesis
8 pages
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
No ratings yet
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
17 pages
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
No ratings yet
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
11 pages
NeRF-SR: High-Quality Super-Sampling
No ratings yet
NeRF-SR: High-Quality Super-Sampling
14 pages
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
No ratings yet
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
12 pages
Neural Radiance Fields NeRFs A Review and Some Rec
No ratings yet
Neural Radiance Fields NeRFs A Review and Some Rec
5 pages
Sparse Input View Synthesis
No ratings yet
Sparse Input View Synthesis
11 pages
Beyondpixels: A Comprehensive Review of The Evolution of Neural Radiance Fields
No ratings yet
Beyondpixels: A Comprehensive Review of The Evolution of Neural Radiance Fields
33 pages
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
No ratings yet
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
19 pages
NeRF Regularization Framework
No ratings yet
NeRF Regularization Framework
10 pages
Sg-Nerf: Sparse-Input Generalized Neural Radiance Fields For Novel View Synthesis
No ratings yet
Sg-Nerf: Sparse-Input Generalized Neural Radiance Fields For Novel View Synthesis
13 pages
Deep Review and Analysis of Recent Nerfs: Original Paper
No ratings yet
Deep Review and Analysis of Recent Nerfs: Original Paper
32 pages
Grid-Guided Neural Radiance Fields For Large Urban Scenes
No ratings yet
Grid-Guided Neural Radiance Fields For Large Urban Scenes
11 pages
Accelerating NeRF With The Visual Hull
No ratings yet
Accelerating NeRF With The Visual Hull
15 pages
Neural RGB-D 3D Reconstruction
No ratings yet
Neural RGB-D 3D Reconstruction
12 pages
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
No ratings yet
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
15 pages
Mip-Nerf 360: Unbounded Anti-Aliased Neural Radiance Fields
No ratings yet
Mip-Nerf 360: Unbounded Anti-Aliased Neural Radiance Fields
18 pages
NeRF 3D Scene Reconstruction Report
No ratings yet
NeRF 3D Scene Reconstruction Report
5 pages
Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields
No ratings yet
Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields
10 pages
pixelNeRF: Few-View Novel View Synthesis
No ratings yet
pixelNeRF: Few-View Novel View Synthesis
10 pages
3d Gaussian Splatting High
No ratings yet
3d Gaussian Splatting High
14 pages
From 2D To 3D: Leveraging Sparse Inputs For High-Fidelity Model Generation With Neural Radiance Fields
No ratings yet
From 2D To 3D: Leveraging Sparse Inputs For High-Fidelity Model Generation With Neural Radiance Fields
5 pages
3721 InNeRF Learning Interpret
No ratings yet
3721 InNeRF Learning Interpret
9 pages
Real-Time Neural SDF Mesh Rendering
No ratings yet
Real-Time Neural SDF Mesh Rendering
9 pages
NeRF - Neural Radiance Field in 3D Vision, A Comprehensive Review
No ratings yet
NeRF - Neural Radiance Field in 3D Vision, A Comprehensive Review
28 pages
DiffusioNeRF Regularizing Neural Radiance Fields With Denoising Diffusion Models CVPR 2023 Paper
No ratings yet
DiffusioNeRF Regularizing Neural Radiance Fields With Denoising Diffusion Models CVPR 2023 Paper
10 pages
3D Aware Synthesis Via Learning Textural and Structural Representations
No ratings yet
3D Aware Synthesis Via Learning Textural and Structural Representations
13 pages
3D Scene Rendering for Researchers
No ratings yet
3D Scene Rendering for Researchers
28 pages
NeRF-RPN: 3D Object Detection Framework
No ratings yet
NeRF-RPN: 3D Object Detection Framework
13 pages
NeRF Seminar Report Part3
No ratings yet
NeRF Seminar Report Part3
13 pages
4K-NeRF: Ultra-High-Res View Synthesis
No ratings yet
4K-NeRF: Ultra-High-Res View Synthesis
16 pages
Ne RF
No ratings yet
Ne RF
20 pages
NeRF-DA: Active Learning for Deblurring
No ratings yet
NeRF-DA: Active Learning for Deblurring
5 pages
Plenoctree Met SH Uitleg in
No ratings yet
Plenoctree Met SH Uitleg in
18 pages
VolumeGAN: High-Fidelity 3D Image Synthesis
No ratings yet
VolumeGAN: High-Fidelity 3D Image Synthesis
12 pages
NeRF: A Survey for Researchers
No ratings yet
NeRF: A Survey for Researchers
26 pages
Neural Watertight Manifold Meshes
No ratings yet
Neural Watertight Manifold Meshes
16 pages
RobustNeRF - Ignoring Distractors With Robust Losses
No ratings yet
RobustNeRF - Ignoring Distractors With Robust Losses
20 pages
3D Gaussian Splatting For Real-Time Radiance Field Rendering - 3d - Gaussian - Splatting - High
No ratings yet
3D Gaussian Splatting For Real-Time Radiance Field Rendering - 3d - Gaussian - Splatting - High
14 pages
NeRF for Low-Light 3D Scene Enhancement
No ratings yet
NeRF for Low-Light 3D Scene Enhancement
17 pages
MVD-Fusion: Depth-consistent 3D Generation
No ratings yet
MVD-Fusion: Depth-consistent 3D Generation
11 pages
NeRF Seminar Report Part2
No ratings yet
NeRF Seminar Report Part2
9 pages
Nerf Supervision
No ratings yet
Nerf Supervision
8 pages
Progressively Optimized Local Radiance Fields For Robust View Synthesis
No ratings yet
Progressively Optimized Local Radiance Fields For Robust View Synthesis
10 pages
Plen Octree
No ratings yet
Plen Octree
10 pages
Efficient Geometry-Aware 3D Generative Adversarial Networks
No ratings yet
Efficient Geometry-Aware 3D Generative Adversarial Networks
27 pages
Li Spacetime Gaussian Feature Splatting For Real-Time Dynamic View Synthesis CVPR 2024 Paper
No ratings yet
Li Spacetime Gaussian Feature Splatting For Real-Time Dynamic View Synthesis CVPR 2024 Paper
13 pages
VastGaussian: Fast 3D Scene Reconstruction
No ratings yet
VastGaussian: Fast 3D Scene Reconstruction
12 pages
Xu 等 - 2023 - 4K4D Real-Time 4D View Synthesis at 4K Resolution
No ratings yet
Xu 等 - 2023 - 4K4D Real-Time 4D View Synthesis at 4K Resolution
17 pages
Gaussian Activated Neural Radiance Fields For
No ratings yet
Gaussian Activated Neural Radiance Fields For
17 pages
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
No ratings yet
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
12 pages
Generation of 3D Textured Models From 2D Images
No ratings yet
Generation of 3D Textured Models From 2D Images
24 pages
3D-Consistent Image Generation
No ratings yet
3D-Consistent Image Generation
17 pages
Neural Volume Rendering and NeRF Insights
No ratings yet
Neural Volume Rendering and NeRF Insights
8 pages
3D Gaussian Splatting for NeRF
No ratings yet
3D Gaussian Splatting for NeRF
17 pages
MIP NeRF
No ratings yet
MIP NeRF
19 pages
Netiquette: A Guide To Using IM at Work
No ratings yet
Netiquette: A Guide To Using IM at Work
3 pages
Fixing NTLDR Missing Errors
No ratings yet
Fixing NTLDR Missing Errors
5 pages
Computer Science Notes For Leve 6th
No ratings yet
Computer Science Notes For Leve 6th
7 pages
Lecture 08
No ratings yet
Lecture 08
24 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
15 pages
Full Stack Developer Profile
No ratings yet
Full Stack Developer Profile
2 pages
TypeScript & npm Basics for PCF Apps
No ratings yet
TypeScript & npm Basics for PCF Apps
2 pages
Free CA Study Materials Online
No ratings yet
Free CA Study Materials Online
217 pages
05 - HC05 8x32LED Matrix
100% (1)
05 - HC05 8x32LED Matrix
4 pages
IP Voice Set Up Guide 2
No ratings yet
IP Voice Set Up Guide 2
13 pages
Do We Really Need Graph Neural Networks For Traffic Forecasting
No ratings yet
Do We Really Need Graph Neural Networks For Traffic Forecasting
12 pages
Release Notes: Lifesize Video Communications Systems Release: V4.0.11
No ratings yet
Release Notes: Lifesize Video Communications Systems Release: V4.0.11
20 pages
MSP430FG461x, MSP430CG461x Mixed-Signal Microcontrollers: 1 Device Overview
No ratings yet
MSP430FG461x, MSP430CG461x Mixed-Signal Microcontrollers: 1 Device Overview
115 pages
Risk Analysis Lab Manual
No ratings yet
Risk Analysis Lab Manual
44 pages
Searching
No ratings yet
Searching
10 pages
2021 Hybrid LiFi and WiFi Networks A Survey
No ratings yet
2021 Hybrid LiFi and WiFi Networks A Survey
23 pages
? SDA Assignment 2
No ratings yet
? SDA Assignment 2
3 pages
Data Structures and Algorithms Overview
No ratings yet
Data Structures and Algorithms Overview
43 pages
Programming Module (A Secure Client-Server Chat Application) .
No ratings yet
Programming Module (A Secure Client-Server Chat Application) .
53 pages
Prisma Cloud Compute Edition Admin
No ratings yet
Prisma Cloud Compute Edition Admin
1,366 pages
Aeronautical Network Security Guide
No ratings yet
Aeronautical Network Security Guide
24 pages
Coding System
No ratings yet
Coding System
8 pages
DAP Module4 Notes
No ratings yet
DAP Module4 Notes
17 pages
Computer Networks & Telecommunications
No ratings yet
Computer Networks & Telecommunications
19 pages
APC - April 2024 AU
No ratings yet
APC - April 2024 AU
116 pages
FounderElecRocv5 11UserGuide
No ratings yet
FounderElecRocv5 11UserGuide
441 pages
IoS's Impact on Industry 4.0 via SOA
No ratings yet
IoS's Impact on Industry 4.0 via SOA
8 pages
Artificial Intelligence - Machine Learning Fundamentals
No ratings yet
Artificial Intelligence - Machine Learning Fundamentals
31 pages
RegEstimationLS ML StatColumbia
No ratings yet
RegEstimationLS ML StatColumbia
44 pages
Book
No ratings yet
Book
19 pages

Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper

Uploaded by

Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper

Uploaded by

Direct Voxel Grid Optimization:

Super-fast Convergence for Radiance Fields Reconstruction

Cheng Sun1,2 Min Sun1,3 Hwann-Tzong Chen1,4

We present a super-fast convergence approach to recon-

accumulation training views

3. Preliminaries where αi is the probability of termination at the point i; Ti is

Figure 3. A single grid cell with post-activation is capable of

d R1×Nx ×Ny ×Nz with post-activated interpolation (Eq. (6c)).

time comparison. We also show GPU specifications after generalizable per-scene

Table 3. Effectiveness of the post-activation. Geometry modeling

View. Syn.-NeRF Syn.-NSVF BlendedMVS T&T

6.4. Ablation studies

You might also like