0% found this document useful (0 votes)

53 views18 pages

Edify 3D: High-Quality 3D Asset Generation

Edify 3D is a cutting-edge model developed by NVIDIA for generating high-quality 3D assets from text prompts or reference images, achieving detailed geometry and textures in under 2 minutes. The model utilizes multi-view diffusion and reconstruction techniques to synthesize RGB and surface normal images, which are then processed to create production-ready 3D assets. This report outlines the model's architecture, capabilities, and training methodologies, emphasizing its efficiency and scalability in 3D asset generation.

Uploaded by

Guangyu Mao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views18 pages

Edify 3D: High-Quality 3D Asset Generation

Uploaded by

Guangyu Mao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2024-11-12

Edify 3D: Scalable High-Quality 3D Asset Generation

NVIDIA1
arXiv:2411.07135v1 [cs.CV] 11 Nov 2024

Figure 1: Edify 3D is a model designed for high-quality 3D asset generation. With input text prompts
and/or a reference image, our model can generate a wide range of detailed 3D assets, supporting
applications such as video game design, extended reality, simulation, and more.

Abstract
We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our
method first synthesizes RGB and surface normal images of the described object at multiple
viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the
shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets
with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2
minutes of runtime.

https://research.nvidia.com/labs/dir/edify-3d

1. Introduction
The creation of detailed digital 3D assets is essential for developing scenes, characters, and
environments across various digital domains. This capability is invaluable to industries such as video
game design, extended reality, film production, and simulation. For 3D content to be production-ready,
it must meet industry standards, including precise mesh structures, high-resolution textures, and
material maps. Consequently, producing such high-quality 3D content is often an exceedingly complex
and time-intensive process. As demand for 3D digital experiences grows, the need for efficient, scalable
solutions in 3D asset creation becomes increasingly crucial.
1A detailed list of contributors and acknowledgments can be found in App. A of this paper.

© 2024 NVIDIA. All rights reserved.

Edify 3D: Scalable High-Quality 3D Asset Generation

Recently, many research works have investigated into training AI models for 3D asset generation (Lin
et al., 2023). A significant challenge, however, is the limited availability of 3D assets suitable for model
training. Creating 3D content requires specialized skills and expertise, making such assets much
scarcer than other visual media like images and videos. This scarcity raises a key research question of
how to design scalable models to generate high-quality 3D assets from such data efficiently.

Edify 3D is an advanced solution designed for high-quality 3D asset generation, addressing the above
challenges while meeting industry standards. Our model generates high-quality 3D assets in under 2
minutes, providing detailed geometry, clean shape topologies, organized UV maps, textures up to 4K
resolution, and physically-based rendering (PBR) materials. Compared to other text-to-3D approaches,
Edify 3D consistently produces superior 3D shapes and textures, with notable improvements in both
efficiency and scalability. This technical report provides a detailed description of Edify 3D.

Core capabilities. Edify 3D features the following capabilities:

• Text-to-3D generation. Given an input text description, Edify 3D generates a digital 3D asset with
the aforementioned properties.
• Image-to-3D generation. Edify 3D can also create a 3D asset from a reference image of the object,
automatically identifying the foreground object in the image.

Model design. The core technology of Edify 3D relies on two types of neural networks: diffusion
models (Ho et al., 2020; Song and Ermon, 2019) and Transformers (Vaswani et al., 2017). Both
architectures have demonstrated great scalability and success in improving generation quality as more
training data becomes available. Following Li et al. (2024), we train the following models:

• Multi-view diffusion models. We train multiple diffusion models to synthesize the RGB appearance
and surface normals of an object from multiple viewpoints (Shi et al., 2023). The input can be a text
prompt, a reference image, or both.
• Reconstruction model. Using the synthesized multi-view RGB and surface normal images, a
reconstruction model predicts the geometry, texture, and materials of the 3D shape. We employ a
Transformer-based model (Hong et al., 2023) to predict a neural representation of the 3D object as
latent tokens, followed by isosurface extraction and mesh processing.

The final output of Edify 3D is a 3D asset that includes the mesh geometry, texture map, and material
map. Fig. 2 illustrates the overall pipeline of Edify 3D.

In this report, we provide:

• A detailed discussion of the design choices in the Edify 3D pipeline.

• An analysis of the scaling behaviors of model components and properties.
• An application of Edify 3D to scalable 3D scene generation from input text prompts.

2. Multi-View Diffusion Model

The process of creating multi-view images is similar to the design of video generation (Brooks et al.,
2024; Chen et al., 2024). We finetune text-to-image models into pose-aware multi-view diffusion
models by conditioning them with camera poses. The models take a text prompt and camera poses as
input and synthesize the object’s appearance from different viewpoints. We train the following models:

1. A base multi-view diffusion model that synthesizes the RGB appearance conditioned on the input
text prompt as well as the camera poses.
2. A multi-view ControlNet (Zhang et al., 2023) model that synthesizes the object’s surface normals,
conditioned on both the multi-view RGB synthesis and the text prompt.

2
Edify 3D: Scalable High-Quality 3D Asset Generation

A steampunk robot turtle

with rusty mechanical parts.

Text prompt

Reconstruction
model
Latent 3D tokens

Isosurface extraction
& mesh processing
Multi-view
diffusion model RGB images

Multi-view
ControlNet Normal images

Mesh geometry

Rasterize low-resolution
texture and surface normal

Update

Texture map
Upscaling Material map
ControlNet High-resolution
RGB images 3D asset
Figure 2: Pipeline of Edify 3D. Given a text description, a multi-view diffusion model synthesizes the
RGB appearance of the described object. The generated multi-view RGB images are then used as a
condition to synthesize surface normals using a multi-view ControlNet (Zhang et al., 2023). Next, a
reconstruction model takes the multi-view RGB and normal images as input and predicts the neural 3D
representation using a set of latent tokens. This is followed by isosurface extraction and subsequent
mesh post-processing to obtain the mesh geometry. An upscaling ControlNet is used to increase the
texture resolution, conditioning on mesh rasterizations to generate high-resolution multi-view RGB
images, which are then back-projected onto the texture map.

3. A multi-view upscaling ControlNet that super-resolves the multi-view RGB images to a higher
resolution, conditioned on the rasterized texture and surface normals of a given 3D mesh.

We use the Edify Image (NVIDIA et al., 2024) model as the base diffusion model architecture with a
U-Net (Ronneberger et al., 2015) with 2.7 billion parameters, operating the diffusion in the pixel space.
The ControlNet encoders are initialized using the weights from the U-Net. We extend the self-attention
layer in the original text-to-image diffusion model with a new mechanism to attend across different
views (Fig. 3), acting as a video diffusion model with the same weights. The camera poses (rotation
and translation) are encoded through a lightweight MLP, which are subsequently added as the time
embeddings to the video diffusion model architecture.

3
Edify 3D: Scalable High-Quality 3D Asset Generation

viewpoint 1

viewpoint 2

viewpoint 3 attention map

Figure 3: Cross-view attention. In standard diffusion models, each view is synthesized by the diffusion
model independently. We extend the self-attention layer (yellow boxes) in our multi-view diffusion
models to attend across other viewpoints using the same weights.

Training. We finetune the text-to-image models on renderings of 3D objects. During training, we jointly
train on natural 2D images as well as 3D object renderings with randomly chosen numbers of views (1,
4, and 8). The diffusion models are trained using the x0 parametrization for the loss, consistent with
the approach used in base model training. For multi-view ControlNets, we train the base model with
multi-view surface normal images first. Subsequently, we add a ControlNet encoder taking RGB images
as input and train it while freezing the base model.

2.1. Ablation Studies

Scaling with respect to the number of viewpoints. During inference, we can sample an arbitrary
number of views while maintaining good multi-view consistency, as shown in Fig. 4. Generating more
views allows for broader coverage of the object’s regions in the multi-view images. As we later discuss
in Sec. 3, the quality of the resulting 3D reconstruction is positively correlated to the number of
multi-view observations (Furukawa and Ponce, 2009). Therefore, the ability of the multi-view diffusion
model to synthesize denser viewpoints is critical to the final 3D generation quality.

Training across different numbers of viewpoints. During training, we sample 1, 4, or 8 views for each
training object, assigning different sampling ratios for each number of views. While training with a
varying number of views enables sampling an arbitrary number of views during inference, it is still
preferable to match the training views to those expected during inference. This helps minimize the gap
between training and inference performance. We compare between two models -- one trained primarily
on 4-view images and one on 8-view images -- and sample 10-view images at the same viewpoints. As
shown in Fig. 5, the model trained mostly with 8-view images produces more natural looking images
with better multi-view consistency across views compared to that trained mostly with 4-view images.

3. Reconstruction Model
Extracting 3D structure from image observations is typically referred to as photogrammetry, which has
been widely applied to many 3D reconstruction tasks (Li et al., 2023; Mildenhall et al., 2020; Wang et al.,

4
Edify 3D: Scalable High-Quality 3D Asset Generation

Sampled 4 views

Sampled 8 views

Sampled 10 views

Sampled 12 views

Figure 4: Comparison of number of sampled views. All images are sampled from the same model.
Our multi-view diffusion model can synthesize object images with dense viewpoint coverage while
maintaining good multi-view consistency, making it suitable for the downstream reconstruction model.

2021). We use a Transformer-based reconstruction model (Hong et al., 2023) to generate the 3D mesh
geometry, texture map, and material map from multi-view images. We find that a Transformer-based
model demonstrates strong generalization capabilities to unseen object images, including synthesized
outputs from 2D multi-view diffusion models (Sec. 2).

We use a decoder-only Transformer model with a latent 3D representation as triplanes (Chan et al.,
2023; Hong et al., 2023). The input RGB and normal images serve as conditioning for the reconstruction
model, with cross-attention layers applied between triplane tokens and the input conditioning. The
triplane tokens are processed through MLPs to predict neural fields for Signed Distance Functions
(SDF) and PBR properties (Karis, 2013), which are used for SDF-based volume rendering (Yariv et al.,
2021). The neural SDF is converted into a 3D mesh through isosurface extraction (Lorensen and Cline,
1998; Shen et al., 2023). The PBR properties are baked into texture and material maps via UV mapping,
including albedo colors and material properties like roughness and metallic channels.

5
Edify 3D: Scalable High-Quality 3D Asset Generation

Model trained with mostly 4 views

Model trained with mostly 8 views

Figure 5: Comparison of number of training views. We compare two models trained primarily with
different numbers of views (4 vs. 8), and sample images at the same 10 views at inference time. The
model trained primarily on 8 views generates images with better multi-view consistency compared to
the 4-view counterpart.

Training. We train our reconstruction model using large-scale imagery and 3D asset data. The model is
supervised on depth, normal, mask, albedo, and material channels through SDF-based volume rendering,
with outputs rendered from artist-generated meshes. Since surface normal computation is relatively
expensive, we compute the normal only at the surface and supervise against the ground truth. We find
that aligning the uncertainty of the SDF (Yariv et al., 2021) with the corresponding rendering resolution
improves the visual quality of the final output. Additionally, we mask out object edges during loss
computation to avoid noisy samples caused by aliasing. To smooth noisy gradients across samples, we
apply exponential moving average (EMA) to aggregate the final reconstruction model weights.

Mesh post-processing. After obtaining the dense triangular 3D mesh from isosurface extraction, we
post-process the mesh with the following steps:

1. Retopologize into a quadrilateral (quad) mesh with simplified geometry and adaptive topologies.
2. Generate the UV mapping based on the resulting quad mesh topology.
3. Bake the albedo and material neural fields into a texture map and material map, respectively.

These post-processing steps make the resulting mesh more suitable for further editing, essential for
artistic and design-oriented downstream applications.

3.1. Ablation Studies

We study the scaling properties of the reconstruction model in the following aspects: (a) the number of
input images and (b) the tokens representing the shapes.

Experimental setup. For validation, we randomly select 78 shapes from a held-out dataset. We
report the LPIPS (Zhang et al., 2018) score on the albedo prediction to quantify the base texture
reconstruction performance. For material prediction accuracy, we use the 𝐿2 error on the roughness
and metallic values. We also use the 𝐿2 error between the ground-truth and predicted depths as a
proxy for evaluating the geometry accuracy of the reconstructed shapes. The camera poses used for
input and output are fixed at an elevation of 20°, pointing towards the origin (Fig. 6).

6
Edify 3D: Scalable High-Quality 3D Asset Generation

4 views 4 views (diagonal) 8 views 16 views

Figure 6: Camera pose sets for scaling reconstruction. The camera poses are set at various azimuth
angles and a fixed elevation of 20°. The camera in yellow indicates the ‘‘frontal’’ pose with respect to a
canonicalized coordinate system. In each setup, the cameras are uniformly distributed around a circus
at a fixed radius looking towards the origin. The ‘‘4 views (diagonal)’’ represents the same 4-view pose
set but with a 45° offset, resulting in the cameras looking at the object region from diagonal angles.

Albedo LPIPS error Material 𝐿2 error Depth 𝐿2 error

Input Validation views Input Validation views Input Validation views

views 4 4 (diag.) 8 16 views 4 4 (diag.) 8 16 views 4 4 (diag.) 8 16
4 0.0732 0.0791 0.0762 0.0768 4 0.0015 0.0020 0.0017 0.0018 4 0.0689 0.0751 0.0720 0.0722
4 (diag.) 0.0802 0.0756 0.0779 0.0783 4 (diag.) 0.0024 0.0019 0.0022 0.0022 4 (diag.) 0.0704 0.0683 0.0694 0.0696
8 0.0691 0.0698 0.0695 0.0699 8 0.0013 0.0012 0.0013 0.0013 8 0.0626 0.0641 0.0633 0.0633
16 0.0687 0.0689 0.0688 0.0687 16 0.0012 0.0013 0.0013 0.0013 16 0.0613 0.0626 0.0619 0.0616

Table 1: Comparison of the number of input views. The diagonal cells indicate the cases where the
input views match the validation views. The quality improves (error decreases) as the number of input
views increases, demonstrating the scalability of our reconstruction model.

Scaling with respect to the number of viewpoints. We find that our reconstruction model consistently
recovers the input views more accurately than novel views. The model scales well with respect to the
number of viewpoints, i.e., its performance improves as more information is provided. We demonstrate
this with various combinations of input and output camera views, as shown in Fig. 6.

We present the quantitative results in Tab. 1. The diagonal entries of the table are highlighted, as they
represent cases where the inputs and outputs are identical. These diagonal entries often show the best
results in each row, indicating that the model reproduces the input views most accurately. Additionally,
the results consistently improve as the number of input views increases from 4 to 16. This suggests
that the reconstruction model benefits from the additional input information.

Motivated by the model’s scaling with the number of viewpoints, we further investigate whether the
number of training views affects reconstruction quality. We evaluate the model using a fixed 8-view
setup, where the model is trained with 4, 6, 8, and 10 views. The results are shown in Fig. 7a. Although
stochastic sampling of camera poses provides diverse viewpoints during training, the reconstruction
quality continues to improve as the number of training views in the same training step increases.

Scaling with respect to compute. We study the impact of the compute requirements of the
reconstruction model without changing the model size (i.e., the number of model parameters). For this
analysis, we scale down the triplane token sizes in the self-attention and cross-attention blocks to
reduce computation. Note that this adjustment does not alter the number of model parameters. We
observe from Fig. 7b that as the number of tokens increases, the results improve proportionally with
the available compute.

7
Edify 3D: Scalable High-Quality 3D Asset Generation

Albedo LPIPS Error 0.0024 Material L2 Error Depth L2 Error Albedo LPIPS Error 0.0050 Material L2 Error Depth L2 Error
0.078 0.10 0.12
0.0022 0.09 0.140
0.0040 0.11
0.076
0.0020 0.08
0.120 0.10
0.074 0.0018 0.08 0.0030 0.09
0.07 0.100
0.072 0.0016 0.08
0.07 0.0020
0.0014 0.080 0.07
0.070 0.07
4 6 8 10 4 6 8 10 4 6 8 10 16 32 64 96 128 16 32 64 96 128 16 32 64 96 128

(a) Number of Training Views (b) Triplane Resolution

Figure 7: (a) Comparison of the number of training views. The reconstruction model continues to
improve as the number of training views increases. (b) Comparison of the number of tokens. With a
fixed number of parameters, the model requires more compute with larger number of tokens. Note that
the self-attention FLOPs increase quadratically with the triplane resolution. The quality consistently
scales with the amount of compute.

4. Data Processing
Edify 3D is trained on a combination of non-public large-scale imagery, pre-rendered multi-view images,
and 3D shape datasets. We focus on pre-processing 3D shape data in this section. The raw 3D data
undergo several preprocessing steps to achieve the quality and format required for model training.

Format conversion. The first step of the data processing pipeline involves converting all 3D shapes
into a unified format. We triangulate the meshes, pack all texture files, and convert the materials into
metallic-roughness format. We discard shapes with corrupted textures or materials. This process
results in a collection of 3D shapes that can be rendered as intended by the original creators.

Quality filtering. We filter out non-object-centric data from the large-scale 3D datasets. We render the
shapes from multiple viewpoints and use AI classifiers to remove partial 3D scans, large scenes, shape
collages, and shapes containing auxiliary structures such as backdrops and ground planes. To ensure
quality, this process is conducted through multiple rounds of active learning, with human experts
continuously curating challenging examples to refine the AI classifier. Additionally, we apply rule-based
filtering to remove shapes with obvious issues, such as those that are excessively thin or lack texture.

Canonical pose alignment. We align our training shapes to their canonical poses to reduce potential
ambiguity when training the model. Pose alignment is also achieved via active learning. We manually
curate a small number of examples, train a pose predictor, look for hard examples in the full dataset,
and repeat the process. Defining the canonical pose is also critical. While many objects such as cars,
animals, and shoes already have natural canonical poses, other shapes may lack a clear front side, in
which case we define the functional part as the front and prioritize maintaining left-right symmetry.

PBR rendering. To render the 3D data into images for the diffusion and reconstruction models,
we use an in-house path tracer for photorealistic rendering. We employ a diverse set of sampling
techniques for the camera parameters. Half of the images are rendered from fixed elevation angles
with consistent intrinsic parameters, while the remaining images are rendered using random camera
poses and intrinsics. As a constraint, we maintain roughly consistent object sizes within the rendered
images. This approach accommodates both text-to-3D use cases (where there is full control over the
camera parameters) and image-to-3D use cases (where the reference image may come from a wide
range of camera intrinsics).

AI captions. To caption the 3D shapes, we render one image per shape and use a vision-language model
(VLM) to generate both long and short captions for the image. To enhance the comprehensiveness of
captions, we also provide the metadata of the shape (e.g. title, description, category tree) to the VLM.

8
Edify 3D: Scalable High-Quality 3D Asset Generation