0% found this document useful (0 votes)
53 views18 pages

Edify 3D: High-Quality 3D Asset Generation

Edify 3D is a cutting-edge model developed by NVIDIA for generating high-quality 3D assets from text prompts or reference images, achieving detailed geometry and textures in under 2 minutes. The model utilizes multi-view diffusion and reconstruction techniques to synthesize RGB and surface normal images, which are then processed to create production-ready 3D assets. This report outlines the model's architecture, capabilities, and training methodologies, emphasizing its efficiency and scalability in 3D asset generation.

Uploaded by

Guangyu Mao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views18 pages

Edify 3D: High-Quality 3D Asset Generation

Edify 3D is a cutting-edge model developed by NVIDIA for generating high-quality 3D assets from text prompts or reference images, achieving detailed geometry and textures in under 2 minutes. The model utilizes multi-view diffusion and reconstruction techniques to synthesize RGB and surface normal images, which are then processed to create production-ready 3D assets. This report outlines the model's architecture, capabilities, and training methodologies, emphasizing its efficiency and scalability in 3D asset generation.

Uploaded by

Guangyu Mao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2024-11-12

Edify 3D: Scalable High-Quality 3D Asset Generation


NVIDIA1
arXiv:2411.07135v1 [cs.CV] 11 Nov 2024

Figure 1: Edify 3D is a model designed for high-quality 3D asset generation. With input text prompts
and/or a reference image, our model can generate a wide range of detailed 3D assets, supporting
applications such as video game design, extended reality, simulation, and more.

Abstract
We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our
method first synthesizes RGB and surface normal images of the described object at multiple
viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the
shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets
with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2
minutes of runtime.

https://research.nvidia.com/labs/dir/edify-3d

1. Introduction
The creation of detailed digital 3D assets is essential for developing scenes, characters, and
environments across various digital domains. This capability is invaluable to industries such as video
game design, extended reality, film production, and simulation. For 3D content to be production-ready,
it must meet industry standards, including precise mesh structures, high-resolution textures, and
material maps. Consequently, producing such high-quality 3D content is often an exceedingly complex
and time-intensive process. As demand for 3D digital experiences grows, the need for efficient, scalable
solutions in 3D asset creation becomes increasingly crucial.
1A detailed list of contributors and acknowledgments can be found in App. A of this paper.

© 2024 NVIDIA. All rights reserved.


Edify 3D: Scalable High-Quality 3D Asset Generation

Recently, many research works have investigated into training AI models for 3D asset generation (Lin
et al., 2023). A significant challenge, however, is the limited availability of 3D assets suitable for model
training. Creating 3D content requires specialized skills and expertise, making such assets much
scarcer than other visual media like images and videos. This scarcity raises a key research question of
how to design scalable models to generate high-quality 3D assets from such data efficiently.

Edify 3D is an advanced solution designed for high-quality 3D asset generation, addressing the above
challenges while meeting industry standards. Our model generates high-quality 3D assets in under 2
minutes, providing detailed geometry, clean shape topologies, organized UV maps, textures up to 4K
resolution, and physically-based rendering (PBR) materials. Compared to other text-to-3D approaches,
Edify 3D consistently produces superior 3D shapes and textures, with notable improvements in both
efficiency and scalability. This technical report provides a detailed description of Edify 3D.

Core capabilities. Edify 3D features the following capabilities:

• Text-to-3D generation. Given an input text description, Edify 3D generates a digital 3D asset with
the aforementioned properties.
• Image-to-3D generation. Edify 3D can also create a 3D asset from a reference image of the object,
automatically identifying the foreground object in the image.

Model design. The core technology of Edify 3D relies on two types of neural networks: diffusion
models (Ho et al., 2020; Song and Ermon, 2019) and Transformers (Vaswani et al., 2017). Both
architectures have demonstrated great scalability and success in improving generation quality as more
training data becomes available. Following Li et al. (2024), we train the following models:

• Multi-view diffusion models. We train multiple diffusion models to synthesize the RGB appearance
and surface normals of an object from multiple viewpoints (Shi et al., 2023). The input can be a text
prompt, a reference image, or both.
• Reconstruction model. Using the synthesized multi-view RGB and surface normal images, a
reconstruction model predicts the geometry, texture, and materials of the 3D shape. We employ a
Transformer-based model (Hong et al., 2023) to predict a neural representation of the 3D object as
latent tokens, followed by isosurface extraction and mesh processing.

The final output of Edify 3D is a 3D asset that includes the mesh geometry, texture map, and material
map. Fig. 2 illustrates the overall pipeline of Edify 3D.

In this report, we provide:

• A detailed discussion of the design choices in the Edify 3D pipeline.


• An analysis of the scaling behaviors of model components and properties.
• An application of Edify 3D to scalable 3D scene generation from input text prompts.

2. Multi-View Diffusion Model


The process of creating multi-view images is similar to the design of video generation (Brooks et al.,
2024; Chen et al., 2024). We finetune text-to-image models into pose-aware multi-view diffusion
models by conditioning them with camera poses. The models take a text prompt and camera poses as
input and synthesize the object’s appearance from different viewpoints. We train the following models:

1. A base multi-view diffusion model that synthesizes the RGB appearance conditioned on the input
text prompt as well as the camera poses.
2. A multi-view ControlNet (Zhang et al., 2023) model that synthesizes the object’s surface normals,
conditioned on both the multi-view RGB synthesis and the text prompt.

2
Edify 3D: Scalable High-Quality 3D Asset Generation

A steampunk robot turtle


with rusty mechanical parts.

Text prompt

Reconstruction
model
Latent 3D tokens

Isosurface extraction
& mesh processing
Multi-view
diffusion model RGB images

Multi-view
ControlNet Normal images

Mesh geometry

Rasterize low-resolution
texture and surface normal

Update

Texture map
Upscaling Material map
ControlNet High-resolution
RGB images 3D asset
Figure 2: Pipeline of Edify 3D. Given a text description, a multi-view diffusion model synthesizes the
RGB appearance of the described object. The generated multi-view RGB images are then used as a
condition to synthesize surface normals using a multi-view ControlNet (Zhang et al., 2023). Next, a
reconstruction model takes the multi-view RGB and normal images as input and predicts the neural 3D
representation using a set of latent tokens. This is followed by isosurface extraction and subsequent
mesh post-processing to obtain the mesh geometry. An upscaling ControlNet is used to increase the
texture resolution, conditioning on mesh rasterizations to generate high-resolution multi-view RGB
images, which are then back-projected onto the texture map.

3. A multi-view upscaling ControlNet that super-resolves the multi-view RGB images to a higher
resolution, conditioned on the rasterized texture and surface normals of a given 3D mesh.

We use the Edify Image (NVIDIA et al., 2024) model as the base diffusion model architecture with a
U-Net (Ronneberger et al., 2015) with 2.7 billion parameters, operating the diffusion in the pixel space.
The ControlNet encoders are initialized using the weights from the U-Net. We extend the self-attention
layer in the original text-to-image diffusion model with a new mechanism to attend across different
views (Fig. 3), acting as a video diffusion model with the same weights. The camera poses (rotation
and translation) are encoded through a lightweight MLP, which are subsequently added as the time
embeddings to the video diffusion model architecture.

3
Edify 3D: Scalable High-Quality 3D Asset Generation

viewpoint 1

viewpoint 2

viewpoint 3 attention map

Figure 3: Cross-view attention. In standard diffusion models, each view is synthesized by the diffusion
model independently. We extend the self-attention layer (yellow boxes) in our multi-view diffusion
models to attend across other viewpoints using the same weights.

Training. We finetune the text-to-image models on renderings of 3D objects. During training, we jointly
train on natural 2D images as well as 3D object renderings with randomly chosen numbers of views (1,
4, and 8). The diffusion models are trained using the x0 parametrization for the loss, consistent with
the approach used in base model training. For multi-view ControlNets, we train the base model with
multi-view surface normal images first. Subsequently, we add a ControlNet encoder taking RGB images
as input and train it while freezing the base model.

2.1. Ablation Studies


Scaling with respect to the number of viewpoints. During inference, we can sample an arbitrary
number of views while maintaining good multi-view consistency, as shown in Fig. 4. Generating more
views allows for broader coverage of the object’s regions in the multi-view images. As we later discuss
in Sec. 3, the quality of the resulting 3D reconstruction is positively correlated to the number of
multi-view observations (Furukawa and Ponce, 2009). Therefore, the ability of the multi-view diffusion
model to synthesize denser viewpoints is critical to the final 3D generation quality.

Training across different numbers of viewpoints. During training, we sample 1, 4, or 8 views for each
training object, assigning different sampling ratios for each number of views. While training with a
varying number of views enables sampling an arbitrary number of views during inference, it is still
preferable to match the training views to those expected during inference. This helps minimize the gap
between training and inference performance. We compare between two models -- one trained primarily
on 4-view images and one on 8-view images -- and sample 10-view images at the same viewpoints. As
shown in Fig. 5, the model trained mostly with 8-view images produces more natural looking images
with better multi-view consistency across views compared to that trained mostly with 4-view images.

3. Reconstruction Model
Extracting 3D structure from image observations is typically referred to as photogrammetry, which has
been widely applied to many 3D reconstruction tasks (Li et al., 2023; Mildenhall et al., 2020; Wang et al.,

4
Edify 3D: Scalable High-Quality 3D Asset Generation

Sampled 4 views

Sampled 8 views

Sampled 10 views

Sampled 12 views

Figure 4: Comparison of number of sampled views. All images are sampled from the same model.
Our multi-view diffusion model can synthesize object images with dense viewpoint coverage while
maintaining good multi-view consistency, making it suitable for the downstream reconstruction model.

2021). We use a Transformer-based reconstruction model (Hong et al., 2023) to generate the 3D mesh
geometry, texture map, and material map from multi-view images. We find that a Transformer-based
model demonstrates strong generalization capabilities to unseen object images, including synthesized
outputs from 2D multi-view diffusion models (Sec. 2).

We use a decoder-only Transformer model with a latent 3D representation as triplanes (Chan et al.,
2023; Hong et al., 2023). The input RGB and normal images serve as conditioning for the reconstruction
model, with cross-attention layers applied between triplane tokens and the input conditioning. The
triplane tokens are processed through MLPs to predict neural fields for Signed Distance Functions
(SDF) and PBR properties (Karis, 2013), which are used for SDF-based volume rendering (Yariv et al.,
2021). The neural SDF is converted into a 3D mesh through isosurface extraction (Lorensen and Cline,
1998; Shen et al., 2023). The PBR properties are baked into texture and material maps via UV mapping,
including albedo colors and material properties like roughness and metallic channels.

5
Edify 3D: Scalable High-Quality 3D Asset Generation

Model trained with mostly 4 views

Model trained with mostly 8 views

Figure 5: Comparison of number of training views. We compare two models trained primarily with
different numbers of views (4 vs. 8), and sample images at the same 10 views at inference time. The
model trained primarily on 8 views generates images with better multi-view consistency compared to
the 4-view counterpart.

Training. We train our reconstruction model using large-scale imagery and 3D asset data. The model is
supervised on depth, normal, mask, albedo, and material channels through SDF-based volume rendering,
with outputs rendered from artist-generated meshes. Since surface normal computation is relatively
expensive, we compute the normal only at the surface and supervise against the ground truth. We find
that aligning the uncertainty of the SDF (Yariv et al., 2021) with the corresponding rendering resolution
improves the visual quality of the final output. Additionally, we mask out object edges during loss
computation to avoid noisy samples caused by aliasing. To smooth noisy gradients across samples, we
apply exponential moving average (EMA) to aggregate the final reconstruction model weights.

Mesh post-processing. After obtaining the dense triangular 3D mesh from isosurface extraction, we
post-process the mesh with the following steps:

1. Retopologize into a quadrilateral (quad) mesh with simplified geometry and adaptive topologies.
2. Generate the UV mapping based on the resulting quad mesh topology.
3. Bake the albedo and material neural fields into a texture map and material map, respectively.

These post-processing steps make the resulting mesh more suitable for further editing, essential for
artistic and design-oriented downstream applications.

3.1. Ablation Studies


We study the scaling properties of the reconstruction model in the following aspects: (a) the number of
input images and (b) the tokens representing the shapes.

Experimental setup. For validation, we randomly select 78 shapes from a held-out dataset. We
report the LPIPS (Zhang et al., 2018) score on the albedo prediction to quantify the base texture
reconstruction performance. For material prediction accuracy, we use the 𝐿2 error on the roughness
and metallic values. We also use the 𝐿2 error between the ground-truth and predicted depths as a
proxy for evaluating the geometry accuracy of the reconstructed shapes. The camera poses used for
input and output are fixed at an elevation of 20°, pointing towards the origin (Fig. 6).

6
Edify 3D: Scalable High-Quality 3D Asset Generation

4 views 4 views (diagonal) 8 views 16 views

Figure 6: Camera pose sets for scaling reconstruction. The camera poses are set at various azimuth
angles and a fixed elevation of 20°. The camera in yellow indicates the ‘‘frontal’’ pose with respect to a
canonicalized coordinate system. In each setup, the cameras are uniformly distributed around a circus
at a fixed radius looking towards the origin. The ‘‘4 views (diagonal)’’ represents the same 4-view pose
set but with a 45° offset, resulting in the cameras looking at the object region from diagonal angles.

Albedo LPIPS error Material 𝐿2 error Depth 𝐿2 error

Input Validation views Input Validation views Input Validation views


views 4 4 (diag.) 8 16 views 4 4 (diag.) 8 16 views 4 4 (diag.) 8 16
4 0.0732 0.0791 0.0762 0.0768 4 0.0015 0.0020 0.0017 0.0018 4 0.0689 0.0751 0.0720 0.0722
4 (diag.) 0.0802 0.0756 0.0779 0.0783 4 (diag.) 0.0024 0.0019 0.0022 0.0022 4 (diag.) 0.0704 0.0683 0.0694 0.0696
8 0.0691 0.0698 0.0695 0.0699 8 0.0013 0.0012 0.0013 0.0013 8 0.0626 0.0641 0.0633 0.0633
16 0.0687 0.0689 0.0688 0.0687 16 0.0012 0.0013 0.0013 0.0013 16 0.0613 0.0626 0.0619 0.0616

Table 1: Comparison of the number of input views. The diagonal cells indicate the cases where the
input views match the validation views. The quality improves (error decreases) as the number of input
views increases, demonstrating the scalability of our reconstruction model.

Scaling with respect to the number of viewpoints. We find that our reconstruction model consistently
recovers the input views more accurately than novel views. The model scales well with respect to the
number of viewpoints, i.e., its performance improves as more information is provided. We demonstrate
this with various combinations of input and output camera views, as shown in Fig. 6.

We present the quantitative results in Tab. 1. The diagonal entries of the table are highlighted, as they
represent cases where the inputs and outputs are identical. These diagonal entries often show the best
results in each row, indicating that the model reproduces the input views most accurately. Additionally,
the results consistently improve as the number of input views increases from 4 to 16. This suggests
that the reconstruction model benefits from the additional input information.

Motivated by the model’s scaling with the number of viewpoints, we further investigate whether the
number of training views affects reconstruction quality. We evaluate the model using a fixed 8-view
setup, where the model is trained with 4, 6, 8, and 10 views. The results are shown in Fig. 7a. Although
stochastic sampling of camera poses provides diverse viewpoints during training, the reconstruction
quality continues to improve as the number of training views in the same training step increases.

Scaling with respect to compute. We study the impact of the compute requirements of the
reconstruction model without changing the model size (i.e., the number of model parameters). For this
analysis, we scale down the triplane token sizes in the self-attention and cross-attention blocks to
reduce computation. Note that this adjustment does not alter the number of model parameters. We
observe from Fig. 7b that as the number of tokens increases, the results improve proportionally with
the available compute.

7
Edify 3D: Scalable High-Quality 3D Asset Generation

Albedo LPIPS Error 0.0024 Material L2 Error Depth L2 Error Albedo LPIPS Error 0.0050 Material L2 Error Depth L2 Error
0.078 0.10 0.12
0.0022 0.09 0.140
0.0040 0.11
0.076
0.0020 0.08
0.120 0.10
0.074 0.0018 0.08 0.0030 0.09
0.07 0.100
0.072 0.0016 0.08
0.07 0.0020
0.0014 0.080 0.07
0.070 0.07
4 6 8 10 4 6 8 10 4 6 8 10 16 32 64 96 128 16 32 64 96 128 16 32 64 96 128

(a) Number of Training Views (b) Triplane Resolution

Figure 7: (a) Comparison of the number of training views. The reconstruction model continues to
improve as the number of training views increases. (b) Comparison of the number of tokens. With a
fixed number of parameters, the model requires more compute with larger number of tokens. Note that
the self-attention FLOPs increase quadratically with the triplane resolution. The quality consistently
scales with the amount of compute.

4. Data Processing
Edify 3D is trained on a combination of non-public large-scale imagery, pre-rendered multi-view images,
and 3D shape datasets. We focus on pre-processing 3D shape data in this section. The raw 3D data
undergo several preprocessing steps to achieve the quality and format required for model training.

Format conversion. The first step of the data processing pipeline involves converting all 3D shapes
into a unified format. We triangulate the meshes, pack all texture files, and convert the materials into
metallic-roughness format. We discard shapes with corrupted textures or materials. This process
results in a collection of 3D shapes that can be rendered as intended by the original creators.

Quality filtering. We filter out non-object-centric data from the large-scale 3D datasets. We render the
shapes from multiple viewpoints and use AI classifiers to remove partial 3D scans, large scenes, shape
collages, and shapes containing auxiliary structures such as backdrops and ground planes. To ensure
quality, this process is conducted through multiple rounds of active learning, with human experts
continuously curating challenging examples to refine the AI classifier. Additionally, we apply rule-based
filtering to remove shapes with obvious issues, such as those that are excessively thin or lack texture.

Canonical pose alignment. We align our training shapes to their canonical poses to reduce potential
ambiguity when training the model. Pose alignment is also achieved via active learning. We manually
curate a small number of examples, train a pose predictor, look for hard examples in the full dataset,
and repeat the process. Defining the canonical pose is also critical. While many objects such as cars,
animals, and shoes already have natural canonical poses, other shapes may lack a clear front side, in
which case we define the functional part as the front and prioritize maintaining left-right symmetry.

PBR rendering. To render the 3D data into images for the diffusion and reconstruction models,
we use an in-house path tracer for photorealistic rendering. We employ a diverse set of sampling
techniques for the camera parameters. Half of the images are rendered from fixed elevation angles
with consistent intrinsic parameters, while the remaining images are rendered using random camera
poses and intrinsics. As a constraint, we maintain roughly consistent object sizes within the rendered
images. This approach accommodates both text-to-3D use cases (where there is full control over the
camera parameters) and image-to-3D use cases (where the reference image may come from a wide
range of camera intrinsics).

AI captions. To caption the 3D shapes, we render one image per shape and use a vision-language model
(VLM) to generate both long and short captions for the image. To enhance the comprehensiveness of
captions, we also provide the metadata of the shape (e.g. title, description, category tree) to the VLM.

8
Edify 3D: Scalable High-Quality 3D Asset Generation

PBR rendering Albedo Surface normal PBR rendering Albedo Surface normal

A bear wearing a cowboy outfit. A hedgehog in a wizard outfit.

PBR rendering Albedo Surface normal PBR rendering Albedo Surface normal

A rabbit as a famous painter with a palette. A knight’s armor on a stand.

PBR rendering Albedo Surface normal PBR rendering Albedo Surface normal

A phonograph made of wood and gold. Cute isometric house, adobe style, desert tan.

PBR rendering Albedo Surface normal PBR rendering Albedo Surface normal

An orange factory robot arm. A spaceship pilot chair.

PBR rendering Albedo Surface normal PBR rendering Albedo Surface normal

A lion dressed as a chef. A full backpack with hanging space tools.

Figure 8: Text-to-3D generation results. We include the input text prompts as well as renderings and
surface normals of generated assets. The generated 3D meshes include detailed geometry and sharp
textures with well-decomposed albedo colors, making them suitable for various downstream editing
and rendering applications.

9
Edify 3D: Scalable High-Quality 3D Asset Generation

Input image Surface normal Input image Surface normal Input image Surface normal

PBR rendering Albedo PBR rendering Albedo PBR rendering Albedo

Figure 9: Image-to-3D generation results. We visualize the input reference images as well as renderings
and surface normals of generated assets. Edify 3D can faithfully recover the underlying 3D structures
of the reference object while also being able to hallucinate detailed textures in unseen surface regions
(e.g., the backside of the cup).

Figure 10: Quad mesh topologies. Edify 3D generates assets in the form of quad meshes with clean
topologies, making it suitable for downstream editing workflows. We visualize the quad mesh topologies
of the generated assets with their PBR renderings side-by-side.

5. Results
We showcase text-to-3D generation results from Edify 3D in Fig. 8 and image-to-3D generation in Fig. 9.
The generated meshes include detailed geometry and sharp textures, with well-decomposed albedo
colors that represent the surface’s base color. For image-to-3D generation, Edify 3D not only accurately
recovers the underlying 3D structures of the reference object, but it also can generate detailed textures
in regions of the surface not directly observed in the input image.

The assets generated by Edify 3D come in the form of quad meshes with well-organized topologies, as
visualized in Fig. 10. These structured meshes allow for easier manipulation and precise adjustments,

10
Edify 3D: Scalable High-Quality 3D Asset Generation

making them well-suited for various downstream editing tasks and rendering applications. This enables
seamless integration into 3D workflows that require visual fidelity and flexibility.

6. Related Work
3D asset generation. The challenge of 3D asset generation is often addressed by training models
on 3D datasets (Gupta et al., 2023; Jun and Nichol, 2023; Nichol et al., 2022; Zeng et al., 2022), but
the scarcity of these datasets limits the ability to generalize. To overcome this, recent methods have
shifted towards using models trained on large-scale image and video datasets. Score Distillation
Sampling (SDS) (Poole et al., 2023) has been adopted in earlier methods (Huang et al., 2023; Lin et al.,
2023; Sun et al., 2024; Tang et al., 2023; Wang et al., 2023, 2024; Yi et al., 2023; Zhu and Zhuang, 2023)
and extended to image-conditioned 3D generative models (Liu et al., 2023; Long et al., 2024; Qian et al.,
2024; Tang et al., 2023; Wang and Shi, 2023; Yu et al., 2023). However, they often experience slow
processing (Lorraine et al., 2023; Xie et al., 2024) and are susceptible to issues such as the Janus
face issue (Shi et al., 2023). To improve performance, newer techniques integrate multi-view image
generative models, focusing on producing multiple consistent views that can be reconstructed into 3D
models (Chan et al., 2023; Chen et al., 2024; Gao et al., 2024; Höllein et al., 2024; Liu et al., 2024; Shi et al.,
2023,; Tang et al., 2024; Weng et al., 2023; Yang et al., 2024). However, maintaining consistency across
these views remains a challenge, leading to the development of methods that enhance reconstruction
robustness from limited views (Li et al., 2024; Liu et al., 2024).

3D reconstruction from multi-view images. 3D asset generation from limited views often involves
3D reconstruction techniques, often with differentiable rendering, that can utilize various 3D
representations such as Neural Radiance Fields (NeRF) (Mildenhall et al., 2020). Meshes are the
most commonly used format in industrial 3D engines, yet reconstructing high-quality meshes from
multi-view images is challenging. Traditional photogrammetry pipelines, including structure from
motion (SfM) (Agarwal et al., 2011; Schönberger and Frahm, 2016; Snavely et al., 2006), multi-view
stereo (MVS) (Furukawa and Ponce, 2009; Schönberger et al., 2016), and surface extraction (Kazhdan
et al., 2006; Lorensen and Cline, 1998; Shen et al., 2021, 2023), are costly and time-consuming, often
yielding low-quality results. While NeRF-based neural rendering methods can achieve high-quality
3D reconstructions (Guédon and Lepetit, 2024; Huang et al., 2024; Kerbl et al., 2023; Li et al., 2023;
Wang et al., 2021; Yariv et al., 2021), they require dense images and extensive optimization, and
converting radiance fields into meshes can lead to suboptimal results. To address these limitations,
Transformer-based (Vaswani et al., 2017) models further improve 3D NeRF reconstruction from sparse
views by learning a feed-forward prior (Hong et al., 2023).

Texture and material generation. Earlier approaches targeting 3D texture generation given a 3D
shape include CLIP (Radford et al., 2021) for text alignment (Michel et al., 2022; Mohammad Khalid
et al., 2022) and SDS loss optimization (Poole et al., 2023). To improve 3D awareness, some text-to-3D
methods combine texture inpainting with depth-conditioned diffusion (Chen et al., 2023), albeit
slower and more artifact-prone. To enhance consistency, other techniques alternate diffusion with
reprojection (Cao et al., 2023) or generate multiple textured views simultaneously (Deng et al., 2024),
though at a higher computational cost. To further enhance realism, some methods have enabled
multi-view PBR modeling (Boss et al., 2022; Zhang et al., 2021) to extend support for generating
material properties (Chen et al., 2023; Qiu et al., 2024; Xu et al., 2023).

7. Application: 3D Scene Generation


In this section, we demonstrate an application of our Edify 3D model to scalable 3D scene genera-
tion (Bahmani et al., 2023). High-quality, large-scale 3D scenes are pivotal for content creation and
training robust embodied AI agents. However, existing scene creation mostly depends on 3D scans

11
Edify 3D: Scalable High-Quality 3D Asset Generation

A Colorado desert scene with a skull in a gold rush theme.

Edify 3D generated assets Scene layout

3D scene generation

Figure 11: The high-quality 3D assets generated by Edify 3D can be extended to 3D scene generation
combined with a given scene layout, which can be generated by an LLM. This enables the generated 3D
scene to be readily rendered at high quality while being editable for various downstream applications.

with costly human annotations (Dai et al., 2017) or direct scene modeling (Bautista et al., 2022; Chen
et al., 2023; Fridman et al., 2024), which largely limits the scalability.

With Edify 3D as the 3D asset generation API, we can design a scalable system to generate 3D scenes
from only an input text prompt (Lin et al., 2024). The system generates the scene layout of 3D objects
with Large Language Models (LLM) (OpenAI et al., 2023), which specifies the positions and sizes of 3D
objects2 . As a result, the system can generate a realistic and complex 3D scene that coherently aligns
with the text prompt describing the whole scene.

We show an example result of 3D scene generation in Fig. 11. The generated assets from Edify 3D
include detailed geometry and textures, forming a scene composition together with a generated scene
layout. Since all the 3D assets are created individually, the generated 3D scene is naturally editable for
various specialized applications, such as artist creation, 3D designs, or embodied AI simulations.

2We refer the readers to Lin et al. (2024) for details on layout generation with LLMs.

12
Edify 3D: Scalable High-Quality 3D Asset Generation

8. Conclusion
In this technical report, we present Edify 3D, a solution designed for high-quality 3D asset generation.
We introduce the Edify 3D models, analyze the scaling laws of each model, and describe the data
curation pipeline. Additionally, we explore the application of Edify 3D to scalable 3D scene generation.
We are committed to advancing and developing new automation tools for 3D asset generation, making
3D content creation more accessible to everyone.

A. Contributors and Acknowledgements


A.1. Core Contributors
System design: Chen-Hsuan Lin, Xiaohui Zeng, Zhaoshuo Li, Zekun Hao, Ming-Yu Liu, Tsung-Yi Lin.
Multi-view diffusion model: Xiaohui Zeng, Qinsheng Zhang, Ming-Yu Liu, Tsung-Yi Lin.
Reconstruction model: Zhaoshuo Li, Chen-Hsuan Lin, Zekun Hao, Yen-Chen Lin, Ming-Yu Liu, Tsung-Yi
Lin.
3D data processing: Zekun Hao, Fangyin Wei, Yin Cui, Yunhao Ge, Yifan Ding, Donglai Xiang, Qianli Ma,
Jacob Munkberg, Jon Hasselgren, Chen-Hsuan Lin, Tsung-Yi Lin.
Mesh post-processing: Donglai Xiang, Qianli Ma, J.P. Lewis, Zekun Hao, Zhaoshuo Li, Fangyin Wei,
Xiaohui Zeng, Jingyi Jin, Chen-Hsuan Lin, Tsung-Yi Lin.
Evaluation: Jingyi Jin, Xiaohui Zeng, Zhaoshuo Li, Qianli Ma, Yen-Chen Lin, Chen-Hsuan Lin, Tsung-Yi Lin.

A.2. Contributors
Maciej Bala, Jacob Huffman, Alice Luo, Stella Shi, Jiashu Xu.

A.3. Acknowledgements
We thank Lars Bishop, Sanja Fidler, Jun Gao, Jinwei Gu, Aaron Lefohn, Arun Mallya, Hanzi Mao,
Seungjun Nah, Fitsum Reda, David Romero Guzman, Rohan Sawhney, Nicholas Sharp, Tianchang Shen,
Peter Shipkov, Towaki Takikawa, Heng Wang, and Martin Watt for useful research discussions and
prototyping. We also thank Dane Aconfora, Yazdan Aghaghiri, Margaret Albrecht, Arslan Ali, Sivakumar
Arayandi Thottakara, Amelia Barton, Lucas Brown, Matt Catrett, Douglas Chang, Steve Chappell,
Gerardo Delgado Cabrera, John Dickinson, Amol Fasale, Daniela Flamm Jackson, Sandra Froehlich,
Devika Ghaisas, Yugi Guvvala, Brett Hamilton, Mohammad Harrim, Nathan Horrocks, Akan Huang,
Sophia Huang, Pooya Jannaty, Pranjali Joshi, Tobias Lasser, Gabriele Leone, Aaron Licata, Ashlee
Martino-Tarr, Alexandre Milesi, Amanda Moran, Pawel Morkisz, Andrew Morse, Jashojit Mukherjee, Brad
Nemire, Dade Orgeron, David Page, Mitesh Patel, Jason Paul, Joel Pennington, Lyne Tchapmi, Jibin
Varghese, Thomas Volk, Raju Wagwani, Herb Woodruff, and Josh Young for feedback and support.

13
Edify 3D: Scalable High-Quality 3D Asset Generation

References
[1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and
Richard Szeliski. Building rome in a day. Communications of the ACM, 2011. 11

[2] Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Xingguang Yan, Gordon Wetzstein,
Leonidas Guibas, and Andrea Tagliasacchi. Cc3d: Layout-conditioned generation of compositional
3d scenes. In ICCV, 2023. 11

[3] Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan
Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al. Gaudi: A neural architect for
immersive 3d scene generation. In NeurIPS, 2022. 12

[4] Mark Boss, Andreas Engelhardt, Abhishek Kar, Yuanzhen Li, Deqing Sun, Jonathan Barron, Hendrik
Lensch, and Varun Jampani. Samurai: Shape and material from unconstrained real-world arbitrary
image collections. In NeurIPS, 2022. 11

[5] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr,
Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh.
Video generation models as world simulators. 2024. URL https://openai.com/research/
video-generation-models-as-world-simulators. 2

[6] Tianshi Cao, Karsten Kreis, Sanja Fidler, Nicholas Sharp, and Kangxue Yin. Texfusion: Synthesizing
3d textures with text-guided image diffusion models. In ICCV, 2023. 11

[7] Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy,
Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis
with 3d-aware diffusion models. In ICCV, 2023. 5, 11

[8] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner.
Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, 2023. 11

[9] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and
appearance for high-quality text-to-3d content creation. In ICCV, 2023. 11

[10] Zhaoxi Chen, Guangcong Wang, and Ziwei Liu. Scenedreamer: Unbounded 3d scene generation
from 2d image collections. TPAMI, 2023. 12

[11] Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models
are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 2, 11

[12] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias
Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 12

[13] Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, and
Maneesh Agrawala. Flashtex: Fast relightable mesh texturing with lightcontrolnet. arXiv preprint
arXiv:2402.13251, 2024. 11

[14] Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. Scenescape: Text-driven consistent
scene generation. In NeurIPS, 2024. 12

[15] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. TPAMI,
2009. 4, 11

[16] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul
Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view
diffusion models. arXiv preprint arXiv:2405.10314, 2024. 11

14
Edify 3D: Scalable High-Quality 3D Asset Generation

[17] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d
mesh reconstruction and high-quality mesh rendering. In CVPR, 2024. 11

[18] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 3dgen: Triplane latent diffusion
for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 11

[19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS,
2020. 2

[20] Lukas Höllein, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer,
Matthias Nießner, et al. Viewdiff: 3d-consistent image generation with text-to-image models. In
CVPR, 2024. 11

[21] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli,
Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint
arXiv:2311.04400, 2023. 2, 5, 11

[22] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting
for geometrically accurate radiance fields. In SIGGRAPH, 2024. 11

[23] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dreamtime: An
improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422,
2023. 11

[24] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint
arXiv:2305.02463, 2023. 11

[25] Brian Karis. Real shading in unreal engine 4. In Physically Based Shading Theory Practice, 2013. 5

[26] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In SGP,
2006. 11

[27] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian
splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023. 11

[28] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli,
Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large
reconstruction model. In ICLR, 2024. 2, 11

[29] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and
Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In CVPR, 2023. 4, 11

[30] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis,
Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation.
In CVPR, 2023. 2, 11

[31] Tsung-Yi Lin, Chen-Hsuan Lin, Yin Cui, Yunhao Ge, Seungjun Nah, Arun Mallya, Zekun Hao, Yifan
Ding, Hanzi Mao, Zhaoshuo Li, et al. Genusd: 3d scene generation made easy. In ACM SIGGRAPH
2024 Real-Time Live! 2024. 12

[32] Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen,
Chong Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with
consistent multi-view generation and 3d diffusion. In CVPR, 2024. 11

[33] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su.
One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In
NeurIPS, 2024. 11

15
Edify 3D: Scalable High-Quality 3D Asset Generation

[34] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick.
Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. 11

[35] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-
Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using
cross-domain diffusion. In CVPR, 2024. 11

[36] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction
algorithm. In Seminal graphics: pioneering efforts that shaped the field. 1998. 5, 11

[37] Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp,
Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, and James Lucas. Att3d: Amortized text-to-3d object
synthesis. In ICCV, 2023. 11

[38] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven
neural stylization for meshes. In CVPR, 2022. 11

[39] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and
Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020. 4,
11

[40] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating
textured meshes from text using pretrained image-text models. In SIGGRAPH Asia, 2022. 11

[41] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for
generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 11

[42] NVIDIA, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge,
Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P.
Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug
Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei,
Xiaohui Zeng, Yu Zeng, and Qinsheng Zhang. Edify image: High-quality image generation with pixel
space laplacian diffusion models. 2024. 3

[43] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical
report. arXiv preprint arXiv:2303.08774, 2023. 12

[44] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d
diffusion. In ICLR, 2023. 11

[45] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee,
Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d
object generation using both 2d and 3d diffusion priors. In ICLR, 2024. 11

[46] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong
Dong, Liefeng Bo, and Xiaoguang Han. Richdreamer: A generalizable normal-depth diffusion model
for detail richness in text-to-3d. In CVPR, 2024. 11

[47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from
natural language supervision. In ICML, 2021. 11

[48] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In MICCAI, 2015. 3

16
Edify 3D: Scalable High-Quality 3D Asset Generation

[49] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view
selection for unstructured multi-view stereo. In ECCV, 2016. 11

[50] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR,
2016. 11

[51] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra:
a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021. 11

[52] Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan
Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based
mesh optimization. ACM Transactions on Graphics (TOG), 2023. 5, 11

[53] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen,
Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model.
arXiv preprint arXiv:2310.15110, 2023. 11

[54] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view
diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023. 2, 11

[55] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in
3d. ACM Transactions on Graphics (TOG), 2006. 11

[56] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. In NeurIPS, 2019. 2

[57] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu.
Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. In ICLR, 2024. 11

[58] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative
gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023. 11

[59] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen.
Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In ICCV, 2023. 11

[60] Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas
Chandra, Yasutaka Furukawa, and Rakesh Ranjan. Mvdiffusion++: A dense high-resolution
multi-view diffusion model for single or sparse-view 3d object reconstruction. arXiv preprint
arXiv:2402.12712, 2024. 11

[61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2, 11

[62] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian
chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023. 11

[63] Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation.
arXiv preprint arXiv:2312.02201, 2023. 11

[64] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus:
Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS,
2021. 4, 11

[65] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer:
High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2024.
11

17
Edify 3D: Scalable High-Quality 3D Asset Generation

[66] Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, CL Chen, and Lei Zhang.
Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint
arXiv:2310.08092, 2023. 11

[67] Kevin Xie, Jonathan Lorraine, Tianshi Cao, Jun Gao, James Lucas, Antonio Torralba, Sanja Fidler,
and Xiaohui Zeng. Latte3d: Large-scale amortized text-to-enhanced3d synthesis. arXiv preprint
arXiv:2403.15385, 2024. 11

[68] Xudong Xu, Zhaoyang Lyu, Xingang Pan, and Bo Dai. Matlaber: Material-aware text-to-3d via
latent brdf auto-encoder. arXiv preprint arXiv:2308.09278, 2023. 11

[69] Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hongdong Li. Consistnet: Enforcing 3d
consistency for multi-view images diffusion. In CVPR, 2024. 11

[70] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces.
In NeurIPS, 2021. 5, 6, 11

[71] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang
Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud
priors. arXiv preprint arXiv:2310.08529, 2023. 11

[72] Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Long Quan, Ying Shan, and Yonghong
Tian. Hifi-123: Towards high-fidelity one image to 3d content generation. arXiv preprint
arXiv:2310.06744, 2023. 11

[73] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis.
Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022. 11

[74] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image
diffusion models. In ICCV, 2023. 2, 3

[75] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In CVPR, 2018. 6

[76] Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Freeman, and
Jonathan T Barron. Nerfactor: Neural factorization of shape and reflectance under an unknown
illumination. ACM Transactions on Graphics (ToG), 2021. 11

[77] Joseph Zhu and Peiye Zhuang. Hifa: High-fidelity textto-3d with advanced diffusion guidance.
arXiv preprint arXiv:2305.18766, 2023. 11

18

You might also like