Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Jia, Xuhui; Zhao, Yang; Chan, Kelvin C. K.; Li, Yandong; Zhang, Han; Gong, Boqing; Hou, Tingbo; Wang, Huisheng; Su, Yu-Chuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2304.02642 (cs)

[Submitted on 5 Apr 2023]

Title:Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Authors:Xuhui Jia, Yang Zhao, Kelvin C.K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su

View PDF

Abstract:This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2304.02642 [cs.CV]
	(or arXiv:2304.02642v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2304.02642

Submission history

From: Kelvin C.K. Chan [view email]
[v1] Wed, 5 Apr 2023 17:59:32 UTC (6,646 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators