A Survey on Personalized Content Synthesis with Diffusion Models

Zhang, Xulu; Wei, Xiaoyong; Hu, Wentao; Wu, Jinlin; Wu, Jiaxin; Zhang, Wengyu; Zhang, Zhaoxiang; Lei, Zhen; Li, Qing

doi:10.1007/s11633-025-1563-3

A Survey on Personalized Content Synthesis with Diffusion Models

Review
Open access
Published: 27 September 2025

Volume 22, pages 817–848, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Machine Intelligence Research Aims and scope Submit manuscript

A Survey on Personalized Content Synthesis with Diffusion Models

Download PDF

1371 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Recent advancements in diffusion models have significantly impacted content creation, leading to the emergence of personalized content synthesis (PCS). By utilizing a small set of user-provided examples featuring the same subject, PCS aims to tailor this subject to specific user-defined prompts. Over the past two years, more than 150 methods have been introduced in this area. However, existing surveys primarily focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper provides a comprehensive survey of PCS, introducing the general frameworks of PCS research, which can be categorized into test-time fine-tuning (TTF) and pre-trained adaptation (PTA) approaches. We analyze the strengths, limitations and key techniques of these methodologies. Additionally, we explore specialized tasks within the field, such as object, face and style personalization, while highlighting their unique challenges and innovations. Despite the promising progress, we also discuss ongoing challenges, including overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to further the development of PCS.

Article PDF

DTIA: Disruptive Text-Image Alignment for Countering Text-to-Image Diffusion Model Personalization

Article Open access 28 December 2024

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Personalized Text-to-Image Generation Using Semantically Enhanced Diffusion Models

References

T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q. L. Han, Y. Tang. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023. DOI: https://doi.org/10.1109/JAS.2023.123618.
Article Google Scholar
F. A. Croitoru, V. Hondru, R. T. Ionescu, M. Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10850–10869, 2023. DOI: https://doi.org/10.1109/TPAMI.2023.3261988.
Article Google Scholar
V. Uc-Cetina, N. Navarro-Guerrero, A. Martin-Gonzalez, C. Weber, S. Wermter. Survey on reinforcement learning for language processing. Artificial Intelligence Review, vol. 56, no. 2, pp. 1543–1575, 2023. DOI: https://doi.org/10.1007/s10462-022-10205-5.
Article Google Scholar
N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 22500–22510, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.02155.
Google Scholar
G. Xiao, T. Yin, W. T. Freeman, F. Durand, S. Han. FastComposer: Tuning-free multi-subject image generation with localized attention. International Journal of Computer Vision, vol. 133, no. 3, pp. 1175–1194, 2025. DOI: https://doi.org/10.1007/s11263-024-02227-z.
Article Google Scholar
Q. Wang, X. Bai, H. Wang, Z. Qin, A. Chen, H. Li, X. Tang, Y. Hu. InstantID: Zero-shot identity-preserving generation in seconds, [Online], Available: https://arxiv.org/abs/2401.07519, 2024.
Google Scholar
R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, D. Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
Google Scholar
N. Kumari, B. Zhang, R. Zhang, E. Shechtman, J. Y. Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 1931–1941, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.00192.
Google Scholar
Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, W. Zuo. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 15897–15907, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.01461.
Google Scholar
Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, Y. Cao. Cones: Concept neurons in diffusion models for customized generation. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 890, 2023.
Google Scholar
A. Voynov, Q. Chu, D. Cohen-Or, K. Aberman. P+: Extended textual conditioning in text-to-image generation, [Online], Available: https://arxiv.org/abs/2303.09522, 2023.
Google Scholar
L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, F. Yang. SVDiff: Compact parameter space for diffusion fine-tuning. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7289–7300, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00673.
Google Scholar
W. Chen, H. Hu, Y. Li, N. Ruiz, X. Jia, M. W. Chang, W. W. Cohen. Subject-driven text-to-image generation via apprenticeship learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 30286–30305, 2023.
Google Scholar
J. Shi, W. Xiong, Z. Lin, H. J. Jung. InstantBooth: Personalized text-to-image generation without test-time finetuning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8543–8552, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00816.
Google Scholar
Y. Tewel, R. Gal, G. Chechik, Y. Atzmon. Key-locked rank one editing for text-to-image personalization. In Proceedings of ACM SIGGRAPH Conference, Los Angeles, USA, Article number 12, 2023. DOI: https://doi.org/10.1145/3588432.3591506.
Google Scholar
H. Chen, Y. Zhang, S. Wu, X. Wang, X. Duan, Y. Zhou, W. Zhu. DisenBooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
Google Scholar
D. Li, J. Li, S. C. H. Hoi. BLIP-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 30146–30166, 2023.
Google Scholar
Y. Alaluf, E. Richardson, G. Metzer, D. Cohen-Or. A neural space-time representation for text-to-image personalization. ACM Transactions on Graphics, vol. 42, no. 6, Article number 243, 2023. DOI: https://doi.org/10.1145/3618322.
Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T. Y. Lee, O. Deussen, C. Xu. ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics, vol. 42, no. 6, Article number 244, 2023. DOI: https://doi.org/10.1145/3618342.
Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu, Y. Ge, Y. Shan, M. Z. Shou. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing System, New Orleans, USA, Article number 699, 2023.
Google Scholar
Z. Liu, Y. Zhang, Y. Shen, K. Zheng, K. Zhu, R. Feng, Y. Liu, D. Zhao, J. Zhou, Y. Cao. Cones 2: Customizable image synthesis with multiple subjects. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2508, 2023.
Google Scholar
K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li, Y. Hao, I. Essa, M. Rubinstein, D. Krishnan. StyleDrop: Text-to-image generation in any style. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2920, 2023.
Google Scholar
G. Yuan, X. Cun, Y. Zhang, M. Li, C. Qi, X. Wang, Y. Shan, H. Zheng. Inserting anybody in diffusion models via celeb basis. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 3190, 2023.
Google Scholar
D. Valevski, D. Lumen, Y. Matias, Y. Leviathan. Face0: Instantaneously conditioning a text-to-image model on a face. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 94, 2023. DOI: https://doi.org/10.1145/3610548.3618249.
Google Scholar
M. Arar, R. Gal, Y. Atzmon, G. Chechik, D. Cohen-Or, A. Shamir, A. H. Bermano. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 72, 2023. DOI: https://doi.org/10.1145/3610548.3618173.
Google Scholar
N. Ruiz, Y. Li, V. Jampani, W. Wei, T. Hou, Y. Pritch, N. Wadhwa, M. Rubinstein, K. Aberman. Hyper Dream-Booth: HyperNetworks for fast personalization of text-to-image models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6527–6536, 2024. DOI: https://doi.org/10.1109/CV-PR52733.2024.00624.
Google Scholar
J. Ma, J. Liang, C. Chen, H. Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In Proceedings of ACM SIGGRAPH Conference Papers, Denver, USA, Article number 25, 2024. DOI: https://doi.org/10.1145/3641519.3657469.
Google Scholar
H. Ye, J. Zhang, S. Liu, X. Han, W. Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2308.06721, 2023.
Google Scholar
Z. Wang, X. Wang, L. Xie, Z. Qi, Y. Shan, W. Wang, P. Luo. StyleAdapter: A unified stylized image generation model. International Journal of Computer Vision, vol. 133, no. 4, pp. 1894–1911, 2025. DOI: https://doi.org/10.1007/s11263-024-02253-x.
Article Google Scholar
X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, F. Wei. Kosmos-G: Generating images in context with multimodal large language models. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
Google Scholar
S. Motamed, D. P. Paudel, L. Van Gool. LEGO: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In Proceedings of the 18th European Computer Vision Association, Milan, Italy, pp. 116–133, 2025. DOI: https://doi.org/10.1007/978-3-031-72633-0_7.
Google Scholar
Y. Yan, C. Zhang, R. Wang, Y. Zhou, G. Zhang, P. Cheng, G. Yu, B. Fu. FaceStudio: Put your face everywhere in seconds, [Online], Available: https://arxiv.org/abs/2312.02663, 2023.
Google Scholar
Z. Li, M. Cao, X. Wang, Z. Qi, M. M. Cheng, Y. Shan. PhotoMaker: Customizing realistic human photos via stacked ID embedding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8640–8650, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00825.
Google Scholar
X. Peng, J. Zhu, B. Jiang, Y. Tai, D. Luo, J. Zhang, W. Lin, T. Jin, C. Wang, R. Ji. PortraitBooth: A versatile portrait model for fast identity-preserved personalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 27070–27080, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.02557.
Google Scholar
X. Zhang, X. Y. Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, Q. Li. Compositional inversion for stable diffusion models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 7350–7358, 2024. DOI: https://doi.org/10.1609/aaai.v38i7.28565.
Google Scholar
Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, X. Wang. Generative multimodal models are in-context learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 14398–14409, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.01365.
Google Scholar
L. Pang, J. Yin, H. Xie, Q. Wang, Q. Li, X. Mao. Cross initialization for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2312.15905, 2023.
Google Scholar
Z. Kong, Y. Zhang, T. Yang, T. Wang, K. Zhang, B. Wu, G. Chen, W. Liu, W. Luo. OMG: Occlusion-friendly personalized multi-concept generation in diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 253–270, 2025. DOI: https://doi.org/10.1007/978-3-031-72751-1_15.
Google Scholar
X. Zhang, W. Zhang, X. Wei, J. Wu, Z. Zhang, Z. Lei, Q. Li. Generative active learning for image synthesis personalization. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 10669–10677, 2024. DOI: https://doi.org/10.1145/3664647.3680773.
Chapter Google Scholar
Y. Zhang, Y. Song, J. Liu, R. Wang, J. Yu, H. Tang, H. Li, X. Tang, Y. Hu, H. Pan, Z. Jiang. SSR-encoder: Encoding selective subject representation for subject-driven generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8069–8078, 2024. DOI: https://doi.org/10.1109/CV-PR52733.2024.00771.
Google Scholar
G. Zhang, K. Sohn, M. Hahn, H. Shi, I. Essa. FineStyle: Fine-grained controllable style personalization for text-to-image models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 52937–52961, 2024.
Google Scholar
Z. Dong, P. Wei, L. Lin. DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter, [Online], Available: https://arxiv.org/abs/2211.11337, 2025.
Google Scholar
A. Voronov, M. Khoroshikh, A. Babenko, M. Ryabinin. Is this loss informative? Faster text-to-image customization by tracking objective dynamics. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1630, 2023.
Google Scholar
I. Han, S. Yang, T. Kwon, J. C. Ye. Highly personalized text embedding for image manipulation by stable diffusion, [Online], Available: https://arxiv.org/abs/2303.08767, 2023.
Google Scholar
C. Xiang, F. Bao, C. Li, H. Su, J. Zhu. A closer look at parameter-efficient tuning in diffusion models, [Online], Available: https://arxiv.org/abs/2303.18181, 2023.
Google Scholar
X. Jia, Y. Zhao, K. C. K. Chan, Y. Li, H. Zhang, B. Gong, T. Hou, H. Wang, Y. C. Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2304.02642, 2023.
Google Scholar
J. Yang, H. Wang, Y. Zhang, R. Xiao, S. Wu, G. Chen, J. Zhao. Controllable textual inversion for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2304.05265, 2023.
Google Scholar
Z. Fei, M. Fan, J. Huang. Gradient-free textual inversion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, pp. 1364–1373, 2023. DOI: https://doi.org/10.1145/3581783.3612599.
Chapter Google Scholar
R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, Y. Qiao, P. Gao, H. Li. Personalize segment anything model with one shot. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
Google Scholar
O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, D. Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 96, 2023. DOI: https://doi.org/10.1145/3610548.3618154.
Google Scholar
J. Xiao, M. Yin, Y. Gong, X. Zang, J. Ren, B. Yuan. COMCAT: Towards efficient compression and customization of attention-based vision models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 1587, 2023.
Google Scholar
S. Hao, K. Han, S. Zhao, K. Y. K. Wong. ViCo: Plug-and-play visual condition for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2306.00971, 2023.
Google Scholar
Z. Qiu, W. Liu, H. Feng, Y. Xue, Y. Feng, Z. Liu, D. Zhang, A. Weller, B. Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 79320–79362, 2023.
Google Scholar
S. Y. Yeh, Y. G. Hsieh, Z. Gao, B. B. W. Yang, G. Oh, Y. Gong. Navigating text-to-image customization: From LyCORIS fine-tuning to model evaluation, [Online], Available: https://arxiv.org/abs/2309.14859, 2024.
Google Scholar
X. He, Z. Cao, N. Kolkin, L. Yu, K. Wan, H. Rhodin, R. Kalarot. A data perspective on enhanced identity preservation for diffusion personalization. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 3782–3791, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00372.
Google Scholar
A. Roy, M. Suin, A. Shah, K. Shah, J. Liu, R. Chellappa. DIFFNAT: Improving diffusion image quality using natural image statistics, [Online], Available: https://arxiv.org/abs/2311.09753, 2023.
Google Scholar
A. Agarwal, S. Karanam, T. Shukla, B. V. Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 6053–6062, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00590.
Google Scholar
R. Zhao, M. Zhu, S. Dong, D. Cheng, N. Wang, X. Gao. CatVersion: Concatenating embeddings for diffusion-based text-to-image personalization. IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 6, pp. 6047–6058, 2025. DOI: https://doi.org/10.1109/TC-SVT.2025.3531917.
Article Google Scholar
M. Safaee, A. Mikaeili, O. Patashnik, D. Cohen-Or, A. Mahdavi-Amiri. CLiC: Concept learning in context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6924–6933, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00661.
Google Scholar
Z. Wang, W. Wei, Y. Zhao, Z. Xiao, M. Hasegawa-Johnson, H. Shi, T. Hou. HiFi tuner: High-fidelity subject-driven fine-tuning for diffusion models, [Online], Available: https://arxiv.org/abs/2312.00079, 2023.
Google Scholar
D. Chae, N. Park, J. Kim, K. Lee. InstructBooth: Instruction-following personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2312.03011, 2024.
Google Scholar
Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, W. Zuo. Decoupled textual embeddings for customized image generation. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 909–917, 2024. DOI: https://doi.org/10.1609/aaai.v38i2.27850.
Google Scholar
B. N. Zhao, Y. Xiao, J. Xu, X. Jiang, Y. Yang, D. Li, L. Itti, V. Vineet, Y. Ge. DreamDistribution: Learning prompt distribution for diverse in-distribution generation, [Online], Available: https://arxiv.org/abs/2312.14216, 2025.
Google Scholar
M. Hua, J. Liu, F. Ding, W. Liu, J. Wu, Q. He. Dream-Tuner: Single image is enough for subject-driven generation, [Online], Available: https://arxiv.org/abs/2312.13691, 2023.
Google Scholar
J. Lu, C. Xie, H. Guo. Object-driven one-shot fine-tuning of text-to-image diffusion with prototypical embedding, [Online], Available: https://arxiv.org/abs/2401.15708, 2024.
Google Scholar
W. Zeng, Y. Yan, Q. Zhu, Z. Chen, P. Chu, W. Zhao, X. Yang. Infusion: Preventing customized text-to-image diffusion from overfitting. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 3568–3577, 2024. DOI: https://doi.org/10.1145/3664647.3680894.
Chapter Google Scholar
M. Jones, S. Y. Wang, N. Kumari, D. Bau, J. Y. Zhu. Customizing text-to-image models with a single image pair, [Online], Available: https://arxiv.org/abs/2405.01536, 2024.
Book Google Scholar
H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, W. Zhu. DisenDreamer: Subject-driven text-to-image generation with sample-aware disentangled tuning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6860–6873, 2024. DOI: https://doi.org/10.1109/TCSVT.2024.3369757.
Article Google Scholar
Z. Fan, Z. Yin, G. Li, Y. Zhan, H. Zheng. Dream-Booth++: Boosting subject-driven generation via region-level references packing. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 11013–11021, 2024. DOI: https://doi.org/10.1145/3664647.3680734.
Chapter Google Scholar
F. Wu, Y. Pang, J. Zhang, L. Pang, J. Yin, B. Zhao, Q. Li, X. Mao. CoRe: Context-regularized text embedding learning for text-to-image personalization, [Online], Available: https://arxiv.org/abs/2408.15914, 2024.
Google Scholar
J. Jin, Y. Shen, Z. Fu, J. Yang. Customized generation reimagined: Fidelity and editability harmonized. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 410–426, 2025. DOI: https://doi.org/10.1007/978-3-031-72973-7_24.
Google Scholar
W. Chen, H. Hu, C. Saharia, W. W. Cohen. Re-imagen: Retrieval-augmented text-to-image generator. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
Google Scholar
X. Xu, Z. Wang, E. Zhang, K. Wang, H. Shi. Versatile diffusion: Text, images and variations all in one diffusion model. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7720–7731, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00713.
Google Scholar
R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik, D. Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics, vol. 42, no. 4, Article number 150, 2023. DOI: https://doi.org/10.1145/3592133.
Y. Ma, H. Yang, W. Wang, J. Fu, J. Liu. Unified multi-modal latent diffusion for joint subject and text conditional image generation, [Online], Available: https://arxiv.org/abs/2303.09319, 2023.
Google Scholar
M. Kim, J. Yoo, S. Kwon. Personalized text-to-image model enhancement strategies: SOD preprocessing and CNN local feature integration. Electronics, vol. 12, no. 22, Article number 4707, 2023. DOI: https://doi.org/10.3390/electronics12224707.
Y. Zhou, R. Zhang, J. Gu, T. Sun. Customization assistant for text-to-image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9182–9191, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00877.
Google Scholar
J. Pan, H. Yan, J. H. Liew, J. Feng, V. Y. F. Tan. Towards accurate guided diffusion sampling through symplectic adjoint method, [Online], Available: https://arxiv.org/abs/2312.12030, 2023.
Google Scholar
S. Purushwalkam, A. Gokul, S. Joty, N. Naik. Boot-PIG: Bootstrapping zero-shot personalized image generation capabilities in pretrained diffusion models. In Proceedings of Computer Vision, Milan, Italy, pp. 252–269, 2025. DOI: https://doi.org/10.1007/978-3-031-91907-7_15.
Google Scholar
N. Chen, M. Huang, Z. Chen, Y. Zheng, L. Zhang, Z. Mao. CustomContrast: A multilevel contrastive perspective for subject-driven text-to-image customization. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 2123–2131, 2025. DOI: https://doi.org/10.1609/aaai.v39i2.32210.
Google Scholar
K. Song, Y. Zhu, B. Liu, Q. Yan, A. Elgammal, X. Yang. MoMA: Multimodal LLM adapter for fast personalized image generation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 117–132, 2025. DOI: https://doi.org/10.1007/978-3-031-73661-2_7.
Google Scholar
C. Shin, J. Choi, H. Kim, S. Yoon. Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator, [Online], Available: https://arxiv.org/abs/2411.15466, 2024.
Google Scholar
J. Park, B. Ko, H. Jang. StyleBoost: A study of personalizing text-to-image generation in any style using dreambooth. In Proceedings of the 14th International Conference on Information and Communication Technology Convergence, Jeju Island, Republic of Korea, pp. 93–98, 2023. DOI: https://doi.org/10.1109/ICTC58733.2023.10392676.
Google Scholar
A. Hertz, A. Voynov, S. Fruchter, D. Cohen-Or. Style aligned image generation via shared attention. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4775–4785, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00457.
Google Scholar
J. Park, B. Ko, H. Jang. Text-to-image synthesis for any artistic styles: Advancements in personalized artistic image generation via subdivision and dual binding, [Online], Available: https://arxiv.org/html/2404.05256v1, 2024.
Google Scholar
J. Choi, C. Shin, Y. Oh, H. Kim, J. Lee, S. Yoon. Style-friendly SNR sampler for style-driven generation, [Online], Available: https://arxiv.org/abs/2411.14793, 2024.
Google Scholar
D. Y. Chen, H. Tennent, C. W. Hsu. ArtAdapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8619–8628, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00823
Google Scholar
Y. Zhou, R. Zhang, T. Sun, J. Xu. Enhancing detail preservation for customized text-to-image generation: A regularization-free approach, [Online], Available: https://arxiv.org/abs/2305.13579, 2023.
Google Scholar
S. Banerjee, G. Mittal, A. Joshi, C. Hegde, N. Memon. Identity-preserving aging of face images via latent diffusion models. In Proceedings of IEEE International Joint Conference on Biometrics, Ljubljana, Slovenia, 2023. DOI: https://doi.org/10.1109/IJCB57857.2023.10448860.
Google Scholar
J. Hyung, J. Shin, J. Choo. MagiCapture: High-resolution multi-concept portrait customization, [Online], Available: https://arxiv.org/abs/2309.06895, 2024.
Google Scholar
P. Cao, L. Yang, F. Zhou, T. Huang, Q. Song. Concept-centric personalization with large-scale diffusion priors, [Online], Available: https://arxiv.org/html/2312.08195v1, 2023.
Google Scholar
C. Kim, J. Lee, S. Joung, B. Kim, Y. M. Baek. Instant-Family: Masked attention for zero-shot multi-ID image generation, [Online], Available: https://arxiv.org/abs/2404.19427, 2024.
Google Scholar
X. Li, J. Zhan, S. He, Y. Xu, J. Dong, H. Zhang, Y. Du. PersonaMagic: Stage-regulated high-fidelity face customization with tandem equilibrium, [Online], Available: https://arxiv.org/abs/2412.15674, 2024.
Google Scholar
Y. C. Su, K. C. K. Chan, Y. Li, Y. Zhao, H. Zhang, B. Gong, H. Wang, X. Jia. Identity encoder for personalized diffusion, [Online], Available: https://arxiv.org/abs/2304.07429, 2023.
Google Scholar
Z. Chen, S. Fang, W. Liu, Q. He, M. Huang, Y. Zhang, Z. Mao. DreamIdentity: Improved editability for efficient face-identity preserved image generation, [Online], Available: https://arxiv.org/abs/2307.00300, 2023.
Google Scholar
Y. Wang, W. Zhang, J. Zheng, C. Jin. High-fidelity person-centric subject-to-image synthesis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7675–7684, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00733.
Google Scholar
X. Li, X. Hou, C. C. Loy. When StyleGAN meets stable diffusion: A W+ adapter for personalized image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 2187–2196, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00213.
Google Scholar
J. Liu, H. Huang, C. Jin, R. He. Portrait diffusion: Training-free face stylization with chain-of-painting, [Online], Available: https://arxiv.org/abs/2312.02212, 2023.
Google Scholar
H. Tang, J. Deng, Z. Pan, H. Tian, P. Chaudhari, X. Zhou. RetriBooru: Leakage-free retrieval of conditions from reference images for subject-driven generation, [Online], Available: https://arxiv.org/abs/2312.02521, 2024.
Google Scholar
J. Xu, S. Motamed, P. Vaddamanu, C. H. Wu, C. Haene, J. C. Bazin, F. De La Torre. Personalized face inpainting with diffusion models by parallel visual attention. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 5420–5430, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00535.
Google Scholar
D. Y. Chen, A. K. Bhunia, S. Koley, A. Sain, P. N. Chowdhury, Y. Z. Song. DemoCaricature: Democratising caricature generation with a rough sketch. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8629–8639, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00824.
Google Scholar
P. Achlioptas, A. Benetatos, I. Fostiropoulos, D. Skourtis. Stellar: Systematic evaluation of human-centric personalized text-to-image methods, [Online], Available: https://arxiv.org/abs/2312.06116, 2023.
Google Scholar
W. Chen, J. Zhang, J. Wu, H. Wu, X. Xiao, L. Lin. ID-aligner: Enhancing identity-preserving text-to-image generation with reward feedback learning, [Online], Available: https://arxiv.org/abs/2404.15449, 2024.
Google Scholar
K. C. Wang, D. Ostashev, Y. Fang, S. Tulyakov, K. Aberman. MoA: Mixture-of-attention for subject-context disentanglement in personalized image generation, [Online], Available: https://arxiv.org/html/2404.11565v1, 2024.
Google Scholar
S. Cui, J. Guo, X. An, J. Deng, Y. Zhao, X. Wei, Z. Feng. IDAdapter: Learning mixed features for tuning-free personalization of text-to-image models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, pp. 950–959, 2024. DOI: https://doi.org/10.1109/CVPRW63382.2024.00100.
Google Scholar
Y. Wu, Z. Li, H. Zheng, C. Wang, B. Li. Infinite-ID: Identity-preserved personalization via ID-semantics decoupling paradigm. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 279–296, 2025. DOI: https://doi.org/10.1007/978-3-031-73242-3_16.
Google Scholar
K. Shiohara, T. Yamasaki. Face2Diffusion for fast and editable face personalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6850–6859, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00654.
Google Scholar
Y. Cai, Z. Jiang, Y. Liu, C. Jiang, W. Xue, W. Luo, Y. Guo. Foundation cures personalization: Recovering facial personalized models’ prompt consistency, [Online], Available: https://arxiv.org/abs/2411.15277v1, 2024.
Google Scholar
G. Qian, K. C. Wang, O. Patashnik, N. Heravi, D. Ostashev, S. Tulyakov, D. Cohen-Or, K. Aberman. Omni-ID: Holistic identity representation designed for generative tasks, [Online], Available: https://arxiv.org/html/2412.09694v1, 2024.
Google Scholar
Y. Li, H. Liu, Y. Wen, Y. J. Lee. Generate anything anywhere in any scene, [Online], Available: https://arxiv.org/abs/2306.17154, 2023.
Google Scholar
T. Rahman, S. Mahajan, H. Y. Lee, J. Ren, S. Tulyakov, L. Sigal. Visual concept-driven image generation with text-to-image diffusion model, [Online], Available: https://arxiv.org/abs/2402.11487, 2025.
Google Scholar
J. Jiang, Y. Zhang, K. Feng, X. Wu, W. Li, R. Pei, F. Li, W. Zuo. MC²: Multi-concept guidance for customized multi-concept generation, [Online], Available: https://arxiv.org/abs/2404.05268, 2024.
Google Scholar
C. Zhu, K. Li, Y. Ma, C. He, X. Li. MultiBooth: Towards generating all your concepts in an image from text. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 10923–10931, 2025. DOI: https://doi.org/10.1609/aaai.v39i10.33187.
Google Scholar
H. Matsuda, R. Togo, K. Maeda, T. Ogawa, M. Haseyama. Multi-object editing in personalized text-to-image diffusion model via segmentation guidance. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, pp. 8140–8144, 2024. DOI: https://doi.org/10.1109/ICASSP48485.2024.10447048.
Google Scholar
D. Zhou, J. Huang, J. Bai, J. Wang, H. Chen, G. Chen, X. Hu, P. A. Heng. MagicTailor: Component-controllable personalization in text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2410.13370, 2024.
Google Scholar
X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, H. Zhao. AnyDoor: Zero-shot object-level image customization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6593–6602, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00630.
Google Scholar
Z. Yuan, M. Cao, X. Wang, Z. Qi, C. Yuan, Y. Shan. CustomNet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2310.19784, 2023.
Google Scholar
D. Zhou, Y. Li, F. Ma, Z. Yang, Y. Yang. MIGC: Multi-instance generation controller for text-to-image synthesis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6818–6828, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00651.
Google Scholar
M. Patel, S. Jung, C. Baral, Y. Yang. λ-ECLIPSE: Multi-concept personalized text-to-image diffusion models by leveraging CLIP latent space, [Online], Available: https://arxiv.org/abs/2402.05195, 2024.
Google Scholar
X. Wang, S. Fu, Q. Huang, W. He, H. Jiang. MS-diffusion: Multi-subject zero-shot image personalization with layout guidance. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 2025.
Google Scholar
Z. Huang, T. Wu, Y. Jiang, K. C. K. Chan, Z. Liu. ReVersion: Diffusion-based relation inversion from images. In Proceedings of SIGGRAPH Asia Conference Papers, Tokyo, Japan, Article number 4, 2024. DOI: https://doi.org/10.1145/3680528.3687658.
Google Scholar
G. Zhang, Y. Qian, J. Deng, X. Cai. Inv-ReVersion: Enhanced relation inversion based on text-to-image diffusion models. Applied Sciences, vol. 14, no. 8, Article number 3338, 2024. DOI: https://doi.org/10.3390/app14083338.
Z. Xu, S. Hao, K. Han. CusConcept: Customized visual concept decomposition with diffusion models. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 3678–3687, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00362.
Google Scholar
S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, D. Wang. Learning disentangled identifiers for action-customized text-to-image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7797–7806, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00745.
Google Scholar
J. Gu, Y. Wang, N. Zhao, T. J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, X. E. Wang. PHOTOSWAP: Personalized subject swapping in images. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1529, 2023.
Google Scholar
S. Zhang, M. Ni, S. Chen, L. Wang, W. Ding, Y. Liu. A two-stage personalized virtual try-on framework with shape control and texture guidance. IEEE Transactions on Multimedia, vol. 26, pp. 10225–10236, 2024. DOI: https://doi.org/10.1109/TMM.2024.3405718.
Article Google Scholar
M. Chen, I. Laina, A. Vedaldi. Training-free layout control with cross-attention guidance. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 5331–5341, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00526.
Google Scholar
J. Gu, N. Zhao, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, Y. Wang, X. E. Wang. SwapAnything: Enabling arbitrary object swapping in personalized visual editing, [Online], Available: https://arxiv.org/abs/2404.05717, 2024.
Google Scholar
N. Kumari, G. Su, R. Zhang, T. Park, E. Shechtman, J. Y. Zhu. Customizing text-to-image diffusion with object viewpoint control, [Online], Available: https://arxiv.org/abs/2404.12333, 2024.
Book Google Scholar
X. Xu, J. Guo, Z. Wang, G. Huang, I. Essa, H. Shi. Prompt-free diffusion: Taking “text” out of text-to-image diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8682–8692, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00829.
Google Scholar
S. Zhao, D. Chen, Y. C. Chen, J. Bao, S. Hao, L. Yuan, K. Y. K. Wong. Uni-controlNet: All-in-one control to text-to-image diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 11127–11150, 2023.
Google Scholar
S. Y. Cheong, A. Mustafa, A. Gilbert. ViscoNet: Bridging and harmonizing visual and textual conditioning for ControlNet, [Online], Available: https://arxiv.org/abs/2312.03154, 2024.
Google Scholar
I. Najdenkoska, A. Sinha, A. Dubey, D. Mahajan, V. Ramanathan, F. Radenovic. Context diffusion: In-context aware image generation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 375–391, 2024. DOI: https://doi.org/10.1007/978-3-031-72980-5_22.
Google Scholar
S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, B. Zhou. FreeControl: Training-free spatial control of any text-to-image diffusion model with any condition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7465–7475, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00713.
Google Scholar
P. Li, Q. Nie, Y. Chen, X. Jiang, K. Wu, Y. Lin, Y. Liu, J. Peng, C. Wang, F. Zheng. Tuning-free image customization with image and text guidance. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 233–250, 2025. DOI: https://doi.org/10.1007/978-3-031-73116-7_14.
Google Scholar
J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, M. Z. Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7623–7633, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00701.
Google Scholar
P. Esser, J. Chiu, P. Atighehchian, J. Granskog, A. Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of IEEE/ CVF International Conference on Computer Vision, Paris, France, pp. 7346–7356, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00675.
Google Scholar
Y. Zhao, E. Xie, L. Hong, Z. Li, G. H. Lee. Make-aprotagonist: Generic video editing with an ensemble of experts, [Online], Available: https://ar5iv.labs.arxiv.org/html/2305.08850, 2023.
Google Scholar
Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan, Q. Chen. Animate-a-story: Storytelling with retrieval-augmented video generation, [Online], Available: https://arxiv.org/abs/2307.06940, 2023.
Google Scholar
R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. W. Liu, W. Wu, J. Keppo, M. Z. Shou. MotionDirector: Motion customization of text-to-video diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 273–290, 2025. DOI: https://doi.org/10.1007/978-3-031-72992-8_16.
Google Scholar
R. Wu, L. Chen, T. Yang, C. Guo, C. Li, X. Zhang. LAMP: Learn a motion pattern for few-shot video generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7089–7098, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00677.
Google Scholar
H. Jeong, G. Y. Park, J. C. Ye. VMC: Video motion customization using temporal attention adaption for text-to-video diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9212–9221, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00880.
Google Scholar
Y. Song, W. Shin, J. Lee, J. Kim, N. Kwak. SAVE: Protagonist diversification with structure agnostic video editing. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 41–57, 2025. DOI: https://doi.org/10.1007/978-3-031-72989-8_3.
Google Scholar
J. Materzynska, J. Sivic, E. Shechtman, A. Torralba, R. Zhang, B. Russell. NewMove: Customizing text-to-video models with novel motions, [Online], Available: https://arxiv.org/abs/2312.04966, 2023.
Google Scholar
Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, H. Shan. DreamVideo: Composing your dream videos with customized subject and motion, [Online], Available: https://arxiv.org/abs/2312.04433, 2023.
Google Scholar
Y. Zhang, F. Tang, N. Huang, H. Huang, C. Ma, W. Dong, C. Xu. MotionCrafter: One-shot motion customization of diffusion models, [Online], Available: https://arxiv.org/abs/2312.05288, 2024.
Google Scholar
Y. Ren, Y. Zhou, J. Yang, J. Shi, D. Liu, F. Liu, M. Kwon, A. Shrivastava. Customize-a-video: One-shot motion customization of text-to-video diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 332–349, 2025. DOI: https://doi.org/10.1007/978-3-031-73024-5_20.
Google Scholar
Z. Ma, D. Zhou, C. H. Yeh, X. S. Wang, X. Li, H. Yang, Z. Dong, K. Keutzer, J. Feng. Magic-Me: Identity-specific video customized diffusion, [Online], Available: https://arxiv.org/abs/2402.09368, 2024.
Google Scholar
X. Bi, J. Lu, B. Liu, X. Cun, Y. Zhang, W. Li, B. Xiao. CustomTTT: Motion and appearance customized video generation via test-time training. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 1871–1879, 2025. DOI: https://doi.org/10.1609/aaai.v39i2.32182.
Google Scholar
H. Li, H. Qiu, S. Zhang, X. Wang, Y. Wei, Z. Li, Y. Zhang, B. Wu, D. Cai. PersonalVideo: High ID-fidelity video customization without dynamic and semantic degradation, [Online], Available: https://arxiv.org/abs/2411.17048, 2024.
Google Scholar
Z. Wang, A. Li, L. Zhu, Y. Guo, Q. Dou, Z. Li. CustomVideo: Customizing text-to-video generation with multiple subjects, [Online], Available: https://arxiv.org/abs/2401.09962, 2024.
Google Scholar
H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, Y. Wu, W. Zhu. VideoDreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models, [Online], Available: https://arxiv.org/abs/2311.00990, 2023.
Google Scholar
H. Zhao, T. Lu, J. Gu, X. Zhang, Q. Zheng, Z. Wu, H. Xu, Y. G. Jiang. MagDiff: Multi-alignment diffusion for high-fidelity video generation and editing. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 205–221, 2025. DOI: https://doi.org/10.1007/978-3-031-72649-1_12.
Google Scholar
Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, Z. Liu. VideoBooth: Diffusion-based video generation with image prompts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6689–6700, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00639.
Google Scholar
T. Wu, Y. Zhang, X. Cun, Z. Qi, J. Pu, H. Dou, G. Zheng, Y. Shan, X. Li. VideoMaker: Zero-shot customized video generation with the inherent force of video diffusion models, [Online], Available: https://arxiv.org/abs/2412.19645, 2024.
Google Scholar
Y. Zhou, R. Zhang, J. Gu, N. Zhao, J. Shi, T. Sun. SUGAR: Subject-driven video customization in a zero-shot manner, [Online], Available: https://arxiv.org/abs/2412.10533, 2024.
Google Scholar
G. Liu, M. Xia, Y. Zhang, H. Chen, J. Xing, Y. Wang, X. Wang, Y. Shan, Y. Yang. StyleCrafter: Taming artistic video diffusion with reference-augmented adapter learning. ACM Transactions on Graphics, vol. 43, no. 6, Article number 251, 2024. DOI: https://doi.org/10.1145/3687975
X. He, Q. Liu, S. Qian, X. Wang, T. Hu, K. Cao, K. Yan, J. Zhang. ID-animator: Zero-shot identity-preserving human video generation, [Online], Available: https://arxiv.org/abs/2404.15275, 2024.
Google Scholar
S. Yuan, J. Huang, X. He, Y. Ge, Y. Shi, L. Chen, J. Luo, L. Yuan. Identity-preserving text-to-video generation by frequency decomposition, [Online], Available: https://arxiv.org/abs/2411.17440, 2024.
Book Google Scholar
H. Fang, D. Qiu, B. Mao, P. Yan, H. Tang. Motion-Character: Identity-preserving and motion controllable human video generation, [Online], Available: https://arxiv.org/abs/2411.18281, 2024.
Google Scholar
M. Feng, J. Liu, K. Yu, Y. Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Li, A. Li, X. Kang, B. Lei, M. Cui, P. Ren, X. Xie. DreaMoving: A human video generation framework based on diffusion models, [Online], Available: https://arxiv.org/abs/2312.05107, 2023.
Google Scholar
C. H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M. Y. Liu, T. Y. Lin. Magic3D: High-resolution text-to-3D content creation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 300–309, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.00037.
Google Scholar
A. Raj, S. Kaza, B. Poole, M. Niemeyer, N. Ruiz, B. Mildenhall, S. Zada, K. Aberman, M. Rubinstein, J. Barron, Y. Li, V. Jampani. DreamBooth3D: Subject-driven text-to-3D generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2349–2359, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00223.
Google Scholar
S. Azadi, T. Hayes, A. Shah, G. Pang, D. Parikh, S. Gupta. Text-conditional contextualized avatars for zero-shot personalization, [Online], Available: https://arxiv.org/abs/2304.07410, 2023.
Google Scholar
C. Zhang, Y. Chen, Y. Fu, Z. Zhou, G. Yu, B. Wang, B. Fu, T. Chen, G. Lin, C. Shen. StyleAvatar3D: Leveraging image-text diffusion models for high-fidelity 3D avatar generation, [Online], Available: https://arxiv.org/abs/2305.19012, 2023.
Google Scholar
Y. Zeng, Y. Lu, X. Ji, Y. Yao, H. Zhu, X. Cao. Avatar-Booth: High-quality and customizable 3D human avatar generation, [Online], Available: https://arxiv.org/abs/2306.09864, 2023.
Google Scholar
Y. Shi, P. Wang, J. Ye, L. Mai, K. Li, X. Yang. MVDream: Multi-view diffusion for 3D generation. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
Google Scholar
Y. Ouyang, W. Chai, J. Ye, D. Tao, Y. Zhan, G. Wang. Chasing consistency in text-to-3D generation from a single image, [Online], Available: https://arxiv.org/abs/2309.03599, 2023.
Google Scholar
Y. Zhao, Z. Yan, E. Xie, L. Hong, Z. Li, G. H. Lee. Animate124: Animating one image to 4D dynamic scene, [Online], Available: https://arxiv.org/abs/2311.14603, 2023.
Google Scholar
Y. Zheng, X. Li, K. Nagano, S. Liu, O. Hilliges, S. De Mello. A unified approach for text-and image-guided 4D scene generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7300–7309, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00697.
Google Scholar
Y. Y. Yeh, J. B. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong, Z. Li. TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4304–4314, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00412.
Google Scholar
J. Zhuang, D. Kang, Y. P. Cao, G. Li, L. Lin, Y. Shan. TIP-editor: An accurate 3D editor following both text-prompts and image-prompts. ACM Transactions on Graphics, vol. 43, no. 4, Article number 121, 2024. DOI: https://doi.org/10.1145/3658205.
T. Van Le, H. Phung, T. H. Nguyen, Q. Dao, N. N. Tran, A. Tran. Anti-DreamBooth: Protecting users from personalized text-to-image synthesis. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2116–2127, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00202.
Google Scholar
Y. Wu, J. Zhang, F. Kerschbaum, T. Zhang. Backdooring textual inversion for concept censorship, [Online], Available: https://arxiv.org/abs/2308.10718, 2023.
Google Scholar
Y. Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y. Wu, M. Hu, T. Li, G. Pu, Y. Liu. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 21169–21178, 2024. DOI: https://doi.org/10.1609/aaai.v38i19.30110.
Google Scholar
J. S. Smith, Y. C. Hsu, L. Zhang, T. Hua, Z. Kira, Y. Shen, H. Jin. Continual diffusion: Continual customization of text-to-image diffusion with C-LoRA, [Online], Available: https://arxiv.org/abs/2304.06027, 2024.
Google Scholar
P. Zhang, N. Zhao, J. Liao. Text-guided vector graphics customization. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 54, 2023. DOI: https://doi.org/10.1145/3610548.3618232.
Google Scholar
H. Wang, X. Xiang, Y. Fan, J. H. Xue. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 4921–4931, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00486.
Google Scholar
K. Wang, F. Yang, B. Raducanu, J. van de Weijer. Multi-class textual-inversion secretly yields a semantic-agnostic classifier. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 4400–4409, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00432.
Google Scholar
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the 9th International Conference on Learning Representations, 2021.
Google Scholar
T. Karras, M. Aittala, T. Aila, S. Laine. Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 26565–26577, 2022.
Google Scholar
J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 6840–6851, 2020.
Google Scholar
B. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982. DOI: https://doi.org/10.1016/0304-4149(82)90051-5.
Article MathSciNet Google Scholar
C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, J. Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 5775–5787, 2022.
Google Scholar
C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, J. Zhu. DPM-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Reserach, vol. 22, no. 4, pp. 730–751, 2025. DOI: https://doi.org/10.1007/s11633-025-1562-4.
Article Google Scholar
D. Chen, Z. Zhou, C. Wang, C. Shen, S. Lyu. On the trajectory regularity of ODE-based diffusion sampling. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, pp. 7905–7934, 2024.
Google Scholar
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 10684–10695, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01042.
Google Scholar
L. Zhang, A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3813–3824, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00355.
Google Scholar
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen. Hierarchical text-conditional image generation with CLIP latents, [Online], Available: https://arxiv.org/abs/2204.06125, 2022.
Google Scholar
O. Ronneberger, P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.
Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
Google Scholar
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, Y. Jiao, A. Ramesh. Improving image generation with better captions, [Online], Available: https://cdn.openai.com/papers/dall-e-3.pdf.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
Google Scholar
J. Li, D. Li, C. Xiong, S. C. H. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 12888–12900, 2022.
Google Scholar
J. Li, D. Li, S. Savarese, S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 814, 2023.
Google Scholar
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, J. Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1833, 2022.
Google Scholar
Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, F. Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 18676–18688, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01814.
Google Scholar
M. Hassanin, S. Anwar, I. Radwan, F. S. Khan, A. Mian. Visual attention methods in deep learning: An indepth survey. Information Fusion, vol. 108, Article number 102417, 2024. DOI: https://doi.org/10.1016/j.inffus.2024.102417.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, P. Dollár, R. Girshick. Segment anything. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3992–4003, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00371.
Google Scholar
X. Y. Wei, Z. Q. Yang. Coaching the exploration and exploitation in active learning for interactive video retrieval. IEEE Transactions on Image Processing, vol. 22, no. 3, pp. 955–968, 2013. DOI: https://doi.org/10.1109/TIP.2012.2222902.
Article MathSciNet Google Scholar
X. Y. Wei, Z. Q. Yang. Coached active learning for interactive video search. In Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, USA, pp. 443–452, 2011. DOI: https://doi.org/10.1145/2072298.2072356.
Chapter Google Scholar
H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, X. Cao. FaceScape: A large-scale high quality 3D face dataset and detailed riggable 3D face prediction. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 598–607, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00068.
Google Scholar
Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi’an, China, pp. 67–74, 2018. DOI: https://doi.org/10.1109/FG.2018.00020.
Google Scholar
Y. Wu, Q. Ji. Facial landmark detection: A literature survey. International Journal of Computer Vision, vol. 127, no. 2, pp. 115–142, 2019. DOI: https://doi.org/10.1007/s11263-018-1097-z.
Article Google Scholar
I. Adjabi, A. Ouahabi, A. Benzaoui, A. Taleb-Ahmed. Past, present, and future of face recognition: A review. Electronics, vol. 9, no. 8, Article number 1188, 2020. DOI: https://doi.org/10.3390/electronics9081188.
T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4217–4228, 2021. DOI: https://doi.org/10.1109/tpami.2020.2970919.
Article Google Scholar
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, 2022.
Google Scholar
J. Song, C. Meng, S. Ermon. Denoising diffusion implicit models. In Proceedings of the 9th International Conference on Learning Representations, 2021.
Google Scholar
T. Islam, A. Miron, X. Liu, Y. Li. Deep learning in virtual try-on: A comprehensive survey. IEEE Access, vol. 12, pp. 29475–29502, 2024. DOI: https://doi.org/10.1109/ACCESS.2024.3368612.
Article Google Scholar
Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, Y. G. Jiang. A survey on video diffusion models. ACM Computing Surveys, vol. 57, no. 2, Article number 41, 2025. DOI: https://doi.org/10.1145/3696415.
B. Poole, A. Jain, J. T. Barron, B. Mildenhall. Dream-Fusion: Text-to-3D using 2D diffusion, [Online], Available: https://arxiv.org/abs/2209.14988, 2022.
Google Scholar
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2022. DOI: https://doi.org/10.1145/3503250.
Article Google Scholar
J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, Y. Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 700, 2023.
Google Scholar
X. Wu, K. Sun, F. Zhu, R. Zhao, H. Li. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2096–2105, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00200.
Google Scholar
X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, [Online], Available: https://arxiv.org/abs/2306.09341, 2023.
Google Scholar
Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, O. Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1594, 2023.
Google Scholar
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9630–9640, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00951.
Google Scholar
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6629–6640, 2017.
Google Scholar
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2818–2826, 2016. DOI: https://doi.org/10.1109/CVPR.2016.308.
Google Scholar
Z. Liu, P. Luo, X. Wang, X. Tang. Deep learning face attributes in the wild. In Proceedings of IEEE/CVF International Conference on Computer Vision, Santiago, Chile, pp. 3730–3738, 2015. DOI: https://doi.org/10.1109/ICCV.2015.425.
Google Scholar
K. Zhang, Z. Zhang, Z. Li, Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. DOI: https://doi.org/10.1109/LSP.2016.2603342.
Article Google Scholar
F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 815–823, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298682.
Google Scholar
X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, Y. Zhao, Y. Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y. Lin, T. Huang, Z. Wang. Emu3: Next-token prediction is all you need, [Online], Available: https://arxiv.org/abs/2409.18869, 2024.
Google Scholar
Gemini Team Google. Gemini: A family of highly capable multimodal models, [Online], Available: https://arxiv.org/abs/2312.11805, 2023.
Google Scholar

Download references

Acknowledgements

This work was supported in part by Chinese National Natural Science Foundation Projects, China (Nos. U23B2054, 62276254 and 62372314), Beijing Natural Science Foundation, China (No. L221013), InnoHK program, and Hong Kong Research Grants Council through Research Impact Fund, China (No. R1015-23). Open access funding provided by The Hong Kong Polytechnic University, China.

Author information

Authors and Affiliations

Department of Computing, The Hong Kong Polytechnic University, Hong Kong, 999077, China
Xulu Zhang, Xiaoyong Wei, Wentao Hu, Jiaxin Wu, Wengyu Zhang & Qing Li
Center for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, 999077, China
Xulu Zhang, Jinlin Wu, Zhaoxiang Zhang & Zhen Lei
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Jinlin Wu, Zhaoxiang Zhang & Zhen Lei
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
Zhaoxiang Zhang & Zhen Lei

Authors

Xulu Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaoyong Wei
View author publications
Search author on:PubMed Google Scholar
Wentao Hu
View author publications
Search author on:PubMed Google Scholar
Jinlin Wu
View author publications
Search author on:PubMed Google Scholar
Jiaxin Wu
View author publications
Search author on:PubMed Google Scholar
Wengyu Zhang
View author publications
Search author on:PubMed Google Scholar
Zhaoxiang Zhang
View author publications
Search author on:PubMed Google Scholar
Zhen Lei
View author publications
Search author on:PubMed Google Scholar
Qing Li
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Xiaoyong Wei or Zhen Lei.

Ethics declarations

The authors declared that they have no conflicts of interest to this work.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Xulu Zhang received the B. Eng. and the M. Sc. degrees in computer science from Sichuan University, China in 2019 and 2022. He is a Ph. D. degree candidate from the Department of Computing, The Hong Kong Polytechnic University, China.

His research interests include image generation and active learning.

Xiaoyong Wei received the Ph. D. degree in computer science from the City University of Hong Kong, China in 2009, and has worked as a postdoctoral fellow in the University of California, USA from December 2013 to December 2015. He has been a professor and the head of Department of Computer Science, Sichuan University, China since 2010. He is an adjunct professor of Peng Cheng Laboratory, China, and a visiting professor of Department of Computing, The Hong Kong Polytechnic University, China. He is a senior member of IEEE, and has served as an Associate Editor of Interdisciplinary Sciences: Computational Life Sciences since 2020, the program Chair of ICMR 2019, ICIMCS 2012, and the technical committee member of over 20 conferences such as ICCV, CVPR, SIGKDD, ACM MM, ICME, and ICIP.

His research interests include multimedia computing, health computing, machine learning and large-scale data mining.

Wentao Hu received the B. Eng. degree in computer science from Shandong University, China in 2021, and the M. Sc. degree in computer science from Sun Yatsen University, China in 2024. He is currently a Ph. D. degree candidate from the Department of Computing, The Hong Kong Polytechnic University, China.

His research interests include image generation and 3D reconstruction.

Jinlin Wu received the B. Sc. degree in computer science from the University of Electronic Science and Technology of China, China in 2017, and the Ph. D. degree in computer science from the University of Chinese Academy of Sciences, China in 2022. He is an assistant research fellow at the Institute of Automation, Chinese Academy of Sciences (CAS), China. He has served as the principal investigator of a National Natural Science Foundation of China (NSFC) Youth Science Fund project and has participated in several other NSFC-funded projects. He has accumulated a solid research foundation in areas such as video analysis for security and medical video understanding. He has published over 30 high-quality academic papers, with more than 500 citations.

His research interests include object detection, image recognition, and video understanding.

Jiaxin Wu received the Ph. D. degree in computer science from the City University of Hong Kong, China in 2024. She is a postdoctoral fellow in the Department of Computing at The Hong Kong Polytechnic University, China.

Her research interests include multimedia retrieval, AI for Science (AI4Science), and natural language processing (NLP).

Wengyu Zhang is an undergraduate student in the Department of Computing, The Hong Kong Polytechnic University, China.

His research interests include natural language processing (NLP), AI for Science (AI4Science), and graph learning.

Zhaoxiang Zhang received the B. Sc. degree in computer science in Department of Electronic Science and Technology, University of Science and Technology of China, China in 2004, and the Ph. D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, China in 2009, respectively. From 2009 to 2015, he worked as a lecturer, associate professor, and later the deputy director of Department of Computer Application Technology at the Beihang University, China. Since July 2015, he has joined the Institute of Automation, Chinese Academy of Sciences, where he is currently a professor. Recently, he specifically focuses on deep learning models, biologically-inspired visual computing and human-like learning, and their applications on human analysis and scene understanding. He has published more than 200 papers in international journals and conferences, including reputable international journals such as IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Journal of Machine Learning Research, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, and top-level international conferences like CVPR, ICCV, NIPS, ECCV, ICLR, AAAI, IJCAI and ACM MM. He has won the best paper awards in several conferences and championships in international competitions and his research has won the “Technical Innovation Award of the Chinese Association of Artificial Intelligence”. He has served as the PC Chair or Area Chair of many international conferences like CVPR, ICCV, AAAI, IJCAI, ACM MM, ICPR and BICS. He is serving or has served as Associate Editor of reputable international journals like International Journal of Computer Vision, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Biometrics, Behavior, and Identity Science, Pattern Recognition and Neurocomputing.

His research interests include pattern recognition, computer vision and machine learning.

Zhen Lei received the B. Sc. degree in automation from the University of Science and Technology of China, China in 2005, and the Ph. D. degree in computer vision from the Institute of Automation, Chinese Academy of Sciences, China in 2010, where he is currently a professor. He is IEEE Fellow, IAPR Fellow and AAIA Fellow. He has published over 200 papers in international journals and conferences with 33 000+ citations in Google Scholar and h-index 86. He was the program co-chair of IJCB2023, was competition co-chair of IJCB2022 and has served as Area Chairs for several conferences and is an Associate Editor for IEEE Transactions on Information Forensics and Security, IEEE Transactions on Biometrics, Behavior, and Identity Science, Pattern Recognition, Neurocomputing and IET Computer Vision journals. He is the winner of 2019 IAPR Young Biometrics Investigator Award.

His research interests include computer vision, pattern recognition, image processing, and face recognition in particular.

Qing Li received the Ph. D. degree in data science from University of Southern California, USA in 1989. He is currently a Chair professor (Data Science) and the Head of the Department of Computing, The Hong Kong Polytechnic University. Formerly, he was the founding director of the Multimedia software Engineering Research Centre (MERC), and a professor at City University of Hong Kong where he worked in the Department of Computer Science from 1998 to 2018. Prior to these, he has also taught at the Hong Kong University of Science and Technology and the Australian National University (Canberra, Australia). He served as a consultant to Microsoft Research Asia (Beijing, China), Motorola Global Computing and Telecommunications Division (Tianjin Regional Operations Center), and the Division of Information Technology, Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia. He has been an adjunct professor of the University of Science and Technology of China (USTC) and the Wuhan University, and a guest professor of the Hunan University (Changsha, China) where he got his B. Eng. degree from the Department of Computer Science in 1982. He is also a guest professor (Software Technology) of the Zhejiang University (Hangzhou, China) - the leading university of the Zhejiang province where he was born.

He has been actively involved in the research community by serving as an Associate Editor and reviewer for technical journals, and as an organizer/co-organizer of numerous international conferences. Some recent conferences in which he is playing or has played major roles include APWeb-WAIM2018, ICDM 2018, WISE2017, ICDSC2016, DASFAA2015, U-Media2014, ER2013, RecSys2013, NDBC2012, ICMR2012, CoopIS2011, WAIM2010, DASFAA2010, APWeb-WAIM2009, ER2008, WISE2007, ICWL2006, HSI2005, WAIM2004, IDEAS2003, VLDB2002, PAKDD2001, IFIP 2.6 Working Conference on Database Semantics (DS-9), IDS2000, and WISE2000. In addition, he served as a programme committee member for over fifty international conferences (including VLDB, ICDE, WWW, DASFAA, ER, CIKM, CAiSE, CoopIS, and FODO). He is currently a fellow of IEEE and IET/IEE, a member of ACM-SIGMOD and IEEE Technical Committee on Data Engineering. He is the Chairperson of the Hong Kong Web Society, and also served/is serving as an executive committee (EXCO) member of IEEE-Hong Kong Computer Chapter and ACM Hong Kong Chapter. In addition, he serves as a councilor of the Database Society of Chinese Computer Federation (CCF), a member of the Big Data Expert Committee of CCF, and is a Steering Committee member of DASFAA, ER, ICWL, UMEDIA, and WISE Society.

His research interests include data science and artificial intelligence.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, X., Wei, X., Hu, W. et al. A Survey on Personalized Content Synthesis with Diffusion Models. Mach. Intell. Res. 22, 817–848 (2025). https://doi.org/10.1007/s11633-025-1563-3

Download citation

Received: 07 February 2025
Accepted: 13 May 2025
Published: 27 September 2025
Version of record: 27 September 2025
Issue date: October 2025
DOI: https://doi.org/10.1007/s11633-025-1563-3

Keywords

Profiles

Xiaoyong Wei View author profile
Qing Li View author profile

A Survey on Personalized Content Synthesis with Diffusion Models

Abstract

Article PDF

Similar content being viewed by others

DTIA: Disruptive Text-Image Alignment for Countering Text-to-Image Diffusion Model Personalization

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Personalized Text-to-Image Generation Using Semantically Enhanced Diffusion Models

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles