Abstract
Recent advancements in diffusion models have significantly impacted content creation, leading to the emergence of personalized content synthesis (PCS). By utilizing a small set of user-provided examples featuring the same subject, PCS aims to tailor this subject to specific user-defined prompts. Over the past two years, more than 150 methods have been introduced in this area. However, existing surveys primarily focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper provides a comprehensive survey of PCS, introducing the general frameworks of PCS research, which can be categorized into test-time fine-tuning (TTF) and pre-trained adaptation (PTA) approaches. We analyze the strengths, limitations and key techniques of these methodologies. Additionally, we explore specialized tasks within the field, such as object, face and style personalization, while highlighting their unique challenges and innovations. Despite the promising progress, we also discuss ongoing challenges, including overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to further the development of PCS.
Article PDF
Similar content being viewed by others
References
T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q. L. Han, Y. Tang. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023. DOI: https://doi.org/10.1109/JAS.2023.123618.
F. A. Croitoru, V. Hondru, R. T. Ionescu, M. Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10850–10869, 2023. DOI: https://doi.org/10.1109/TPAMI.2023.3261988.
V. Uc-Cetina, N. Navarro-Guerrero, A. Martin-Gonzalez, C. Weber, S. Wermter. Survey on reinforcement learning for language processing. Artificial Intelligence Review, vol. 56, no. 2, pp. 1543–1575, 2023. DOI: https://doi.org/10.1007/s10462-022-10205-5.
N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 22500–22510, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.02155.
G. Xiao, T. Yin, W. T. Freeman, F. Durand, S. Han. FastComposer: Tuning-free multi-subject image generation with localized attention. International Journal of Computer Vision, vol. 133, no. 3, pp. 1175–1194, 2025. DOI: https://doi.org/10.1007/s11263-024-02227-z.
Q. Wang, X. Bai, H. Wang, Z. Qin, A. Chen, H. Li, X. Tang, Y. Hu. InstantID: Zero-shot identity-preserving generation in seconds, [Online], Available: https://arxiv.org/abs/2401.07519, 2024.
R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, D. Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
N. Kumari, B. Zhang, R. Zhang, E. Shechtman, J. Y. Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 1931–1941, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.00192.
Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, W. Zuo. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 15897–15907, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.01461.
Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, Y. Cao. Cones: Concept neurons in diffusion models for customized generation. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 890, 2023.
A. Voynov, Q. Chu, D. Cohen-Or, K. Aberman. P+: Extended textual conditioning in text-to-image generation, [Online], Available: https://arxiv.org/abs/2303.09522, 2023.
L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, F. Yang. SVDiff: Compact parameter space for diffusion fine-tuning. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7289–7300, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00673.
W. Chen, H. Hu, Y. Li, N. Ruiz, X. Jia, M. W. Chang, W. W. Cohen. Subject-driven text-to-image generation via apprenticeship learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 30286–30305, 2023.
J. Shi, W. Xiong, Z. Lin, H. J. Jung. InstantBooth: Personalized text-to-image generation without test-time finetuning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8543–8552, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00816.
Y. Tewel, R. Gal, G. Chechik, Y. Atzmon. Key-locked rank one editing for text-to-image personalization. In Proceedings of ACM SIGGRAPH Conference, Los Angeles, USA, Article number 12, 2023. DOI: https://doi.org/10.1145/3588432.3591506.
H. Chen, Y. Zhang, S. Wu, X. Wang, X. Duan, Y. Zhou, W. Zhu. DisenBooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
D. Li, J. Li, S. C. H. Hoi. BLIP-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 30146–30166, 2023.
Y. Alaluf, E. Richardson, G. Metzer, D. Cohen-Or. A neural space-time representation for text-to-image personalization. ACM Transactions on Graphics, vol. 42, no. 6, Article number 243, 2023. DOI: https://doi.org/10.1145/3618322.
Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T. Y. Lee, O. Deussen, C. Xu. ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics, vol. 42, no. 6, Article number 244, 2023. DOI: https://doi.org/10.1145/3618342.
Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu, Y. Ge, Y. Shan, M. Z. Shou. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing System, New Orleans, USA, Article number 699, 2023.
Z. Liu, Y. Zhang, Y. Shen, K. Zheng, K. Zhu, R. Feng, Y. Liu, D. Zhao, J. Zhou, Y. Cao. Cones 2: Customizable image synthesis with multiple subjects. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2508, 2023.
K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li, Y. Hao, I. Essa, M. Rubinstein, D. Krishnan. StyleDrop: Text-to-image generation in any style. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2920, 2023.
G. Yuan, X. Cun, Y. Zhang, M. Li, C. Qi, X. Wang, Y. Shan, H. Zheng. Inserting anybody in diffusion models via celeb basis. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 3190, 2023.
D. Valevski, D. Lumen, Y. Matias, Y. Leviathan. Face0: Instantaneously conditioning a text-to-image model on a face. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 94, 2023. DOI: https://doi.org/10.1145/3610548.3618249.
M. Arar, R. Gal, Y. Atzmon, G. Chechik, D. Cohen-Or, A. Shamir, A. H. Bermano. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 72, 2023. DOI: https://doi.org/10.1145/3610548.3618173.
N. Ruiz, Y. Li, V. Jampani, W. Wei, T. Hou, Y. Pritch, N. Wadhwa, M. Rubinstein, K. Aberman. Hyper Dream-Booth: HyperNetworks for fast personalization of text-to-image models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6527–6536, 2024. DOI: https://doi.org/10.1109/CV-PR52733.2024.00624.
J. Ma, J. Liang, C. Chen, H. Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In Proceedings of ACM SIGGRAPH Conference Papers, Denver, USA, Article number 25, 2024. DOI: https://doi.org/10.1145/3641519.3657469.
H. Ye, J. Zhang, S. Liu, X. Han, W. Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2308.06721, 2023.
Z. Wang, X. Wang, L. Xie, Z. Qi, Y. Shan, W. Wang, P. Luo. StyleAdapter: A unified stylized image generation model. International Journal of Computer Vision, vol. 133, no. 4, pp. 1894–1911, 2025. DOI: https://doi.org/10.1007/s11263-024-02253-x.
X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, F. Wei. Kosmos-G: Generating images in context with multimodal large language models. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
S. Motamed, D. P. Paudel, L. Van Gool. LEGO: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In Proceedings of the 18th European Computer Vision Association, Milan, Italy, pp. 116–133, 2025. DOI: https://doi.org/10.1007/978-3-031-72633-0_7.
Y. Yan, C. Zhang, R. Wang, Y. Zhou, G. Zhang, P. Cheng, G. Yu, B. Fu. FaceStudio: Put your face everywhere in seconds, [Online], Available: https://arxiv.org/abs/2312.02663, 2023.
Z. Li, M. Cao, X. Wang, Z. Qi, M. M. Cheng, Y. Shan. PhotoMaker: Customizing realistic human photos via stacked ID embedding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8640–8650, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00825.
X. Peng, J. Zhu, B. Jiang, Y. Tai, D. Luo, J. Zhang, W. Lin, T. Jin, C. Wang, R. Ji. PortraitBooth: A versatile portrait model for fast identity-preserved personalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 27070–27080, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.02557.
X. Zhang, X. Y. Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, Q. Li. Compositional inversion for stable diffusion models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 7350–7358, 2024. DOI: https://doi.org/10.1609/aaai.v38i7.28565.
Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, X. Wang. Generative multimodal models are in-context learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 14398–14409, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.01365.
L. Pang, J. Yin, H. Xie, Q. Wang, Q. Li, X. Mao. Cross initialization for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2312.15905, 2023.
Z. Kong, Y. Zhang, T. Yang, T. Wang, K. Zhang, B. Wu, G. Chen, W. Liu, W. Luo. OMG: Occlusion-friendly personalized multi-concept generation in diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 253–270, 2025. DOI: https://doi.org/10.1007/978-3-031-72751-1_15.
X. Zhang, W. Zhang, X. Wei, J. Wu, Z. Zhang, Z. Lei, Q. Li. Generative active learning for image synthesis personalization. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 10669–10677, 2024. DOI: https://doi.org/10.1145/3664647.3680773.
Y. Zhang, Y. Song, J. Liu, R. Wang, J. Yu, H. Tang, H. Li, X. Tang, Y. Hu, H. Pan, Z. Jiang. SSR-encoder: Encoding selective subject representation for subject-driven generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8069–8078, 2024. DOI: https://doi.org/10.1109/CV-PR52733.2024.00771.
G. Zhang, K. Sohn, M. Hahn, H. Shi, I. Essa. FineStyle: Fine-grained controllable style personalization for text-to-image models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 52937–52961, 2024.
Z. Dong, P. Wei, L. Lin. DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter, [Online], Available: https://arxiv.org/abs/2211.11337, 2025.
A. Voronov, M. Khoroshikh, A. Babenko, M. Ryabinin. Is this loss informative? Faster text-to-image customization by tracking objective dynamics. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1630, 2023.
I. Han, S. Yang, T. Kwon, J. C. Ye. Highly personalized text embedding for image manipulation by stable diffusion, [Online], Available: https://arxiv.org/abs/2303.08767, 2023.
C. Xiang, F. Bao, C. Li, H. Su, J. Zhu. A closer look at parameter-efficient tuning in diffusion models, [Online], Available: https://arxiv.org/abs/2303.18181, 2023.
X. Jia, Y. Zhao, K. C. K. Chan, Y. Li, H. Zhang, B. Gong, T. Hou, H. Wang, Y. C. Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2304.02642, 2023.
J. Yang, H. Wang, Y. Zhang, R. Xiao, S. Wu, G. Chen, J. Zhao. Controllable textual inversion for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2304.05265, 2023.
Z. Fei, M. Fan, J. Huang. Gradient-free textual inversion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, pp. 1364–1373, 2023. DOI: https://doi.org/10.1145/3581783.3612599.
R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, Y. Qiao, P. Gao, H. Li. Personalize segment anything model with one shot. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, D. Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 96, 2023. DOI: https://doi.org/10.1145/3610548.3618154.
J. Xiao, M. Yin, Y. Gong, X. Zang, J. Ren, B. Yuan. COMCAT: Towards efficient compression and customization of attention-based vision models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 1587, 2023.
S. Hao, K. Han, S. Zhao, K. Y. K. Wong. ViCo: Plug-and-play visual condition for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2306.00971, 2023.
Z. Qiu, W. Liu, H. Feng, Y. Xue, Y. Feng, Z. Liu, D. Zhang, A. Weller, B. Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 79320–79362, 2023.
S. Y. Yeh, Y. G. Hsieh, Z. Gao, B. B. W. Yang, G. Oh, Y. Gong. Navigating text-to-image customization: From LyCORIS fine-tuning to model evaluation, [Online], Available: https://arxiv.org/abs/2309.14859, 2024.
X. He, Z. Cao, N. Kolkin, L. Yu, K. Wan, H. Rhodin, R. Kalarot. A data perspective on enhanced identity preservation for diffusion personalization. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 3782–3791, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00372.
A. Roy, M. Suin, A. Shah, K. Shah, J. Liu, R. Chellappa. DIFFNAT: Improving diffusion image quality using natural image statistics, [Online], Available: https://arxiv.org/abs/2311.09753, 2023.
A. Agarwal, S. Karanam, T. Shukla, B. V. Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 6053–6062, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00590.
R. Zhao, M. Zhu, S. Dong, D. Cheng, N. Wang, X. Gao. CatVersion: Concatenating embeddings for diffusion-based text-to-image personalization. IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 6, pp. 6047–6058, 2025. DOI: https://doi.org/10.1109/TC-SVT.2025.3531917.
M. Safaee, A. Mikaeili, O. Patashnik, D. Cohen-Or, A. Mahdavi-Amiri. CLiC: Concept learning in context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6924–6933, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00661.
Z. Wang, W. Wei, Y. Zhao, Z. Xiao, M. Hasegawa-Johnson, H. Shi, T. Hou. HiFi tuner: High-fidelity subject-driven fine-tuning for diffusion models, [Online], Available: https://arxiv.org/abs/2312.00079, 2023.
D. Chae, N. Park, J. Kim, K. Lee. InstructBooth: Instruction-following personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2312.03011, 2024.
Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, W. Zuo. Decoupled textual embeddings for customized image generation. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 909–917, 2024. DOI: https://doi.org/10.1609/aaai.v38i2.27850.
B. N. Zhao, Y. Xiao, J. Xu, X. Jiang, Y. Yang, D. Li, L. Itti, V. Vineet, Y. Ge. DreamDistribution: Learning prompt distribution for diverse in-distribution generation, [Online], Available: https://arxiv.org/abs/2312.14216, 2025.
M. Hua, J. Liu, F. Ding, W. Liu, J. Wu, Q. He. Dream-Tuner: Single image is enough for subject-driven generation, [Online], Available: https://arxiv.org/abs/2312.13691, 2023.
J. Lu, C. Xie, H. Guo. Object-driven one-shot fine-tuning of text-to-image diffusion with prototypical embedding, [Online], Available: https://arxiv.org/abs/2401.15708, 2024.
W. Zeng, Y. Yan, Q. Zhu, Z. Chen, P. Chu, W. Zhao, X. Yang. Infusion: Preventing customized text-to-image diffusion from overfitting. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 3568–3577, 2024. DOI: https://doi.org/10.1145/3664647.3680894.
M. Jones, S. Y. Wang, N. Kumari, D. Bau, J. Y. Zhu. Customizing text-to-image models with a single image pair, [Online], Available: https://arxiv.org/abs/2405.01536, 2024.
H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, W. Zhu. DisenDreamer: Subject-driven text-to-image generation with sample-aware disentangled tuning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6860–6873, 2024. DOI: https://doi.org/10.1109/TCSVT.2024.3369757.
Z. Fan, Z. Yin, G. Li, Y. Zhan, H. Zheng. Dream-Booth++: Boosting subject-driven generation via region-level references packing. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 11013–11021, 2024. DOI: https://doi.org/10.1145/3664647.3680734.
F. Wu, Y. Pang, J. Zhang, L. Pang, J. Yin, B. Zhao, Q. Li, X. Mao. CoRe: Context-regularized text embedding learning for text-to-image personalization, [Online], Available: https://arxiv.org/abs/2408.15914, 2024.
J. Jin, Y. Shen, Z. Fu, J. Yang. Customized generation reimagined: Fidelity and editability harmonized. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 410–426, 2025. DOI: https://doi.org/10.1007/978-3-031-72973-7_24.
W. Chen, H. Hu, C. Saharia, W. W. Cohen. Re-imagen: Retrieval-augmented text-to-image generator. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
X. Xu, Z. Wang, E. Zhang, K. Wang, H. Shi. Versatile diffusion: Text, images and variations all in one diffusion model. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7720–7731, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00713.
R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik, D. Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics, vol. 42, no. 4, Article number 150, 2023. DOI: https://doi.org/10.1145/3592133.
Y. Ma, H. Yang, W. Wang, J. Fu, J. Liu. Unified multi-modal latent diffusion for joint subject and text conditional image generation, [Online], Available: https://arxiv.org/abs/2303.09319, 2023.
M. Kim, J. Yoo, S. Kwon. Personalized text-to-image model enhancement strategies: SOD preprocessing and CNN local feature integration. Electronics, vol. 12, no. 22, Article number 4707, 2023. DOI: https://doi.org/10.3390/electronics12224707.
Y. Zhou, R. Zhang, J. Gu, T. Sun. Customization assistant for text-to-image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9182–9191, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00877.
J. Pan, H. Yan, J. H. Liew, J. Feng, V. Y. F. Tan. Towards accurate guided diffusion sampling through symplectic adjoint method, [Online], Available: https://arxiv.org/abs/2312.12030, 2023.
S. Purushwalkam, A. Gokul, S. Joty, N. Naik. Boot-PIG: Bootstrapping zero-shot personalized image generation capabilities in pretrained diffusion models. In Proceedings of Computer Vision, Milan, Italy, pp. 252–269, 2025. DOI: https://doi.org/10.1007/978-3-031-91907-7_15.
N. Chen, M. Huang, Z. Chen, Y. Zheng, L. Zhang, Z. Mao. CustomContrast: A multilevel contrastive perspective for subject-driven text-to-image customization. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 2123–2131, 2025. DOI: https://doi.org/10.1609/aaai.v39i2.32210.
K. Song, Y. Zhu, B. Liu, Q. Yan, A. Elgammal, X. Yang. MoMA: Multimodal LLM adapter for fast personalized image generation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 117–132, 2025. DOI: https://doi.org/10.1007/978-3-031-73661-2_7.
C. Shin, J. Choi, H. Kim, S. Yoon. Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator, [Online], Available: https://arxiv.org/abs/2411.15466, 2024.
J. Park, B. Ko, H. Jang. StyleBoost: A study of personalizing text-to-image generation in any style using dreambooth. In Proceedings of the 14th International Conference on Information and Communication Technology Convergence, Jeju Island, Republic of Korea, pp. 93–98, 2023. DOI: https://doi.org/10.1109/ICTC58733.2023.10392676.
A. Hertz, A. Voynov, S. Fruchter, D. Cohen-Or. Style aligned image generation via shared attention. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4775–4785, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00457.
J. Park, B. Ko, H. Jang. Text-to-image synthesis for any artistic styles: Advancements in personalized artistic image generation via subdivision and dual binding, [Online], Available: https://arxiv.org/html/2404.05256v1, 2024.
J. Choi, C. Shin, Y. Oh, H. Kim, J. Lee, S. Yoon. Style-friendly SNR sampler for style-driven generation, [Online], Available: https://arxiv.org/abs/2411.14793, 2024.
D. Y. Chen, H. Tennent, C. W. Hsu. ArtAdapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8619–8628, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00823
Y. Zhou, R. Zhang, T. Sun, J. Xu. Enhancing detail preservation for customized text-to-image generation: A regularization-free approach, [Online], Available: https://arxiv.org/abs/2305.13579, 2023.
S. Banerjee, G. Mittal, A. Joshi, C. Hegde, N. Memon. Identity-preserving aging of face images via latent diffusion models. In Proceedings of IEEE International Joint Conference on Biometrics, Ljubljana, Slovenia, 2023. DOI: https://doi.org/10.1109/IJCB57857.2023.10448860.
J. Hyung, J. Shin, J. Choo. MagiCapture: High-resolution multi-concept portrait customization, [Online], Available: https://arxiv.org/abs/2309.06895, 2024.
P. Cao, L. Yang, F. Zhou, T. Huang, Q. Song. Concept-centric personalization with large-scale diffusion priors, [Online], Available: https://arxiv.org/html/2312.08195v1, 2023.
C. Kim, J. Lee, S. Joung, B. Kim, Y. M. Baek. Instant-Family: Masked attention for zero-shot multi-ID image generation, [Online], Available: https://arxiv.org/abs/2404.19427, 2024.
X. Li, J. Zhan, S. He, Y. Xu, J. Dong, H. Zhang, Y. Du. PersonaMagic: Stage-regulated high-fidelity face customization with tandem equilibrium, [Online], Available: https://arxiv.org/abs/2412.15674, 2024.
Y. C. Su, K. C. K. Chan, Y. Li, Y. Zhao, H. Zhang, B. Gong, H. Wang, X. Jia. Identity encoder for personalized diffusion, [Online], Available: https://arxiv.org/abs/2304.07429, 2023.
Z. Chen, S. Fang, W. Liu, Q. He, M. Huang, Y. Zhang, Z. Mao. DreamIdentity: Improved editability for efficient face-identity preserved image generation, [Online], Available: https://arxiv.org/abs/2307.00300, 2023.
Y. Wang, W. Zhang, J. Zheng, C. Jin. High-fidelity person-centric subject-to-image synthesis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7675–7684, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00733.
X. Li, X. Hou, C. C. Loy. When StyleGAN meets stable diffusion: A W+ adapter for personalized image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 2187–2196, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00213.
J. Liu, H. Huang, C. Jin, R. He. Portrait diffusion: Training-free face stylization with chain-of-painting, [Online], Available: https://arxiv.org/abs/2312.02212, 2023.
H. Tang, J. Deng, Z. Pan, H. Tian, P. Chaudhari, X. Zhou. RetriBooru: Leakage-free retrieval of conditions from reference images for subject-driven generation, [Online], Available: https://arxiv.org/abs/2312.02521, 2024.
J. Xu, S. Motamed, P. Vaddamanu, C. H. Wu, C. Haene, J. C. Bazin, F. De La Torre. Personalized face inpainting with diffusion models by parallel visual attention. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 5420–5430, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00535.
D. Y. Chen, A. K. Bhunia, S. Koley, A. Sain, P. N. Chowdhury, Y. Z. Song. DemoCaricature: Democratising caricature generation with a rough sketch. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8629–8639, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00824.
P. Achlioptas, A. Benetatos, I. Fostiropoulos, D. Skourtis. Stellar: Systematic evaluation of human-centric personalized text-to-image methods, [Online], Available: https://arxiv.org/abs/2312.06116, 2023.
W. Chen, J. Zhang, J. Wu, H. Wu, X. Xiao, L. Lin. ID-aligner: Enhancing identity-preserving text-to-image generation with reward feedback learning, [Online], Available: https://arxiv.org/abs/2404.15449, 2024.
K. C. Wang, D. Ostashev, Y. Fang, S. Tulyakov, K. Aberman. MoA: Mixture-of-attention for subject-context disentanglement in personalized image generation, [Online], Available: https://arxiv.org/html/2404.11565v1, 2024.
S. Cui, J. Guo, X. An, J. Deng, Y. Zhao, X. Wei, Z. Feng. IDAdapter: Learning mixed features for tuning-free personalization of text-to-image models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, pp. 950–959, 2024. DOI: https://doi.org/10.1109/CVPRW63382.2024.00100.
Y. Wu, Z. Li, H. Zheng, C. Wang, B. Li. Infinite-ID: Identity-preserved personalization via ID-semantics decoupling paradigm. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 279–296, 2025. DOI: https://doi.org/10.1007/978-3-031-73242-3_16.
K. Shiohara, T. Yamasaki. Face2Diffusion for fast and editable face personalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6850–6859, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00654.
Y. Cai, Z. Jiang, Y. Liu, C. Jiang, W. Xue, W. Luo, Y. Guo. Foundation cures personalization: Recovering facial personalized models’ prompt consistency, [Online], Available: https://arxiv.org/abs/2411.15277v1, 2024.
G. Qian, K. C. Wang, O. Patashnik, N. Heravi, D. Ostashev, S. Tulyakov, D. Cohen-Or, K. Aberman. Omni-ID: Holistic identity representation designed for generative tasks, [Online], Available: https://arxiv.org/html/2412.09694v1, 2024.
Y. Li, H. Liu, Y. Wen, Y. J. Lee. Generate anything anywhere in any scene, [Online], Available: https://arxiv.org/abs/2306.17154, 2023.
T. Rahman, S. Mahajan, H. Y. Lee, J. Ren, S. Tulyakov, L. Sigal. Visual concept-driven image generation with text-to-image diffusion model, [Online], Available: https://arxiv.org/abs/2402.11487, 2025.
J. Jiang, Y. Zhang, K. Feng, X. Wu, W. Li, R. Pei, F. Li, W. Zuo. MC2: Multi-concept guidance for customized multi-concept generation, [Online], Available: https://arxiv.org/abs/2404.05268, 2024.
C. Zhu, K. Li, Y. Ma, C. He, X. Li. MultiBooth: Towards generating all your concepts in an image from text. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 10923–10931, 2025. DOI: https://doi.org/10.1609/aaai.v39i10.33187.
H. Matsuda, R. Togo, K. Maeda, T. Ogawa, M. Haseyama. Multi-object editing in personalized text-to-image diffusion model via segmentation guidance. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, pp. 8140–8144, 2024. DOI: https://doi.org/10.1109/ICASSP48485.2024.10447048.
D. Zhou, J. Huang, J. Bai, J. Wang, H. Chen, G. Chen, X. Hu, P. A. Heng. MagicTailor: Component-controllable personalization in text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2410.13370, 2024.
X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, H. Zhao. AnyDoor: Zero-shot object-level image customization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6593–6602, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00630.
Z. Yuan, M. Cao, X. Wang, Z. Qi, C. Yuan, Y. Shan. CustomNet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2310.19784, 2023.
D. Zhou, Y. Li, F. Ma, Z. Yang, Y. Yang. MIGC: Multi-instance generation controller for text-to-image synthesis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6818–6828, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00651.
M. Patel, S. Jung, C. Baral, Y. Yang. λ-ECLIPSE: Multi-concept personalized text-to-image diffusion models by leveraging CLIP latent space, [Online], Available: https://arxiv.org/abs/2402.05195, 2024.
X. Wang, S. Fu, Q. Huang, W. He, H. Jiang. MS-diffusion: Multi-subject zero-shot image personalization with layout guidance. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 2025.
Z. Huang, T. Wu, Y. Jiang, K. C. K. Chan, Z. Liu. ReVersion: Diffusion-based relation inversion from images. In Proceedings of SIGGRAPH Asia Conference Papers, Tokyo, Japan, Article number 4, 2024. DOI: https://doi.org/10.1145/3680528.3687658.
G. Zhang, Y. Qian, J. Deng, X. Cai. Inv-ReVersion: Enhanced relation inversion based on text-to-image diffusion models. Applied Sciences, vol. 14, no. 8, Article number 3338, 2024. DOI: https://doi.org/10.3390/app14083338.
Z. Xu, S. Hao, K. Han. CusConcept: Customized visual concept decomposition with diffusion models. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 3678–3687, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00362.
S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, D. Wang. Learning disentangled identifiers for action-customized text-to-image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7797–7806, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00745.
J. Gu, Y. Wang, N. Zhao, T. J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, X. E. Wang. PHOTOSWAP: Personalized subject swapping in images. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1529, 2023.
S. Zhang, M. Ni, S. Chen, L. Wang, W. Ding, Y. Liu. A two-stage personalized virtual try-on framework with shape control and texture guidance. IEEE Transactions on Multimedia, vol. 26, pp. 10225–10236, 2024. DOI: https://doi.org/10.1109/TMM.2024.3405718.
M. Chen, I. Laina, A. Vedaldi. Training-free layout control with cross-attention guidance. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 5331–5341, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00526.
J. Gu, N. Zhao, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, Y. Wang, X. E. Wang. SwapAnything: Enabling arbitrary object swapping in personalized visual editing, [Online], Available: https://arxiv.org/abs/2404.05717, 2024.
N. Kumari, G. Su, R. Zhang, T. Park, E. Shechtman, J. Y. Zhu. Customizing text-to-image diffusion with object viewpoint control, [Online], Available: https://arxiv.org/abs/2404.12333, 2024.
X. Xu, J. Guo, Z. Wang, G. Huang, I. Essa, H. Shi. Prompt-free diffusion: Taking “text” out of text-to-image diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8682–8692, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00829.
S. Zhao, D. Chen, Y. C. Chen, J. Bao, S. Hao, L. Yuan, K. Y. K. Wong. Uni-controlNet: All-in-one control to text-to-image diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 11127–11150, 2023.
S. Y. Cheong, A. Mustafa, A. Gilbert. ViscoNet: Bridging and harmonizing visual and textual conditioning for ControlNet, [Online], Available: https://arxiv.org/abs/2312.03154, 2024.
I. Najdenkoska, A. Sinha, A. Dubey, D. Mahajan, V. Ramanathan, F. Radenovic. Context diffusion: In-context aware image generation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 375–391, 2024. DOI: https://doi.org/10.1007/978-3-031-72980-5_22.
S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, B. Zhou. FreeControl: Training-free spatial control of any text-to-image diffusion model with any condition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7465–7475, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00713.
P. Li, Q. Nie, Y. Chen, X. Jiang, K. Wu, Y. Lin, Y. Liu, J. Peng, C. Wang, F. Zheng. Tuning-free image customization with image and text guidance. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 233–250, 2025. DOI: https://doi.org/10.1007/978-3-031-73116-7_14.
J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, M. Z. Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7623–7633, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00701.
P. Esser, J. Chiu, P. Atighehchian, J. Granskog, A. Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of IEEE/ CVF International Conference on Computer Vision, Paris, France, pp. 7346–7356, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00675.
Y. Zhao, E. Xie, L. Hong, Z. Li, G. H. Lee. Make-aprotagonist: Generic video editing with an ensemble of experts, [Online], Available: https://ar5iv.labs.arxiv.org/html/2305.08850, 2023.
Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan, Q. Chen. Animate-a-story: Storytelling with retrieval-augmented video generation, [Online], Available: https://arxiv.org/abs/2307.06940, 2023.
R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. W. Liu, W. Wu, J. Keppo, M. Z. Shou. MotionDirector: Motion customization of text-to-video diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 273–290, 2025. DOI: https://doi.org/10.1007/978-3-031-72992-8_16.
R. Wu, L. Chen, T. Yang, C. Guo, C. Li, X. Zhang. LAMP: Learn a motion pattern for few-shot video generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7089–7098, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00677.
H. Jeong, G. Y. Park, J. C. Ye. VMC: Video motion customization using temporal attention adaption for text-to-video diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9212–9221, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00880.
Y. Song, W. Shin, J. Lee, J. Kim, N. Kwak. SAVE: Protagonist diversification with structure agnostic video editing. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 41–57, 2025. DOI: https://doi.org/10.1007/978-3-031-72989-8_3.
J. Materzynska, J. Sivic, E. Shechtman, A. Torralba, R. Zhang, B. Russell. NewMove: Customizing text-to-video models with novel motions, [Online], Available: https://arxiv.org/abs/2312.04966, 2023.
Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, H. Shan. DreamVideo: Composing your dream videos with customized subject and motion, [Online], Available: https://arxiv.org/abs/2312.04433, 2023.
Y. Zhang, F. Tang, N. Huang, H. Huang, C. Ma, W. Dong, C. Xu. MotionCrafter: One-shot motion customization of diffusion models, [Online], Available: https://arxiv.org/abs/2312.05288, 2024.
Y. Ren, Y. Zhou, J. Yang, J. Shi, D. Liu, F. Liu, M. Kwon, A. Shrivastava. Customize-a-video: One-shot motion customization of text-to-video diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 332–349, 2025. DOI: https://doi.org/10.1007/978-3-031-73024-5_20.
Z. Ma, D. Zhou, C. H. Yeh, X. S. Wang, X. Li, H. Yang, Z. Dong, K. Keutzer, J. Feng. Magic-Me: Identity-specific video customized diffusion, [Online], Available: https://arxiv.org/abs/2402.09368, 2024.
X. Bi, J. Lu, B. Liu, X. Cun, Y. Zhang, W. Li, B. Xiao. CustomTTT: Motion and appearance customized video generation via test-time training. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 1871–1879, 2025. DOI: https://doi.org/10.1609/aaai.v39i2.32182.
H. Li, H. Qiu, S. Zhang, X. Wang, Y. Wei, Z. Li, Y. Zhang, B. Wu, D. Cai. PersonalVideo: High ID-fidelity video customization without dynamic and semantic degradation, [Online], Available: https://arxiv.org/abs/2411.17048, 2024.
Z. Wang, A. Li, L. Zhu, Y. Guo, Q. Dou, Z. Li. CustomVideo: Customizing text-to-video generation with multiple subjects, [Online], Available: https://arxiv.org/abs/2401.09962, 2024.
H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, Y. Wu, W. Zhu. VideoDreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models, [Online], Available: https://arxiv.org/abs/2311.00990, 2023.
H. Zhao, T. Lu, J. Gu, X. Zhang, Q. Zheng, Z. Wu, H. Xu, Y. G. Jiang. MagDiff: Multi-alignment diffusion for high-fidelity video generation and editing. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 205–221, 2025. DOI: https://doi.org/10.1007/978-3-031-72649-1_12.
Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, Z. Liu. VideoBooth: Diffusion-based video generation with image prompts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6689–6700, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00639.
T. Wu, Y. Zhang, X. Cun, Z. Qi, J. Pu, H. Dou, G. Zheng, Y. Shan, X. Li. VideoMaker: Zero-shot customized video generation with the inherent force of video diffusion models, [Online], Available: https://arxiv.org/abs/2412.19645, 2024.
Y. Zhou, R. Zhang, J. Gu, N. Zhao, J. Shi, T. Sun. SUGAR: Subject-driven video customization in a zero-shot manner, [Online], Available: https://arxiv.org/abs/2412.10533, 2024.
G. Liu, M. Xia, Y. Zhang, H. Chen, J. Xing, Y. Wang, X. Wang, Y. Shan, Y. Yang. StyleCrafter: Taming artistic video diffusion with reference-augmented adapter learning. ACM Transactions on Graphics, vol. 43, no. 6, Article number 251, 2024. DOI: https://doi.org/10.1145/3687975
X. He, Q. Liu, S. Qian, X. Wang, T. Hu, K. Cao, K. Yan, J. Zhang. ID-animator: Zero-shot identity-preserving human video generation, [Online], Available: https://arxiv.org/abs/2404.15275, 2024.
S. Yuan, J. Huang, X. He, Y. Ge, Y. Shi, L. Chen, J. Luo, L. Yuan. Identity-preserving text-to-video generation by frequency decomposition, [Online], Available: https://arxiv.org/abs/2411.17440, 2024.
H. Fang, D. Qiu, B. Mao, P. Yan, H. Tang. Motion-Character: Identity-preserving and motion controllable human video generation, [Online], Available: https://arxiv.org/abs/2411.18281, 2024.
M. Feng, J. Liu, K. Yu, Y. Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Li, A. Li, X. Kang, B. Lei, M. Cui, P. Ren, X. Xie. DreaMoving: A human video generation framework based on diffusion models, [Online], Available: https://arxiv.org/abs/2312.05107, 2023.
C. H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M. Y. Liu, T. Y. Lin. Magic3D: High-resolution text-to-3D content creation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 300–309, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.00037.
A. Raj, S. Kaza, B. Poole, M. Niemeyer, N. Ruiz, B. Mildenhall, S. Zada, K. Aberman, M. Rubinstein, J. Barron, Y. Li, V. Jampani. DreamBooth3D: Subject-driven text-to-3D generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2349–2359, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00223.
S. Azadi, T. Hayes, A. Shah, G. Pang, D. Parikh, S. Gupta. Text-conditional contextualized avatars for zero-shot personalization, [Online], Available: https://arxiv.org/abs/2304.07410, 2023.
C. Zhang, Y. Chen, Y. Fu, Z. Zhou, G. Yu, B. Wang, B. Fu, T. Chen, G. Lin, C. Shen. StyleAvatar3D: Leveraging image-text diffusion models for high-fidelity 3D avatar generation, [Online], Available: https://arxiv.org/abs/2305.19012, 2023.
Y. Zeng, Y. Lu, X. Ji, Y. Yao, H. Zhu, X. Cao. Avatar-Booth: High-quality and customizable 3D human avatar generation, [Online], Available: https://arxiv.org/abs/2306.09864, 2023.
Y. Shi, P. Wang, J. Ye, L. Mai, K. Li, X. Yang. MVDream: Multi-view diffusion for 3D generation. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
Y. Ouyang, W. Chai, J. Ye, D. Tao, Y. Zhan, G. Wang. Chasing consistency in text-to-3D generation from a single image, [Online], Available: https://arxiv.org/abs/2309.03599, 2023.
Y. Zhao, Z. Yan, E. Xie, L. Hong, Z. Li, G. H. Lee. Animate124: Animating one image to 4D dynamic scene, [Online], Available: https://arxiv.org/abs/2311.14603, 2023.
Y. Zheng, X. Li, K. Nagano, S. Liu, O. Hilliges, S. De Mello. A unified approach for text-and image-guided 4D scene generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7300–7309, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00697.
Y. Y. Yeh, J. B. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong, Z. Li. TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4304–4314, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00412.
J. Zhuang, D. Kang, Y. P. Cao, G. Li, L. Lin, Y. Shan. TIP-editor: An accurate 3D editor following both text-prompts and image-prompts. ACM Transactions on Graphics, vol. 43, no. 4, Article number 121, 2024. DOI: https://doi.org/10.1145/3658205.
T. Van Le, H. Phung, T. H. Nguyen, Q. Dao, N. N. Tran, A. Tran. Anti-DreamBooth: Protecting users from personalized text-to-image synthesis. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2116–2127, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00202.
Y. Wu, J. Zhang, F. Kerschbaum, T. Zhang. Backdooring textual inversion for concept censorship, [Online], Available: https://arxiv.org/abs/2308.10718, 2023.
Y. Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y. Wu, M. Hu, T. Li, G. Pu, Y. Liu. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 21169–21178, 2024. DOI: https://doi.org/10.1609/aaai.v38i19.30110.
J. S. Smith, Y. C. Hsu, L. Zhang, T. Hua, Z. Kira, Y. Shen, H. Jin. Continual diffusion: Continual customization of text-to-image diffusion with C-LoRA, [Online], Available: https://arxiv.org/abs/2304.06027, 2024.
P. Zhang, N. Zhao, J. Liao. Text-guided vector graphics customization. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 54, 2023. DOI: https://doi.org/10.1145/3610548.3618232.
H. Wang, X. Xiang, Y. Fan, J. H. Xue. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 4921–4931, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00486.
K. Wang, F. Yang, B. Raducanu, J. van de Weijer. Multi-class textual-inversion secretly yields a semantic-agnostic classifier. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 4400–4409, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00432.
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the 9th International Conference on Learning Representations, 2021.
T. Karras, M. Aittala, T. Aila, S. Laine. Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 26565–26577, 2022.
J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 6840–6851, 2020.
B. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982. DOI: https://doi.org/10.1016/0304-4149(82)90051-5.
C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, J. Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 5775–5787, 2022.
C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, J. Zhu. DPM-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Reserach, vol. 22, no. 4, pp. 730–751, 2025. DOI: https://doi.org/10.1007/s11633-025-1562-4.
D. Chen, Z. Zhou, C. Wang, C. Shen, S. Lyu. On the trajectory regularity of ODE-based diffusion sampling. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, pp. 7905–7934, 2024.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 10684–10695, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01042.
L. Zhang, A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3813–3824, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00355.
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen. Hierarchical text-conditional image generation with CLIP latents, [Online], Available: https://arxiv.org/abs/2204.06125, 2022.
O. Ronneberger, P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, Y. Jiao, A. Ramesh. Improving image generation with better captions, [Online], Available: https://cdn.openai.com/papers/dall-e-3.pdf.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
J. Li, D. Li, C. Xiong, S. C. H. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 12888–12900, 2022.
J. Li, D. Li, S. Savarese, S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 814, 2023.
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, J. Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1833, 2022.
Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, F. Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 18676–18688, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01814.
M. Hassanin, S. Anwar, I. Radwan, F. S. Khan, A. Mian. Visual attention methods in deep learning: An indepth survey. Information Fusion, vol. 108, Article number 102417, 2024. DOI: https://doi.org/10.1016/j.inffus.2024.102417.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, P. Dollár, R. Girshick. Segment anything. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3992–4003, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00371.
X. Y. Wei, Z. Q. Yang. Coaching the exploration and exploitation in active learning for interactive video retrieval. IEEE Transactions on Image Processing, vol. 22, no. 3, pp. 955–968, 2013. DOI: https://doi.org/10.1109/TIP.2012.2222902.
X. Y. Wei, Z. Q. Yang. Coached active learning for interactive video search. In Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, USA, pp. 443–452, 2011. DOI: https://doi.org/10.1145/2072298.2072356.
H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, X. Cao. FaceScape: A large-scale high quality 3D face dataset and detailed riggable 3D face prediction. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 598–607, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00068.
Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi’an, China, pp. 67–74, 2018. DOI: https://doi.org/10.1109/FG.2018.00020.
Y. Wu, Q. Ji. Facial landmark detection: A literature survey. International Journal of Computer Vision, vol. 127, no. 2, pp. 115–142, 2019. DOI: https://doi.org/10.1007/s11263-018-1097-z.
I. Adjabi, A. Ouahabi, A. Benzaoui, A. Taleb-Ahmed. Past, present, and future of face recognition: A review. Electronics, vol. 9, no. 8, Article number 1188, 2020. DOI: https://doi.org/10.3390/electronics9081188.
T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4217–4228, 2021. DOI: https://doi.org/10.1109/tpami.2020.2970919.
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, 2022.
J. Song, C. Meng, S. Ermon. Denoising diffusion implicit models. In Proceedings of the 9th International Conference on Learning Representations, 2021.
T. Islam, A. Miron, X. Liu, Y. Li. Deep learning in virtual try-on: A comprehensive survey. IEEE Access, vol. 12, pp. 29475–29502, 2024. DOI: https://doi.org/10.1109/ACCESS.2024.3368612.
Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, Y. G. Jiang. A survey on video diffusion models. ACM Computing Surveys, vol. 57, no. 2, Article number 41, 2025. DOI: https://doi.org/10.1145/3696415.
B. Poole, A. Jain, J. T. Barron, B. Mildenhall. Dream-Fusion: Text-to-3D using 2D diffusion, [Online], Available: https://arxiv.org/abs/2209.14988, 2022.
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2022. DOI: https://doi.org/10.1145/3503250.
J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, Y. Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 700, 2023.
X. Wu, K. Sun, F. Zhu, R. Zhao, H. Li. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2096–2105, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00200.
X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, [Online], Available: https://arxiv.org/abs/2306.09341, 2023.
Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, O. Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1594, 2023.
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9630–9640, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00951.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6629–6640, 2017.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2818–2826, 2016. DOI: https://doi.org/10.1109/CVPR.2016.308.
Z. Liu, P. Luo, X. Wang, X. Tang. Deep learning face attributes in the wild. In Proceedings of IEEE/CVF International Conference on Computer Vision, Santiago, Chile, pp. 3730–3738, 2015. DOI: https://doi.org/10.1109/ICCV.2015.425.
K. Zhang, Z. Zhang, Z. Li, Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. DOI: https://doi.org/10.1109/LSP.2016.2603342.
F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 815–823, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298682.
X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, Y. Zhao, Y. Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y. Lin, T. Huang, Z. Wang. Emu3: Next-token prediction is all you need, [Online], Available: https://arxiv.org/abs/2409.18869, 2024.
Gemini Team Google. Gemini: A family of highly capable multimodal models, [Online], Available: https://arxiv.org/abs/2312.11805, 2023.
Acknowledgements
This work was supported in part by Chinese National Natural Science Foundation Projects, China (Nos. U23B2054, 62276254 and 62372314), Beijing Natural Science Foundation, China (No. L221013), InnoHK program, and Hong Kong Research Grants Council through Research Impact Fund, China (No. R1015-23). Open access funding provided by The Hong Kong Polytechnic University, China.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declared that they have no conflicts of interest to this work.
Additional information
Colored figures are available in the online version at https://link.springer.com/journal/11633
Xulu Zhang received the B. Eng. and the M. Sc. degrees in computer science from Sichuan University, China in 2019 and 2022. He is a Ph. D. degree candidate from the Department of Computing, The Hong Kong Polytechnic University, China.
His research interests include image generation and active learning.
Xiaoyong Wei received the Ph. D. degree in computer science from the City University of Hong Kong, China in 2009, and has worked as a postdoctoral fellow in the University of California, USA from December 2013 to December 2015. He has been a professor and the head of Department of Computer Science, Sichuan University, China since 2010. He is an adjunct professor of Peng Cheng Laboratory, China, and a visiting professor of Department of Computing, The Hong Kong Polytechnic University, China. He is a senior member of IEEE, and has served as an Associate Editor of Interdisciplinary Sciences: Computational Life Sciences since 2020, the program Chair of ICMR 2019, ICIMCS 2012, and the technical committee member of over 20 conferences such as ICCV, CVPR, SIGKDD, ACM MM, ICME, and ICIP.
His research interests include multimedia computing, health computing, machine learning and large-scale data mining.
Wentao Hu received the B. Eng. degree in computer science from Shandong University, China in 2021, and the M. Sc. degree in computer science from Sun Yatsen University, China in 2024. He is currently a Ph. D. degree candidate from the Department of Computing, The Hong Kong Polytechnic University, China.
His research interests include image generation and 3D reconstruction.
Jinlin Wu received the B. Sc. degree in computer science from the University of Electronic Science and Technology of China, China in 2017, and the Ph. D. degree in computer science from the University of Chinese Academy of Sciences, China in 2022. He is an assistant research fellow at the Institute of Automation, Chinese Academy of Sciences (CAS), China. He has served as the principal investigator of a National Natural Science Foundation of China (NSFC) Youth Science Fund project and has participated in several other NSFC-funded projects. He has accumulated a solid research foundation in areas such as video analysis for security and medical video understanding. He has published over 30 high-quality academic papers, with more than 500 citations.
His research interests include object detection, image recognition, and video understanding.
Jiaxin Wu received the Ph. D. degree in computer science from the City University of Hong Kong, China in 2024. She is a postdoctoral fellow in the Department of Computing at The Hong Kong Polytechnic University, China.
Her research interests include multimedia retrieval, AI for Science (AI4Science), and natural language processing (NLP).
Wengyu Zhang is an undergraduate student in the Department of Computing, The Hong Kong Polytechnic University, China.
His research interests include natural language processing (NLP), AI for Science (AI4Science), and graph learning.
Zhaoxiang Zhang received the B. Sc. degree in computer science in Department of Electronic Science and Technology, University of Science and Technology of China, China in 2004, and the Ph. D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, China in 2009, respectively. From 2009 to 2015, he worked as a lecturer, associate professor, and later the deputy director of Department of Computer Application Technology at the Beihang University, China. Since July 2015, he has joined the Institute of Automation, Chinese Academy of Sciences, where he is currently a professor. Recently, he specifically focuses on deep learning models, biologically-inspired visual computing and human-like learning, and their applications on human analysis and scene understanding. He has published more than 200 papers in international journals and conferences, including reputable international journals such as IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Journal of Machine Learning Research, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, and top-level international conferences like CVPR, ICCV, NIPS, ECCV, ICLR, AAAI, IJCAI and ACM MM. He has won the best paper awards in several conferences and championships in international competitions and his research has won the “Technical Innovation Award of the Chinese Association of Artificial Intelligence”. He has served as the PC Chair or Area Chair of many international conferences like CVPR, ICCV, AAAI, IJCAI, ACM MM, ICPR and BICS. He is serving or has served as Associate Editor of reputable international journals like International Journal of Computer Vision, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Biometrics, Behavior, and Identity Science, Pattern Recognition and Neurocomputing.
His research interests include pattern recognition, computer vision and machine learning.
Zhen Lei received the B. Sc. degree in automation from the University of Science and Technology of China, China in 2005, and the Ph. D. degree in computer vision from the Institute of Automation, Chinese Academy of Sciences, China in 2010, where he is currently a professor. He is IEEE Fellow, IAPR Fellow and AAIA Fellow. He has published over 200 papers in international journals and conferences with 33 000+ citations in Google Scholar and h-index 86. He was the program co-chair of IJCB2023, was competition co-chair of IJCB2022 and has served as Area Chairs for several conferences and is an Associate Editor for IEEE Transactions on Information Forensics and Security, IEEE Transactions on Biometrics, Behavior, and Identity Science, Pattern Recognition, Neurocomputing and IET Computer Vision journals. He is the winner of 2019 IAPR Young Biometrics Investigator Award.
His research interests include computer vision, pattern recognition, image processing, and face recognition in particular.
Qing Li received the Ph. D. degree in data science from University of Southern California, USA in 1989. He is currently a Chair professor (Data Science) and the Head of the Department of Computing, The Hong Kong Polytechnic University. Formerly, he was the founding director of the Multimedia software Engineering Research Centre (MERC), and a professor at City University of Hong Kong where he worked in the Department of Computer Science from 1998 to 2018. Prior to these, he has also taught at the Hong Kong University of Science and Technology and the Australian National University (Canberra, Australia). He served as a consultant to Microsoft Research Asia (Beijing, China), Motorola Global Computing and Telecommunications Division (Tianjin Regional Operations Center), and the Division of Information Technology, Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia. He has been an adjunct professor of the University of Science and Technology of China (USTC) and the Wuhan University, and a guest professor of the Hunan University (Changsha, China) where he got his B. Eng. degree from the Department of Computer Science in 1982. He is also a guest professor (Software Technology) of the Zhejiang University (Hangzhou, China) - the leading university of the Zhejiang province where he was born.
He has been actively involved in the research community by serving as an Associate Editor and reviewer for technical journals, and as an organizer/co-organizer of numerous international conferences. Some recent conferences in which he is playing or has played major roles include APWeb-WAIM2018, ICDM 2018, WISE2017, ICDSC2016, DASFAA2015, U-Media2014, ER2013, RecSys2013, NDBC2012, ICMR2012, CoopIS2011, WAIM2010, DASFAA2010, APWeb-WAIM2009, ER2008, WISE2007, ICWL2006, HSI2005, WAIM2004, IDEAS2003, VLDB2002, PAKDD2001, IFIP 2.6 Working Conference on Database Semantics (DS-9), IDS2000, and WISE2000. In addition, he served as a programme committee member for over fifty international conferences (including VLDB, ICDE, WWW, DASFAA, ER, CIKM, CAiSE, CoopIS, and FODO). He is currently a fellow of IEEE and IET/IEE, a member of ACM-SIGMOD and IEEE Technical Committee on Data Engineering. He is the Chairperson of the Hong Kong Web Society, and also served/is serving as an executive committee (EXCO) member of IEEE-Hong Kong Computer Chapter and ACM Hong Kong Chapter. In addition, he serves as a councilor of the Database Society of Chinese Computer Federation (CCF), a member of the Big Data Expert Committee of CCF, and is a Steering Committee member of DASFAA, ER, ICWL, UMEDIA, and WISE Society.
His research interests include data science and artificial intelligence.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, X., Wei, X., Hu, W. et al. A Survey on Personalized Content Synthesis with Diffusion Models. Mach. Intell. Res. 22, 817–848 (2025). https://doi.org/10.1007/s11633-025-1563-3
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11633-025-1563-3
