Skip to main content

Advertisement

Springer Nature Link
Account
Menu
Find a journal Publish with us Track your research
Search
Saved research
Cart
  1. Home
  2. Machine Intelligence Research
  3. Article

A Survey on Personalized Content Synthesis with Diffusion Models

  • Review
  • Open access
  • Published: 27 September 2025
  • Volume 22, pages 817–848, (2025)
  • Cite this article

You have full access to this open access article

Download PDF
Save article
View saved research
Machine Intelligence Research Aims and scope Submit manuscript
A Survey on Personalized Content Synthesis with Diffusion Models
Download PDF
  • Xulu Zhang  ORCID: orcid.org/0000-0003-2473-460X1,2,
  • Xiaoyong Wei  ORCID: orcid.org/0000-0002-5706-51771,
  • Wentao Hu  ORCID: orcid.org/0000-0002-2071-93411,
  • Jinlin Wu  ORCID: orcid.org/0000-0001-7877-57282,3,
  • Jiaxin Wu  ORCID: orcid.org/0000-0003-4074-34421,
  • Wengyu Zhang  ORCID: orcid.org/0009-0001-2347-41831,
  • Zhaoxiang Zhang  ORCID: orcid.org/0000-0003-2648-38752,3,4,
  • Zhen Lei  ORCID: orcid.org/0000-0002-0791-189X2,3,4 &
  • …
  • Qing Li  ORCID: orcid.org/0000-0003-3370-471X1 
  • 1371 Accesses

  • 3 Citations

  • 1 Altmetric

  • Explore all metrics

Abstract

Recent advancements in diffusion models have significantly impacted content creation, leading to the emergence of personalized content synthesis (PCS). By utilizing a small set of user-provided examples featuring the same subject, PCS aims to tailor this subject to specific user-defined prompts. Over the past two years, more than 150 methods have been introduced in this area. However, existing surveys primarily focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper provides a comprehensive survey of PCS, introducing the general frameworks of PCS research, which can be categorized into test-time fine-tuning (TTF) and pre-trained adaptation (PTA) approaches. We analyze the strengths, limitations and key techniques of these methodologies. Additionally, we explore specialized tasks within the field, such as object, face and style personalization, while highlighting their unique challenges and innovations. Despite the promising progress, we also discuss ongoing challenges, including overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to further the development of PCS.

Article PDF

Download to read the full article text

Similar content being viewed by others

DTIA: Disruptive Text-Image Alignment for Countering Text-to-Image Diffusion Model Personalization

Article Open access 28 December 2024

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Chapter © 2025

Personalized Text-to-Image Generation Using Semantically Enhanced Diffusion Models

Chapter © 2026

Explore related subjects

Discover the latest articles, books and news in related subjects, suggested using machine learning.
  • Instructional Design
  • Media Planning
  • Media and Communication Methods
  • Media Design
  • Mixed Methods
  • Personalized Medicine
  • Natural Language Processing Techniques for Sentiment Analysis

References

  1. T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q. L. Han, Y. Tang. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023. DOI: https://doi.org/10.1109/JAS.2023.123618.

    Article  Google Scholar 

  2. F. A. Croitoru, V. Hondru, R. T. Ionescu, M. Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10850–10869, 2023. DOI: https://doi.org/10.1109/TPAMI.2023.3261988.

    Article  Google Scholar 

  3. V. Uc-Cetina, N. Navarro-Guerrero, A. Martin-Gonzalez, C. Weber, S. Wermter. Survey on reinforcement learning for language processing. Artificial Intelligence Review, vol. 56, no. 2, pp. 1543–1575, 2023. DOI: https://doi.org/10.1007/s10462-022-10205-5.

    Article  Google Scholar 

  4. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 22500–22510, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.02155.

    Google Scholar 

  5. G. Xiao, T. Yin, W. T. Freeman, F. Durand, S. Han. FastComposer: Tuning-free multi-subject image generation with localized attention. International Journal of Computer Vision, vol. 133, no. 3, pp. 1175–1194, 2025. DOI: https://doi.org/10.1007/s11263-024-02227-z.

    Article  Google Scholar 

  6. Q. Wang, X. Bai, H. Wang, Z. Qin, A. Chen, H. Li, X. Tang, Y. Hu. InstantID: Zero-shot identity-preserving generation in seconds, [Online], Available: https://arxiv.org/abs/2401.07519, 2024.

    Google Scholar 

  7. R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, D. Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.

    Google Scholar 

  8. N. Kumari, B. Zhang, R. Zhang, E. Shechtman, J. Y. Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 1931–1941, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.00192.

    Google Scholar 

  9. Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, W. Zuo. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 15897–15907, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.01461.

    Google Scholar 

  10. Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, Y. Cao. Cones: Concept neurons in diffusion models for customized generation. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 890, 2023.

    Google Scholar 

  11. A. Voynov, Q. Chu, D. Cohen-Or, K. Aberman. P+: Extended textual conditioning in text-to-image generation, [Online], Available: https://arxiv.org/abs/2303.09522, 2023.

    Google Scholar 

  12. L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, F. Yang. SVDiff: Compact parameter space for diffusion fine-tuning. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7289–7300, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00673.

    Google Scholar 

  13. W. Chen, H. Hu, Y. Li, N. Ruiz, X. Jia, M. W. Chang, W. W. Cohen. Subject-driven text-to-image generation via apprenticeship learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 30286–30305, 2023.

    Google Scholar 

  14. J. Shi, W. Xiong, Z. Lin, H. J. Jung. InstantBooth: Personalized text-to-image generation without test-time finetuning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8543–8552, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00816.

    Google Scholar 

  15. Y. Tewel, R. Gal, G. Chechik, Y. Atzmon. Key-locked rank one editing for text-to-image personalization. In Proceedings of ACM SIGGRAPH Conference, Los Angeles, USA, Article number 12, 2023. DOI: https://doi.org/10.1145/3588432.3591506.

    Google Scholar 

  16. H. Chen, Y. Zhang, S. Wu, X. Wang, X. Duan, Y. Zhou, W. Zhu. DisenBooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.

    Google Scholar 

  17. D. Li, J. Li, S. C. H. Hoi. BLIP-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 30146–30166, 2023.

    Google Scholar 

  18. Y. Alaluf, E. Richardson, G. Metzer, D. Cohen-Or. A neural space-time representation for text-to-image personalization. ACM Transactions on Graphics, vol. 42, no. 6, Article number 243, 2023. DOI: https://doi.org/10.1145/3618322.

  19. Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T. Y. Lee, O. Deussen, C. Xu. ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics, vol. 42, no. 6, Article number 244, 2023. DOI: https://doi.org/10.1145/3618342.

  20. Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu, Y. Ge, Y. Shan, M. Z. Shou. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing System, New Orleans, USA, Article number 699, 2023.

    Google Scholar 

  21. Z. Liu, Y. Zhang, Y. Shen, K. Zheng, K. Zhu, R. Feng, Y. Liu, D. Zhao, J. Zhou, Y. Cao. Cones 2: Customizable image synthesis with multiple subjects. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2508, 2023.

    Google Scholar 

  22. K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li, Y. Hao, I. Essa, M. Rubinstein, D. Krishnan. StyleDrop: Text-to-image generation in any style. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2920, 2023.

    Google Scholar 

  23. G. Yuan, X. Cun, Y. Zhang, M. Li, C. Qi, X. Wang, Y. Shan, H. Zheng. Inserting anybody in diffusion models via celeb basis. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 3190, 2023.

    Google Scholar 

  24. D. Valevski, D. Lumen, Y. Matias, Y. Leviathan. Face0: Instantaneously conditioning a text-to-image model on a face. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 94, 2023. DOI: https://doi.org/10.1145/3610548.3618249.

    Google Scholar 

  25. M. Arar, R. Gal, Y. Atzmon, G. Chechik, D. Cohen-Or, A. Shamir, A. H. Bermano. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 72, 2023. DOI: https://doi.org/10.1145/3610548.3618173.

    Google Scholar 

  26. N. Ruiz, Y. Li, V. Jampani, W. Wei, T. Hou, Y. Pritch, N. Wadhwa, M. Rubinstein, K. Aberman. Hyper Dream-Booth: HyperNetworks for fast personalization of text-to-image models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6527–6536, 2024. DOI: https://doi.org/10.1109/CV-PR52733.2024.00624.

    Google Scholar 

  27. J. Ma, J. Liang, C. Chen, H. Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In Proceedings of ACM SIGGRAPH Conference Papers, Denver, USA, Article number 25, 2024. DOI: https://doi.org/10.1145/3641519.3657469.

    Google Scholar 

  28. H. Ye, J. Zhang, S. Liu, X. Han, W. Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2308.06721, 2023.

    Google Scholar 

  29. Z. Wang, X. Wang, L. Xie, Z. Qi, Y. Shan, W. Wang, P. Luo. StyleAdapter: A unified stylized image generation model. International Journal of Computer Vision, vol. 133, no. 4, pp. 1894–1911, 2025. DOI: https://doi.org/10.1007/s11263-024-02253-x.

    Article  Google Scholar 

  30. X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, F. Wei. Kosmos-G: Generating images in context with multimodal large language models. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.

    Google Scholar 

  31. S. Motamed, D. P. Paudel, L. Van Gool. LEGO: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In Proceedings of the 18th European Computer Vision Association, Milan, Italy, pp. 116–133, 2025. DOI: https://doi.org/10.1007/978-3-031-72633-0_7.

    Google Scholar 

  32. Y. Yan, C. Zhang, R. Wang, Y. Zhou, G. Zhang, P. Cheng, G. Yu, B. Fu. FaceStudio: Put your face everywhere in seconds, [Online], Available: https://arxiv.org/abs/2312.02663, 2023.

    Google Scholar 

  33. Z. Li, M. Cao, X. Wang, Z. Qi, M. M. Cheng, Y. Shan. PhotoMaker: Customizing realistic human photos via stacked ID embedding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8640–8650, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00825.

    Google Scholar 

  34. X. Peng, J. Zhu, B. Jiang, Y. Tai, D. Luo, J. Zhang, W. Lin, T. Jin, C. Wang, R. Ji. PortraitBooth: A versatile portrait model for fast identity-preserved personalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 27070–27080, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.02557.

    Google Scholar 

  35. X. Zhang, X. Y. Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, Q. Li. Compositional inversion for stable diffusion models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 7350–7358, 2024. DOI: https://doi.org/10.1609/aaai.v38i7.28565.

    Google Scholar 

  36. Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, X. Wang. Generative multimodal models are in-context learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 14398–14409, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.01365.

    Google Scholar 

  37. L. Pang, J. Yin, H. Xie, Q. Wang, Q. Li, X. Mao. Cross initialization for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2312.15905, 2023.

    Google Scholar 

  38. Z. Kong, Y. Zhang, T. Yang, T. Wang, K. Zhang, B. Wu, G. Chen, W. Liu, W. Luo. OMG: Occlusion-friendly personalized multi-concept generation in diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 253–270, 2025. DOI: https://doi.org/10.1007/978-3-031-72751-1_15.

    Google Scholar 

  39. X. Zhang, W. Zhang, X. Wei, J. Wu, Z. Zhang, Z. Lei, Q. Li. Generative active learning for image synthesis personalization. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 10669–10677, 2024. DOI: https://doi.org/10.1145/3664647.3680773.

    Chapter  Google Scholar 

  40. Y. Zhang, Y. Song, J. Liu, R. Wang, J. Yu, H. Tang, H. Li, X. Tang, Y. Hu, H. Pan, Z. Jiang. SSR-encoder: Encoding selective subject representation for subject-driven generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8069–8078, 2024. DOI: https://doi.org/10.1109/CV-PR52733.2024.00771.

    Google Scholar 

  41. G. Zhang, K. Sohn, M. Hahn, H. Shi, I. Essa. FineStyle: Fine-grained controllable style personalization for text-to-image models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 52937–52961, 2024.

    Google Scholar 

  42. Z. Dong, P. Wei, L. Lin. DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter, [Online], Available: https://arxiv.org/abs/2211.11337, 2025.

    Google Scholar 

  43. A. Voronov, M. Khoroshikh, A. Babenko, M. Ryabinin. Is this loss informative? Faster text-to-image customization by tracking objective dynamics. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1630, 2023.

    Google Scholar 

  44. I. Han, S. Yang, T. Kwon, J. C. Ye. Highly personalized text embedding for image manipulation by stable diffusion, [Online], Available: https://arxiv.org/abs/2303.08767, 2023.

    Google Scholar 

  45. C. Xiang, F. Bao, C. Li, H. Su, J. Zhu. A closer look at parameter-efficient tuning in diffusion models, [Online], Available: https://arxiv.org/abs/2303.18181, 2023.

    Google Scholar 

  46. X. Jia, Y. Zhao, K. C. K. Chan, Y. Li, H. Zhang, B. Gong, T. Hou, H. Wang, Y. C. Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2304.02642, 2023.

    Google Scholar 

  47. J. Yang, H. Wang, Y. Zhang, R. Xiao, S. Wu, G. Chen, J. Zhao. Controllable textual inversion for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2304.05265, 2023.

    Google Scholar 

  48. Z. Fei, M. Fan, J. Huang. Gradient-free textual inversion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, pp. 1364–1373, 2023. DOI: https://doi.org/10.1145/3581783.3612599.

    Chapter  Google Scholar 

  49. R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, Y. Qiao, P. Gao, H. Li. Personalize segment anything model with one shot. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.

    Google Scholar 

  50. O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, D. Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 96, 2023. DOI: https://doi.org/10.1145/3610548.3618154.

    Google Scholar 

  51. J. Xiao, M. Yin, Y. Gong, X. Zang, J. Ren, B. Yuan. COMCAT: Towards efficient compression and customization of attention-based vision models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 1587, 2023.

    Google Scholar 

  52. S. Hao, K. Han, S. Zhao, K. Y. K. Wong. ViCo: Plug-and-play visual condition for personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2306.00971, 2023.

    Google Scholar 

  53. Z. Qiu, W. Liu, H. Feng, Y. Xue, Y. Feng, Z. Liu, D. Zhang, A. Weller, B. Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 79320–79362, 2023.

    Google Scholar 

  54. S. Y. Yeh, Y. G. Hsieh, Z. Gao, B. B. W. Yang, G. Oh, Y. Gong. Navigating text-to-image customization: From LyCORIS fine-tuning to model evaluation, [Online], Available: https://arxiv.org/abs/2309.14859, 2024.

    Google Scholar 

  55. X. He, Z. Cao, N. Kolkin, L. Yu, K. Wan, H. Rhodin, R. Kalarot. A data perspective on enhanced identity preservation for diffusion personalization. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 3782–3791, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00372.

    Google Scholar 

  56. A. Roy, M. Suin, A. Shah, K. Shah, J. Liu, R. Chellappa. DIFFNAT: Improving diffusion image quality using natural image statistics, [Online], Available: https://arxiv.org/abs/2311.09753, 2023.

    Google Scholar 

  57. A. Agarwal, S. Karanam, T. Shukla, B. V. Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 6053–6062, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00590.

    Google Scholar 

  58. R. Zhao, M. Zhu, S. Dong, D. Cheng, N. Wang, X. Gao. CatVersion: Concatenating embeddings for diffusion-based text-to-image personalization. IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 6, pp. 6047–6058, 2025. DOI: https://doi.org/10.1109/TC-SVT.2025.3531917.

    Article  Google Scholar 

  59. M. Safaee, A. Mikaeili, O. Patashnik, D. Cohen-Or, A. Mahdavi-Amiri. CLiC: Concept learning in context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6924–6933, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00661.

    Google Scholar 

  60. Z. Wang, W. Wei, Y. Zhao, Z. Xiao, M. Hasegawa-Johnson, H. Shi, T. Hou. HiFi tuner: High-fidelity subject-driven fine-tuning for diffusion models, [Online], Available: https://arxiv.org/abs/2312.00079, 2023.

    Google Scholar 

  61. D. Chae, N. Park, J. Kim, K. Lee. InstructBooth: Instruction-following personalized text-to-image generation, [Online], Available: https://arxiv.org/abs/2312.03011, 2024.

    Google Scholar 

  62. Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, W. Zuo. Decoupled textual embeddings for customized image generation. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 909–917, 2024. DOI: https://doi.org/10.1609/aaai.v38i2.27850.

    Google Scholar 

  63. B. N. Zhao, Y. Xiao, J. Xu, X. Jiang, Y. Yang, D. Li, L. Itti, V. Vineet, Y. Ge. DreamDistribution: Learning prompt distribution for diverse in-distribution generation, [Online], Available: https://arxiv.org/abs/2312.14216, 2025.

    Google Scholar 

  64. M. Hua, J. Liu, F. Ding, W. Liu, J. Wu, Q. He. Dream-Tuner: Single image is enough for subject-driven generation, [Online], Available: https://arxiv.org/abs/2312.13691, 2023.

    Google Scholar 

  65. J. Lu, C. Xie, H. Guo. Object-driven one-shot fine-tuning of text-to-image diffusion with prototypical embedding, [Online], Available: https://arxiv.org/abs/2401.15708, 2024.

    Google Scholar 

  66. W. Zeng, Y. Yan, Q. Zhu, Z. Chen, P. Chu, W. Zhao, X. Yang. Infusion: Preventing customized text-to-image diffusion from overfitting. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 3568–3577, 2024. DOI: https://doi.org/10.1145/3664647.3680894.

    Chapter  Google Scholar 

  67. M. Jones, S. Y. Wang, N. Kumari, D. Bau, J. Y. Zhu. Customizing text-to-image models with a single image pair, [Online], Available: https://arxiv.org/abs/2405.01536, 2024.

    Book  Google Scholar 

  68. H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, W. Zhu. DisenDreamer: Subject-driven text-to-image generation with sample-aware disentangled tuning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6860–6873, 2024. DOI: https://doi.org/10.1109/TCSVT.2024.3369757.

    Article  Google Scholar 

  69. Z. Fan, Z. Yin, G. Li, Y. Zhan, H. Zheng. Dream-Booth++: Boosting subject-driven generation via region-level references packing. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 11013–11021, 2024. DOI: https://doi.org/10.1145/3664647.3680734.

    Chapter  Google Scholar 

  70. F. Wu, Y. Pang, J. Zhang, L. Pang, J. Yin, B. Zhao, Q. Li, X. Mao. CoRe: Context-regularized text embedding learning for text-to-image personalization, [Online], Available: https://arxiv.org/abs/2408.15914, 2024.

    Google Scholar 

  71. J. Jin, Y. Shen, Z. Fu, J. Yang. Customized generation reimagined: Fidelity and editability harmonized. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 410–426, 2025. DOI: https://doi.org/10.1007/978-3-031-72973-7_24.

    Google Scholar 

  72. W. Chen, H. Hu, C. Saharia, W. W. Cohen. Re-imagen: Retrieval-augmented text-to-image generator. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.

    Google Scholar 

  73. X. Xu, Z. Wang, E. Zhang, K. Wang, H. Shi. Versatile diffusion: Text, images and variations all in one diffusion model. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7720–7731, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00713.

    Google Scholar 

  74. R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik, D. Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics, vol. 42, no. 4, Article number 150, 2023. DOI: https://doi.org/10.1145/3592133.

  75. Y. Ma, H. Yang, W. Wang, J. Fu, J. Liu. Unified multi-modal latent diffusion for joint subject and text conditional image generation, [Online], Available: https://arxiv.org/abs/2303.09319, 2023.

    Google Scholar 

  76. M. Kim, J. Yoo, S. Kwon. Personalized text-to-image model enhancement strategies: SOD preprocessing and CNN local feature integration. Electronics, vol. 12, no. 22, Article number 4707, 2023. DOI: https://doi.org/10.3390/electronics12224707.

  77. Y. Zhou, R. Zhang, J. Gu, T. Sun. Customization assistant for text-to-image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9182–9191, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00877.

    Google Scholar 

  78. J. Pan, H. Yan, J. H. Liew, J. Feng, V. Y. F. Tan. Towards accurate guided diffusion sampling through symplectic adjoint method, [Online], Available: https://arxiv.org/abs/2312.12030, 2023.

    Google Scholar 

  79. S. Purushwalkam, A. Gokul, S. Joty, N. Naik. Boot-PIG: Bootstrapping zero-shot personalized image generation capabilities in pretrained diffusion models. In Proceedings of Computer Vision, Milan, Italy, pp. 252–269, 2025. DOI: https://doi.org/10.1007/978-3-031-91907-7_15.

    Google Scholar 

  80. N. Chen, M. Huang, Z. Chen, Y. Zheng, L. Zhang, Z. Mao. CustomContrast: A multilevel contrastive perspective for subject-driven text-to-image customization. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 2123–2131, 2025. DOI: https://doi.org/10.1609/aaai.v39i2.32210.

    Google Scholar 

  81. K. Song, Y. Zhu, B. Liu, Q. Yan, A. Elgammal, X. Yang. MoMA: Multimodal LLM adapter for fast personalized image generation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 117–132, 2025. DOI: https://doi.org/10.1007/978-3-031-73661-2_7.

    Google Scholar 

  82. C. Shin, J. Choi, H. Kim, S. Yoon. Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator, [Online], Available: https://arxiv.org/abs/2411.15466, 2024.

    Google Scholar 

  83. J. Park, B. Ko, H. Jang. StyleBoost: A study of personalizing text-to-image generation in any style using dreambooth. In Proceedings of the 14th International Conference on Information and Communication Technology Convergence, Jeju Island, Republic of Korea, pp. 93–98, 2023. DOI: https://doi.org/10.1109/ICTC58733.2023.10392676.

    Google Scholar 

  84. A. Hertz, A. Voynov, S. Fruchter, D. Cohen-Or. Style aligned image generation via shared attention. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4775–4785, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00457.

    Google Scholar 

  85. J. Park, B. Ko, H. Jang. Text-to-image synthesis for any artistic styles: Advancements in personalized artistic image generation via subdivision and dual binding, [Online], Available: https://arxiv.org/html/2404.05256v1, 2024.

    Google Scholar 

  86. J. Choi, C. Shin, Y. Oh, H. Kim, J. Lee, S. Yoon. Style-friendly SNR sampler for style-driven generation, [Online], Available: https://arxiv.org/abs/2411.14793, 2024.

    Google Scholar 

  87. D. Y. Chen, H. Tennent, C. W. Hsu. ArtAdapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8619–8628, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00823

    Google Scholar 

  88. Y. Zhou, R. Zhang, T. Sun, J. Xu. Enhancing detail preservation for customized text-to-image generation: A regularization-free approach, [Online], Available: https://arxiv.org/abs/2305.13579, 2023.

    Google Scholar 

  89. S. Banerjee, G. Mittal, A. Joshi, C. Hegde, N. Memon. Identity-preserving aging of face images via latent diffusion models. In Proceedings of IEEE International Joint Conference on Biometrics, Ljubljana, Slovenia, 2023. DOI: https://doi.org/10.1109/IJCB57857.2023.10448860.

    Google Scholar 

  90. J. Hyung, J. Shin, J. Choo. MagiCapture: High-resolution multi-concept portrait customization, [Online], Available: https://arxiv.org/abs/2309.06895, 2024.

    Google Scholar 

  91. P. Cao, L. Yang, F. Zhou, T. Huang, Q. Song. Concept-centric personalization with large-scale diffusion priors, [Online], Available: https://arxiv.org/html/2312.08195v1, 2023.

    Google Scholar 

  92. C. Kim, J. Lee, S. Joung, B. Kim, Y. M. Baek. Instant-Family: Masked attention for zero-shot multi-ID image generation, [Online], Available: https://arxiv.org/abs/2404.19427, 2024.

    Google Scholar 

  93. X. Li, J. Zhan, S. He, Y. Xu, J. Dong, H. Zhang, Y. Du. PersonaMagic: Stage-regulated high-fidelity face customization with tandem equilibrium, [Online], Available: https://arxiv.org/abs/2412.15674, 2024.

    Google Scholar 

  94. Y. C. Su, K. C. K. Chan, Y. Li, Y. Zhao, H. Zhang, B. Gong, H. Wang, X. Jia. Identity encoder for personalized diffusion, [Online], Available: https://arxiv.org/abs/2304.07429, 2023.

    Google Scholar 

  95. Z. Chen, S. Fang, W. Liu, Q. He, M. Huang, Y. Zhang, Z. Mao. DreamIdentity: Improved editability for efficient face-identity preserved image generation, [Online], Available: https://arxiv.org/abs/2307.00300, 2023.

    Google Scholar 

  96. Y. Wang, W. Zhang, J. Zheng, C. Jin. High-fidelity person-centric subject-to-image synthesis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7675–7684, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00733.

    Google Scholar 

  97. X. Li, X. Hou, C. C. Loy. When StyleGAN meets stable diffusion: A W+ adapter for personalized image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 2187–2196, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00213.

    Google Scholar 

  98. J. Liu, H. Huang, C. Jin, R. He. Portrait diffusion: Training-free face stylization with chain-of-painting, [Online], Available: https://arxiv.org/abs/2312.02212, 2023.

    Google Scholar 

  99. H. Tang, J. Deng, Z. Pan, H. Tian, P. Chaudhari, X. Zhou. RetriBooru: Leakage-free retrieval of conditions from reference images for subject-driven generation, [Online], Available: https://arxiv.org/abs/2312.02521, 2024.

    Google Scholar 

  100. J. Xu, S. Motamed, P. Vaddamanu, C. H. Wu, C. Haene, J. C. Bazin, F. De La Torre. Personalized face inpainting with diffusion models by parallel visual attention. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 5420–5430, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00535.

    Google Scholar 

  101. D. Y. Chen, A. K. Bhunia, S. Koley, A. Sain, P. N. Chowdhury, Y. Z. Song. DemoCaricature: Democratising caricature generation with a rough sketch. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8629–8639, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00824.

    Google Scholar 

  102. P. Achlioptas, A. Benetatos, I. Fostiropoulos, D. Skourtis. Stellar: Systematic evaluation of human-centric personalized text-to-image methods, [Online], Available: https://arxiv.org/abs/2312.06116, 2023.

    Google Scholar 

  103. W. Chen, J. Zhang, J. Wu, H. Wu, X. Xiao, L. Lin. ID-aligner: Enhancing identity-preserving text-to-image generation with reward feedback learning, [Online], Available: https://arxiv.org/abs/2404.15449, 2024.

    Google Scholar 

  104. K. C. Wang, D. Ostashev, Y. Fang, S. Tulyakov, K. Aberman. MoA: Mixture-of-attention for subject-context disentanglement in personalized image generation, [Online], Available: https://arxiv.org/html/2404.11565v1, 2024.

    Google Scholar 

  105. S. Cui, J. Guo, X. An, J. Deng, Y. Zhao, X. Wei, Z. Feng. IDAdapter: Learning mixed features for tuning-free personalization of text-to-image models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, pp. 950–959, 2024. DOI: https://doi.org/10.1109/CVPRW63382.2024.00100.

    Google Scholar 

  106. Y. Wu, Z. Li, H. Zheng, C. Wang, B. Li. Infinite-ID: Identity-preserved personalization via ID-semantics decoupling paradigm. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 279–296, 2025. DOI: https://doi.org/10.1007/978-3-031-73242-3_16.

    Google Scholar 

  107. K. Shiohara, T. Yamasaki. Face2Diffusion for fast and editable face personalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6850–6859, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00654.

    Google Scholar 

  108. Y. Cai, Z. Jiang, Y. Liu, C. Jiang, W. Xue, W. Luo, Y. Guo. Foundation cures personalization: Recovering facial personalized models’ prompt consistency, [Online], Available: https://arxiv.org/abs/2411.15277v1, 2024.

    Google Scholar 

  109. G. Qian, K. C. Wang, O. Patashnik, N. Heravi, D. Ostashev, S. Tulyakov, D. Cohen-Or, K. Aberman. Omni-ID: Holistic identity representation designed for generative tasks, [Online], Available: https://arxiv.org/html/2412.09694v1, 2024.

    Google Scholar 

  110. Y. Li, H. Liu, Y. Wen, Y. J. Lee. Generate anything anywhere in any scene, [Online], Available: https://arxiv.org/abs/2306.17154, 2023.

    Google Scholar 

  111. T. Rahman, S. Mahajan, H. Y. Lee, J. Ren, S. Tulyakov, L. Sigal. Visual concept-driven image generation with text-to-image diffusion model, [Online], Available: https://arxiv.org/abs/2402.11487, 2025.

    Google Scholar 

  112. J. Jiang, Y. Zhang, K. Feng, X. Wu, W. Li, R. Pei, F. Li, W. Zuo. MC2: Multi-concept guidance for customized multi-concept generation, [Online], Available: https://arxiv.org/abs/2404.05268, 2024.

    Google Scholar 

  113. C. Zhu, K. Li, Y. Ma, C. He, X. Li. MultiBooth: Towards generating all your concepts in an image from text. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 10923–10931, 2025. DOI: https://doi.org/10.1609/aaai.v39i10.33187.

    Google Scholar 

  114. H. Matsuda, R. Togo, K. Maeda, T. Ogawa, M. Haseyama. Multi-object editing in personalized text-to-image diffusion model via segmentation guidance. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, pp. 8140–8144, 2024. DOI: https://doi.org/10.1109/ICASSP48485.2024.10447048.

    Google Scholar 

  115. D. Zhou, J. Huang, J. Bai, J. Wang, H. Chen, G. Chen, X. Hu, P. A. Heng. MagicTailor: Component-controllable personalization in text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2410.13370, 2024.

    Google Scholar 

  116. X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, H. Zhao. AnyDoor: Zero-shot object-level image customization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6593–6602, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00630.

    Google Scholar 

  117. Z. Yuan, M. Cao, X. Wang, Z. Qi, C. Yuan, Y. Shan. CustomNet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models, [Online], Available: https://arxiv.org/abs/2310.19784, 2023.

    Google Scholar 

  118. D. Zhou, Y. Li, F. Ma, Z. Yang, Y. Yang. MIGC: Multi-instance generation controller for text-to-image synthesis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6818–6828, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00651.

    Google Scholar 

  119. M. Patel, S. Jung, C. Baral, Y. Yang. λ-ECLIPSE: Multi-concept personalized text-to-image diffusion models by leveraging CLIP latent space, [Online], Available: https://arxiv.org/abs/2402.05195, 2024.

    Google Scholar 

  120. X. Wang, S. Fu, Q. Huang, W. He, H. Jiang. MS-diffusion: Multi-subject zero-shot image personalization with layout guidance. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 2025.

    Google Scholar 

  121. Z. Huang, T. Wu, Y. Jiang, K. C. K. Chan, Z. Liu. ReVersion: Diffusion-based relation inversion from images. In Proceedings of SIGGRAPH Asia Conference Papers, Tokyo, Japan, Article number 4, 2024. DOI: https://doi.org/10.1145/3680528.3687658.

    Google Scholar 

  122. G. Zhang, Y. Qian, J. Deng, X. Cai. Inv-ReVersion: Enhanced relation inversion based on text-to-image diffusion models. Applied Sciences, vol. 14, no. 8, Article number 3338, 2024. DOI: https://doi.org/10.3390/app14083338.

  123. Z. Xu, S. Hao, K. Han. CusConcept: Customized visual concept decomposition with diffusion models. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 3678–3687, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00362.

    Google Scholar 

  124. S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, D. Wang. Learning disentangled identifiers for action-customized text-to-image generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7797–7806, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00745.

    Google Scholar 

  125. J. Gu, Y. Wang, N. Zhao, T. J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, X. E. Wang. PHOTOSWAP: Personalized subject swapping in images. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1529, 2023.

    Google Scholar 

  126. S. Zhang, M. Ni, S. Chen, L. Wang, W. Ding, Y. Liu. A two-stage personalized virtual try-on framework with shape control and texture guidance. IEEE Transactions on Multimedia, vol. 26, pp. 10225–10236, 2024. DOI: https://doi.org/10.1109/TMM.2024.3405718.

    Article  Google Scholar 

  127. M. Chen, I. Laina, A. Vedaldi. Training-free layout control with cross-attention guidance. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 5331–5341, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00526.

    Google Scholar 

  128. J. Gu, N. Zhao, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, Y. Wang, X. E. Wang. SwapAnything: Enabling arbitrary object swapping in personalized visual editing, [Online], Available: https://arxiv.org/abs/2404.05717, 2024.

    Google Scholar 

  129. N. Kumari, G. Su, R. Zhang, T. Park, E. Shechtman, J. Y. Zhu. Customizing text-to-image diffusion with object viewpoint control, [Online], Available: https://arxiv.org/abs/2404.12333, 2024.

    Book  Google Scholar 

  130. X. Xu, J. Guo, Z. Wang, G. Huang, I. Essa, H. Shi. Prompt-free diffusion: Taking “text” out of text-to-image diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8682–8692, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00829.

    Google Scholar 

  131. S. Zhao, D. Chen, Y. C. Chen, J. Bao, S. Hao, L. Yuan, K. Y. K. Wong. Uni-controlNet: All-in-one control to text-to-image diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 11127–11150, 2023.

    Google Scholar 

  132. S. Y. Cheong, A. Mustafa, A. Gilbert. ViscoNet: Bridging and harmonizing visual and textual conditioning for ControlNet, [Online], Available: https://arxiv.org/abs/2312.03154, 2024.

    Google Scholar 

  133. I. Najdenkoska, A. Sinha, A. Dubey, D. Mahajan, V. Ramanathan, F. Radenovic. Context diffusion: In-context aware image generation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 375–391, 2024. DOI: https://doi.org/10.1007/978-3-031-72980-5_22.

    Google Scholar 

  134. S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, B. Zhou. FreeControl: Training-free spatial control of any text-to-image diffusion model with any condition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7465–7475, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00713.

    Google Scholar 

  135. P. Li, Q. Nie, Y. Chen, X. Jiang, K. Wu, Y. Lin, Y. Liu, J. Peng, C. Wang, F. Zheng. Tuning-free image customization with image and text guidance. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 233–250, 2025. DOI: https://doi.org/10.1007/978-3-031-73116-7_14.

    Google Scholar 

  136. J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, M. Z. Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 7623–7633, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00701.

    Google Scholar 

  137. P. Esser, J. Chiu, P. Atighehchian, J. Granskog, A. Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of IEEE/ CVF International Conference on Computer Vision, Paris, France, pp. 7346–7356, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00675.

    Google Scholar 

  138. Y. Zhao, E. Xie, L. Hong, Z. Li, G. H. Lee. Make-aprotagonist: Generic video editing with an ensemble of experts, [Online], Available: https://ar5iv.labs.arxiv.org/html/2305.08850, 2023.

    Google Scholar 

  139. Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan, Q. Chen. Animate-a-story: Storytelling with retrieval-augmented video generation, [Online], Available: https://arxiv.org/abs/2307.06940, 2023.

    Google Scholar 

  140. R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. W. Liu, W. Wu, J. Keppo, M. Z. Shou. MotionDirector: Motion customization of text-to-video diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 273–290, 2025. DOI: https://doi.org/10.1007/978-3-031-72992-8_16.

    Google Scholar 

  141. R. Wu, L. Chen, T. Yang, C. Guo, C. Li, X. Zhang. LAMP: Learn a motion pattern for few-shot video generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7089–7098, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00677.

    Google Scholar 

  142. H. Jeong, G. Y. Park, J. C. Ye. VMC: Video motion customization using temporal attention adaption for text-to-video diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9212–9221, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00880.

    Google Scholar 

  143. Y. Song, W. Shin, J. Lee, J. Kim, N. Kwak. SAVE: Protagonist diversification with structure agnostic video editing. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 41–57, 2025. DOI: https://doi.org/10.1007/978-3-031-72989-8_3.

    Google Scholar 

  144. J. Materzynska, J. Sivic, E. Shechtman, A. Torralba, R. Zhang, B. Russell. NewMove: Customizing text-to-video models with novel motions, [Online], Available: https://arxiv.org/abs/2312.04966, 2023.

    Google Scholar 

  145. Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, H. Shan. DreamVideo: Composing your dream videos with customized subject and motion, [Online], Available: https://arxiv.org/abs/2312.04433, 2023.

    Google Scholar 

  146. Y. Zhang, F. Tang, N. Huang, H. Huang, C. Ma, W. Dong, C. Xu. MotionCrafter: One-shot motion customization of diffusion models, [Online], Available: https://arxiv.org/abs/2312.05288, 2024.

    Google Scholar 

  147. Y. Ren, Y. Zhou, J. Yang, J. Shi, D. Liu, F. Liu, M. Kwon, A. Shrivastava. Customize-a-video: One-shot motion customization of text-to-video diffusion models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 332–349, 2025. DOI: https://doi.org/10.1007/978-3-031-73024-5_20.

    Google Scholar 

  148. Z. Ma, D. Zhou, C. H. Yeh, X. S. Wang, X. Li, H. Yang, Z. Dong, K. Keutzer, J. Feng. Magic-Me: Identity-specific video customized diffusion, [Online], Available: https://arxiv.org/abs/2402.09368, 2024.

    Google Scholar 

  149. X. Bi, J. Lu, B. Liu, X. Cun, Y. Zhang, W. Li, B. Xiao. CustomTTT: Motion and appearance customized video generation via test-time training. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 1871–1879, 2025. DOI: https://doi.org/10.1609/aaai.v39i2.32182.

    Google Scholar 

  150. H. Li, H. Qiu, S. Zhang, X. Wang, Y. Wei, Z. Li, Y. Zhang, B. Wu, D. Cai. PersonalVideo: High ID-fidelity video customization without dynamic and semantic degradation, [Online], Available: https://arxiv.org/abs/2411.17048, 2024.

    Google Scholar 

  151. Z. Wang, A. Li, L. Zhu, Y. Guo, Q. Dou, Z. Li. CustomVideo: Customizing text-to-video generation with multiple subjects, [Online], Available: https://arxiv.org/abs/2401.09962, 2024.

    Google Scholar 

  152. H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, Y. Wu, W. Zhu. VideoDreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models, [Online], Available: https://arxiv.org/abs/2311.00990, 2023.

    Google Scholar 

  153. H. Zhao, T. Lu, J. Gu, X. Zhang, Q. Zheng, Z. Wu, H. Xu, Y. G. Jiang. MagDiff: Multi-alignment diffusion for high-fidelity video generation and editing. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 205–221, 2025. DOI: https://doi.org/10.1007/978-3-031-72649-1_12.

    Google Scholar 

  154. Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, Z. Liu. VideoBooth: Diffusion-based video generation with image prompts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6689–6700, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00639.

    Google Scholar 

  155. T. Wu, Y. Zhang, X. Cun, Z. Qi, J. Pu, H. Dou, G. Zheng, Y. Shan, X. Li. VideoMaker: Zero-shot customized video generation with the inherent force of video diffusion models, [Online], Available: https://arxiv.org/abs/2412.19645, 2024.

    Google Scholar 

  156. Y. Zhou, R. Zhang, J. Gu, N. Zhao, J. Shi, T. Sun. SUGAR: Subject-driven video customization in a zero-shot manner, [Online], Available: https://arxiv.org/abs/2412.10533, 2024.

    Google Scholar 

  157. G. Liu, M. Xia, Y. Zhang, H. Chen, J. Xing, Y. Wang, X. Wang, Y. Shan, Y. Yang. StyleCrafter: Taming artistic video diffusion with reference-augmented adapter learning. ACM Transactions on Graphics, vol. 43, no. 6, Article number 251, 2024. DOI: https://doi.org/10.1145/3687975

  158. X. He, Q. Liu, S. Qian, X. Wang, T. Hu, K. Cao, K. Yan, J. Zhang. ID-animator: Zero-shot identity-preserving human video generation, [Online], Available: https://arxiv.org/abs/2404.15275, 2024.

    Google Scholar 

  159. S. Yuan, J. Huang, X. He, Y. Ge, Y. Shi, L. Chen, J. Luo, L. Yuan. Identity-preserving text-to-video generation by frequency decomposition, [Online], Available: https://arxiv.org/abs/2411.17440, 2024.

    Book  Google Scholar 

  160. H. Fang, D. Qiu, B. Mao, P. Yan, H. Tang. Motion-Character: Identity-preserving and motion controllable human video generation, [Online], Available: https://arxiv.org/abs/2411.18281, 2024.

    Google Scholar 

  161. M. Feng, J. Liu, K. Yu, Y. Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Li, A. Li, X. Kang, B. Lei, M. Cui, P. Ren, X. Xie. DreaMoving: A human video generation framework based on diffusion models, [Online], Available: https://arxiv.org/abs/2312.05107, 2023.

    Google Scholar 

  162. C. H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M. Y. Liu, T. Y. Lin. Magic3D: High-resolution text-to-3D content creation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 300–309, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.00037.

    Google Scholar 

  163. A. Raj, S. Kaza, B. Poole, M. Niemeyer, N. Ruiz, B. Mildenhall, S. Zada, K. Aberman, M. Rubinstein, J. Barron, Y. Li, V. Jampani. DreamBooth3D: Subject-driven text-to-3D generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2349–2359, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00223.

    Google Scholar 

  164. S. Azadi, T. Hayes, A. Shah, G. Pang, D. Parikh, S. Gupta. Text-conditional contextualized avatars for zero-shot personalization, [Online], Available: https://arxiv.org/abs/2304.07410, 2023.

    Google Scholar 

  165. C. Zhang, Y. Chen, Y. Fu, Z. Zhou, G. Yu, B. Wang, B. Fu, T. Chen, G. Lin, C. Shen. StyleAvatar3D: Leveraging image-text diffusion models for high-fidelity 3D avatar generation, [Online], Available: https://arxiv.org/abs/2305.19012, 2023.

    Google Scholar 

  166. Y. Zeng, Y. Lu, X. Ji, Y. Yao, H. Zhu, X. Cao. Avatar-Booth: High-quality and customizable 3D human avatar generation, [Online], Available: https://arxiv.org/abs/2306.09864, 2023.

    Google Scholar 

  167. Y. Shi, P. Wang, J. Ye, L. Mai, K. Li, X. Yang. MVDream: Multi-view diffusion for 3D generation. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.

    Google Scholar 

  168. Y. Ouyang, W. Chai, J. Ye, D. Tao, Y. Zhan, G. Wang. Chasing consistency in text-to-3D generation from a single image, [Online], Available: https://arxiv.org/abs/2309.03599, 2023.

    Google Scholar 

  169. Y. Zhao, Z. Yan, E. Xie, L. Hong, Z. Li, G. H. Lee. Animate124: Animating one image to 4D dynamic scene, [Online], Available: https://arxiv.org/abs/2311.14603, 2023.

    Google Scholar 

  170. Y. Zheng, X. Li, K. Nagano, S. Liu, O. Hilliges, S. De Mello. A unified approach for text-and image-guided 4D scene generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 7300–7309, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00697.

    Google Scholar 

  171. Y. Y. Yeh, J. B. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong, Z. Li. TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4304–4314, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00412.

    Google Scholar 

  172. J. Zhuang, D. Kang, Y. P. Cao, G. Li, L. Lin, Y. Shan. TIP-editor: An accurate 3D editor following both text-prompts and image-prompts. ACM Transactions on Graphics, vol. 43, no. 4, Article number 121, 2024. DOI: https://doi.org/10.1145/3658205.

  173. T. Van Le, H. Phung, T. H. Nguyen, Q. Dao, N. N. Tran, A. Tran. Anti-DreamBooth: Protecting users from personalized text-to-image synthesis. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2116–2127, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00202.

    Google Scholar 

  174. Y. Wu, J. Zhang, F. Kerschbaum, T. Zhang. Backdooring textual inversion for concept censorship, [Online], Available: https://arxiv.org/abs/2308.10718, 2023.

    Google Scholar 

  175. Y. Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y. Wu, M. Hu, T. Li, G. Pu, Y. Liu. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 21169–21178, 2024. DOI: https://doi.org/10.1609/aaai.v38i19.30110.

    Google Scholar 

  176. J. S. Smith, Y. C. Hsu, L. Zhang, T. Hua, Z. Kira, Y. Shen, H. Jin. Continual diffusion: Continual customization of text-to-image diffusion with C-LoRA, [Online], Available: https://arxiv.org/abs/2304.06027, 2024.

    Google Scholar 

  177. P. Zhang, N. Zhao, J. Liao. Text-guided vector graphics customization. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 54, 2023. DOI: https://doi.org/10.1145/3610548.3618232.

    Google Scholar 

  178. H. Wang, X. Xiang, Y. Fan, J. H. Xue. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 4921–4931, 2024. DOI: https://doi.org/10.1109/WACV57701.2024.00486.

    Google Scholar 

  179. K. Wang, F. Yang, B. Raducanu, J. van de Weijer. Multi-class textual-inversion secretly yields a semantic-agnostic classifier. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, pp. 4400–4409, 2025. DOI: https://doi.org/10.1109/WACV61041.2025.00432.

    Google Scholar 

  180. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the 9th International Conference on Learning Representations, 2021.

    Google Scholar 

  181. T. Karras, M. Aittala, T. Aila, S. Laine. Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 26565–26577, 2022.

    Google Scholar 

  182. J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 6840–6851, 2020.

    Google Scholar 

  183. B. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982. DOI: https://doi.org/10.1016/0304-4149(82)90051-5.

    Article  MathSciNet  Google Scholar 

  184. C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, J. Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, pp. 5775–5787, 2022.

    Google Scholar 

  185. C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, J. Zhu. DPM-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Reserach, vol. 22, no. 4, pp. 730–751, 2025. DOI: https://doi.org/10.1007/s11633-025-1562-4.

    Article  Google Scholar 

  186. D. Chen, Z. Zhou, C. Wang, C. Shen, S. Lyu. On the trajectory regularity of ODE-based diffusion sampling. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, pp. 7905–7934, 2024.

    Google Scholar 

  187. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 10684–10695, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01042.

    Google Scholar 

  188. L. Zhang, A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3813–3824, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00355.

    Google Scholar 

  189. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen. Hierarchical text-conditional image generation with CLIP latents, [Online], Available: https://arxiv.org/abs/2204.06125, 2022.

    Google Scholar 

  190. O. Ronneberger, P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.

    Google Scholar 

  191. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.

    Google Scholar 

  192. J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, Y. Jiao, A. Ramesh. Improving image generation with better captions, [Online], Available: https://cdn.openai.com/papers/dall-e-3.pdf.

  193. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.

    Google Scholar 

  194. J. Li, D. Li, C. Xiong, S. C. H. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 12888–12900, 2022.

    Google Scholar 

  195. J. Li, D. Li, S. Savarese, S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, Article number 814, 2023.

    Google Scholar 

  196. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, J. Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1833, 2022.

    Google Scholar 

  197. Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, F. Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 18676–18688, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01814.

    Google Scholar 

  198. M. Hassanin, S. Anwar, I. Radwan, F. S. Khan, A. Mian. Visual attention methods in deep learning: An indepth survey. Information Fusion, vol. 108, Article number 102417, 2024. DOI: https://doi.org/10.1016/j.inffus.2024.102417.

  199. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, P. Dollár, R. Girshick. Segment anything. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3992–4003, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00371.

    Google Scholar 

  200. X. Y. Wei, Z. Q. Yang. Coaching the exploration and exploitation in active learning for interactive video retrieval. IEEE Transactions on Image Processing, vol. 22, no. 3, pp. 955–968, 2013. DOI: https://doi.org/10.1109/TIP.2012.2222902.

    Article  MathSciNet  Google Scholar 

  201. X. Y. Wei, Z. Q. Yang. Coached active learning for interactive video search. In Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, USA, pp. 443–452, 2011. DOI: https://doi.org/10.1145/2072298.2072356.

    Chapter  Google Scholar 

  202. H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, X. Cao. FaceScape: A large-scale high quality 3D face dataset and detailed riggable 3D face prediction. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 598–607, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00068.

    Google Scholar 

  203. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi’an, China, pp. 67–74, 2018. DOI: https://doi.org/10.1109/FG.2018.00020.

    Google Scholar 

  204. Y. Wu, Q. Ji. Facial landmark detection: A literature survey. International Journal of Computer Vision, vol. 127, no. 2, pp. 115–142, 2019. DOI: https://doi.org/10.1007/s11263-018-1097-z.

    Article  Google Scholar 

  205. I. Adjabi, A. Ouahabi, A. Benzaoui, A. Taleb-Ahmed. Past, present, and future of face recognition: A review. Electronics, vol. 9, no. 8, Article number 1188, 2020. DOI: https://doi.org/10.3390/electronics9081188.

  206. T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4217–4228, 2021. DOI: https://doi.org/10.1109/tpami.2020.2970919.

    Article  Google Scholar 

  207. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, 2022.

    Google Scholar 

  208. J. Song, C. Meng, S. Ermon. Denoising diffusion implicit models. In Proceedings of the 9th International Conference on Learning Representations, 2021.

    Google Scholar 

  209. T. Islam, A. Miron, X. Liu, Y. Li. Deep learning in virtual try-on: A comprehensive survey. IEEE Access, vol. 12, pp. 29475–29502, 2024. DOI: https://doi.org/10.1109/ACCESS.2024.3368612.

    Article  Google Scholar 

  210. Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, Y. G. Jiang. A survey on video diffusion models. ACM Computing Surveys, vol. 57, no. 2, Article number 41, 2025. DOI: https://doi.org/10.1145/3696415.

  211. B. Poole, A. Jain, J. T. Barron, B. Mildenhall. Dream-Fusion: Text-to-3D using 2D diffusion, [Online], Available: https://arxiv.org/abs/2209.14988, 2022.

    Google Scholar 

  212. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2022. DOI: https://doi.org/10.1145/3503250.

    Article  Google Scholar 

  213. J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, Y. Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 700, 2023.

    Google Scholar 

  214. X. Wu, K. Sun, F. Zhu, R. Zhao, H. Li. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 2096–2105, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.00200.

    Google Scholar 

  215. X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, [Online], Available: https://arxiv.org/abs/2306.09341, 2023.

    Google Scholar 

  216. Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, O. Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1594, 2023.

    Google Scholar 

  217. M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9630–9640, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00951.

    Google Scholar 

  218. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6629–6640, 2017.

    Google Scholar 

  219. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2818–2826, 2016. DOI: https://doi.org/10.1109/CVPR.2016.308.

    Google Scholar 

  220. Z. Liu, P. Luo, X. Wang, X. Tang. Deep learning face attributes in the wild. In Proceedings of IEEE/CVF International Conference on Computer Vision, Santiago, Chile, pp. 3730–3738, 2015. DOI: https://doi.org/10.1109/ICCV.2015.425.

    Google Scholar 

  221. K. Zhang, Z. Zhang, Z. Li, Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. DOI: https://doi.org/10.1109/LSP.2016.2603342.

    Article  Google Scholar 

  222. F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 815–823, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298682.

    Google Scholar 

  223. X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, Y. Zhao, Y. Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y. Lin, T. Huang, Z. Wang. Emu3: Next-token prediction is all you need, [Online], Available: https://arxiv.org/abs/2409.18869, 2024.

    Google Scholar 

  224. Gemini Team Google. Gemini: A family of highly capable multimodal models, [Online], Available: https://arxiv.org/abs/2312.11805, 2023.

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by Chinese National Natural Science Foundation Projects, China (Nos. U23B2054, 62276254 and 62372314), Beijing Natural Science Foundation, China (No. L221013), InnoHK program, and Hong Kong Research Grants Council through Research Impact Fund, China (No. R1015-23). Open access funding provided by The Hong Kong Polytechnic University, China.

Author information

Authors and Affiliations

  1. Department of Computing, The Hong Kong Polytechnic University, Hong Kong, 999077, China

    Xulu Zhang, Xiaoyong Wei, Wentao Hu, Jiaxin Wu, Wengyu Zhang & Qing Li

  2. Center for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, 999077, China

    Xulu Zhang, Jinlin Wu, Zhaoxiang Zhang & Zhen Lei

  3. State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China

    Jinlin Wu, Zhaoxiang Zhang & Zhen Lei

  4. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China

    Zhaoxiang Zhang & Zhen Lei

Authors
  1. Xulu Zhang
    View author publications

    Search author on:PubMed Google Scholar

  2. Xiaoyong Wei
    View author publications

    Search author on:PubMed Google Scholar

  3. Wentao Hu
    View author publications

    Search author on:PubMed Google Scholar

  4. Jinlin Wu
    View author publications

    Search author on:PubMed Google Scholar

  5. Jiaxin Wu
    View author publications

    Search author on:PubMed Google Scholar

  6. Wengyu Zhang
    View author publications

    Search author on:PubMed Google Scholar

  7. Zhaoxiang Zhang
    View author publications

    Search author on:PubMed Google Scholar

  8. Zhen Lei
    View author publications

    Search author on:PubMed Google Scholar

  9. Qing Li
    View author publications

    Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Xiaoyong Wei or Zhen Lei.

Ethics declarations

The authors declared that they have no conflicts of interest to this work.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Xulu Zhang received the B. Eng. and the M. Sc. degrees in computer science from Sichuan University, China in 2019 and 2022. He is a Ph. D. degree candidate from the Department of Computing, The Hong Kong Polytechnic University, China.

His research interests include image generation and active learning.

Xiaoyong Wei received the Ph. D. degree in computer science from the City University of Hong Kong, China in 2009, and has worked as a postdoctoral fellow in the University of California, USA from December 2013 to December 2015. He has been a professor and the head of Department of Computer Science, Sichuan University, China since 2010. He is an adjunct professor of Peng Cheng Laboratory, China, and a visiting professor of Department of Computing, The Hong Kong Polytechnic University, China. He is a senior member of IEEE, and has served as an Associate Editor of Interdisciplinary Sciences: Computational Life Sciences since 2020, the program Chair of ICMR 2019, ICIMCS 2012, and the technical committee member of over 20 conferences such as ICCV, CVPR, SIGKDD, ACM MM, ICME, and ICIP.

His research interests include multimedia computing, health computing, machine learning and large-scale data mining.

Wentao Hu received the B. Eng. degree in computer science from Shandong University, China in 2021, and the M. Sc. degree in computer science from Sun Yatsen University, China in 2024. He is currently a Ph. D. degree candidate from the Department of Computing, The Hong Kong Polytechnic University, China.

His research interests include image generation and 3D reconstruction.

Jinlin Wu received the B. Sc. degree in computer science from the University of Electronic Science and Technology of China, China in 2017, and the Ph. D. degree in computer science from the University of Chinese Academy of Sciences, China in 2022. He is an assistant research fellow at the Institute of Automation, Chinese Academy of Sciences (CAS), China. He has served as the principal investigator of a National Natural Science Foundation of China (NSFC) Youth Science Fund project and has participated in several other NSFC-funded projects. He has accumulated a solid research foundation in areas such as video analysis for security and medical video understanding. He has published over 30 high-quality academic papers, with more than 500 citations.

His research interests include object detection, image recognition, and video understanding.

Jiaxin Wu received the Ph. D. degree in computer science from the City University of Hong Kong, China in 2024. She is a postdoctoral fellow in the Department of Computing at The Hong Kong Polytechnic University, China.

Her research interests include multimedia retrieval, AI for Science (AI4Science), and natural language processing (NLP).

Wengyu Zhang is an undergraduate student in the Department of Computing, The Hong Kong Polytechnic University, China.

His research interests include natural language processing (NLP), AI for Science (AI4Science), and graph learning.

Zhaoxiang Zhang received the B. Sc. degree in computer science in Department of Electronic Science and Technology, University of Science and Technology of China, China in 2004, and the Ph. D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, China in 2009, respectively. From 2009 to 2015, he worked as a lecturer, associate professor, and later the deputy director of Department of Computer Application Technology at the Beihang University, China. Since July 2015, he has joined the Institute of Automation, Chinese Academy of Sciences, where he is currently a professor. Recently, he specifically focuses on deep learning models, biologically-inspired visual computing and human-like learning, and their applications on human analysis and scene understanding. He has published more than 200 papers in international journals and conferences, including reputable international journals such as IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Journal of Machine Learning Research, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, and top-level international conferences like CVPR, ICCV, NIPS, ECCV, ICLR, AAAI, IJCAI and ACM MM. He has won the best paper awards in several conferences and championships in international competitions and his research has won the “Technical Innovation Award of the Chinese Association of Artificial Intelligence”. He has served as the PC Chair or Area Chair of many international conferences like CVPR, ICCV, AAAI, IJCAI, ACM MM, ICPR and BICS. He is serving or has served as Associate Editor of reputable international journals like International Journal of Computer Vision, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Biometrics, Behavior, and Identity Science, Pattern Recognition and Neurocomputing.

His research interests include pattern recognition, computer vision and machine learning.

Zhen Lei received the B. Sc. degree in automation from the University of Science and Technology of China, China in 2005, and the Ph. D. degree in computer vision from the Institute of Automation, Chinese Academy of Sciences, China in 2010, where he is currently a professor. He is IEEE Fellow, IAPR Fellow and AAIA Fellow. He has published over 200 papers in international journals and conferences with 33 000+ citations in Google Scholar and h-index 86. He was the program co-chair of IJCB2023, was competition co-chair of IJCB2022 and has served as Area Chairs for several conferences and is an Associate Editor for IEEE Transactions on Information Forensics and Security, IEEE Transactions on Biometrics, Behavior, and Identity Science, Pattern Recognition, Neurocomputing and IET Computer Vision journals. He is the winner of 2019 IAPR Young Biometrics Investigator Award.

His research interests include computer vision, pattern recognition, image processing, and face recognition in particular.

Qing Li received the Ph. D. degree in data science from University of Southern California, USA in 1989. He is currently a Chair professor (Data Science) and the Head of the Department of Computing, The Hong Kong Polytechnic University. Formerly, he was the founding director of the Multimedia software Engineering Research Centre (MERC), and a professor at City University of Hong Kong where he worked in the Department of Computer Science from 1998 to 2018. Prior to these, he has also taught at the Hong Kong University of Science and Technology and the Australian National University (Canberra, Australia). He served as a consultant to Microsoft Research Asia (Beijing, China), Motorola Global Computing and Telecommunications Division (Tianjin Regional Operations Center), and the Division of Information Technology, Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia. He has been an adjunct professor of the University of Science and Technology of China (USTC) and the Wuhan University, and a guest professor of the Hunan University (Changsha, China) where he got his B. Eng. degree from the Department of Computer Science in 1982. He is also a guest professor (Software Technology) of the Zhejiang University (Hangzhou, China) - the leading university of the Zhejiang province where he was born.

He has been actively involved in the research community by serving as an Associate Editor and reviewer for technical journals, and as an organizer/co-organizer of numerous international conferences. Some recent conferences in which he is playing or has played major roles include APWeb-WAIM2018, ICDM 2018, WISE2017, ICDSC2016, DASFAA2015, U-Media2014, ER2013, RecSys2013, NDBC2012, ICMR2012, CoopIS2011, WAIM2010, DASFAA2010, APWeb-WAIM2009, ER2008, WISE2007, ICWL2006, HSI2005, WAIM2004, IDEAS2003, VLDB2002, PAKDD2001, IFIP 2.6 Working Conference on Database Semantics (DS-9), IDS2000, and WISE2000. In addition, he served as a programme committee member for over fifty international conferences (including VLDB, ICDE, WWW, DASFAA, ER, CIKM, CAiSE, CoopIS, and FODO). He is currently a fellow of IEEE and IET/IEE, a member of ACM-SIGMOD and IEEE Technical Committee on Data Engineering. He is the Chairperson of the Hong Kong Web Society, and also served/is serving as an executive committee (EXCO) member of IEEE-Hong Kong Computer Chapter and ACM Hong Kong Chapter. In addition, he serves as a councilor of the Database Society of Chinese Computer Federation (CCF), a member of the Big Data Expert Committee of CCF, and is a Steering Committee member of DASFAA, ER, ICWL, UMEDIA, and WISE Society.

His research interests include data science and artificial intelligence.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Wei, X., Hu, W. et al. A Survey on Personalized Content Synthesis with Diffusion Models. Mach. Intell. Res. 22, 817–848 (2025). https://doi.org/10.1007/s11633-025-1563-3

Download citation

  • Received: 07 February 2025

  • Accepted: 13 May 2025

  • Published: 27 September 2025

  • Version of record: 27 September 2025

  • Issue date: October 2025

  • DOI: https://doi.org/10.1007/s11633-025-1563-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Generative models
  • image synthesis
  • diffusion models
  • personalized content synthesis
  • subject customization

Profiles

  1. Xiaoyong Wei View author profile
  2. Qing Li View author profile

Advertisement

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Language editing
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

Not affiliated

Springer Nature

© 2026 Springer Nature