A walk through image generation: Why we need diffusion models
Diffusion models are among the latest and most popular methods for image generation, particularly based on user-provided natural language prompts. The conceptual challenge of this class of image generation model is to create a method that is:
- Scalable to train and execute
- Able to generate a diversity of images, including with user-guided prompts
- Able to generate natural-looking images
- Has stable training behavior that is possible to replicate easily
One approach to this problem is “autoregressive” models, where the image is generated pixel by pixel, using the prior-generated pixels as successive inputs1. The inputs to these models could be both a set of image pixels and natural language instructions from the user that are encoded into an embedding vector. This approach is slow, as it makes each pixel dependent upon prior steps in the model output. As we’ve seen...