...We started from scratch, utilizing a vast dataset of 115 million image-text pairs, which included COYO-100M, CC3M, and CC12M. In the case of the Prior and Decoder components, we harnessed the power of ViT-L/14, a text encoder from OpenAI's CLIP repository. To optimize efficiency, we made a significant modification to the original unCLIP implementation. Instead of employing a trainable transformer in the decoder, we integrated the text encoder from ViT-L/14.