【医学图像合成】MediSyn Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis
【医学图像合成】MediSyn Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis
Medicine.
Abstract
Diffusion models have recently gained significant traction due to their ability to
generate high-fidelity and diverse images and videos conditioned on text prompts.
In medicine, this application promises to address the critical challenge of data
scarcity, a consequence of barriers in data sharing, stringent patient privacy
regulations, and disparities in patient population and demographics. By gener-
ating realistic and varying medical 2D and 3D images, these models offer a rich,
privacy-respecting resource for algorithmic training and research. To this end,
we introduce MediSyn, a pair of instruction-tuned text-guided latent diffusion
models with the ability to generate high-fidelity and diverse medical 2D and 3D
images across specialties and modalities. Through established metrics, we show
significant improvement in broad medical image and video synthesis guided by
text prompts.
1 Introduction
Deep learning in medicine has made remarkable strides, with applications ranging from
diagnostic imaging to predictive analytics [1–3]. The fusion of advanced computational
1
Endoscopic fundoscopy Optical robotic assisted AP Frontal
image showing image of the left coherence prostatectomy Chest X-ray
normal stomach retina with no tomography at the nerve- (CXR) of a
diabetic image of a retina sparing phase male patient
retinopathy
Fig. 1: Series of medical images generated by Medisyn’s 2D model, where the accompanying
captions serve as the text prompts for the model
techniques with medical expertise has enabled the development of models that can
identify patterns in complex datasets, offering unprecedented insights into patient care
and disease management [4–6]. These successes have not only enhanced diagnostic and
therapeutic capabilities but have also opened new avenues for personalized medicine,
augmenting the potential for tailored patient care.
However, the paucity of high-quality annotated datasets remains a fundamental
barrier to the development of machine learning models in the medical field. While
large volumes of data are generated by the healthcare industry worldwide, annotating
these datasets comes at a significant cost due to the extensive domain expertise and
time commitment involved [7, 8]. Additionally, underlying medico-legal constraints
surrounding the acquisition and dispersion of medical data pose an import barrier to
the aggregation of data at the scale needed for the development of machine learning
tasks [9]. To add to this, medical data often reflects the disease distribution of a pop-
ulation, leading to imbalanced datasets with marked disparities in illness incidence
and prevalence rates. These obstacles, coupled with the under-representation of cer-
tain populations in medical settings, can result in biased and fallible clinical support
decision systems that fail to generalize to new settings and population groups [10, 11].
In recent years, denoising diffusion models probabilistic (DDPMs) have garnered
immense interest due to their ability to synthesize diverse and high-fidelity images [12,
2
1.
2.
3.
4.
13]. By decomposing the generation task into a sequence of denoising steps, diffusion
models have achieved state-of-the-art results on perceived output quality and data
distribution metrics. Additionally, advancements in text embeddings [14, 15] have
enabled DDPMs to incorporate textual prompts, allowing for precise control over the
image generation process. Derived from DDPMs, latent diffusion models (LDMs) are
increasingly used due to their efficient denoising operations in latent space [16–18].
Such models have also been utilized to generate videos by incorporating temporal
functionalities [19, 20].
In this work, we focus on the ability of LDMs to generate novel datasets to overcome
class imbalances traditionally associated with medical data, and potentially reduce the
need for manual annotation of medical 2D and 3D data. We present MediSyn, a pair of
text-guided latent diffusion models for broad medical 2D and 3D modality synthesis.
To overcome the scarcity of labelled medical data, we leverage a vast corpus of more
than 5 million image-caption pairs and 100,000 video-caption pairs collected from
the public domain across numerous medical specialties, and integrate comprehensive
natural language annotations to develop a pair of versatile diffusion models for the
medical domain.
2 Related Work
Since their introduction, generative models have had a rich history in the medical field,
ranging from anomaly detection and image denoising [21, 22], to image reconstruction
and segmentation [23, 24]. For instance, DDPMs have been trained to convert MRIs to
CTs for soft tissue injury [25], synthesize labeled brain MRIs for training segmentation
3
models [26], denoise OCTs to erase visual artifacts [27], and reconstruct images for
accelerated MRI scans [28].
Our work, akin to Sagers et al. and Chambon et al. focuses on synthesizing multi-
class medical datasets through text prompts. In their work, Chambon et al. adapt
a pre-trained LDM, Stable Diffusion, on a corpus of chest x-rays (CXR) and their
corresponding radiology reports to generate CXR displaying different disease states
[29]. Similarly, Sagers et al. use DALL-E to synthesize skin lesions across all Fitzpatrick
skin types [30].
Despite impressive results, the lack of large, curated, publicly available medical
imaging datasets makes training these models challenging, often resulting in outputs
with limited diversity and realism [31]. The resulting outputs, despite being visually
impressive, are often constrained to a single imaging modality type or medical sub-
specialty which restricts their utility outside the scope of their defined tasks. While
adopting a similar approach to the works outlined above, our research stands out in
several ways:
• We collect and train on one of the largest publicly available datasets of medical
images and videos to date, spanning more than 5 million image-caption pairs and
100,000 video-caption pairs (comprised of volumetric scans and image sequences)
across 8 broad specialties and 9 image types.
• We present a method to generate high-fidelity, high-resolution, and diverse medical
images from a fine-tuned 2D LDM
• Similarly, we demonstrate the ability to synthesize high-quality, coherent, and vary-
ing medical image sequences and volumetric scans in video format from a fine-tuned
3D LDM
• We demonstrate significant improvements in the generated outputs through stan-
dard metrics.
3 Methods
3.1 Description of the Dataset
We assembled a set of 5,785,333 medical image-caption pairs, covering 8 specialties
and 9 imaging modalities, to train Medisyn’s 2D model. We reserved an additional
1000 image-caption pairs (125 pairs from each specialty) for model evaluation.
For Medisyn 3D, we compiled a total of 107,216 medical video-caption pairs, span-
ning 2 specialties and 3 imaging modalities. We performed model evaluation on a
separate set of 200 pairs (100 from each specialty). Summary statistics for the entire
dataset is provided in Table 1.
4
3.1.2 Unstructured Public Dataset Collection
Similarly, there exists a variety of reference medical websites showcasing disease states,
case reports, and collections. We construct website-specific pipelines to download and
process all modality-caption pairs encountered and append them to our dataset, while
adhering to polite scraping strategies and responsible data usage. All data is spot
checked for modality and caption quality.
5
Stable Diffusion 2.1, while maintaining comparable image fidelity. Würstchen’s model
architecture is structured as a three-stage pipeline. Stages A and B encode the images
into highly compressed latent representations, while Stage C synthesizes (denoises)
them with text-conditioning. All textual prompts were embedded using a frozen CLIP
text encoder.
To prevent unnecessary training, we conducted ablative experiments to assess
which stages of Würstchen required fine-tuning on our medical dataset. Details of
these studies, along with their architectures, are found in the Appendix 6.3. Our
findings revealed that both Stages A and B were effective in compressing and upsam-
pling images from our medical dataset. In contrast, Stage C failed to synthesize latent
samples that align with real medical images, so we proceeded with fine tuning its
text-conditional LDM on our medical image-text pairs.
For fine-tuning the pretrained LDM, we closely followed the original training spec-
ifications (learning rate, optimizer, loss function, etc.) with a few adjustments. We
trained the model for roughly three epochs of the training set with a learning rate of
1e-4 and an effective batch size of 256. We initiated an exponential moving average
(EMA) of the model’s parameters at 1500 steps and subsequently updated it every
100 steps. Text captions were dropped 5 percent of the time to enable Classifier-Free
Guidance (CFG) [35]. This method allows us to trade off between image diversity
and fidelity by controlling the CFG scale (a parameter). Higher values help increase
alignment to textual prompts, while lower values lead to stronger mode coverage. We
conducted training on 4 A100 GPUs using PyTorch’s Distributed Data Parallel (DDP)
[36], spanning roughly 4 days.
6
video training and the other half for images: both spatial and temporal layers of the
U-Net are not aligned to medical data. Prior to training, we preprocess the videos
and compute their appearance changes using the ViT-B/8 version of DINOv2. For
image training, we use the middle frames of the medical videos as the model inputs.
We trained for a total of 10053 steps with a learning rate of 5e-5, covering roughly 3
epochs of the video dataset. We set the effective batch size for videos and images to 32
and 512, respectively. We dropped text captions 10% of the time. The entire training
process was completed in roughly 12 hours.
3.5 Evaluation
Due to the lack of text-to-image models trained on broad medical data, we opted to use
the pretrained Würstchen v2 as our baseline model. For each textual prompt from our
test set, we generated corresponding images using Würstchen v2 and our fine-tuned
version at all three checkpoints. We set the CFG scale to 6.0 across all models, and
fixed the image generation size at 1024x1024 pixels. For our metric, we chose Fréchet
Inception Distance (FID) [40], which measures the fidelity and diversity of the set of
generated images relative to the original ones. For HiGen, we similarly benchmark our
fine-tuned versions against the pretrained model. We set the CFG scale to 12.0 for all
models, and fixed the dimensions of the generated videos to 448 pixels in width and 256
pixels in height. The video frame length was set to 32. We set the appearance factor
to 0.7 and motion factor to 300. We used the Fréchet Video Distance (FVD) [41] as
our quantitative metric. FVD builds on FID by incorporating temporal information:
embeddings are computed using Inflated 3D ConvNet, a video classification model.
Fig. 3: In the FID formula, x and y represent the feature vectors of the generated and
original images respectively (extracted via Inception v3 model), µx and µy denote their
mean vectors respectively, Σx and Σy refer to their covariance matrices respectively,
and T r is the trace of a matrix (sum of its diagonal entries). Low FID values indicate
similar distributions whereas high values suggest dissimilar ones.
4 Results
For both the 2D and 3D models, we observed that all of our fine-tuned versions
significantly outperformed their pretrained counterparts. We noticed the FID of our
2D model, both with and without EMA, showed substantial decreases of 55.6% and
54.4% respectively by the end of the first epoch, far exceeding the performance of base
Würstchen. However, there was no further improvement after epoch 1, suggesting fast
overfitting of the medical images. Additionally, performance between the EMA and
7
Model FID (Non-EMA) FID (EMA) FVD
Pretrained Würstchen v2 - 167.6916 -
Medisyn 2D, end of epoch 1 76.4656 74.4487 -
Medisyn 2D, end of epoch 2 76.3751 74.6191 -
Medisyn 2D, end of epoch 3 84.1539 77.1361 -
Pretrained HiGen - - 5046.6630
Medisyn 3D, 3351 steps - - 645.7636
Medisyn 3D, 6702 steps - - 573.5518
Medisyn 3D, 10053 steps - - 472.9926
Table 2: Evaluation results for Medisyn and its pretrained counterparts.
Lower FID/FVD values indicate superior performance (more similar to
distribution of original medical images/videos).
non-EMA versions was nearly identical (differing by no more than 8.7%), suggesting
that incorporating EMA may not be necessary.
For our 3D model, we observe that the FVD notably decreased by 87.2% by the end
of epoch 1, far surpassing the pretrained version. Unlike Würstchen, HiGen displayed
consistent improvement after the first epoch (an average decrease of 14.4%), implying
a capacity for further training.
5 Discussion
Our findings demonstrate Medisyn’s remarkable ability to generate high-fidelity and
diverse medical images, image sequences and volumetric scans across various medi-
cal subspecialties and imaging modalities. Other medical text-driven diffusion models,
such as TauPETGen [42] for tau PET images and GenerateCT [43] for chest CT vol-
umes, have proven successful in generating high-quality images that accurately depict
anatomical features and clinical conditions. However, these models are constrained to a
single imaging modality and anatomical region, thereby restricting their applicability.
Moreover, they were trained on relatively small datasets sourced from a limited number
of institutions, which could lead to more biased outputs. In contrast, Medisyn, having
been trained on one of the largest publicly accessible medical image and video datasets
to date, is equipped to synthesize data that cover numerous medical disciplines, pop-
ulation groups, and disease states. Leveraging our two models, we can synthesize new
medical datasets as well as augment existing ones, potentially improving a wide array
of medical machine learning tools, both general and specialized. Additionally, our mod-
els can minimize the need to repeatedly fine-tune on specific datasets for generating
different imaging modalities, thus reducing computational costs for academic labs.
Our study had several limitations. First, we solely relied on standard quantitative
metrics, which fail to measure the clinical relevance of the generated data, specifically
their anatomical and pathological accuracy. To address this, we suggest an qualitative
evaluation by a team of clinical experts. Second, both our 2D and 3D models face
challenges in generating high-fidelity images for certain medical subspecialties and
imaging modalities, such as electrocardiograms and brain MRIs, respectively. Third,
our 3D model was limited to generating a fixed number of frames–32 for optimal
quality. This poses a challenge within medical contexts, where the number of slices or
8
images in medical scans can widely vary. Fourth, we employed frozen, general domain
text encoders, which may not fully capture the subtleties present in medical text.
We suggest adapting encoders pretrained specifically on medical corpora to further
improve results.
Future research should focus on finding ways image and video generative models
can more accurately capture and exhibit the anatomical and pathological details of the
medical data they’re trained on. This could be achieved through augmenting existing
model architectures, using specialized loss functions, among other approaches.
In summary, we introduced a pair of text-conditional LDMs trained on an extensive
medical image and video dataset covering various medical subspecialties and imag-
ing modalities. By generating high-fidelity and diverse medical 2D and 3D images,
Medisyn illustrates the potential for a singular framework to broadly address the
challenge of data scarcity in healthcare.
6 Acknowledgments
We would like to thank Stanford Sherlock for their continuous support with GPU
access. We would also like to thank John Ng, Rajiv Gandhi, and the rest of the Oracle
team for their generous support with GPU access.
Declarations
6.1 Funding
This project was supported in part by a National Heart, Lung, and Blood Institute
(NIH NHLBI) grant (1R01HL157235-01A1) (W.H.).
Appendix
A Würstchen Stages
A.1 Stage A
In Stage A, a Vector Quantized Generative Adversarial Network (VQGAN) encodes a
latent representation of the original image. Our objective was to evaluate the perfor-
mance of the pretrained VQGAN in accurately encoding and reconstructing images
from our medical dataset. To examine the fidelity of the reconstructed images, we
9
used the structural similarity index measure (SSIM), a metric more closely aligned
to the human visual system by considering structural information [44]. We selected a
random sample of 1,000 images from our training set, processed each one through the
VQGAN, and calculated the SSIM between each pair of original and reconstructed
images. We obtained an average SSIM of 0.9842, so we skipped fine-tuning this stage.
(2µx µy + c1 )(2σxy + c2 )
SSIM (x, y) =
(µ2x+ µ2y + c1 )(σx2 + σy2 + c2 )
Fig. 4: In the SSIM formula, x and y are the two images being compared, µx and
µy represent their mean brightness, σx2 and σy2 denote their variances, σxy refers to
the covariance between the two images, and c1 and c2 are constants used to stabilize
the division (preventing division by zero). SSIM values range from -1 to 1, where -1
indicates no similarity between the two images and 1 suggests perfect similarity.
A.2 Stage B
Stage B uses an LDM conditioned on an EfficientNet encoding of the original image
and the corresponding text to recreate the VQGAN representation. Since its primary
role is to refine Stage C’s latent samples, we continued employing the SSIM to allow
direct comparison to the given reference image. We passed the same images used in
Stage A to the LDM, and subsequently processed the model’s outputs through the
VQGAN-decoder to obtain the final images. We calculated the SSIM statistics between
the reconstructed images and the original ones. As the average SSIM was 0.8328, we
also skipped fine-tuning this stage.
10
A.3 Stage C
Stage C involves training a separate LDM, conditioned on text, in the latent space of
the EfficientNet encoder. As this stage is tasked with the actual generation (denoising)
of latent samples, we now examine Würstchen as a whole, prompting the model with
the same text captions used in the Stage B assessment. Predictably, we observed that
none of its generated images bore resemblance to real medical imagery. Thus, we
assessed that Stage C, specifically its text-conditional LDM, required fine-tuning on
our medical image-text pairs.
B HiGen components
B.1 VAE
We processed a random set of 50 videos from our training data through the VAE to
evaluate its reconstruction abilities. Following the rationale outlined in section A.1,
we employed the SSIM on a frame-by-frame basis between the original videos and
reconstructed ones. Specifically, we averaged the SSIM values across all frames for
each video pair. Noting an average SSIM of 0.9182, we skipped fine-tuning the VAE.
11
B.2 3D U-Net
Akin to Würstchen’s Stage C, this U-Net is responsible for the initial synthesis (denois-
ing) of latents. Consequently, we evaluated the entire HiGen model by using a random
set of 200 text captions from our training data for video synthesis, and observed that
the generated videos did not resemble their real counterparts. Thus, we proceeded to
fine-tune the U-Net on our medical video-text pairs.
References
[1] Tang, Y., Tang, Y., Peng, Y., al.: Automated abnormality classification of chest
radiographs using deep convolutional neural networks. npj Digit. Med. 3, 70
(2020)
[2] Placido, D., Yuan, B., Hjaltelin, J.X., al.: A deep learning algorithm to predict
risk of pancreatic cancer from disease trajectories. Nat Med 29, 1113–1122 (2023)
[3] Dai, L., Sheng, B., Chen, T., al.: A deep learning system for predicting time to
progression of diabetic retinopathy. Nat Med 30, 584–594 (2024)
[4] Amgad, M., Hodge, J.M., Elsebaie, M.A.T., al.: A population-level digital histo-
logic biomarker for enhanced prognosis of invasive breast cancer. Nat Med 30,
85–97 (2024)
[5] Landi, I., Glicksberg, B.S., Lee, H.C., al.: Deep representation learning of elec-
tronic health records to unlock patient stratification at scale. npj Digit. Med. 3,
96 (2020) https://doi.org/10.1038/s41746-020-0301-z
[6] Lu, J., Bender, B., Jin, J.Y., al.: Deep learning prediction of patient response time
course from early data via neural-pharmacokinetic/pharmacodynamic modelling.
Nat Mach Intell 3, 696–704 (2021)
[7] Lyu, M., Mei, L., Huang, S., al.: M4raw: A multi-contrast, multi-repetition, multi-
channel mri k-space dataset for low-field mri research. Sci Data 10, 264 (2023)
[8] Liu, C., Leigh, R., Johnson, B., al.: A large public dataset of annotated clinical
mris and metadata of patients with acute stroke. Sci Data 10, 548 (2023)
[9] Kohli, M.D., Summers, R.M., Geis, J.: Medical image data and datasets in the era
of machine learning—whitepaper from the 2016 c-mimi meeting dataset session.
J Digit Imaging 30, 392–399 (2017) https://doi.org/10.1007/s10278-017-9976-3
[10] Daneshjou, R., Vodrahalli, K., Novoa, R., Jenkins, M., Liang, W., Rotemberg, V.,
Ko, J., Swetter, S., Bailey, E., Gevaert, O., Mukherjee, P., Phung, M., Yekrang,
K., Fong, B., Sahasrabudhe, R., Allerup, J., Okata-Karigane, U., Zou, J., Chiou,
A.: Disparities in dermatology ai performance on a diverse, curated clinical image
set. Sci Adv 8(32), 6147 (2022)
12
[11] Acosta, J.N., Falcone, G.J., Rajpurkar, P., et al.: Multimodal biomedical ai. Nat
Med 28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2
[12] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020)
arXiv:2006.11239 [cs.LG]
[13] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021)
arXiv:2105.05233 [cs.LG]
[14] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning trans-
ferable visual models from natural language supervision (2021) https://doi.org/
10.48550/arXiv.2103.00020 [cs.CV]
[15] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,
W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text
transformer (2019) https://doi.org/10.48550/arXiv.1910.10683 [cs.LG]
[16] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models (2021) arXiv:2112.10752 [cs.CV]
[17] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna,
J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image
synthesis (2023) arXiv:2307.01952 [cs.CV]
[18] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents (2022) arXiv:2204.06125 [cs.CV]
[19] Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S.,
Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion
models (2023) arXiv:2304.08818 [cs.CV]
[20] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz,
D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable
video diffusion: Scaling latent video diffusion models to large datasets (2023)
arXiv:2311.15127 [cs.CV]
[21] Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsu-
pervised anomaly detection with generative adversarial networks to guide marker
discovery 1703 (2017) arXiv:1703.05921
[22] Gondara, L.: Medical image denoising using convolutional denoising autoencoders
(2016) arXiv:1608.04667 [cs.CV]
[23] Bhadra, S., Zhou, W., Anastasio, M.A.: Medical image reconstruction with
image-adaptive priors learned by use of generative adversarial networks (2020)
arXiv:2001.10830 [eess.IV]
13
[24] Wu, J., Fu, R., Fang, H., Zhang, Y., Yang, Y., Xiong, H., Liu, H., Xu, Y.: Med-
segdiff: Medical image segmentation with diffusion probabilistic model (2022)
arXiv:2211.00611 [cs.CV]
[25] Lyu, Q., Wang, G.: Conversion between ct and mri images using diffusion and
score-matching models (2022) arXiv:2209.12104 [eess.IV]
[26] Akbar, U.M., Larsson, M., Blystad, I., al.: Brain tumor segmentation using syn-
thetic mr images - a comparison of gans and diffusion models. Scientific Data 11,
259 (2024) https://doi.org/10.1038/s41597-024-03073-x
[27] Akter, N., Perry, S., Fletcher, J., Simunovic, M., Roy, M.: Automated artifacts
and noise removal from optical coherence tomography images using deep learn-
ing technique. In: 2020 IEEE Symposium Series on Computational Intelligence
(SSCI), pp. 2536–2542 (2020). https://doi.org/10.1109/SSCI47803.2020.9308336
[28] Chung, H., Ye, J.C.: Score-based diffusion models for accelerated mri. Medical
Image Analysis 80, 102479 (2022) https://doi.org/10.1016/j.media.2022.102479
[29] Chambon, P., Bluethgen, C., Delbrouck, J.-B., Sluijs, R.V., Polacin, M.,
Chaves, J.M.Z., Abraham, T.M., Purohit, S., Langlotz, C.P., Chaudhari, A.:
Roentgen: Vision-language foundation model for chest x-ray generation (2022)
arXiv:2211.12737 [cs.CV]
[30] Sagers, L.W., Diao, J.A., Melas-Kyriazi, L., Groh, M., Rajpurkar, P., Adamson,
A.S., Rotemberg, V., Daneshjou, R., Manrai, A.K.: Augmenting medical image
classifiers with synthetic data from latent diffusion models (2023) 2308.12453
[cs.CV]
[31] Sagers, L.W., Diao, J.A., Melas-Kyriazi, L., Groh, M., Rajpurkar, P., Adam-
son, A.S., Rotemberg, V., Daneshjou, R., Manrai, A.K.: Augmenting medi-
cal image classifiers with synthetic data from latent diffusion models (2023)
arXiv:2308.12453 [cs.CV]
[32] Wilde, B., Saha, A., Broek, R.P.G., Huisman, H.: Medical diffusion on a budget:
textual inversion for medical image generation (2023) arXiv:2303.13430 [cs.CV]
[33] Akrout, M., Gyepesi, B., Holló, P., Poór, A., Kincső, B., Solis, S., Cirone,
K., Kawahara, J., Slade, D., Abid, L., Kovács, M., Fazekas, I.: Diffusion-based
data augmentation for skin disease classification: Impact across original medical
datasets to fully synthetic images (2023) arXiv:2301.04802 [cs.LG]
[34] Pernias, P., Rampas, D., Richter, M.L., Pal, C.J., Aubreville, M.: Wuerstchen:
An efficient architecture for large-scale text-to-image diffusion models (2023)
arXiv:2306.00637 [cs.CV]
[35] Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022) arXiv:2207.12598
14
[cs.LG]
[36] Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith,
J., Vaughan, B., Damania, P., Chintala, S.: Pytorch distributed: Experiences on
accelerating data parallel training (2020) arXiv:2006.15704 [cs.DC]
[37] Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A.,
Hur, J., Liu, G., Raj, A., Li, Y., Rubinstein, M., Michaeli, T., Wang, O., Sun, D.,
Dekel, T., Mosseri, I.: Lumiere: A space-time diffusion model for video generation
(2024) 2401.12945 [cs.CV]
[38] Qing, Z., Zhang, S., Wang, J., Wang, X., Wei, Y., Zhang, Y., Gao, C., Sang,
N.: Hierarchical spatio-temporal decoupling for text-to-video generation (2023)
arXiv:2312.04483 [cs.CV]
[39] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer-
nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba,
W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Syn-
naeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.:
DINOv2: Learning robust visual features without supervision (2023) 2304.07193
[cs.CV]
[40] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: ”gans
trained by a two time-scale update rule converge to a local nash equilibrium”
(2018) cs.LG:1706.08500 [cs.LG]
[41] Unterthiner, T., Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.:
Towards accurate generative models of video: A new metric & challenges (2018)
1812.01717 [cs.CV]
[42] Jang, S.-I., Lois, C., Thibault, E., Becker, J.A., Dong, Y., Normandin, M.D.,
Price, J.C., Johnson, K.A., Fakhri, G.E., Gong, K.: Taupetgen: Text-conditional
tau pet image synthesis based on latent diffusion models (2023) 2306.11984
[cs.CV]
[43] Hamamci, I.E., Er, S., Sekuboyina, A., Simsar, E., Tezcan, A., Simsek, A.G.,
Esirgun, S.N., Almas, F., Dogan, I., Dasdelen, M.F., Prabhakar, C., Reynaud, H.,
Pati, S., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generatect: Text-conditional
generation of 3d chest ct volumes (2024) arXiv:2305.16037 [cs.CV]
[44] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess-
ment: From error visibility to structural similarity. IEEE Transactions on Image
Processing 13(4), 600–612 (2004)
15