0% found this document useful (0 votes)

15 views15 pages

【医学图像合成】MediSyn Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis

MediSyn introduces text-guided latent diffusion models for generating high-fidelity medical 2D and 3D images, addressing data scarcity in the medical field. By leveraging a large dataset of over 5 million image-caption pairs, these models enhance algorithmic training and research while ensuring patient privacy. The study demonstrates significant improvements in medical image synthesis through established metrics, showcasing the potential of these models across various medical specialties and modalities.

Uploaded by

yangbenyi000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views15 pages

【医学图像合成】MediSyn Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis

Uploaded by

yangbenyi000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

MediSyn: Text-Guided Diffusion Models for

Broad Medical 2D and 3D Image Synthesis

Cho Joseph1*† , Zakka Cyril1† , Shad Rohan2 , Wightman Ross3 ,
Chaudhari Akshay4 , Hiesinger William1*†
1 Department of Cardiothoracic Surgery, Stanford Medicine.
2 Division
of Cardiovascular Surgery, Penn Medicine.
arXiv:2405.09806v1 [cs.CV] 16 May 2024

3 Research Division, HuggingFace.

4 Radiology & Integrative Biomedical Imaging Informatics, Stanford

Medicine.

*Corresponding author(s). E-mail(s): [email protected];

[email protected];
† These authors contributed equally to this work.

Abstract
Diffusion models have recently gained significant traction due to their ability to
generate high-fidelity and diverse images and videos conditioned on text prompts.
In medicine, this application promises to address the critical challenge of data
scarcity, a consequence of barriers in data sharing, stringent patient privacy
regulations, and disparities in patient population and demographics. By gener-
ating realistic and varying medical 2D and 3D images, these models offer a rich,
privacy-respecting resource for algorithmic training and research. To this end,
we introduce MediSyn, a pair of instruction-tuned text-guided latent diffusion
models with the ability to generate high-fidelity and diverse medical 2D and 3D
images across specialties and modalities. Through established metrics, we show
significant improvement in broad medical image and video synthesis guided by
text prompts.

1 Introduction
Deep learning in medicine has made remarkable strides, with applications ranging from
diagnostic imaging to predictive analytics [1–3]. The fusion of advanced computational

1
Endoscopic fundoscopy Optical robotic assisted AP Frontal
image showing image of the left coherence prostatectomy Chest X-ray
normal stomach retina with no tomography at the nerve- (CXR) of a
diabetic image of a retina sparing phase male patient
retinopathy

magnetic Alopecia areata Lateral Chest computed Colonoscopic

resonance (MR) X-ray (CXR) of tomography image with
soft tissue fluid a male patient (CT) abdomen polyp visible
ankle normal
abd-normal

Fig. 1: Series of medical images generated by Medisyn’s 2D model, where the accompanying
captions serve as the text prompts for the model

techniques with medical expertise has enabled the development of models that can
identify patterns in complex datasets, offering unprecedented insights into patient care
and disease management [4–6]. These successes have not only enhanced diagnostic and
therapeutic capabilities but have also opened new avenues for personalized medicine,
augmenting the potential for tailored patient care.
However, the paucity of high-quality annotated datasets remains a fundamental
barrier to the development of machine learning models in the medical field. While
large volumes of data are generated by the healthcare industry worldwide, annotating
these datasets comes at a significant cost due to the extensive domain expertise and
time commitment involved [7, 8]. Additionally, underlying medico-legal constraints
surrounding the acquisition and dispersion of medical data pose an import barrier to
the aggregation of data at the scale needed for the development of machine learning
tasks [9]. To add to this, medical data often reflects the disease distribution of a pop-
ulation, leading to imbalanced datasets with marked disparities in illness incidence
and prevalence rates. These obstacles, coupled with the under-representation of cer-
tain populations in medical settings, can result in biased and fallible clinical support
decision systems that fail to generalize to new settings and population groups [10, 11].
In recent years, denoising diffusion models probabilistic (DDPMs) have garnered
immense interest due to their ability to synthesize diverse and high-fidelity images [12,

2
1.

Fig. 2: Outputs generated by Medisyn’s 3D model and their corresponding textual

prompts are: (1) “This echocardiogram shows normal left ventricular systolic function
and qualitatively normal right ventricular function in a child. No significant heart
valve dysfunction, pericardial effusion, nor mitral valve regurgitation are present.”, (2)
“normal head CT”, (3) “knee MRI in the axial plane”, and (4) “high-quality simulated
abnormal head CT sinogram”. Please note these outputs are subsets of our full videos
(32 frames).

13]. By decomposing the generation task into a sequence of denoising steps, diffusion
models have achieved state-of-the-art results on perceived output quality and data
distribution metrics. Additionally, advancements in text embeddings [14, 15] have
enabled DDPMs to incorporate textual prompts, allowing for precise control over the
image generation process. Derived from DDPMs, latent diffusion models (LDMs) are
increasingly used due to their efficient denoising operations in latent space [16–18].
Such models have also been utilized to generate videos by incorporating temporal
functionalities [19, 20].
In this work, we focus on the ability of LDMs to generate novel datasets to overcome
class imbalances traditionally associated with medical data, and potentially reduce the
need for manual annotation of medical 2D and 3D data. We present MediSyn, a pair of
text-guided latent diffusion models for broad medical 2D and 3D modality synthesis.
To overcome the scarcity of labelled medical data, we leverage a vast corpus of more
than 5 million image-caption pairs and 100,000 video-caption pairs collected from
the public domain across numerous medical specialties, and integrate comprehensive
natural language annotations to develop a pair of versatile diffusion models for the
medical domain.

2 Related Work
Since their introduction, generative models have had a rich history in the medical field,
ranging from anomaly detection and image denoising [21, 22], to image reconstruction
and segmentation [23, 24]. For instance, DDPMs have been trained to convert MRIs to
CTs for soft tissue injury [25], synthesize labeled brain MRIs for training segmentation

3
models [26], denoise OCTs to erase visual artifacts [27], and reconstruct images for
accelerated MRI scans [28].
Our work, akin to Sagers et al. and Chambon et al. focuses on synthesizing multi-
class medical datasets through text prompts. In their work, Chambon et al. adapt
a pre-trained LDM, Stable Diffusion, on a corpus of chest x-rays (CXR) and their
corresponding radiology reports to generate CXR displaying different disease states
[29]. Similarly, Sagers et al. use DALL-E to synthesize skin lesions across all Fitzpatrick
skin types [30].
Despite impressive results, the lack of large, curated, publicly available medical
imaging datasets makes training these models challenging, often resulting in outputs
with limited diversity and realism [31]. The resulting outputs, despite being visually
impressive, are often constrained to a single imaging modality type or medical sub-
specialty which restricts their utility outside the scope of their defined tasks. While
adopting a similar approach to the works outlined above, our research stands out in
several ways:
• We collect and train on one of the largest publicly available datasets of medical
images and videos to date, spanning more than 5 million image-caption pairs and
100,000 video-caption pairs (comprised of volumetric scans and image sequences)
across 8 broad specialties and 9 image types.
• We present a method to generate high-fidelity, high-resolution, and diverse medical
images from a fine-tuned 2D LDM
• Similarly, we demonstrate the ability to synthesize high-quality, coherent, and vary-
ing medical image sequences and volumetric scans in video format from a fine-tuned
3D LDM
• We demonstrate significant improvements in the generated outputs through stan-
dard metrics.

3 Methods
3.1 Description of the Dataset
We assembled a set of 5,785,333 medical image-caption pairs, covering 8 specialties
and 9 imaging modalities, to train Medisyn’s 2D model. We reserved an additional
1000 image-caption pairs (125 pairs from each specialty) for model evaluation.
For Medisyn 3D, we compiled a total of 107,216 medical video-caption pairs, span-
ning 2 specialties and 3 imaging modalities. We performed model evaluation on a
separate set of 200 pairs (100 from each specialty). Summary statistics for the entire
dataset is provided in Table 1.

3.1.1 Structured Public Dataset Collection

A collection of publicly available medical datasets spanning several machine learning
tasks (e.g. classification, regression) is aggregated and processed. Each dataset is man-
ually converted into an image-text or video-text format as appropriate by combining
all available metadata and labels into a single caption. The resulting dataset is spot
checked to ensure high quality dataset generation.

4
3.1.2 Unstructured Public Dataset Collection
Similarly, there exists a variety of reference medical websites showcasing disease states,
case reports, and collections. We construct website-specific pipelines to download and
process all modality-caption pairs encountered and append them to our dataset, while
adhering to polite scraping strategies and responsible data usage. All data is spot
checked for modality and caption quality.

Modality Type Number of Pairs

Specialty Number of Pairs X-rays (XR) 856,682

Radiology 3,106,264 Computerized Tomography 1,150,206

(CT)
Dermatology 79,209
Magnetic Resonance Imag- 699,105
Cardiology 177,275 ing (MRI)
Gastroenterology 7,568 Ultrasound (US) 435,900
Pathology 197,310 Electrocardiogram (ECG) 141,566
General Surgery 2,150,757 Microscopy Images 197,310
Ophthalmology 174,246 Light Photography 174,246
Pulmonlogy 920 Spectrograms 920
(a) Dataset Breakdown per Specialty Surgical Footage 2,150,757

(b) Dataset Breakdown per Modality

Table 1: Dataset breakdown summary

3.2 Data Pre-processing

We apply the same sequence of processing steps to our collected images. First, we resize
the longer dimension of each image to 1,152 pixels while proportionally scaling the
shorter edge to maintain the original aspect ratio. Next, we pad the shorter edge with
black pixels, thus obtaining a square image of 1,152x1,152 pixels. Last, we perform a
center crop on every image for a resulting size of 1,024x1,024 pixels.
For Medisyn’s 3D model, we randomly selected 32 consecutive images or slices from
each video and appended black frames to those short of 32 frames. Next, we resize
each frame to a height of 256 pixels, while scaling the width to maintain aspect ratio.
Last, we either perform cropping or padding with black pixels to achieve a width of
448 pixels.

3.3 Image Generation

Despite the prevalent use of Stable Diffusion for medical text-to-image tasks [29,
32, 33], we sought for a more computationally fast architecture. We thus adopted
Würstchen v2, a text-to-image LDM with a 42x spatial compression rate [34]. This
image compression leads to an eightfold reduction in GPU training time relative to

5
Stable Diffusion 2.1, while maintaining comparable image fidelity. Würstchen’s model
architecture is structured as a three-stage pipeline. Stages A and B encode the images
into highly compressed latent representations, while Stage C synthesizes (denoises)
them with text-conditioning. All textual prompts were embedded using a frozen CLIP
text encoder.
To prevent unnecessary training, we conducted ablative experiments to assess
which stages of Würstchen required fine-tuning on our medical dataset. Details of
these studies, along with their architectures, are found in the Appendix 6.3. Our
findings revealed that both Stages A and B were effective in compressing and upsam-
pling images from our medical dataset. In contrast, Stage C failed to synthesize latent
samples that align with real medical images, so we proceeded with fine tuning its
text-conditional LDM on our medical image-text pairs.
For fine-tuning the pretrained LDM, we closely followed the original training spec-
ifications (learning rate, optimizer, loss function, etc.) with a few adjustments. We
trained the model for roughly three epochs of the training set with a learning rate of
1e-4 and an effective batch size of 256. We initiated an exponential moving average
(EMA) of the model’s parameters at 1500 steps and subsequently updated it every
100 steps. Text captions were dropped 5 percent of the time to enable Classifier-Free
Guidance (CFG) [35]. This method allows us to trade off between image diversity
and fidelity by controlling the CFG scale (a parameter). Higher values help increase
alignment to textual prompts, while lower values lead to stronger mode coverage. We
conducted training on 4 A100 GPUs using PyTorch’s Distributed Data Parallel (DDP)
[36], spanning roughly 4 days.

3.4 Video Generation

Video synthesis is notably more difficult than image synthesis primarily due to the
need for both spatial and temporal coherence [37]. To help overcome this, we pro-
ceed with HiGen, a text-to-video LDM that decouples spatial and temporal processing
across two distinct stages: structure and content [38]. HiGen incorporates a Varia-
tional Autoencoder (VAE) for encoding and decoding frames, a 3D U-Net for noise
estimation (synthesis), and a frozen CLIP encoder for embedding textual prompts.
In the structure phase, the U-Net incorporates the middle frames of the videos as
spatial priors, alongside the corresponding embedded texts. To further increase tem-
poral consistency, the content level introduces embeddings of motion and appearance
variations from the corresponding videos. Motion vectors are derived from pixel-wise
differences between consecutive frames, while appearance values are obtained from the
image encoder of DINOv2 [39].
Similarly to Würstchen, we examined which components of HiGen required fine-
tuning on our medical dataset, with the exception of the CLIP text encoder which
remained frozen (Appendix 6.3 ). Although the VAE was performant in encoding and
decompressing our medical videos, the 3D U-Net failed to synthesize samples that
resemble real medical data. Thus, we proceeded with fine-tuning the U-Net on our
medical video-text pairs.
We loosely follow the original training method, incorporating several modifications.
With equivalent compute (8 A100 80GB GPUs), we allocate half of the GPUs for

6
video training and the other half for images: both spatial and temporal layers of the
U-Net are not aligned to medical data. Prior to training, we preprocess the videos
and compute their appearance changes using the ViT-B/8 version of DINOv2. For
image training, we use the middle frames of the medical videos as the model inputs.
We trained for a total of 10053 steps with a learning rate of 5e-5, covering roughly 3
epochs of the video dataset. We set the effective batch size for videos and images to 32
and 512, respectively. We dropped text captions 10% of the time. The entire training
process was completed in roughly 12 hours.

3.5 Evaluation
Due to the lack of text-to-image models trained on broad medical data, we opted to use
the pretrained Würstchen v2 as our baseline model. For each textual prompt from our
test set, we generated corresponding images using Würstchen v2 and our fine-tuned
version at all three checkpoints. We set the CFG scale to 6.0 across all models, and
fixed the image generation size at 1024x1024 pixels. For our metric, we chose Fréchet
Inception Distance (FID) [40], which measures the fidelity and diversity of the set of
generated images relative to the original ones. For HiGen, we similarly benchmark our
fine-tuned versions against the pretrained model. We set the CFG scale to 12.0 for all
models, and fixed the dimensions of the generated videos to 448 pixels in width and 256
pixels in height. The video frame length was set to 32. We set the appearance factor
to 0.7 and motion factor to 300. We used the Fréchet Video Distance (FVD) [41] as
our quantitative metric. FVD builds on FID by incorporating temporal information:
embeddings are computed using Inflated 3D ConvNet, a video classification model.

F ID(x, y) = ∥µx − µy ∥2 + T r(Σx + Σy − 2(Σx Σy )1/2 )

Fig. 3: In the FID formula, x and y represent the feature vectors of the generated and
original images respectively (extracted via Inception v3 model), µx and µy denote their
mean vectors respectively, Σx and Σy refer to their covariance matrices respectively,
and T r is the trace of a matrix (sum of its diagonal entries). Low FID values indicate
similar distributions whereas high values suggest dissimilar ones.

4 Results
For both the 2D and 3D models, we observed that all of our fine-tuned versions
significantly outperformed their pretrained counterparts. We noticed the FID of our
2D model, both with and without EMA, showed substantial decreases of 55.6% and
54.4% respectively by the end of the first epoch, far exceeding the performance of base
Würstchen. However, there was no further improvement after epoch 1, suggesting fast
overfitting of the medical images. Additionally, performance between the EMA and

7
Model FID (Non-EMA) FID (EMA) FVD
Pretrained Würstchen v2 - 167.6916 -
Medisyn 2D, end of epoch 1 76.4656 74.4487 -
Medisyn 2D, end of epoch 2 76.3751 74.6191 -
Medisyn 2D, end of epoch 3 84.1539 77.1361 -
Pretrained HiGen - - 5046.6630
Medisyn 3D, 3351 steps - - 645.7636
Medisyn 3D, 6702 steps - - 573.5518
Medisyn 3D, 10053 steps - - 472.9926
Table 2: Evaluation results for Medisyn and its pretrained counterparts.
Lower FID/FVD values indicate superior performance (more similar to
distribution of original medical images/videos).

non-EMA versions was nearly identical (differing by no more than 8.7%), suggesting
that incorporating EMA may not be necessary.
For our 3D model, we observe that the FVD notably decreased by 87.2% by the end
of epoch 1, far surpassing the pretrained version. Unlike Würstchen, HiGen displayed
consistent improvement after the first epoch (an average decrease of 14.4%), implying
a capacity for further training.

5 Discussion
Our findings demonstrate Medisyn’s remarkable ability to generate high-fidelity and
diverse medical images, image sequences and volumetric scans across various medi-
cal subspecialties and imaging modalities. Other medical text-driven diffusion models,
such as TauPETGen [42] for tau PET images and GenerateCT [43] for chest CT vol-
umes, have proven successful in generating high-quality images that accurately depict
anatomical features and clinical conditions. However, these models are constrained to a
single imaging modality and anatomical region, thereby restricting their applicability.
Moreover, they were trained on relatively small datasets sourced from a limited number
of institutions, which could lead to more biased outputs. In contrast, Medisyn, having
been trained on one of the largest publicly accessible medical image and video datasets
to date, is equipped to synthesize data that cover numerous medical disciplines, pop-
ulation groups, and disease states. Leveraging our two models, we can synthesize new
medical datasets as well as augment existing ones, potentially improving a wide array
of medical machine learning tools, both general and specialized. Additionally, our mod-
els can minimize the need to repeatedly fine-tune on specific datasets for generating
different imaging modalities, thus reducing computational costs for academic labs.
Our study had several limitations. First, we solely relied on standard quantitative
metrics, which fail to measure the clinical relevance of the generated data, specifically
their anatomical and pathological accuracy. To address this, we suggest an qualitative
evaluation by a team of clinical experts. Second, both our 2D and 3D models face
challenges in generating high-fidelity images for certain medical subspecialties and
imaging modalities, such as electrocardiograms and brain MRIs, respectively. Third,
our 3D model was limited to generating a fixed number of frames–32 for optimal
quality. This poses a challenge within medical contexts, where the number of slices or

8
images in medical scans can widely vary. Fourth, we employed frozen, general domain
text encoders, which may not fully capture the subtleties present in medical text.
We suggest adapting encoders pretrained specifically on medical corpora to further
improve results.
Future research should focus on finding ways image and video generative models
can more accurately capture and exhibit the anatomical and pathological details of the
medical data they’re trained on. This could be achieved through augmenting existing
model architectures, using specialized loss functions, among other approaches.
In summary, we introduced a pair of text-conditional LDMs trained on an extensive
medical image and video dataset covering various medical subspecialties and imag-
ing modalities. By generating high-fidelity and diverse medical 2D and 3D images,
Medisyn illustrates the potential for a singular framework to broadly address the
challenge of data scarcity in healthcare.

6 Acknowledgments
We would like to thank Stanford Sherlock for their continuous support with GPU
access. We would also like to thank John Ng, Rajiv Gandhi, and the rest of the Oracle
team for their generous support with GPU access.

Declarations
6.1 Funding
This project was supported in part by a National Heart, Lung, and Blood Institute
(NIH NHLBI) grant (1R01HL157235-01A1) (W.H.).

6.2 Competing interests

The authors declare no competing interests.

6.3 Authors’ contributions

J.C., C.Z.and W.H. designed the experiments. J.C and C.Z. wrote the manuscript. The
code-base was authored by J.C and C.Z. Computational experiments were performed
by J.C and C.Z.

Appendix
A Würstchen Stages
A.1 Stage A
In Stage A, a Vector Quantized Generative Adversarial Network (VQGAN) encodes a
latent representation of the original image. Our objective was to evaluate the perfor-
mance of the pretrained VQGAN in accurately encoding and reconstructing images
from our medical dataset. To examine the fidelity of the reconstructed images, we

9
used the structural similarity index measure (SSIM), a metric more closely aligned
to the human visual system by considering structural information [44]. We selected a
random sample of 1,000 images from our training set, processed each one through the
VQGAN, and calculated the SSIM between each pair of original and reconstructed
images. We obtained an average SSIM of 0.9842, so we skipped fine-tuning this stage.

(2µx µy + c1 )(2σxy + c2 )
SSIM (x, y) =
(µ2x+ µ2y + c1 )(σx2 + σy2 + c2 )

Fig. 4: In the SSIM formula, x and y are the two images being compared, µx and
µy represent their mean brightness, σx2 and σy2 denote their variances, σxy refers to
the covariance between the two images, and c1 and c2 are constants used to stabilize
the division (preventing division by zero). SSIM values range from -1 to 1, where -1
indicates no similarity between the two images and 1 suggests perfect similarity.

A.2 Stage B
Stage B uses an LDM conditioned on an EfficientNet encoding of the original image
and the corresponding text to recreate the VQGAN representation. Since its primary
role is to refine Stage C’s latent samples, we continued employing the SSIM to allow
direct comparison to the given reference image. We passed the same images used in
Stage A to the LDM, and subsequently processed the model’s outputs through the
VQGAN-decoder to obtain the final images. We calculated the SSIM statistics between
the reconstructed images and the original ones. As the average SSIM was 0.8328, we
also skipped fine-tuning this stage.

Generated 1 Original 1 Generated 2 Original 2 Generated 3 Original 3

Fig. 5: Würstchen’s two compression stages: comparison of reconstructed images with

their original counterparts. The top row shows reconstructions from Stage A, while
the bottom row shows images reconstructed starting from Stage B .

10
A.3 Stage C
Stage C involves training a separate LDM, conditioned on text, in the latent space of
the EfficientNet encoder. As this stage is tasked with the actual generation (denoising)
of latent samples, we now examine Würstchen as a whole, prompting the model with
the same text captions used in the Stage B assessment. Predictably, we observed that
none of its generated images bore resemblance to real medical imagery. Thus, we
assessed that Stage C, specifically its text-conditional LDM, required fine-tuning on
our medical image-text pairs.

Model Average Median Standard Deviation Range

Stage A 0.9842 0.9899 0.0242 0.4545
Stage B 0.8328 0.8649 0.1291 0.9071
VAE 0.9182 0.9502 0.0979 0.3547
Table 3: SSIM statistics across both 2D and 3D LDM compo-
nents. SSIM values range from -1 to 1, where -1 indicates no
similarity between the two images and 1 suggests perfect sim-
ilarity.

B HiGen components
B.1 VAE
We processed a random set of 50 videos from our training data through the VAE to
evaluate its reconstruction abilities. Following the rationale outlined in section A.1,
we employed the SSIM on a frame-by-frame basis between the original videos and
reconstructed ones. Specifically, we averaged the SSIM values across all frames for
each video pair. Noting an average SSIM of 0.9182, we skipped fine-tuning the VAE.

Fig. 6: Comparison of original (top-row) and VAE-reconstructed (bottom-row)

images.

11
B.2 3D U-Net
Akin to Würstchen’s Stage C, this U-Net is responsible for the initial synthesis (denois-
ing) of latents. Consequently, we evaluated the entire HiGen model by using a random
set of 200 text captions from our training data for video synthesis, and observed that
the generated videos did not resemble their real counterparts. Thus, we proceeded to
fine-tune the U-Net on our medical video-text pairs.

References
[1] Tang, Y., Tang, Y., Peng, Y., al.: Automated abnormality classification of chest
radiographs using deep convolutional neural networks. npj Digit. Med. 3, 70
(2020)

[2] Placido, D., Yuan, B., Hjaltelin, J.X., al.: A deep learning algorithm to predict
risk of pancreatic cancer from disease trajectories. Nat Med 29, 1113–1122 (2023)

[3] Dai, L., Sheng, B., Chen, T., al.: A deep learning system for predicting time to
progression of diabetic retinopathy. Nat Med 30, 584–594 (2024)

[4] Amgad, M., Hodge, J.M., Elsebaie, M.A.T., al.: A population-level digital histo-
logic biomarker for enhanced prognosis of invasive breast cancer. Nat Med 30,
85–97 (2024)

[5] Landi, I., Glicksberg, B.S., Lee, H.C., al.: Deep representation learning of elec-
tronic health records to unlock patient stratification at scale. npj Digit. Med. 3,
96 (2020) https://doi.org/10.1038/s41746-020-0301-z

[6] Lu, J., Bender, B., Jin, J.Y., al.: Deep learning prediction of patient response time
course from early data via neural-pharmacokinetic/pharmacodynamic modelling.
Nat Mach Intell 3, 696–704 (2021)

[7] Lyu, M., Mei, L., Huang, S., al.: M4raw: A multi-contrast, multi-repetition, multi-
channel mri k-space dataset for low-field mri research. Sci Data 10, 264 (2023)

[8] Liu, C., Leigh, R., Johnson, B., al.: A large public dataset of annotated clinical
mris and metadata of patients with acute stroke. Sci Data 10, 548 (2023)

[9] Kohli, M.D., Summers, R.M., Geis, J.: Medical image data and datasets in the era
of machine learning—whitepaper from the 2016 c-mimi meeting dataset session.
J Digit Imaging 30, 392–399 (2017) https://doi.org/10.1007/s10278-017-9976-3

[10] Daneshjou, R., Vodrahalli, K., Novoa, R., Jenkins, M., Liang, W., Rotemberg, V.,
Ko, J., Swetter, S., Bailey, E., Gevaert, O., Mukherjee, P., Phung, M., Yekrang,
K., Fong, B., Sahasrabudhe, R., Allerup, J., Okata-Karigane, U., Zou, J., Chiou,
A.: Disparities in dermatology ai performance on a diverse, curated clinical image
set. Sci Adv 8(32), 6147 (2022)

12
[11] Acosta, J.N., Falcone, G.J., Rajpurkar, P., et al.: Multimodal biomedical ai. Nat
Med 28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

[12] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020)
arXiv:2006.11239 [cs.LG]

[13] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021)
arXiv:2105.05233 [cs.LG]

[14] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning trans-
ferable visual models from natural language supervision (2021) https://doi.org/
10.48550/arXiv.2103.00020 [cs.CV]

[15] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,
W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text
transformer (2019) https://doi.org/10.48550/arXiv.1910.10683 [cs.LG]

[16] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models (2021) arXiv:2112.10752 [cs.CV]

[17] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna,
J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image
synthesis (2023) arXiv:2307.01952 [cs.CV]

[18] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents (2022) arXiv:2204.06125 [cs.CV]

[19] Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S.,
Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion
models (2023) arXiv:2304.08818 [cs.CV]

[20] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz,
D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable
video diffusion: Scaling latent video diffusion models to large datasets (2023)
arXiv:2311.15127 [cs.CV]

[21] Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsu-
pervised anomaly detection with generative adversarial networks to guide marker
discovery 1703 (2017) arXiv:1703.05921

[22] Gondara, L.: Medical image denoising using convolutional denoising autoencoders
(2016) arXiv:1608.04667 [cs.CV]

[23] Bhadra, S., Zhou, W., Anastasio, M.A.: Medical image reconstruction with
image-adaptive priors learned by use of generative adversarial networks (2020)
arXiv:2001.10830 [eess.IV]

13
[24] Wu, J., Fu, R., Fang, H., Zhang, Y., Yang, Y., Xiong, H., Liu, H., Xu, Y.: Med-
segdiff: Medical image segmentation with diffusion probabilistic model (2022)
arXiv:2211.00611 [cs.CV]

[25] Lyu, Q., Wang, G.: Conversion between ct and mri images using diffusion and
score-matching models (2022) arXiv:2209.12104 [eess.IV]

[26] Akbar, U.M., Larsson, M., Blystad, I., al.: Brain tumor segmentation using syn-
thetic mr images - a comparison of gans and diffusion models. Scientific Data 11,
259 (2024) https://doi.org/10.1038/s41597-024-03073-x

[27] Akter, N., Perry, S., Fletcher, J., Simunovic, M., Roy, M.: Automated artifacts
and noise removal from optical coherence tomography images using deep learn-
ing technique. In: 2020 IEEE Symposium Series on Computational Intelligence
(SSCI), pp. 2536–2542 (2020). https://doi.org/10.1109/SSCI47803.2020.9308336

[28] Chung, H., Ye, J.C.: Score-based diffusion models for accelerated mri. Medical
Image Analysis 80, 102479 (2022) https://doi.org/10.1016/j.media.2022.102479

[29] Chambon, P., Bluethgen, C., Delbrouck, J.-B., Sluijs, R.V., Polacin, M.,
Chaves, J.M.Z., Abraham, T.M., Purohit, S., Langlotz, C.P., Chaudhari, A.:
Roentgen: Vision-language foundation model for chest x-ray generation (2022)
arXiv:2211.12737 [cs.CV]

[30] Sagers, L.W., Diao, J.A., Melas-Kyriazi, L., Groh, M., Rajpurkar, P., Adamson,
A.S., Rotemberg, V., Daneshjou, R., Manrai, A.K.: Augmenting medical image
classifiers with synthetic data from latent diffusion models (2023) 2308.12453
[cs.CV]

[31] Sagers, L.W., Diao, J.A., Melas-Kyriazi, L., Groh, M., Rajpurkar, P., Adam-
son, A.S., Rotemberg, V., Daneshjou, R., Manrai, A.K.: Augmenting medi-
cal image classifiers with synthetic data from latent diffusion models (2023)
arXiv:2308.12453 [cs.CV]

[32] Wilde, B., Saha, A., Broek, R.P.G., Huisman, H.: Medical diffusion on a budget:
textual inversion for medical image generation (2023) arXiv:2303.13430 [cs.CV]

[33] Akrout, M., Gyepesi, B., Holló, P., Poór, A., Kincső, B., Solis, S., Cirone,
K., Kawahara, J., Slade, D., Abid, L., Kovács, M., Fazekas, I.: Diffusion-based
data augmentation for skin disease classification: Impact across original medical
datasets to fully synthetic images (2023) arXiv:2301.04802 [cs.LG]

[34] Pernias, P., Rampas, D., Richter, M.L., Pal, C.J., Aubreville, M.: Wuerstchen:
An efficient architecture for large-scale text-to-image diffusion models (2023)
arXiv:2306.00637 [cs.CV]

[35] Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022) arXiv:2207.12598

14
[cs.LG]

[36] Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith,
J., Vaughan, B., Damania, P., Chintala, S.: Pytorch distributed: Experiences on
accelerating data parallel training (2020) arXiv:2006.15704 [cs.DC]

[37] Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A.,
Hur, J., Liu, G., Raj, A., Li, Y., Rubinstein, M., Michaeli, T., Wang, O., Sun, D.,
Dekel, T., Mosseri, I.: Lumiere: A space-time diffusion model for video generation
(2024) 2401.12945 [cs.CV]

[38] Qing, Z., Zhang, S., Wang, J., Wang, X., Wei, Y., Zhang, Y., Gao, C., Sang,
N.: Hierarchical spatio-temporal decoupling for text-to-video generation (2023)
arXiv:2312.04483 [cs.CV]

[39] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer-
nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba,
W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Syn-
naeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.:
DINOv2: Learning robust visual features without supervision (2023) 2304.07193
[cs.CV]

[40] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: ”gans
trained by a two time-scale update rule converge to a local nash equilibrium”
(2018) cs.LG:1706.08500 [cs.LG]

[41] Unterthiner, T., Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.:
Towards accurate generative models of video: A new metric & challenges (2018)
1812.01717 [cs.CV]

[42] Jang, S.-I., Lois, C., Thibault, E., Becker, J.A., Dong, Y., Normandin, M.D.,
Price, J.C., Johnson, K.A., Fakhri, G.E., Gong, K.: Taupetgen: Text-conditional
tau pet image synthesis based on latent diffusion models (2023) 2306.11984
[cs.CV]

[43] Hamamci, I.E., Er, S., Sekuboyina, A., Simsar, E., Tezcan, A., Simsek, A.G.,
Esirgun, S.N., Almas, F., Dogan, I., Dasdelen, M.F., Prabhakar, C., Reynaud, H.,
Pati, S., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generatect: Text-conditional
generation of 3d chest ct volumes (2024) arXiv:2305.16037 [cs.CV]

[44] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess-
ment: From error visibility to structural similarity. IEEE Transactions on Image
Processing 13(4), 600–612 (2004)

Multimodal Medical Data Generation
No ratings yet
Multimodal Medical Data Generation
40 pages
D P M GAN M 2DI: Iffusion Robabilistic Odels Beat ON Edical Mages
No ratings yet
D P M GAN M 2DI: Iffusion Robabilistic Odels Beat ON Edical Mages
13 pages
MedSyn Text-Guided Anatomy-Aware Synthesis
No ratings yet
MedSyn Text-Guided Anatomy-Aware Synthesis
14 pages
Mam e
No ratings yet
Mam e
23 pages
INTRODUCTION
No ratings yet
INTRODUCTION
25 pages
Medmnist V2 - A Large-Scale Lightweight Benchmark For 2D and 3D Biomedical Image Classification
No ratings yet
Medmnist V2 - A Large-Scale Lightweight Benchmark For 2D and 3D Biomedical Image Classification
10 pages
MedMNISTv2 2110.14795v2
No ratings yet
MedMNISTv2 2110.14795v2
11 pages
Text
No ratings yet
Text
10 pages
3.a Domain Translation
No ratings yet
3.a Domain Translation
12 pages
Measurement Guidance in Diffusion Models Insight From Medical Image Synthesis
No ratings yet
Measurement Guidance in Diffusion Models Insight From Medical Image Synthesis
15 pages
Difussion Models Survey
No ratings yet
Difussion Models Survey
33 pages
A Real-World Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification
No ratings yet
A Real-World Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification
9 pages
Final
No ratings yet
Final
21 pages
Paper 2
No ratings yet
Paper 2
11 pages
Diffusion models in medical imaging - A comprehensive survey (科研通-ablesci.com)
No ratings yet
Diffusion models in medical imaging - A comprehensive survey (科研通-ablesci.com)
22 pages
Multimodal GenAi Pranav
No ratings yet
Multimodal GenAi Pranav
7 pages
Med Image Syn
No ratings yet
Med Image Syn
10 pages
Medical Paper - Plag Report
No ratings yet
Medical Paper - Plag Report
34 pages
Biology: Synsiggan: Generative Adversarial Networks For Synthetic Biomedical Signal Generation
No ratings yet
Biology: Synsiggan: Generative Adversarial Networks For Synthetic Biomedical Signal Generation
20 pages
GAN-based Synthetic Medical Image Augmentation
No ratings yet
GAN-based Synthetic Medical Image Augmentation
10 pages
Met 3 D
No ratings yet
Met 3 D
12 pages
Adapting Pretrained Vision-Language Foundational Models To Medical Imaging Domains
No ratings yet
Adapting Pretrained Vision-Language Foundational Models To Medical Imaging Domains
17 pages
Hci Research Paper
No ratings yet
Hci Research Paper
7 pages
Synthetic CT Generation From MRI Using 3D Transformer-Based Denoising Diffusion Model
No ratings yet
Synthetic CT Generation From MRI Using 3D Transformer-Based Denoising Diffusion Model
22 pages
SSRN 5273371
No ratings yet
SSRN 5273371
30 pages
Applications of Large Models in Medicine
No ratings yet
Applications of Large Models in Medicine
17 pages
Chapter
No ratings yet
Chapter
78 pages
Chapter 8
No ratings yet
Chapter 8
6 pages
Lab 14
No ratings yet
Lab 14
6 pages
Zeroth Review
No ratings yet
Zeroth Review
9 pages
UC96 - FUC Touch PID Controller Manual V5
No ratings yet
UC96 - FUC Touch PID Controller Manual V5
2 pages
AZ-204 Exam Questions With Aswers Latest
100% (2)
AZ-204 Exam Questions With Aswers Latest
34 pages
Bank Green Dot Account Balance Screenshot - Google Search
No ratings yet
Bank Green Dot Account Balance Screenshot - Google Search
1 page
XII IP Practical File
0% (1)
XII IP Practical File
52 pages
Spatial Power Combine
No ratings yet
Spatial Power Combine
5 pages
Web Portal System Thesis
100% (3)
Web Portal System Thesis
7 pages
Bsit Capstone Guidelines-1
No ratings yet
Bsit Capstone Guidelines-1
35 pages
ISE Principles of Statistics For Engineers and Scientists (ISE HED IRWIN INDUSTRIAL ENGINEERING) 2nd Edition William Navidi Prof. PDF Download
100% (1)
ISE Principles of Statistics For Engineers and Scientists (ISE HED IRWIN INDUSTRIAL ENGINEERING) 2nd Edition William Navidi Prof. PDF Download
52 pages
Anviz T5 RFID Manual
No ratings yet
Anviz T5 RFID Manual
52 pages
Ppg01 Ft01 Efs01 00015 Ou110776646 r0 Circuit Diagram Process Gas System Hood Exh
No ratings yet
Ppg01 Ft01 Efs01 00015 Ou110776646 r0 Circuit Diagram Process Gas System Hood Exh
47 pages
1Z0 1104 24 Demo
No ratings yet
1Z0 1104 24 Demo
8 pages
Best RDP Solutions for 2021
No ratings yet
Best RDP Solutions for 2021
3 pages
Using Autodesk Vault With A Single Autodesk Inventor Project
No ratings yet
Using Autodesk Vault With A Single Autodesk Inventor Project
8 pages
NFA and DFA Design Assignments
No ratings yet
NFA and DFA Design Assignments
2 pages
Day 1-Assignment
No ratings yet
Day 1-Assignment
4 pages
Module 2 Notes DYPIU
No ratings yet
Module 2 Notes DYPIU
40 pages
Binary Codes
100% (1)
Binary Codes
40 pages
Placement Report 2021 22
No ratings yet
Placement Report 2021 22
2 pages
Nokia Siemens Networks Flexi BSC, S15, Site Documentation, Issue 01
No ratings yet
Nokia Siemens Networks Flexi BSC, S15, Site Documentation, Issue 01
19 pages
ARD EntryProx With Data Sheet EnUS 1366479243
No ratings yet
ARD EntryProx With Data Sheet EnUS 1366479243
2 pages
Ddta Question
No ratings yet
Ddta Question
10 pages
Dismantle ATP V2 - For Drop Site ZTE ZSLO - 1092
No ratings yet
Dismantle ATP V2 - For Drop Site ZTE ZSLO - 1092
111 pages
Proposed House Plan and Area Details
No ratings yet
Proposed House Plan and Area Details
1 page
Hyosung-Innovue MX8600S R2
No ratings yet
Hyosung-Innovue MX8600S R2
2 pages
MS Excel
No ratings yet
MS Excel
34 pages
Google Drive Training Proposal
No ratings yet
Google Drive Training Proposal
4 pages
Nnse 284 S
No ratings yet
Nnse 284 S
31 pages
Android Magazine 59 - 2016 UK
No ratings yet
Android Magazine 59 - 2016 UK
100 pages
Chapter Two Activity of Interration of PROTOTYPE S1 ICT Learners Book NCDC
No ratings yet
Chapter Two Activity of Interration of PROTOTYPE S1 ICT Learners Book NCDC
7 pages
Amul
No ratings yet
Amul
2 pages

【医学图像合成】MediSyn Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis

Uploaded by

【医学图像合成】MediSyn Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis

Uploaded by

MediSyn: Text-Guided Diffusion Models for

Broad Medical 2D and 3D Image Synthesis

3 Research Division, HuggingFace.

*Corresponding author(s). E-mail(s): [email protected];

magnetic Alopecia areata Lateral Chest computed Colonoscopic

Fig. 2: Outputs generated by Medisyn’s 3D model and their corresponding textual

3.1.1 Structured Public Dataset Collection

Modality Type Number of Pairs

Specialty Number of Pairs X-rays (XR) 856,682

Radiology 3,106,264 Computerized Tomography 1,150,206

(b) Dataset Breakdown per Modality

Table 1: Dataset breakdown summary

3.2 Data Pre-processing

3.3 Image Generation

3.4 Video Generation

F ID(x, y) = ∥µx − µy ∥2 + T r(Σx + Σy − 2(Σx Σy )1/2 )

6.2 Competing interests

6.3 Authors’ contributions

Generated 1 Original 1 Generated 2 Original 2 Generated 3 Original 3

Generated 1 Original 1 Generated 2 Original 2 Generated 3 Original 3

Fig. 5: Würstchen’s two compression stages: comparison of reconstructed images with

Model Average Median Standard Deviation Range

Fig. 6: Comparison of original (top-row) and VAE-reconstructed (bottom-row)

You might also like