Segment Anything in Medical Images
Segment Anything in Medical Images
Jun Ma1,2,3 , Yuting He4 , Feifei Li1 , Lin Han5 , Chenyu You6 ,
Bo Wang1,2,3,7,8*
1 Peter Munk Cardiac Centre, University Health Network, Toronto,
Canada.
2 Department of Laboratory Medicine and Pathobiology, University of
Canada.
8 UHN AI Hub, Toronto, Canada.
Abstract
Medical image segmentation is a critical component in clinical practice, facili-
tating accurate diagnosis, treatment planning, and disease monitoring. However,
existing methods, often tailored to specific modalities or disease types, lack gen-
eralizability across the diverse spectrum of medical image segmentation tasks.
Here we present MedSAM, a foundation model designed for bridging this gap
by enabling universal medical image segmentation. The model is developed on
a large-scale medical image dataset with 1,570,263 image-mask pairs, covering
10 imaging modalities and over 30 cancer types. We conduct a comprehensive
evaluation on 86 internal validation tasks and 60 external validation tasks, demon-
strating better accuracy and robustness than modality-wise specialist models. By
delivering accurate and efficient segmentation across a wide spectrum of tasks,
MedSAM holds significant potential to expedite the evolution of diagnostic tools
and the personalization of treatment plans.
1
Introduction
Segmentation is a fundamental task in medical imaging analysis, which involves iden-
tifying and delineating regions of interest (ROI) in various medical images, such as
organs, lesions, and tissues [1]. Accurate segmentation is essential for many clinical
applications, including disease diagnosis, treatment planning, and monitoring of dis-
ease progression [2, 3]. Manual segmentation has long been the gold standard for
delineating anatomical structures and pathological regions, but this process is time-
consuming, labor-intensive, and often requires a high degree of expertise. Semi- or
fully-automatic segmentation methods can significantly reduce the time and labor
required, increase consistency, and enable the analysis of large-scale datasets [4].
Deep learning-based models have shown great promise in medical image segmen-
tation due to their ability to learn intricate image features and deliver accurate
segmentation results across a diverse range of tasks, from segmenting specific anatom-
ical structures to identifying pathological regions [5]. However, a significant limitation
of many current medical image segmentation models is their task-specific nature. These
models are typically designed and trained for a specific segmentation task, and their
performance can degrade significantly when applied to new tasks or different types
of imaging data [6]. This lack of generality poses a substantial obstacle to the wider
application of these models in clinical practice. In contrast, recent advances in the
field of natural image segmentation have witnessed the emergence of segmentation
foundation models, such as Segment Anything Model (SAM) [7] and Segment Every-
thing Everywhere with Multi-modal prompts all at once [8], showcasing remarkable
versatility and performance across various segmentation tasks.
There is a growing demand for universal models in medical image segmentation:
models that can be trained once and then applied to a wide range of segmentation
tasks. Such models would not only exhibit heightened versatility in terms of model
capacity, but also potentially lead to more consistent results across different tasks.
However, the applicability of the segmentation foundation models (e.g., SAM [7])
to medical image segmentation remains limited due to the significant differences
between natural images and medical images. Essentially, SAM is a promptable seg-
mentation method that requires points or bounding boxes to specify the segmentation
targets. This resembles conventional interactive segmentation methods [4, 9–11] but
SAM has better generalization ability, while existing deep learning-based interactive
segmentation methods focus mainly on limited tasks and image modalities.
Many studies have applied the out-of-the-box SAM models to typical medical image
segmentation tasks [12–17] and other challenging scenarios [18–21]. For example, the
concurrent studies [22, 23] conducted a comprehensive assessment of SAM across a
diverse array of medical images, underscoring that SAM achieved satisfactory segmen-
tation outcomes primarily on targets characterized by distinct boundaries. However,
the model exhibited substantial limitations in segmenting typical medical targets with
weak boundaries or low contrast. In congruence with these observations, we further
introduce MedSAM, a refined foundation model that significantly enhances the seg-
mentation performance of SAM on medical images. MedSAM accomplishes this by
fine-tuning SAM on an unprecedented dataset with more than one million medical
image-mask pairs.
2
We thoroughly evaluate MedSAM through comprehensive experiments on 86 inter-
nal validation tasks and 60 external validation tasks, spanning a variety of anatomical
structures, pathological conditions, and medical imaging modalities. Experimen-
tal results demonstrate that MedSAM consistently outperforms the state-of-the-art
(SOTA) segmentation foundation model [7], while achieving performance on par with,
or even surpassing specialist models [1, 24] that were trained on the images from the
same modality. These results highlight the potential of MedSAM as a new paradigm
for versatile medical image segmentation.
Fig. 1 MedSAM is trained on a large-scale dataset that can handle diverse segmentation
tasks. The dataset covers a variety of anatomical structures, pathological conditions, and medical
imaging modalities. The magenta contours and mask overlays denote the expert annotations and
MedSAM segmentation results, respectively.
3
Results
MedSAM: a foundation model for promptable medical image
segmentation
MedSAM aims to fulfill the role of a foundation model for universal medical image
segmentation. A crucial aspect of constructing such a model is the capacity to accom-
modate a wide range of variations in imaging conditions, anatomical structures, and
pathological conditions. To address this challenge, we curated a diverse and large-scale
medical image segmentation dataset with 1,570,263 medical image-mask pairs, cover-
ing 10 imaging modalities, over 30 cancer types, and a multitude of imaging protocols
(Fig. 1, Supplementary Table 1-4). This large-scale dataset allows MedSAM to learn
a rich representation of medical images, capturing a broad spectrum of anatomies and
lesions across different modalities. Fig. 2a provides an overview of the distribution for
images across different medical imaging modalities in the dataset, ranked by their total
numbers. It is evident that Computed Tomography (CT), Magnetic Resonance Imag-
ing (MRI), and endoscopy are the dominant modalities, reflecting their ubiquity in
clinical practice. CT and MRI images provide detailed cross-sectional views of 3D body
structures, making them indispensable for non-invasive diagnostic imaging. Endoscopy,
albeit more invasive, enables direct visual inspection of organ interiors, proving invalu-
able for diagnosing gastrointestinal and urological conditions. Despite the prevalence
of these modalities, others such as ultrasound, pathology, fundus, dermoscopy, mam-
mography, and Optical Coherence Tomography (OCT) also hold significant roles in
clinical practice. The diversity of these modalities and their corresponding segmenta-
tion targets underscores the necessity for universal and effective segmentation models
capable of handling the unique characteristics associated with each modality.
Another critical consideration is the selection of the appropriate segmentation
prompt and network architecture. While the concept of fully automatic segmentation
foundation models is enticing, it is fraught with challenges that make it impractical.
One of the primary challenges is the variability inherent in segmentation tasks. For
example, given a liver cancer CT image, the segmentation task can vary depending
on the specific clinical scenario. One clinician might be interested in segmenting the
liver tumor, while another might need to segment the entire liver and surrounding
organs. Additionally, the variability in imaging modalities presents another challenge.
Modalities such as CT and MR generate 3D images, whereas others like X-Ray and
ultrasound yield 2D images. These variabilities in task definition and imaging modali-
ties complicate the design of a fully automatic model capable of accurately anticipating
and addressing the diverse requirements of different users.
Considering these challenges, we argue that a more practical approach is to develop
a promptable 2D segmentation model. The model can be easily adapted to specific
tasks based on user-provided prompts, offering enhanced flexibility and adaptability.
It is also able to handle both 2D and 3D images by processing 3D images as a series
of 2D slices. Typical user prompts include points and bounding boxes and we show
some segmentation examples with the different prompts in Supplementary Fig. 1. It
can be found that bounding boxes provide a more unambiguous spatial context for
the region of interest, enabling the algorithm to more precisely discern the target
4
a b
Mask decoder
Image
encoder Prompt encoder
Image
embedding Segmentation
Input Image
Fig. 2 a, The number of medical image-mask pairs in each modality. b, MedSAM is a promptable
segmentation method where users can use bounding boxes to specify the segmentation targets. Source
data are provided as a Source Data file.
area. This stands in contrast to point-based prompts, which can introduce ambigu-
ity, particularly when proximate structures resemble each other. Moreover, drawing a
bounding box is efficient, especially in scenarios involving multi-object segmentation.
We follow the network architecture in SAM [7], including an image encoder, a prompt
encoder, and a mask decoder (Fig. 2b). The image encoder [25] maps the input image
into a high-dimensional image embedding space. The prompt encoder transforms the
user-drawn bounding boxes into feature representations via positional encoding [26].
Finally, the mask decoder fuses the image embedding and prompt features using
cross-attention [27] (Methods).
5
a b
Fig. 3 Quantitative and qualitative evaluation results on the internal validation set. a,
Performance distribution of 86 internal validation tasks in terms of median Dice Similarity Coefficient
(DSC) score. The center line within the box represents the median value, with the bottom and top
bounds of the box delineating the 25th and 75th percentiles, respectively. Whiskers are chosen to
show the 1.5 of the interquartile range. Up-triangles denote the minima and down-triangles denote
the maxima. b, Podium plots for visualizing the performance correspondence of 86 internal validation
tasks. Upper part: each colored dot denotes the median DSC achieved with the respective method
on one task. Dots corresponding to identical tasks are connected by a line. Lower part: bar charts
represent the frequency of achieved ranks for each method. MedSAM ranks in the first place on most
tasks. c, Visualized segmentation examples on the internal validation set. The four examples are liver
cancer, brain cancer, breast cancer, and polyp in Computed Tomography (CT), (Magnetic Resonance
Imaging) MRI, ultrasound, and endoscopy images, respectively. Blue: bounding box prompts; Yellow:
segmentation results. Magenta: expert annotations. Source data are provided as a Source Data file.
the median DSC achieved with the respective method on one task. Dots correspond-
ing to identical test cases are connected by a line. In the lower part, the frequency
of achieved ranks for each method is presented with bar charts. It can be found that
MedSAM ranked in first place on most tasks, surpassing the performance of the U-Net
and DeepLabV3+ specialist models that have a high frequency of ranks with second
and third places, respectively, In contrast, SAM ranked last place in almost all tasks.
Fig. 3c (and Supplementary Fig. 9) visualizes some randomly selected segmentation
examples where MedSAM obtained a median DSC score, including liver tumor in CT
images, brain tumor in MR images, breast tumor in ultrasound images, and ployp in
endoscopy images. SAM struggles with targets of weak boundaries, which is prone to
under or over-segmentation errors. In contrast, MedSAM can accurately segment a
wide range of targets across various imaging conditions, which achieves comparable of
even better than the specialist U-Net and DeepLabV3+ models.
The external validation included 60 segmentation tasks, all of which either were
from new datasets or involved unseen segmentation targets (Supplementary Table
9-11, Fig. 10-12). Fig. 4a and b show the task-wise median DSC score distribu-
tion and their correspondence of the 60 tasks, respectively. Although SAM continued
exhibiting lower performance on most CT and MR segmentation tasks, the specialist
6
a b
Fig. 4 Quantitative and qualitative evaluation results on the external validation set. a,
Performance distribution of 60 external validation tasks in terms of median Dice Similarity Coefficient
(DSC) score. The center line within the box represents the median value, with the bottom and top
bounds of the box delineating the 25th and 75th percentiles, respectively. Whiskers are chosen to
show the 1.5 of the interquartile range. Up-triangles denote the minima and down-triangles denote
the maxima. b, Podium plots for visualizing the performance correspondence of 60 external validation
tasks. Upper part: each colored dot denotes the median DSC achieved with the respective method
on one task. Dots corresponding to identical tasks are connected by a line. Lower part: bar charts
represent the frequency of achieved ranks for each method. MedSAM ranks in the first place on most
tasks. c, Visualized segmentation examples on the external validation set. The four examples are the
lymph node, cervical cancer, fetal head, and polyp in CT, MR, ultrasound, and endoscopy images,
respectively. Source data are provided as a Source Data file.
7
superior performance compared to the SAM (Supplementary Fig. 14), highlighting its
remarkable generalization ability.
a b
Fig. 5 a, Scaling up the training image size to one million can significantly improve the model
performance on both internal and external validation sets. b, MedSAM can be used to substantially
reduce the annotation time cost. Source data are provided as a Source Data file.
8
Discussion
We introduce MedSAM, a deep learning-powered foundation model designed for the
segmentation of a wide array of anatomical structures and lesions across diverse med-
ical imaging modalities. MedSAM is trained on a meticulously assembled large-scale
dataset comprised of over one million medical image-mask pairs. Its promptable config-
uration strikes an optimal balance between automation and customization, rendering
MedSAM a versatile tool for universal medical image segmentation.
Through comprehensive evaluations encompassing both internal and external val-
idation, MedSAM has demonstrated substantial capabilities in segmenting a diverse
array of targets and robust generalization abilities to manage new data and tasks.
Its performance not only significantly exceeds that of existing the state-of-the-art
segmentation foundation model, but also rivals or even surpasses specialist models.
By providing precise delineation of anatomical structures and pathological regions,
MedSAM facilitates the computation of various quantitative measures that serve as
biomarkers. For instance, in the field of oncology, MedSAM could play a crucial role
in accelerating the 3D tumor annotation process, enabling subsequent calculations
of tumor volume, which is a critical biomarker [29] for assessing disease progression
and response to treatment. Additionally, MedSAM provides a successful paradigm
for adapting natural image foundation models to new domains, which can be fur-
ther extended to biological image segmentation [30], such as cell segmentation in light
microscopy images [31] and organelle segmentation in electron microscopy images [32].
While MedSAM boasts strong capabilities, it does present certain limitations. One
such limitation is the modality imbalance in the training set, with CT, MRI, and
endoscopy images dominating the dataset. This could potentially impact the model’s
performance on less-represented modalities, such as mammography. Another limita-
tion is its difficulty in the segmentation of vessel-like branching structures because
the bounding box prompt can be ambiguous in this setting. For example, arteries and
veins share the same bounding box in eye fundus images. However, these limitations
do not diminish MedSAM’s utility. Since MedSAM has learned rich and representa-
tive medical image features from the large-scale training set, it can be fine-tuned to
effectively segment new tasks from less-represented modalities or intricate structures
like vessels.
In conclusion, this study highlights the feasibility of constructing a single founda-
tion model capable of managing a multitude of segmentation tasks, thereby eliminating
the need for task-specific models. MedSAM, as the inaugural foundation model in
medical image segmentation, holds great potential to accelerate the advancement of
new diagnostic and therapeutic tools, and ultimately contribute to improved patient
care [33].
Methods
Dataset curation and pre-processing
We curated a comprehensive dataset by collating images from publicly available medi-
cal image segmentation datasets, which were obtained from various sources across the
9
internet, including the Cancer Imaging Archive (TCIA) [34], Kaggle, Grand-Challenge,
Scientific Data, CodaLab, and segmentation challenges in the Medical Image Comput-
ing and Computer Assisted Intervention Society (MICCAI). All the datasets provided
segmentation annotations by human experts, which have been widely used in existing
literature (Supplementary Table 1-4). We incorporated these annotations directly for
both model development and validation.
The original 3D datasets consisted of Computed Tomography (CT) and Magnetic
Resonance (MR) images in DICOM, nrrd, or mhd formats. To ensure uniformity and
compatibility with developing medical image deep learning models, we converted the
images to the widely used NifTI format. Additionally, grayscale images (such as X-Ray
and Ultrasound) as well as RGB images (including endoscopy, dermoscopy, fundus,
and pathology images), were converted to the png format. Several exclusive criteria are
applied to improve the dataset quality and consistency, including incomplete images
and segmentation targets with branching structures, inaccurate annotations, and tiny
volumes. Notably, image intensities varied significantly across different modalities. For
instance, CT images had intensity values ranging from -2000 to 2000, while MR images
exhibited a range of 0 to 3000. In endoscopy and ultrasound images, intensity values
typically spanned from 0 to 255. To facilitate stable training, we performed intensity
normalization across all images, ensuring they shared the same intensity range.
For CT images, we initially normalized the Hounsfield units using typical window
width and level values. The employed window width and level values for soft tissues,
lung, and brain are (W:400, L:40), (W:1500, L:-160), and (W:80, L:40), respectively.
Subsequently, the intensity values were rescaled to the range of [0, 255]. For MR, X-
Ray, ultrasound, mammography, and Optical Coherence Tomography (OCT) images,
we clipped the intensity values to the range between the 0.5th and 99.5th percentiles
before rescaling them to the range of [0, 255]. Regarding RGB images (e.g., endoscopy,
dermoscopy, fundus, and pathology images), if they were already within the expected
intensity range of [0, 255], their intensities remained unchanged. However, if they fell
outside this range, we utilized max-min normalization to rescale the intensity values
to [0, 255]. Finally, to meet the model’s input requirements, all images were resized to
a uniform size of 1024 × 1024 × 3. In the case of whole-slide pathology images, patches
were extracted using a sliding window approach without overlaps. The patches located
on boundaries were padded to this size with 0. As for 3D CT and MR images, each 2D
slice was resized to 1024 × 1024, and the channel was repeated three times to maintain
consistency. The remaining 2D images were directly resized to 1024 × 1024 × 3. Bi-
cubic interpolation was used for resizing images, while nearest-neighbor interpolation
was applied for resizing masks to preserve their precise boundaries and avoid intro-
ducing unwanted artifacts. These standardization procedures ensured uniformity and
compatibility across all images and facilitated seamless integration into the subsequent
stages of the model training and evaluation pipeline.
Network architecture
The network utilized in this study was built on transformer architecture [27], which has
demonstrated remarkable effectiveness in various domains such as natural language
processing and image recognition tasks [25]. Specifically, the network incorporated a
10
vision transformer (ViT)-based image encoder responsible for extracting image fea-
tures, a prompt encoder for integrating user interactions (bounding boxes), and a mask
decoder that generated segmentation results and confidence scores using the image
embedding, prompt embedding, and output token.
To strike a balance between segmentation performance and computational effi-
ciency, we employed the base ViT model as the image encoder since extensive
evaluation indicated that larger ViT models, such as ViT Large and ViT Huge, offered
only marginal improvements in accuracy [7] while significantly increasing computa-
tional demands. Specifically, the base ViT model consists of 12 transformer layers [27],
with each block comprising a multi-head self-attention block and a Multilayer Percep-
tron (MLP) block incorporating layer normalization [35]. Pre-training was performed
using masked auto-encoder modeling [36], followed by fully supervised training on the
SAM dataset [7]. The input image (1024 × 1024 × 3) was reshaped into a sequence
of flattened 2D patches with the size 16 × 16 × 3, yielding a feature size in image
embedding of 64 × 64 after passing through the image encoder, which is 16× down-
scaled. The prompt encoders mapped the corner point of the bounding box prompt
to 256-dimensional vectorial embeddings [26]. In particular, each bounding box was
represented by an embedding pair of the top-left corner point and the bottom-right
corner point. To facilitate real-time user interactions once the image embedding had
been computed, a lightweight mask decoder architecture was employed. It consists
of two transformer layers [27] for fusing the image embedding and prompt encod-
ing, and two transposed convolutional layers to enhance the embedding resolution to
256 × 256. Subsequently, the embedding underwent sigmoid activation, followed by
bi-linear interpolations to match the input size.
11
The model was initialized with the pre-trained SAM model with the ViT-Base
model. We fixed the prompt encoder since it can already encode the bounding box
prompt. All the trainable parameters in the image encoder and mask decoder were
updated during training. Specifically, the number of trainable parameters for the image
encoder and mask decoder are 89,670,912 and 4,058,340, respectively. The bounding
box prompt was simulated from the expert annotations with a random perturbation
of 0-20 pixels. The loss function is the unweighted sum between Dice loss and cross-
entropy loss, which has been proven to be robust in various segmentation tasks [1].
The network was optimized by AdamW [37] optimizer (β1 = 0.9, β2 = 0.999) with
an initial learning rate of 1e-4 and a weight decay of 0.01. The global batch size was
160 and data augmentation was not used. The model was trained on 20 A100 (80G)
GPUs with 150 epochs and the last checkpoint was selected as the final model.
Furthermore, to thoroughly evaluate the performance of MedSAM, we conducted
comparative analyses against both the state-of-the-art segmentation foundation model
SAM [7] and specialist models (i.e., U-Net [1] and DeepLabV3+ [24]). The training
images contained 10 modalities: CT, MR, chest X-Ray (CXR), dermoscopy, endoscopy,
ultrasound, mammography, OCT, and pathology, and we trained the U-Net and
DeepLabV3+ specialist models for each modality. There were 20 specialist models in
total and the number of corresponding training images was presented in Supplemen-
tary Table 5. We employed the nnU-Net to conduct all U-Net experiments, which can
automatically configure the network architecture based on the dataset properties. In
order to incorporate the bounding box prompt into the model, we transformed the
bounding box into a binary mask and concatenated it with the image as the model
input. This function was originally supported by nnU-Net in the cascaded pipeline,
which has demonstrated increased performance in many segmentation tasks by using
the binary mask as an additional channel to specify the target location. The training
settings followed the default configurations of 2D nnU-Net. Each model was trained on
one A100 GPU with 1000 epochs and the last checkpoint was used as the final model.
The DeepLabV3+ specialist models used ResNet50 [38] as the encoder. Similar to [3],
the input images were resized to 224×224×3. The bounding box was transformed into
a binary mask as an additional input channel to provide the object location prompt.
Segmentation Models Pytorch (0.3.3) [39] was used to perform training and inference
for all the modality-wise specialist DeepLabV3+ models. Each modality-wise model
was trained on one A100 GPU with 500 epochs and the last checkpoint was used as
the final model. During the inference phase, SAM and MedSAM were used to per-
form segmentation across all modalities with a single model. In contrast, the U-Net
and DeepLabV3+ specialist models were used to individually segment the respective
corresponding modalities.
A task-specific segmentation model might outperform a modality-based one for
certain applications. Since U-Net obtained better performance than DeepLabV3+ on
most tasks, we further conducted a comparison study by training task-specific U-Net
models on four representative tasks, including liver cancer segmentation in CT scans,
abdominal organ segmentation in MR scans, nerve cancer segmentation in ultrasound,
and polyp segmentation in endoscopy images. The experiments included both internal
validation and external validation. For internal validation, we adhered to the default
12
data splits, using them to train the task-specific U-Net models and then evaluate
their performance on the corresponding validation set. For external validation, the
trained U-Net models were evaluated on new datasets from the same modality or
segmentation targets. In all these experiments, MedSAM was directly applied to the
validation sets without additional fine-tuning. As shown in Supplementary Fig. 15,
while task-specific U-Net models often achieved great results on internal validation
sets, their performance diminished significantly for external sets. In contrast, MedSAM
maintained consistent performance across both internal and external validation sets.
This underscores MedSAM’s superior generalization ability, making it a versatile tool
in a variety of medical image segmentation tasks.
Loss function
We used the unweighted sum between cross-entropy loss and Dice loss [40] as the final
loss function since it has been proven to be robust across different medical image seg-
mentation tasks [41]. Specifically, let S, G denote the segmentation result and ground
truth, respectively. si , gi denote the predicted segmentation and ground truth of voxel
i, respectively. N is the number of voxels in the image I. Binary cross-entropy loss is
defined by
N
1 X
LBCE = − [gi log si + (1 − gi ) log(1 − si )] , (1)
N i=1
and dice loss is defined by
PN
2 i=1 gi si
LDice = 1 − PN PN . (2)
2 2
i=1 (gi ) + i=1 (si )
13
experts independently drew the long and short tumor axes as initial markers, which is
a common practice in tumor response evaluation. This process was executed every 3-10
slices from the top slice to the bottom slice of the tumor. Then, we applied MedSAM
to segment the tumors based on these sparse linear annotations, including three steps.
• Step 1. For each annotated slice, a rectangle binary mask was generated based on
the linear label that can completely cover the linear label.
• Step 2. For the unlabeled slices, the rectangle binary masks were created through
interpolation of the surrounding labeled slices.
• Step 3. We transformed the binary masks into bounding boxes and then fed them
along with the images into MedSAM to generate segmentation results.
All these steps were conducted in an automatic way and the model running time
was recorded for each case. Finally, human experts manually refined the segmentation
results until they met their satisfaction. To summarize, the time cost of the second
group of annotations contained three parts: initial markers, MedSAM inference, and
refinement. All the manual annotation processes were based on ITK-SNAP [44], an
open-source software designed for medical image visualization and annotation.
Evaluation metrics
We followed the recommendations in Metrics Reloaded [45] and used Dice Similarity
Coefficient and Normalized Surface Distance (NSD) to quantitatively evaluate the
segmentation results. DSC is a region-based segmentation metric, aiming to evaluate
the region overlap between expert annotation masks and segmentation results, which
is defined by
2|G ∩ S|
DSC(G, S) = ,
|G| + |S|
NSD [46] is a boundary-based metric, aiming to evaluate the boundary consensus
between expert annotation masks and segmentation results at a given tolerance, which
is defined by
(τ ) (τ )
|∂G ∩ B∂S | + |∂S ∩ B∂G |
N SD(G, S) = ,
|∂G| + |∂S|
(τ ) (τ )
where B∂G = {x ∈ R3 | ∃x̃ ∈ ∂G, ||x − x̃|| ≤ τ }, B∂S = {x ∈ R3 | ∃x̃ ∈ ∂S, ||x − x̃|| ≤
τ } denote the border region of the expert annotation mask and the segmentation
surface at tolerance τ , respectively. In this paper, we set the tolerance τ as 2.
Statistical analysis
To statistically analyze and compare the performance of the aforementioned four
methods (MedSAM, SAM, U-Net, and DeepLabV3+ specialist models), we employed
the Wilcoxon signed-rank test. This non-parametric test is well-suited for comparing
paired samples and is particularly useful when the data does not meet the assump-
tions of normal distribution. This analysis allowed us to determine if any method
demonstrated statistically superior segmentation performance compared to the others,
14
providing valuable insights into the comparative effectiveness of the evaluated meth-
ods. The Wilcoxon signed-rank test results are marked on the DSC and NSD score
tables (Supplementary Table 6-11).
Software Utilized
All code was implemented in Python (3.10) using Pytorch (2.0) as the base deep learn-
ing framework. We also used several python packages for data analysis and results
visualization, including connected-components-3d (3.10.3), SimpleITK (2.2.1), niba-
bel (5.1.0), torchvision (0.15.2), numpy (1.24.3), scikit-image (0.20.0), scipy (1.10.1),
and pandas (2.0.2), matplotlib (3.7.1), opencv-python (4.8.0), ChallengeR (1.0.5), and
plotly (5.15.0). Biorender was used to create Fig. 1.
Data availability
The training and validating datasets used in this study are available in the public
domain and can be downloaded via the links provided in the Supplementary Table
16-17. Source data are provided with this paper in the Source Data file. We confirmed
that All the image datasets in this study are publicly accessible and permitted for
research purposes.
Code availability
The training script, inference script, and trained model have been publicly avail-
able at https://github.com/bowang-lab/MedSAM. A permanent version is released
on Zenodo [? ].
Acknowledgments. This work was supported by the Natural Sciences and Engi-
neering Research Council of Canada (NSERC, RGPIN-2020-06189 and DGECR-2020-
00294) and CIFAR AI Chair programs. The authors of this paper highly appreciate
all the data owners for providing public medical images to the community. We also
thank Meta AI for making the source code of segment anything publicly available to
the community. This research was enabled in part by computing resources provided
by the Digital Research Alliance of Canada.
Author Contributions
Conceived and designed the experiments: J.M. Y.H., C.Y., B.W. Performed the exper-
iments: J.M. Y.H., F.L., L.H., C.Y. Analyzed the data: J.M. Y.H., F.L., L.H., C.Y.,
B.W. Wrote the paper: J.M. Y.H., F.L., L.H., C.Y., B.W. All authors have read and
agreed to the published version of the manuscript.
Competing Interests
The authors declare no competing interests
15
References
[1] Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a
self-configuring method for deep learning-based biomedical image segmentation.
Nature Methods 18(2), 203–211 (2021)
[2] De Fauw, J., Ledsam, J.R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Black-
well, S., Askham, H., Glorot, X., O’Donoghue, B., Visentin, D., et al.: Clinically
applicable deep learning for diagnosis and referral in retinal disease. Nature
Medicine 24(9), 1342–1350 (2018)
[3] Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heiden-
reich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for
beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)
[4] Wang, G., Zuluaga, M.A., Li, W., Pratt, R., Patel, P.A., Aertsen, M., Doel,
T., David, A.L., Deprest, J., Ourselin, S., et al.: Deepigeos: a deep interac-
tive geodesic framework for medical image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 41(7), 1559–1572 (2018)
[5] Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman,
B.A., Litjens, G., Menze, B., Ronneberger, O., Summers, R.M., et al.: The medical
segmentation decathlon. Nature Communications 13(1), 4128 (2022)
[6] Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.:
Image segmentation using deep learning: A survey. IEEE Transactions on Pattern
Analysis and Machine Intelligence 44(7), 3523–3542 (2021)
[7] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.-Y., Dollar, P., Girshick, R.: Segment anything.
In: IEEE International Conference on Computer Vision, pp. 4015–4026 (2023)
[8] Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Gao, J., Lee, Y.J.: Segment everything
everywhere all at once. In: Advances in Neural Information Processing Systems
(2023)
[9] Wang, G., Li, W., Zuluaga, M.A., Pratt, R., Patel, P.A., Aertsen, M., Doel, T.,
David, A.L., Deprest, J., Ourselin, S., et al.: Interactive medical image segmen-
tation using deep learning with image-specific fine tuning. IEEE Transactions on
Medical Imaging 37(7), 1562–1573 (2018)
[10] Zhou, T., Li, L., Bredell, G., Li, J., Unkelbach, J., Konukoglu, E.: Volumet-
ric memory network for interactive medical image segmentation. Medical Image
Analysis 83, 102599 (2023)
[11] Luo, X., Wang, G., Song, T., Zhang, J., Aertsen, M., Deprest, J., Ourselin, S.,
Vercauteren, T., Zhang, S.: Mideepseg: Minimally interactive segmentation of
16
unseen objects from medical images using deep learning. Medical image analysis
72, 102102 (2021)
[12] Deng, R., Cui, C., Liu, Q., Yao, T., Remedios, L.W., Bao, S., Landman, B.A.,
Tang, Y., Wheless, L.E., Coburn, L.A., Wilson, K.T., Wang, Y., Fogo, A.B.,
Yang, H., Huo, Y.: Segment anything model (SAM) for digital pathology: Assess
zero-shot segmentation on whole slide imaging. In: Medical Imaging with Deep
Learning, Short Paper Track (2023)
[13] Hu, C., Li, X.: When sam meets medical images: An investigation of segment
anything model (sam) on multi-phase liver tumor segmentation. arXiv preprint
arXiv:2304.08506 (2023)
[14] He, S., Bao, R., Li, J., Grant, P.E., Ou, Y.: Accuracy of segment-anything model
(sam) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324
(2023)
[15] Wald, T., Roy, S., Koehler, G., Disch, N., Rokuss, M.R., Holzschuh, J., Zimmerer,
D., Maier-Hein, K.: SAM.MD: Zero-shot medical image segmentation capabilities
of the segment anything model. In: Medical Imaging with Deep Learning, Short
Paper Track (2023)
[16] Zhou, T., Zhang, Y., Zhou, Y., Wu, Y., Gong, C.: Can sam segment polyps?
arXiv preprint arXiv:2304.07583 (2023)
[17] Mohapatra, S., Gosai, A., Schlaug, G.: Sam vs bet: A comparative study for brain
extraction and segmentation of magnetic resonance images using deep learning.
arXiv preprint arXiv:2304.04738 (2023)
[18] Chen, J., Bai, X.: Learning to” segment anything” in thermal infrared images
through knowledge distillation with a large scale dataset satir. arXiv preprint
arXiv:2304.07969 (2023)
[19] Tang, L., Xiao, H., Li, B.: Can sam segment anything? when sam meets
camouflaged object detection. arXiv preprint arXiv:2304.04709 (2023)
[20] Ji, G.-P., Fan, D.-P., Xu, P., Zhou, B., Cheng, M.-M., Van Gool, L.: Sam strug-
gles in concealed scenes–empirical study on ”segment anything”. Science China
Information Sciences 66 (2023)
[21] Ji, W., Li, J., Bi, Q., Li, W., Cheng, L.: Segment anything is not always per-
fect: An investigation of sam on different real-world applications. arXiv preprint
arXiv:2304.05750 (2023)
[22] Mazurowski, M.A., Dong, H., Gu, H., Yang, J., Konz, N., Zhang, Y.: Segment
anything model for medical image analysis: an experimental study. Medical Image
Analysis 89, 102918 (2023)
17
[23] Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J.,
Chen, J., Chen, C., Liu, S., Chi, H., Hu, X., Yue, K., Li, L., Grau, V., Fan, D.-P.,
Dong, F., Ni, D.: Segment anything model for medical images? Medical Image
Analysis 92, 103061 (2024)
[24] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder
with atrous separable convolution for semantic image segmentation. In: Proceed-
ings of the European Conference on Computer Vision, pp. 801–818 (2018)
[25] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is
worth 16x16 words: Transformers for image recognition at scale. In: International
Conference on Learning Representations (2020)
[26] Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Sing-
hal, U., Ramamoorthi, R., Barron, J., Ng, R.: Fourier features let networks
learn high frequency functions in low dimensional domains. Advances in Neural
Information Processing Systems 33, 7537–7547 (2020)
[27] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural
Information Processing Systems, vol. 30 (2017)
[28] He, B., Kwan, A.C., Cho, J.H., Yuan, N., Pollick, C., Shiota, T., Ebinger, J.,
Bello, N.A., Wei, J., Josan, K., Duffy, G., Jujjavarapu, M., Siegel, R., Cheng,
S., Zou, J.Y., Ouyang, D.: Blinded, randomized trial of sonographer versus AI
cardiac function assessment. Nature 616(7957), 520–524 (2023)
[29] Eisenhauer, E.A., Therasse, P., Bogaerts, J., Schwartz, L.H., Sargent, D., Ford, R.,
Dancey, J., Arbuck, S., Gwyther, S., Mooney, M., et al.: New response evaluation
criteria in solid tumours: revised recist guideline (version 1.1). European Journal
of Cancer 45(2), 228–247 (2009)
[30] Ma, J., Wang, B.: Towards foundation models of biological image segmentation.
Nature Methods 20(7), 953–955 (2023)
[31] Ma, J., Xie, R., Ayyadhury, S., Ge, C., Gupta, A., Gupta, R., Gu, S., Zhang, Y.,
Lee, G., Kim, J., Lou, W., Li, H., Upschulte, E., Dickscheid, T., Almeida, J.G.,
Wang, Y., Han, L., Yang, X., Labagnara, M., Rahi, S.J., Kempster, C., Pollitt,
A., Espinosa, L., Mignot, T., Middeke, J.M., Eckardt, J.-N., Li, W., Li, Z., Cai,
X., Bai, B., Greenwald, N.F., Valen, D.V., Weisbart, E., Cimini, B.A., Li, Z.,
Zuo, C., Brück, O., Bader, G.D., Wang, B.: The multi-modality cell segmentation
challenge: Towards universal solutions. arXiv:2308.05864 (2023)
[32] Xie, R., Pang, K., Bader, G.D., Wang, B.: Maester: Masked autoencoder guided
segmentation at pixel resolution for accurate, self-supervised subcellular structure
recognition. In: IEEE Conference on Computer Vision and Pattern Recognition,
18
pp. 3292–3301 (2023)
[33] Bera, K., Braman, N., Gupta, A., Velcheti, V., Madabhushi, A.: Predicting cancer
outcomes with radiomics and artificial intelligence in radiology. Nature Reviews
Clinical Oncology 19(2), 132–146 (2022)
[34] Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S.,
Phillips, S., Maffitt, D., Pringle, M., et al.: The cancer imaging archive (TCIA):
maintaining and operating a public information repository. Journal of Digital
Imaging 26(6), 1045–1057 (2013)
[35] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint
arXiv:1607.06450 (2016)
[36] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders
are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
[37] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Interna-
tional Conference on Learning Representations (2019)
[38] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni-
tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 770–778 (2016)
[40] Milletari, F., Navab, N., Ahmadi, S.-A.: V-net: Fully convolutional neural net-
works for volumetric medical image segmentation. In: International Conference
on 3D Vision (3DV), pp. 565–571 (2016)
[41] Ma, J., Chen, J., Ng, M., Huang, R., Li, Y., Li, C., Yang, X., Martel, A.:
Loss odyssey in medical image segmentation. Medical Image Analysis 71, 102035
(2021)
[42] Ahmed, A., Elmohr, M., Fuentes, D., Habra, M., Fisher, S., Perrier, N., Zhang,
M., Elsayes, K.: Radiomic mapping model for prediction of ki-67 expression in
adrenocortical carcinoma. Clinical Radiology 75(6), 479–17 (2020)
[43] Moawad, A.W., Ahmed, A.A., ElMohr, M., Eltaher, M., Habra, M.A., Fisher,
S., Perrier, N., Zhang, M., Fuentes, D., Elsayes, K.: Voxel-level segmentation of
pathologically-proven Adrenocortical carcinoma with Ki-67 expression (Adrenal-
ACC-Ki67-Seg) [Data set]. https://doi.org/10.7937/1FPG-VM46 (2023)
[44] Yushkevich, P.A., Gao, Y., Gerig, G.: Itk-snap: An interactive tool for semi-
automatic segmentation of multi-modality biomedical images. In: International
Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
19
pp. 3342–3345 (2016)
[45] Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M.D., Büttner, F., et al.: Met-
rics reloaded: Pitfalls and recommendations for image analysis validation. arXiv
preprint arXiv:2206.01653 (2022)
20