Survey on Segment Anything Model
Survey on Segment Anything Model
Abstract—Artificial intelligence (AI) is evolving towards artificial general intelligence, which refers to the ability of an AI system to
perform a wide range of tasks and exhibit a level of intelligence similar to that of a human being. This is in contrast to narrow or
specialized AI, which is designed to perform specific tasks with a high degree of efficiency. Therefore, it is urgent to design a general
class of models, which we term foundation models, trained on broad data that can be adapted to various downstream tasks. The
recently proposed segment anything model (SAM) has made significant progress in breaking the boundaries of segmentation, greatly
arXiv:2305.08196v2 [cs.CV] 19 May 2023
promoting the development of foundation models for computer vision. To fully comprehend SAM, we conduct a survey study. As the
first to comprehensively review the progress of segmenting anything task for vision and beyond based on the foundation model of SAM,
this work focuses on its applications to various tasks and data types by discussing its historical development, recent progress, and
profound impact on broad applications. We first introduce the background and terminology for foundation models including SAM, as
well as state-of-the-art methods contemporaneous with SAM that are significant for segmenting anything task. Then, we analyze and
summarize the advantages and limitations of SAM across various image processing applications, including software scenes, real-world
scenes, and complex scenes. Importantly, many insights are drawn to guide future research to develop more versatile foundation
models and improve the architecture of SAM. We also summarize massive other amazing applications of SAM in vision and beyond.
Finally, we maintain a continuously updated paper list and an open-source project summary for foundation model SAM at here.
Index Terms—Survey, Artificial General Intelligence, Foundation Models, Segment Anything, Open Source Projects.
F
1 I NTRODUCTION
learned the general concept of what an object is, even for unknown and in more directions. Finally, we conclude the survey in
objects, unfamiliar scenes (e.g., underwater and cell microscopy) Section 5. This survey will be regularly updated to reflect
and ambiguous cases” and demonstrated great potential as a the dynamic progress of the foundation model of SAM, as
fundamental model for CV [22], [28], [29]. this is a rapid-evolving and promising field towards AGI.
Recently, a large number of extended works have been
proposed by the community to explore the capability
2 BACKGROUND AND T ERMINOLOGY
boundaries of SAM and apply it to various tasks, e.g., med-
ical image analysis [30], [31], [32], [33], [34], [35], [36], [37], 2.1 Image Segmentation
[38], image inpainting [39], image editing [40], style trans- 2.1.1 Classic Segmentation
fer [41], infrastructure detection [42], camouflaged object Image segmentation is a fundamental computer vision task
detection [29], mirror and transparent objects detection [43], that separates a digital image into multiple parts by as-
image captioning [44], audio-visual localization [45], video signing each pixel to a class or object. Traditionally, the
object tracking [46], 3D reconstruction [47], few-shot object segmentation includes three major tasks: semantic, instance,
counting [48], and adversarial attacks [49], [50]. Concurrent and panoptic. Semantic segmentation [69], [70], [71], [72]
to SAM, Wang et al. [23] proposed a generalist model, assigns each pixel to a predefined semantic class label.
namely SegGPT, to unify various segmentation tasks into Instance segmentation [73], [74], [75] further separate in-
an in-context learning framework, which has demonstrated stances of the same class. Panoptic segmentation proposed
strong zero-shot capabilities. Furthermore, Zou et al. [24] by [76] combines semantic and instance segmentation to
proposed a more general segmentation system SEEM via understand scenes comprehensively. Researchers have fully
introducing more diverse prompts than SAM, including vi- explored the above tasks in past studies. Due to the opera-
sual prompts (points, boxes, scribbles, masks), text prompts, tion consistency of the above tasks at the pixel level, many
and referring prompts (referred regions of another image). studies have tried to use a unified framework to provide
As the authors claim that the introduced unified prompt solutions for three segmentation methods simultaneously,
scheme in SEEM can encode different prompts into the such as K-net [77], MaskFormer [78], and Mask2Former [79].
joint visual-semantic space to produce a strong zero-shot
generalization ability to address unseen user prompts for 2.1.2 Interactive Segmentation
segmentation. Additionally, some pioneering works explore Interactive segmentation [80] is a particular segment task
general AI methods for detecting/segmenting anything in with the character of leveraging information from the guid-
the open-vocabulary scenarios, e.g., Grounding DINO [51], ance of user interaction. Despite being a longstanding chal-
OVSeg [52], V3Det [53], and OpenSeg [54]. These advance- lenge, the problem has seen considerable improvement.
ments have led many researchers to believe that versatile Usually, the user provides some initial input, such as points,
foundation models are a critical step towards artificial gen- strokes, or bounding boxes, to indicate the rough location
eral intelligence (AGI) [55], [56], [57], [58]. and shape of the object. Then, the algorithm iteratively
To this end, this work provides a comprehensive survey refines the segmentation based on the user feedback, such
of these works with the goal of helping researchers under- as correcting mislabeled regions or adding missing parts.
stand the latest developments related to SAM models. This Interactive segmentation is useful for many applications
survey mainly focuses on various foundation models since that require precise object extraction, such as medical image
SAM, especially the applications of SAM to various tasks analysis [81], [82], [83], photo editing [84], and data annota-
and data types. Readers are referred to existing surveys [1], tion [85], [86].
[2], [3], [59], [60], [61], [62], [63], [64], [65], [66], [67] for
language, vision, and multimodal foundation models. To
the best of our knowledge, this survey is the first to com- 2.2 Foundation Models
prehensively review the recent progress of segmenting any- Foundation models are a new paradigm for building arti-
thing task for vision and beyond based on the foundation ficial intelligence systems that can be adapted to various
model of SAM. Concurrent to our work, [33], [68] briefly downstream tasks. They are based on training large neural
summarized recent efforts to extend SAM to vision and networks on massive amounts of data, often using self-
medical image segmentation tasks, however, we provide supervised learning techniques. This allows them to learn
a more comprehensive review with many new insights general representations and capabilities that can be trans-
from a broader perspective. Furthermore, we maintain a ferred to different domains and applications. The term was
continuously updated paper list and a project summary to coined by the Stanford Center for Research on Foundation
reflect the dynamic progress of the foundation model of Models (CRFM) in 2021 to capture the significance and
SAM during its development. challenges of this paradigm [1].
The remainder of this survey is organized as follows: The development of foundation models can be traced
Section 2 introduces the background and terminology for back to the rise of deep learning and self-supervised learn-
foundation models including SAM, as well as methods con- ing in the NLP field, which enabled learning powerful repre-
temporaneous with SAM that are important for segmenting sentations from raw text data. Early examples of foundation
anything task. Section 3 discusses the awesome works based models were pre-trained LLMs, such as BERT [4], T5 [5], and
on SAM for various image processing applications, includ- GPT-n series [6], [8], [7], which demonstrated impressive
ing software scenes, real-world scenes, and complex scenes. capabilities and performance on a wide range of NLP tasks.
Section 4 further discusses the follow-up works of SAM that In CV research, current foundation models try to take
extend SAM to vision-related applications, beyond vision, advantage of LLMs which are trained on large-scale data
Tete Xiao Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Dollár Ross Girshick
1
project lead 2
joint first author 3
equal contribution 4
directional lead
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023
Meta AI Research, FAIR 3
train
model
image Segment Anything 1B (SA-1B):
encoder
• 1+ billion masks
prompt
cat with • 11 million images
encoder
black ears • privacy respecting
• licensed images
segmentation prompt image prompt image
(a) Task: promptable segmentation (b) Model: Segment Anything Model (SAM) (c) Data: data engine (top) & dataset (bottom)
Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-
Fig. 1:able
Overview of the
segmentation SAaproject,
task, including
segmentation modeltask, model,
(SAM) and data.
that powers dataThe figure and
annotation is borrowed from the
enables zero-shot original
transfer to apaper
range [20].
of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.
, score
Abstract matching in some cases) fine-tuned models [10, 21]. Empir-
We introduce the Segment Anything (SA) project: a new
image
ical trends show thismask behavior
decoderimproving with model scale,
task, model, and dataset for image segmentation. encoder Using our dataset size, and total training compute [56, 10, 21, 51]. , score
model is designed and trained to be promptable, so it can For example, CLIP [82] and ALIGN [55] use
valid contrastive
masks
Figure 4:zero-shot
transfer SegmenttoAnything
new image Model (SAM) structure
distributions overview.
and tasks.A learning
heavyweight imageto train text
encoderpaperand image
outputs encoders that
an image embedding that align the can
two
Fig. 2: Overall ofWe SAM from the
modalities. original
Once trained, [20].
engineered text prompts enable
then be efficiently
evaluate queriedon
its capabilities by numerous
a variety oftasks
inputand prompts
find to produce object masks at amortized real-time speed. For ambiguous
that
prompts
its corresponding
zero-shot performance to more than one–object,
is impressive SAM can outputzero-shot
often competitive multiple valid generalization
masks andtoassociated
novel visual concepts
confidence and data
scores.
with or even superior to prior fully supervised results. We distributions. Such encoders also compose effectively with
and show superb performance in learning universal visual 2.3.1 Task to enable downstream tasks, such as image
other modules
are releasing the Segment Anything Modelimage-text
(SAM) and data. cor-
3. Segment Anything Model
representations from diverse, large-scale loss
The [15, 45,(e.g.,
ultimate
generation 64]goal
over masks.
of
DALL·E the[83]).To rank
SA project
While masks, is to
much theprogress
model pre-
provide a model
has
responding dataset
Representatives include(SA-1B) CLIP of[13], 1B masks
ALIGN and 11M[14],images
Florence at
dicts
been
with aamade
confidence
wideonrange score
vision ofand(i.e., estimated
language
functions IoU)
can for
encoders,
that each
becomputer
quicklymask.
vi-
adapted
https://segment-anything.com
[87], VLBERT We next describe
[88], X-LXMERTthe Segment to foster
[89], andresearch
Anything Model
DALL-E into[19]
foun-
(SAM) try tosion
dation models for computer vision.
manyincludesThe
Efficiency. existinga wide andrange
overall new
modelof problems
design is beyond
segmentation largely this
tasks scope,as per-
(such
motivated
for promptable
to capture the segmentation.
cross-modal SAM has
interactions three components,
between vision and forming by efficiency. Given a precomputed image embedding,exist.
and for many
edge of these,
detection, abundant
object training
proposal data does not
generation, instance
the
illustrated
language. They in canFig. be 4: an image or
transferred encoder,
directaact flexible prompt
on classifica- In this
segmentation, work, our
and goal is
segmenting to build a foundation
prompt encoder and mask decoder run in a web browser, on text),
objects from model
free-formfor
encoder,
1. and
Introduction
tion, retrieval, objecta fast mask decoder.
detection, videoWe build on Transformer
understanding, visual and image
CPU, caninsegmentation.
50ms. This
∼transfer That
zeroruntimeis, we
samples seektotonew
performance developdataa prompt-
enables seam-
distributions
vision models [14,
question-answering, image33, 20, 62] with specific
captioning, and image tradeoffs for
genera- able tasks.
less,
and model Since
real-time andinteractive
pre-train itcomplex
many promptingon a broad of dataset
our
functions model.using
can abetask realized
(amortized)
Large real-time
language modelsperformance.
pre-trained Weondescribe
web-scale these com-
datasets
tion tasks. Recently, ImageBind [90] attempted to align that enables
six through simplepowerful generalization.
combinations of With this
existing model,
tools. For we
example,
ponents
are at ainformation
high-level
revolutionizing NLPhere, with with
strongdetails in §A.and few-shot
zero-shot Losses and training. We supervise mask prediction with
different modal around image/video informa- ifaim to solve aa range of of
downstream segmentation problems
generalization [10]. These “foundation models” [8] can thethere
linear iscombination
bounding box loss
focal detector
[65] fordice
and humans,
loss [73]human
Image
tion and encoder.
learn Motivated
the unified by scalability
embedding and powerful
space, openingpre- on new
up instance data distributions
segmentation using
can prompt engineering.
be solvedsegmentation
by providing
used in [14]. We train for the promptable taskthe box
furthergeneralize
training
research toon
methods,tasks anduse
we
multimodal dataandistributions
MAE
foundation beyond
[47] pre-trained
models.those seen
Vision
Foun- Thea of
output success of geometric
the detector this as
plan hinges on
a prompt to three
the components:
model. Researchers
using mixture of prompts (for text prompts see
dationduring
Transformer training.
for (ViT) This[33]capability
minimally is often
adapted implemented
to process with
high task,inspiration
model, and[92,data. To wedevelop them, wethisaddress the using
models computer vision and multimodal learning take from
prompt engineering in which hand-crafted text is used to §7.5). Following 37],LLMs to achieve
simulate an interactive target,
setup
are stillresolution
an activeinputs area of[62]. The image
research, withencoder runs once per
many challenges following
and prompt questions about image segmentation:
by randomly sampling prompts in 11 rounds per mask, al- down-
engineering [6] to cover the pre-training and
prompt
image and thecan language
be appliedmodel priorto togenerate
promptinga valid textual re-
therobustness,
model.
opportunities for improving their performance, 1. What
stream
lowing SAM task
tasks. will enable
toMore
integrate zero-shotThe
specifically,
seamlessly generalization?
into concept
our of interactive
data engine.
sponse for the task at hand. When scaled and trained with
interpretability,
Prompt and social
encoder. We impact.two sets of prompts: sparse
consider segmentation is introduced to form
2. What is the corresponding model architecture? the promptable task and
abundant text corpora from the web, these models’ zero and
(points, boxes, text) and dense (masks).
few-shot performance compares surprisingly well to (even We represent 4. Segment
realize the Anything
training of the Data
model.
3. What data can power this task and model? Engine
points and boxes by positional encodings [95] summed with A unique characteristic of the promptable task is return-
learned embeddings
2.3 Segment AnythingforModel each prompt type and free-form text ing As segmentation
a valid segmentationmasks are masknot when
abundant givenon the
anyinter-
segmenta-
net, we built a data engine to enable the collection of our
with an off-the-shelf text encoder from CLIP [82]. Dense 1 tion prompt. A prompt can be anything indicating what to
1.1B mask dataset, SA-1B. The data engine has three
prompts from
SAM comes (i.e., masks) are embedded
the Segment using convolutions
Anything (SA) project andof segment. A valid segmentation mask means that even if the
stages: (1) a model-assisted manual annotation stage, (2) a
Meta insummed2023 element-wise
[20]. By finding with the image embedding.
foundation models that ap- input prompt would lead to ambiguity (such as an image
semi-automatic stage with a mix of automatically predicted
pearedMask in the NLP and
decoder. TheCV mask fields
decodershow strong performance,
efficiently maps the im- of a human wearing a T-shirt, the prompt point is on the T-
masks and model-assisted annotation, and (3) a fully auto-
researchers tried to build
age embedding, prompt a similar
embeddings, model andwhich
an output cantoken
unify shirt), it should be a reasonable mask for at least one object
matic stage in which our model generates masks without
the whole
to a image
mask. segmentation
This design, inspired task. However,
by [14, 20], theemploys
available a (which returns the mask of the human or The masks of the
annotator input. We go into details of each next.
data inmodification
the segmentation field is insufficient
of a Transformer decoder block and[103]differs from T-shirt are all reasonable).
followed
their design purpose. Assisted-manual stage. In the first stage, resembling clas-
by a dynamic maskTherefore,
prediction head. as shown in Fig.decoder
Our modified 1, they
divideblock
the pathway into three steps, namely Task, Model, sic interactive
2.3.2 Model segmentation, a team of professional annota-
uses prompt self-attention and cross-attention in two
tors labeled masksof by clicking foreground in /2.background
It mainlyob-consists
directions
and Data. (prompt-to-image
Correspondingly, embedding
a project and vice-versa)
for segmentation to
tasks The structure SAM is shown
ject points using a browser-based interactive segmentation
update all
is proposed, embeddings.
including the After
promptablerunning segmentation
two blocks, we up- task of three parts, a powerful image encoder (MAE [91] pre-
tool powered by SAM. Masks could be refined using pixel-
sample
(prompts the image
include embedding
providing and an MLP
a location, maps athe
a range, outputor trained ViT [92]); a prompt encoder, which is divided into
mask,
precise “brush” and “eraser” tools. Our model-assisted an-
token to a dynamic
a text description of the linear classifier, target),
segmentation which then the computes
SAM that sparse input (CLIP’s [13] text encoder is used as a position
notation runs in real-time directly inside a browser (using
the mask
can accept foreground
multiple prompt probability
inputsat and each realize
image location.
interactive encoder to process points, boxes, and text-form prompt)
precomputed image embeddings) enabling a truly interac-
use and the Dataset SA-1B formed using the data engine and
Resolving ambiguity. With one output, the model will av-
dense input (convolutions processes mask input); and a
tive experience. We did not impose semantic constraints for
of the erage
interactive train-annotate loop process
multiple valid masks if given an ambiguous prompt. with over one mask
labeling decoder
objects,(prompt-image
and annotators freely bidirectional
labeled both Transformer
“stuff” de-
billionTo masks.
address this, we modify the model to predict multiple coder using self-attention and cross-attention).
and “things” [1]. We suggested annotators label objects In addition,
output masks for a single prompt (see Fig. 3). We found they could name or describe, but did not collect these names
3 mask outputs is sufficient to address most common cases or descriptions. Annotators were asked to label objects in
(nested masks are often at most three deep: whole, part, and order of prominence and were encouraged to proceed to the
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 4
when the input prompts are ambiguous, the network will Meanwhile, the SegGPT [23] stand for Segment every-
rank the three possible mask outputs based on confidence. thing with a Generalist Painter, exploring in-context training
The loss functions used in training include focal loss [93] and inference scheme. It forms a generalist in-context learn-
and dice loss [94]. ing framework [98] that unifies different segmentation data
formats. And treating the training process as an in-context
2.3.3 Data coloring problem with a random coloring scheme instead of
Since there is insufficient public data for training, re- using predefined color space. This training process requires
searchers use the training-annotation iterative process to the model to focus on contextual information to accomplish
form a Data Engine to achieve model training and dataset the specific task. Based on these improvements, the model
construction simultaneously. The specific process can be can perform arbitrary segmentation tasks based on input
divided into three stages. images or videos through in-context inference.
1) Assisted-manual stage. Professional annotators use Also, SEEM [24] further broadens the scope of task ap-
the interactive labeling tool on the browser, com- plicability of single segmentation models. It further expands
bined with SAM for manual labeling. SAM first uses the types of supported prompts, including points, boxes,
the public dataset for training. As the data gradually scribbles, masks, texts, and referred regions of another
increases, the size of the image encoder of SAM also image. With the proposed joint visual-semantic space, the
increases. At the end of this stage, 4.3M masks and model has compatibility to compose flexible multi-prompt
120k images were collected. input. SEEM also can process as a classic segmentation
2) Semi-automatic stage. To increase the mask’s di- model when no prompt provides. However, it also suffers
versity and improve the model’s performance, the from limited training data and absent the support of part-
researchers first pre-filled the mask with which the based segmentation.
model can make high-confidence predictions. Then
they asked the annotator to annotate the unfilled 3 SAM FOR I MAGE P ROCESSING
part interactively. At the end of this stage, an image 3.1 Software Scenes
can provide an average of 72 masks.
3) Fully automatic stage. In this stage, due to the 3.1.1 Image Editing
collection of enough masks and the introduction Modern software scenes require operations on image edit-
of the ambiguity-aware model, the final training of ing and inpainting, e.g., removing objects, filling objects,
the SAM and the acquisition of the SA-1B data set and replacing objects. However, the existing inpainting
can be performed. The ambiguity-aware model en- works, like [99], [100], [101], [102], need fine annotations
ables SAM to predict effective masks even when the for each mask to achieve good performance, which is labor-
prompt is ambiguous. Specifically, the researchers intensitive. SAM [20], which can generate accurate masks
use a 32x32 grid to obtain prompt points on each with simple prompts such as points or boxes, can help assist
image uniformly. If the prompt point is located the image editing scenes.
on the target part or sub-part structure, the model
will return the mask of the sub-part, part, or the
whole object. And filter sort the outputs based on
confidence. At the end of this stage, the final SA-1B
dataset contains 11M images and 1.1B masks.
With the advantages of well-designed tasks, model struc-
ture, and massive high-quality training data, experimental
shows that the zero-sample transfer capability of the SAM
model has performed excellent results in single cue point
segmentation, edge detection, Object proposal, instance seg-
mentation, interactive segmentation, and multimodal seg-
mentation (Text-to-Mask) tasks. It even outperforms super-
vised models in some respects.
Fig. 3: Overall pipeline of Inpaint Anything (IA). The input
2.4 Concurrent Works image is segmented by the SAM and the targeted segment
Parallel to SAM research, many efforts have been made to is replaced by the output of the inpaint models to achieve
solve segmentation tasks with other general methods. different tasks. The figure is borrowed from the original
OneFormer [25] leverages the task-conditioned joint paper [39].
training strategy, task token, and query-text contrastive loss
to form a universal image segmentation framework. One- Inpaint Anything (IA) [39] designs a pipeline to solve
former enables training on all three traditional segmentation inpainting-related problems by combining the advantages
tasks within a single universal model and a multi-task of SAM, the state-of-the-art (SOTA) image inpainters [99],
training process. With different backbones, it outperforms and AI-generated content (AIGC) models [103]. The pipeline
specialized models on [95], Cityscapes [96], and COCO is illustrated in Fig. 3. For object removing, the pipeline is
datasets [97] with even costs much less training time and composed of SAM and SOTA inpainters, like LaMa [99]. The
resources. clicking action from the user is used as a prompt in SAM
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 5
to generate a mask for the object area, and the LaMa fills it 2) Obtain the style and content masks with SAM and
with corrosion and dilation operations. For object filling and input prompts.
replacing, the AIGC models, like Stable Diffusion (SD) [103], 3) Fuse the attention map with the mask-controlling
are used in the second step to fill the selected object with signals from the last step.
newly generated objects by text prompts. 4) Compute the stylized feature with the updated atten-
tion map and derive the final result.
With the defined pipeline, the paper proves that the
proposed method is a plug-and-play component to the
existing style transfer methods, including Local Transfor-
mation Based Style Transfer [104], [105], Global Transforma-
tion Based Style Transfer [106], and Diffusion Based Style
Transfer [107], which shows its great potential of broad
Fig. 4: Overall pipeline of Edit Everything from the original applications.
paper [40].
A similar idea can also be seen on Edit Everything [40]. 3.2 Real-World Scenes
As shown in Fig. 4, it allows users to edit images using 3.2.1 Detection
simple text instructions. Specifically, when inputting an
SAM holds the ability to assist the applications in many
image, SAM first separates it into several segments with
real-world scenes such as real-world object detection, object
no prompts, followed by a source prompt instructing CLIP
counting, and moving object detection scenes. Recently,
to rank the received segments. Only the segment with the
[108] evaluated the performance of SAM across a diverse
highest score is selected as the target one to be replaced
range of real-world segmentation scenarios, e.g., natural im-
with the newly generated object from SD with the target
age, agriculture, manufacture, remote sensing, and health-
prompt. Compared with the object-replacing solution in IA,
care scenes. The paper finds that it has excellent generaliza-
the authors train the CLIP with 400 million parameters and
tion on common scenes like natural images, while it shows
the SD with 1 billion parameters in Chinese scenarios to
less effectiveness in low-contrast scenes and requires strong
make it more reliable towards the Chinese text prompts.
prior knowledge in complex scenes.
Furthermore, the paper improves the realism of the image
by breaking down the complex prompts into smaller entities
and replaced in a sequential manner. Although it performs
well as a novel tool, the paper points out that it still needs
specific enhancement in different scenarios.
tool as the crater shapes focus on circular or elliptical. segmentation way. The authors apply the moving object
Craters are one of the most important morphological fea- bounding boxes in DSEC-MOD [114] as prompts to generate
tures in planetary exploration, detecting and counting them a large number of preparatory masks with SAM, which is
is an important but time-consuming task in planetary sci- accurate and reliable. The dataset contains 16 sequences of
ence. Although the existing works in machine learning and 13,314 frames in total and provides event-based data with
computer vision successfully solve some specific problems pixel-level annotations, which can be a valuable resource
in crater detection, they rely on specific types of data and for the MOS field.
thus fail to work well in a different data source.
In [110], the authors propose a universal crater detec-
tion scheme with the zero-shot generalization of SAM to
unfamiliar objects. The pipeline uses SAM to segment the
input images, which has no restrictions on the data type
and resolutions. Then, it utilizes circular-elliptical indexes to
filter the segmentation masks that are not circular-elliptical
shapes. Finally, a post-processing filter is employed to get Fig. 7: Framework of the SAA+ [115], which achieves zero-
rid of duplicates, artifacts, and false positive ones. The shot anomaly segmentation by introducting hybrid prompts
pipeline shows its great potential to be the general tool in as a regularization technique. The figure is borrowed from
the current field and the authors also discuss the drawbacks the original paper [115].
that only the specific shapes can be recognized.
3.2.2 Counting
3.3 Complex Scenes
Few-shot object counting is an important application scene
of computer vision in the real world, which counts an 3.3.1 Low-Contrast Scene
unseen-category object with only a few bounding boxes of Except for the normal scenes mentioned above, whether
examples provided. Since SAM shows impressive perfor- SAM can solve the segmentation issue in complex scenes,
mance with great generalization for unseen objects, it shows like low-contrast scenes, is also a meaningful question to
potential to be used in few-shot object counting. [48] is the broaden its application. To find out the generalization abil-
first paper testing SAM in this task and comparing it with ity of SAM in more complex scenes, Ji et al. [22] quan-
other baseline few-shot counting methods. The goal is to titatively compare it with cutting-edge models in three
find out whether SAM can segment and distinguish the concealed scenes, i.e., camouflaged animals, industrial de-
target objects using the reference examples. fects, and medical lesions. They conduct the experiments
To realize this, the authors define a pipeline. Firstly, they on three camouflaged object segmentation (COS) datasets,
calculate the dense image feature with an image encoder, i.e., CAMO [116] with 250 samples, COD10K [117] with
i.e., ViT-H. Secondly, they use bounding boxes as prompts to 2026 samples, and NC4K [118] with 4121 samples. And
generate segment masks for the reference examples, which compare it with the outstanding transformer-based model,
are then computed with the dense image feature as the CamoFormer-P/S [119] and HitNet [120]. They observe
feature vectors of the reference objects. Thirdly, they use from the results that SAM looks not skillful in concealed
point grids as prompts to segment everything and generate scenes and point out that the potential solution may rely
feature vectors for all masks. After that, they compute on the support of prior knowledge in the specific fields.
the cosine similarity of the feature vectors between the The same conclusion also can be drawn in [29], where the
predicted masks and reference examples, only the ones authors compare SAM with 22 SOTA methods in camou-
larger than the pre-defined threshold are considered the flaged object detection on the same three datasets mentioned
target objects. The paper compares the proposed SAM- above.
based method with other few-shot counting methods on two Dominic et al. [121] utilizes SAM in the field of plant
datasets, FSC-147 [111] and MS-COCO [112], and finds that phenotyping, e.g., potato leaves segmentation, by propos-
it falls behind the SOTA baselines, especially for small and ing a method named Leaf Only SAM. Specifically, they
congested objects. Thus, further improvement for SAM in combine SAM with four post-processing steps to identify
some special scenes is still needed. only leaf objects without any training data. After receiving
the segmentation masks from SAM, they first check the
3.2.3 Moving Object color by finding the green masks. To reduce ambiguity,
Moving object segmentation (MOS) is a crucial task for the they further check the main object by keeping the one with
application of computer vision in many real-world appli- IoU of more than 90% and removing the rest. Then, they
cation scenarios, such as autonomous driving. The existing compare the area of the minimum enclosing circle to filter
datasets for this research are mainly RGB or Lidar videos, the masks with incorrect shapes. Finally, they remove multi-
which lack the event information that can help under- leaf masks by summing all the mask objects in the image
stand the dynamic scenes better. To fill this gap, DSEC- and labeling each pixel to the amount of belonging masks.
MOS [113] is proposed with the moving vehicles’ frames The mask with a mean score over 1.5 is assumed to be a
and the corresponding event data, which can facilitate the duplicate. The experiments compare Leaf Only SAM with a
development of more accurate, robust, and efficient algo- fine-tuned Mask R-CNN and find that Leaf Only SAM does
rithms for autonomous driving. The contribution of SAM not outperform the latter with just a 10% performance drop,
to DSEC-MOS annotation is that it provides a promptable but its training-free ability shows its potential for this field.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 7
and can be addressed through modest changes in SAM’s computer processing to create cross-sectional images (slices)
design. Similar observations and ideas are proposed in [141]. of the bones, blood vessels, and soft tissues inside your
The authors adopt SAM to the geological mapping task with body. This paper [145] presents a preliminary examination
remote sensing images of the Mars surface and find that it of SAM as an annotation tool for medical image analysis,
cannot be directly applied to domain-specific tasks due to a specifically for the segmentation of multi-phase liver tumors
lack of problem-specific bias. Thus, they change the SAM’s (MPLiTS). Their investigation focused on the prompts used,
design by introducing a domain-specific decoder, which can data resolution, and phases. The experimental findings
learn the problem-specific semantics by fine-tuning through demonstrate the effectiveness of SAM in this context, while
knowledge distillation with only five labeled images and also highlighting areas where MPLiTS could be improved.
achieve the same performance as the Mask-RCNN [142] To provide comprehensive guidance to the MPLiTS commu-
trained with over 400 labeled images. nity, the authors plan to conduct further investigations that
The unlabeled remote sensing dataset issue can also be encompass a wider range of aspects. The purpose of this
solved with SAM. With the small and intensive characters paper [146] is to perform an initial assessment of SAM’s out-
on objects in remote sensing images, they are hard and of-the-box zero-shot capabilities for medical image segmen-
cost-inefficient to be labeled by human experts. Thus, the tation by evaluating its performance on an abdominal CT
paper [143] develops an efficient pipeline with SAM to organ segmentation task using point or bounding box-based
generate a large-scale remote sensing segmentation dataset prompting. The findings indicate that SAM can generalize
named SAMRS. It will be detailed in the data annotations effectively to CT data, which could potentially accelerate
section. The paper [141] mentioned above also explores the the development of semi-automatic segmentation tools for
applicability of their proposed SAMs decoder for annotation clinicians. SAMed [147] is the proposed solution for medical
and finds that the bounding box-based segmentation mode image segmentation, which differs from previous methods
is more suitable for rapid annotation. Apart from that, some in that it leverages the SAM, a large-scale image segmen-
researchers combine the advantages of different foundation tation model. This approach involves customizing the SAM
models [144], like SAM and Grounding DINO, to achieve model for medical image segmentation by applying the low-
text-prompt guiding segmentation on remote sensing im- rank-based (LoRA) finetuning strategy to the SAM image
ages and prove its effectiveness in this field. The details are encoder. Unlike SAM, SAMed can perform better for se-
illustrated in the vision and language section. mantic segmentation tasks on medical images. The trained
SAMed model achieves comparable performance to SOTA
4 OTHER A PPLICATIONS : V ISION AND B EYOND methods. Additionally, since SAMed only updates a small
fraction of the SAM parameters, its deployment and storage
4.1 Vision Related Applications
costs are minimal in practical usage.
4.1.1 Medical Imaging For MRI images. MRI is a non-invasive diagnostic
The goal of medical image segmentation is to reveal the imaging technique that uses a powerful magnetic field,
anatomical or pathological structure of the corresponding radio waves, and a computer to produce detailed images
tissues, which can assist computer-aided diagnosis and of internal structures in the body. MRI is commonly used to
intelligent clinical surgery [161], [162]. Due to the rapid visualize the brain, spine, joints, and other soft tissues. This
development of computation power and medical data re- study [148] compares SAM with FSL’s Brain Extraction Tool
sources, deep learning-based medical image segmentation (BET), a widely used and current gold standard brain ex-
has achieved massive progress in accuracy and speed traction technique, on a variety of brain scans with varying
against traditional counterparts [163], [164]. With the emerg- image qualities, MR sequences, and brain lesions affecting
ing Visual Transformer (ViT), ViT-based medical image different brain regions. The findings show that SAM out-
methods [165], [166], [167], [168] has achieved surpassing performs BET based on average Dice coefficient, IoU, and
performance in the medical image segmentation. However, accuracy metrics, particularly in cases where image quality
such networks are towards a specific task, which lacks the is compromised by signal inhomogeneities, non-isotropic
generalization ability on other tasks. Recently, SAM has voxel resolutions, or the presence of brain lesions near or
been proposed to make it possible to solve multiple kinds of involving the outer regions of the brain and meninges.
segmentation tasks in a unified framework. In this context, Furthermore, SAM has superior segmentation properties,
researchers have paid attention to customizing SAM for enabling a fine-grain separation of different tissue com-
medical image segmentation and concluded some useful partments and brain structures. These results suggest that
strategies to improve its performance. This work [33] briefly SAM has the potential to become a more accurate, robust,
summarizes recent efforts to extend the success of SAM to and versatile tool for a broad range of brain extraction
medical image segmentation tasks, while we provide a more and segmentation applications. The paper [149] demon-
comprehensive summary with deeper insight in this section. strates that SAM can achieve high segmentation accuracy
According to the imaging format of medical images, for brain tumor MRI datasets in a point-to-mask setting,
the usage of SAM in medical image segmentation can and effectively generalizes to brain tumor MRI datasets
be categorized into six series: Computerized Tomography and achieves segmentation accuracy similar to what was
(CT) images, Magnetic Resonance Imaging (MRI) images, observed in the 2D photographs where it was previously
colonoscopy images, H&E stained histological sections im- evaluated. Furthermore, the authors identify the challenges
ages, multiple format images, and others. encountered when using SAM for tumor segmentation in
For CT images. A CT scan combines a series of X-ray im- MRI datasets and propose strategies to address them, which
ages taken from different angles around your body and uses can also be applied in the context of clinical implementation.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 9
Fig. 10: Outline of SAM for medical images, including Computerized Tomography (CT) images, Magnetic Resonance
Imaging (MRI) images, colonoscopy images, multiple format images, H&E stained histological sections images, and others.
not have a predefined structure like grids or lattices. These programs that form a comprehensive perception, planning, def main() -> dict:
image = GetObsImage()
masks = SAM(image=image)
objs, masks = ImageCrop(image=image, masks=masks)
graphs can represent a wide range of data, including social and action loop for robotic tasks. block = CLIPRetrieval(objs, "the polka dot block")
green_container = CLIPRetrieval(objs, "the green container")
block_loc = Pixel2Loc(block, masks)
container_loc = Pixel2Loc(green_container, masks)
networks, citation networks, e-commerce product graphs, # Pick the block and place it in the container
action = PickPlace(pick=block_loc, place=container_loc,
bounds=BOUNDS)
info = RobotExecution(action=action)
and molecule graphs. Due to the complexity and hetero- LLM Generated Policy
return info
Generated Policy
def main() -> dict:
image = GetObsImage()
geneity of these graphs, developing a foundation model for Instruction masks = SAM(image=image)
objs, masks = ImageCrop(image=image, masks=masks)
Put the green and purple polka dot block into the green container. # Retrieve the polka dot block, with the pure text modal
universal graph analysis has become a challenging task. Put <dragged_obj> into <base_obj> .
block = CLIPRetrieval(objs, "the polka dot block")
In recent years, several research works have been con- Put the first clicked object into the second one .
...
info = RobotExecution(action=action)
return info
[187] proposes the Graph Convolutional Network (GCN), "green and purple
polka dot block "
Text
Encoder
𝐹!
loc_0 = Pixel2Loc(obj=obj_0, masks=masks)
action = PickPlace(pick=loc_0, place=loc_0,
yaw_angle_degree=150)
Coord.
Mapping
bounds=BOUNDS,
info = RobotExecution(action=action)
def GetObsImage()PickPlace(…)
-> Image.Image:
SAGE method proposed by Hamilton et al. [188] enables the Encoder def ImageCrop(image:
pass …
Image.Image, masks: dict):
Rotate Action
"""Crop image with given masks."""
Construction
𝐹!
Speed
🦕
Grounding
Dino
Box Prompt
SAM CLIP
SAM
Text Prompt Text Prompt
CLIP
Point Prompt
Building
Car
Road
…
come from existing annotations or be derived from a scene The pipeline begins by using a text prompt as input for
text detection model. Grounding DINO, which generates bounding boxes. These
bounding boxes are then passed to SAM, which produces
a segmentation map. In the second step, a text prompt
pixel-level segmentation masks for text instances, resulting is given to CLIP Surgery, resulting in a heatmap. This
in more accurate and fine-grained annotations. heatmap is sampled to create point prompts for SAM,
In this context, the SAMText [180] pipeline shown in which generates segmentation masks. Finally, SAM is used
Fig. 18 offers a scalable and efficient solution for generating to generate segmentation maps, and CLIP is employed to
mask annotations for video text spotting tasks. By applying compare the semantic similarity between these maps and
the SAM model to bounding box annotations, SAMText is the text prompt. The figure is borrowed from the original
able to generate mask annotations for large-scale video text paper [144].
datasets, as demonstrated by the SAMText-9M dataset.
While SAMText is a novel approach for generating mask SAM also benefits vision and language tasks, such as im-
annotations for video text spotting tasks, it builds upon the age caption and text-based segmentation. InternGPT [191]
foundation laid by the SAM model. Specifically, the SAM is a system that combines chatbots capable of planning and
model’s ability to generate high-quality pixel-level masks reasoning with non-verbal cues like pointing movements,
for objects in images has been adapted for the specific task allowing users to manipulate images and videos on a screen
of generating masks for text instances in video frames. directly. The system uses SAM to segment objects. In this
Given an input scene text image or video frame, SAM- study [144], the authors propose a pipeline called Text2Seg,
Text begins by extracting the bounding box coordinates from which leverages multiple visual foundation models to fa-
existing annotations or derived from a scene text detec- cilitate remote sensing image semantic segmentation tasks
tion model. If the boxes are oriented, SAMText calculates guided by text prompts. The authors focus on the remote
their minimum bounding rectangle to obtain the horizontal sensing domain, where images are notably dissimilar from
bounding boxes (HBB), which is then used as the input those in conventional scenarios, and traditional models of-
prompt for the SAM model to obtain mask labels. The ten underperform when confronted with testing data col-
SAM model is a segmentation model that is pre-trained on lected under different scenarios.
natural images and fine-tuned on the COCO-Text dataset to The authors incorporate multiple visual foundation
generate mask annotations for text instances. models into their pipeline, including the SAM, which was
After obtaining the mask for each text instance, post- introduced by Meta AI Research as the first foundation
processing may be necessary to ensure its connectivity. In model for the object segmentation task. SAM is capable
particular, if a mask comprises several segments, it may of performing zero-shot segmentation using various visual
be desirable to derive a minimum enclosing mask as an prompts as guidance, making it particularly suitable for
optional step in order to achieve a more cohesive repre- remote sensing imagery processing, where labeled datasets
sentation. Furthermore, optical flow estimation can also be are often sparse and inherently heterogeneous.
utilized to enhance the accuracy of the generated masks and However, adapting SAM for specific tasks remains a
ensure temporal consistency. non-trivial task, as the model segments all objects into
The SAMText pipeline provides an exciting avenue for distinct masks, making it impractical to use directly for
future research in video text spotting. By providing fine- semantic segmentation tasks. To address this issue, the
grained mask annotations for large-scale datasets, SAMText authors propose leveraging other foundation models that
enables the development and evaluation of more accurate focus on different aspects to generate visual prompts as
and effective video text spotting models. Additionally, the guidance for the SAM model. These models can generate
SAMText approach may inspire the development of new points or bounding boxes that help narrow down the areas
segmentation-based approaches for other computer vision where SAM needs to make predictions or assist in filtering
tasks. SAM’s predictions based on specific text prompts.
Overall, the SAMText pipeline represents an important The authors evaluated their pipeline (shown in Fig. 19)
contribution to the field of video text spotting, offering a on four commonly used remote sensing datasets and
scalable and efficient solution for generating fine-grained achieved promising results. They showed that although
mask annotations. The approach holds promise for advanc- SAM is effective in segmenting instances within the pro-
ing the accuracy and effectiveness of video text spotting vided frame, generating segmentation masks for a spe-
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 15
cific category remains challenging. In contrast, other visual according to user preferences.
foundation models, like Grounding DINO [192] and CLIP The CAT approach leverages the SAM [20] to perform
[13], exhibit a superior capacity to understand the semantic the segmentation task. SAM is a promptable segmentation
features of images, thus generating coarse-grained visual model that achieves strong performance across various im-
prompts. Integrating these models into a unified pipeline age domains. Specifically, SAM adapts interactive segmen-
allows us to exploit their combined strengths. tation to achieve promptable ability, where a prompt, any
Overall, this work provides insights into maximizing interaction (e.g., points, boxes) indicating what to segment
the applicability of visual foundation models in specific in an image, is used to prompt SAM to return a valid
contexts with minimal model tuning. The authors’ proposed segmentation mask. Once the user-specified segmentation
pipeline is not limited to any specific dataset and can be mask is obtained, it is easy to generate the desired caption
applied to various scenarios with minimal adjustments to according to the original image and mask prompt.
the prompt tuning process. The authors hope that their work The CAT approach is a training-free and adaptable solu-
will encourage additional research on the application of tion for controllable image captioning tasks. It expands the
visual foundation models for diverse downstream tasks and range of supported control signals, enhances the model’s
stimulate the development of increasingly powerful visual flexibility and scalability, and offers strong user-interactive
foundation models. capabilities. The work highlights the importance of mul-
timodal controls and prompts in controllable image cap-
Visual Chain-of-Thought tioning and provides insights into potential future research
Segmenter Cap�oner Text Refiner
directions in this field.
Ques�on: what is this?
In conclusion, vision and language tasks may not need
Visual Controls Language Controls Answer: bu�erfly fish
training models anymore with the help of SAM and LLMs.
Describe the bu�erfly
… Sen�ment Style
… fish in the image. We are about to see more powerful tools utilizing large pre-
Points Boxes Trajectory Length Language Answer: bu�erfly fish
in the aquarium trained models or APIs to solve various kinds of tasks.
Fig. 20: Overview of the Caption Anything [44] framework, 4.2.6 Audio and Vision
which enhances image captioning by incorporating multi-
modal controls that align with human intention, allowing Audio
Spectrogram global audio
for a diverse range of visual focuses and language styles. feature
The visual prompt is initially converted into a mask prompt Audio Pixel-wise
Audio-Visual
Encoder
using the segmenter. Then, the captioner generates a raw Fusion
with all objects that exist in the video. However, with the I
input from
the last block
Fc feature of the
class token
Ft feature of the
text token
Fi
feature of the
image tokens
sum
matrix
multiplication
recent advances in deep learning, researchers have devel-
oped many effective methods for this task. original block original block ... original block feature multi-label
surgery recognition
One approach [45] to audio-visual localization and seg- I
original path
Fc
mentation is to learn cross-modal representations that can ...
feature
surgery
visual
explainability
new block new block new block Ft
align audio and visual information. In Fig. 21, AV-SAM new path open-vocab.
Fi
leverages pixel-wise audio-visual fusion across audio and depth: d segmentation
visual features from the pre-trained audio encoder and im- CLIP Surgery on architecture CLIP Surgery on feature
mentation is to use contrastive learning to learn cross- original block with q-k self-attention new block using v-v self-attention without FFN
satisfactory results without requiring model fine-tuning, has been shown to improve the mIoU of pseudo labels from
which is a cumbersome requirement in current WSSS tech- five SOTA WSSS methods by an average of 6.2% on the train
niques such as [207], [208], which involve classification re- set of the PASCAL VOC 2012 dataset.
training and pseudo-label generation. By using SAM as a
foundation model in WSSS, the process can be made more 4.3.2 Adversarial Robustness
straightforward and less complex.
As the concern regarding the susceptibility of computer
For example, [209] aimed to investigate the suitability of
vision systems to adversarial attacks grows [213], there
the SAM for WSSS and adapt it to generate pseudo-labels
is a need to understand and research adversarial attacks.
using only image-level class labels. The study conducted
Adversarial attacks involve adding minor, undetectable
an initial analysis of SAM as a foundation model in WSSS
perturbations to an input image to deceive the computer
and discovered that it can achieve comparable performance
vision system into misclassifying the image [214], [215]. This
without requiring fine-tuning. The study also revealed that
poses a significant concern because it may result in security
SAM could generate high-quality segmentation masks and
breaches and safety risks in applications like autonomous
even surpass human annotations in some cases.
vehicles [216] and facial recognition systems [217].
In [209], the performance of SAM was evaluated on the
PASCAL VOC and MS-COCO datasets, where it demon- Recently, [50] assessed the adversarial robustness of the
strated significant improvements over the latest SOTA meth- SAM model in the context of prompt-based segmentation in
ods. However, SAM encountered difficulties in certain situ- computer vision. The authors developed a framework called
ations due to the issue of semantic obscurity. While SAM Attack-SAM to evaluate SAM’s vulnerability to adversarial
performed well in most unambiguous settings, addressing attacks in the prompt-based mask prediction task. This is the
semantic obscurity may require investigating the use of first comprehensive investigation of SAM’s susceptibility
hierarchical-structured semantic classes and better prompts. to adversarial attacks, and the findings indicate that while
Additionally, the study suggests exploring SAM’s ability SAM is vulnerable to white-box attacks, it remains some-
to segment ”stuff” classes like ”sky,” ”sea,” and ”road” to what robust in the black-box setting. Understanding the ro-
enhance overall scene understanding. In [181], the authors bustness of SAM in the face of adversarial attacks is crucial
presented a WSSS method that utilizes SAM as a pseudo- for developing more secure computer vision systems. The
label generator. They employed diverse weak labels, such study provides the following insights:
as image-level labels, points, scribbles, and bounding boxes, • Firstly, the research reveals that SAM is vulnerable to
as prompts for SAM to produce precise class masks. These white-box attacks, meaning that an attacker with full
masks were then used to generate pseudo labels for training knowledge of the model’s architecture and parame-
segmentation networks. The method’s effectiveness was ters can generate adversarial examples that deceive
evaluated on the PASCAL VOC 2012 dataset, and the results the model with ease. However, SAM is relatively
indicate that SAM can serve as a reliable pseudo-label gen- robust in the black-box setting, where the attacker’s
erator, with scribbles as prompts achieving an 89.7% mIoU knowledge of the model is limited.
score on the training set. The final segmentation model • Secondly, the study shows that small objects tend to
obtained a 76.6% mIoU score on the test set. This suggests be more resistant to adversarial attacks in the black-
that the method can be valuable in training segmentation box setting, possibly due to the limited perturbation
networks with weak supervision, especially in scenarios of the small object region.
where pixel-level annotations are unavailable. • Thirdly, the paper offers insights into the transfer-
However, in [181], the authors also pointed out some ability of adversarial attacks among prompts in the
of SAM’s limitations in handling WSSS. For instance, when segment everything mode of SAM. Additionally, the
using point prompts located by CAMs [210], [211], SAM study provides a list of recent research papers and
may generate incorrect object masks due to the CAMs’ surveys on various topics related to generative AI,
coarse locations. Additionally, in the case of using bounding which can serve as a useful resource for researchers
box prompts, SAM may struggle to generate precise object in the field.
masks when multiple objects are placed on the same table.
These limitations imply that SAM may not always be capa- Besides, [49] presented BadSAM, which applied back-
ble of producing accurate object masks when dealing with door attacks to the SAM model for image segmentation.
WSSS. This paper demonstrated the potential security risks as-
In addition, due to the fact that current methods that rely sociated with such attacks and emphasized the need for
on CAM to generate pseudo labels suffer from limitations further investigation into defense strategies in this area.
such as partial and false activation. To overcome this, a new The experiments conducted in the study demonstrate that
approach in [212] was introduced in this study that utilizes the SAM model can be exploited by attackers, presenting
the SAM to generate high-quality pseudo labels by selecting a significant risk to end-users. The paper also outlines a
relevant masks and labeling them based on initial CAM seed process for initiating backdoor attacks on SAM and provides
masks or post-processed pseudo labels. The results show insights into the attacker’s knowledge and methodology. In
that this approach significantly improves the precision of [218], the researchers carried out a thorough investigation
segmentation while also reducing computational costs com- into the robustness of SAM in various real-world scenarios.
pared to existing post-processing modules. The approach is They conducted experiments that involved a wide range
highly versatile and compatible with existing WSSS models of image perturbations and domains, which revealed that
without modification to base networks or pipelines, and it SAM’s performance typically deteriorates when dealing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 18
with perturbed images. However, the degree of vulnerabil- instance segmentation, integrating it into XAI pipelines can
ity varies across different types of perturbations. By tailor- be computationally demanding. To address this concern,
ing the prompting techniques and utilizing domain-specific the authors proposed a lightweight per-input equivalent
knowledge that takes into account the unique characteris- (PIE) scheme, enabling efficient explanation using a surro-
tics of each dataset, it is possible to enhance the model’s gate model. Evaluation conducted on ImageNet and COCO
resilience to these perturbations and tackle challenges that datasets showcased the promising performance of EAC
are specific to each dataset. This work further highlights compared to commonly utilized XAI methods.
certain types of perturbations that have a significant impact
on the model’s ability to withstand challenges, identifying
areas where improvements can be made.
advantages and limitations of SAM across various applica- [19] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hi-
tions. These observations can provide some insights to guide erarchical text-conditional image generation with clip latents,”
arXiv preprint arXiv:2204.06125, 2022.
future research to develop stronger foundation models and [20] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
further improve the robustness and generalization capabili- T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment
ties of SAM. Finally, we summarize massive other amazing anything,” arXiv preprint arXiv:2304.02643, 2023.
applications of SAM in vision and beyond. The appendix [21] R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, P. Gao,
and H. Li, “Personalize segment anything model with one shot,”
provides a preliminary summary of open-source projects on arXiv preprint arXiv:2305.03048, 2023.
SAM in table format. [22] G.-P. Ji, D.-P. Fan, P. Xu, M.-M. Cheng, B. Zhou, and L. Van Gool,
“Sam struggles in concealed scenes–empirical study on” segment
anything”,” arXiv preprint arXiv:2304.06022, 2023.
R EFERENCES [23] X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang,
“Seggpt: Segmenting everything in context,” arXiv preprint
[1] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, arXiv:2304.03284, 2023.
S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill
[24] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee,
et al., “On the opportunities and risks of foundation models,”
“Segment everything everywhere all at once,” arXiv preprint
arXiv preprint arXiv:2108.07258, 2021.
arXiv:2304.06718, 2023.
[2] X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian,
and W. Gao, “Large-scale multi-modal pre-trained models: A [25] J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “One-
comprehensive survey,” arXiv preprint arXiv:2302.10035, 2023. former: One transformer to rule universal image segmentation,”
[3] P. P. Liang, A. Zadeh, and L.-P. Morency, “Foundations and recent arXiv preprint arXiv:2211.06220, 2022.
trends in multimodal machine learning: Principles, challenges, [26] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-
and open questions,” arXiv preprint arXiv:2209.03430, 2022. train, prompt, and predict: A systematic survey of prompting
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- methods in natural language processing,” ACM Computing Sur-
training of deep bidirectional transformers for language under- veys, vol. 55, no. 9, pp. 1–35, 2023.
standing,” arXiv preprint arXiv:1810.04805, 2018. [27] J. Fan, “Gpt-3 moment in computer vision,” https://twitter.com/
[5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, DrJimFan, 2023.
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer [28] Y. Jing, X. Wang, and D. Tao, “Segment anything in non-
learning with a unified text-to-text transformer,” The Journal of euclidean domains: Challenges and opportunities,” arXiv preprint
Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. arXiv:2304.11595, 2023.
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- [29] L. Tang, H. Xiao, and B. Li, “Can sam segment anything?
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., when sam meets camouflaged object detection,” arXiv preprint
“Language models are few-shot learners,” Advances in neural arXiv:2304.04709, 2023.
information processing systems, vol. 33, pp. 1877–1901, 2020. [30] J. Ma and B. Wang, “Segment anything in medical images,” arXiv
[7] OpenAI, “Gpt-4,” https://cdn.openai.com/papers/gpt-4.pdf, preprint arXiv:2304.12306, 2023.
2023. [31] T. Zhou, Y. Zhang, Y. Zhou, Y. Wu, and C. Gong, “Can sam
[8] ——, “Introducing chatgpt,” https://openai.com/blog/chatgpt, segment polyps?” arXiv preprint arXiv:2304.07583, 2023.
Accessed: 2023-05-10. [32] P. Shi, J. Qiu, S. M. D. Abaxi, H. Wei, F. P.-W. Lo, and W. Yuan,
[9] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vi- “Generalist vision foundation models for medical imaging: A
sion transformers,” in Proceedings of the IEEE/CVF Conference on case study of segment anything model on zero-shot medical
Computer Vision and Pattern Recognition, 2022, pp. 12 104–12 113. segmentation,” arXiv preprint arXiv:2304.12637, 2023.
[10] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, [33] Y. Zhang and R. Jiao, “How segment anything model (sam) boost
J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin medical image segmentation?” arXiv preprint arXiv:2305.03678,
et al., “Scaling vision transformers to 22 billion parameters,” arXiv 2023.
preprint arXiv:2302.05442, 2023. [34] J. Wu, R. Fu, H. Fang, Y. Liu, Z. Wang, Y. Xu, Y. Jin, and T. Arbel,
[11] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, “Medical sam adapter: Adapting segment anything model for
Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up medical image segmentation,” arXiv preprint arXiv:2304.12620,
capacity and resolution,” in Proceedings of the IEEE/CVF conference 2023.
on computer vision and pattern recognition, 2022, pp. 12 009–12 019. [35] T. Chen, L. Zhu, C. Ding, R. Cao, S. Zhang, Y. Wang, Z. Li, L. Sun,
[12] L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, P. Mao, and Y. Zang, “Sam fails to segment anything?–sam-
and Y. Qiao, “Videomae v2: Scaling video masked autoencoders adapter: Adapting sam in underperformed scenes: Camouflage,
with dual masking,” arXiv preprint arXiv:2303.16727, 2023. shadow, and more,” arXiv preprint arXiv:2304.09148, 2023.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar-
[36] D. Cheng, Z. Qin, Z. Jiang, S. Zhang, Q. Lao, and K. Li, “Sam
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning
on medical images: A comprehensive study on three prompt
transferable visual models from natural language supervision,”
modes,” arXiv preprint arXiv:2305.00035, 2023.
in International conference on machine learning. PMLR, 2021, pp.
[37] M. A. Mazurowski, H. Dong, H. Gu, J. Yang, N. Konz, and
8748–8763.
Y. Zhang, “Segment anything model for medical image analysis:
[14] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le,
an experimental study,” arXiv preprint arXiv:2304.10517, 2023.
Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-
language representation learning with noisy text supervision,” in [38] Y. Huang, X. Yang, L. Liu, H. Zhou, A. Chang, X. Zhou, R. Chen,
International Conference on Machine Learning. PMLR, 2021, pp. J. Yu, J. Chen, C. Chen et al., “Segment anything model for
4904–4916. medical images?” arXiv preprint arXiv:2304.14660, 2023.
[15] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with [39] T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen,
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, “Inpaint anything: Segment anything meets image inpainting,”
2018. arXiv preprint arXiv:2304.06790, 2023.
[16] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hock- [40] D. Xie, R. Wang, J. Ma, C. Chen, H. Lu, D. Yang, F. Shi, and X. Lin,
enmaier, and S. Lazebnik, “Flickr30k entities: Collecting region- “Edit everything: A text-guided generative system for images
to-phrase correspondences for richer image-to-sentence models,” editing,” arXiv preprint arXiv:2304.14006, 2023.
in Proceedings of the IEEE international conference on computer vision, [41] S. Liu, J. Ye, and X. Wang, “Any-to-any style transfer: Making
2015, pp. 2641–2649. picasso and da vinci collaborate,” arXiv e-prints, pp. arXiv–2304,
[17] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, 2023.
and C. L. Zitnick, “Microsoft coco captions: Data collection and [42] M. Ahmadi, A. G. Lonbar, A. Sharifi, A. T. Beris, M. Nouri, and
evaluation server,” arXiv preprint arXiv:1504.00325, 2015. A. S. Javidi, “Application of segment anything model for civil
[18] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, infrastructure defect assessment,” arXiv preprint arXiv:2304.12600,
M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” 2023.
in International Conference on Machine Learning. PMLR, 2021, pp. [43] D. Han, C. Zhang, Y. Qiao, M. Qamar, Y. Jung, S. Lee, S.-H. Bae,
8821–8831. and C. S. Hong, “Segment anything model (sam) meets glass:
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 20
Mirror and transparent objects cannot be easily detected,” arXiv and methods,” Journal of Artificial Intelligence Research, vol. 71, pp.
preprint arXiv:2305.00278, 2023. 1183–1317, 2021.
[44] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, [68] C. Zhang, S. Zheng, C. Li, Y. Qiao, T. Kang, X. Shan, C. Zhang,
S. Zhao, Y. Shan et al., “Caption anything: Interactive image C. Qin, F. Rameau, S.-H. Bae et al., “Asurvey on segment anything
description with diverse multimodal controls,” arXiv preprint model (sam): Vision foundation model meets prompt engineer-
arXiv:2305.02677, 2023. ing,” 2023.
[45] S. Mo and Y. Tian, “Av-sam: Segment anything model meets [69] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
audio-visual localization and segmentation,” arXiv preprint Yuille, “Semantic image segmentation with deep convolutional
arXiv:2305.01836, 2023. nets and fully connected crfs,” arXiv preprint arXiv:1412.7062,
[46] J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng, 2014.
“Track anything: Segment anything meets videos,” arXiv preprint [70] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethink-
arXiv:2304.11968, 2023. ing atrous convolution for semantic image segmentation,” arXiv
[47] Q. Shen, X. Yang, and X. Wang, “Anything-3d: Towards preprint arXiv:1706.05587, 2017.
single-view anything reconstruction in the wild,” arXiv preprint [71] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
arXiv:2304.10261, 2023. Yuille, “Deeplab: Semantic image segmentation with deep convo-
[48] Z. Ma, X. Hong, and Q. Shangguan, “Can sam count any- lutional nets, atrous convolution, and fully connected crfs,” IEEE
thing? an empirical study on sam counting,” arXiv preprint transactions on pattern analysis and machine intelligence, vol. 40,
arXiv:2304.10817, 2023. no. 4, pp. 834–848, 2017.
[49] Z. Guan, M. Hu, Z. Zhou, J. Zhang, S. Li, and N. Liu, “Badsam: [72] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
Exploring security vulnerabilities of sam via backdoor attacks,” “Encoder-decoder with atrous separable convolution for seman-
arXiv preprint arXiv:2305.03289, 2023. tic image segmentation,” in Proceedings of the European conference
[50] C. Zhang, C. Zhang, T. Kang, D. Kim, S.-H. Bae, and I. S. Kweon, on computer vision (ECCV), 2018, pp. 801–818.
“Attack-sam: Towards evaluating adversarial robustness of seg- [73] A. M. Hafiz and G. M. Bhat, “A survey on instance segmentation:
ment anything model,” arXiv preprint arXiv:2305.00866, 2023. state of the art,” International journal of multimedia information
[51] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, retrieval, vol. 9, no. 3, pp. 171–189, 2020.
J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino [74] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
with grounded pre-training for open-set object detection,” arXiv for instance segmentation,” in Proceedings of the IEEE conference
preprint arXiv:2303.05499, 2023. on computer vision and pattern recognition, 2018, pp. 8759–8768.
[52] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Va- [75] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time in-
jda, and D. Marculescu, “Open-vocabulary semantic segmenta- stance segmentation,” in Proceedings of the IEEE/CVF international
tion with mask-adapted clip,” arXiv preprint arXiv:2210.04150, conference on computer vision, 2019, pp. 9157–9166.
2022. [76] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, “Panop-
[53] J. Wang, P. Zhang, T. Chu, Y. Cao, Y. Zhou, T. Wu, B. Wang, C. He, tic segmentation,” in Proceedings of the IEEE/CVF Conference on
and D. Lin, “V3det: Vast vocabulary visual detection dataset,” Computer Vision and Pattern Recognition (CVPR), June 2019.
arXiv preprint arXiv:2304.03752, 2023. [77] W. Zhang, J. Pang, K. Chen, and C. C. Loy, “K-net: Towards
[54] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Open-vocabulary image unified image segmentation,” Advances in Neural Information Pro-
segmentation,” arXiv preprint arXiv:2112.12143, 2021. cessing Systems, vol. 34, pp. 10 326–10 338, 2021.
[78] B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification
[55] J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang,
is not all you need for semantic segmentation,” in NeurIPS, 2021.
Z. Zou, Z. Wu, W. He et al., “Towards artificial general intelli-
gence with hybrid tianjic chip architecture,” Nature, vol. 572, no. [79] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar,
7767, pp. 106–111, 2019. “Masked-attention mask transformer for universal image seg-
mentation,” in Proceedings of the IEEE/CVF Conference on Computer
[56] B. Goertzel, “Artificial general intelligence: concept, state of the
Vision and Pattern Recognition, 2022, pp. 1290–1299.
art, and future prospects,” Journal of Artificial General Intelligence,
[80] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep
vol. 5, no. 1, p. 1, 2014.
interactive object selection,” in Proceedings of the IEEE conference
[57] J. Grudin and R. Jacques, “Chatbots, humbots, and the quest
on computer vision and pattern recognition, 2016, pp. 373–381.
for artificial general intelligence,” in Proceedings of the 2019 CHI
[81] G. Wang, M. A. Zuluaga, W. Li, R. Pratt, P. A. Patel, M. Aertsen,
conference on human factors in computing systems, 2019, pp. 1–11.
T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren,
[58] B. Goertzel and P. Wang, “A foundational architecture for artifi- “DeepIGeoS: A deep interactive geodesic framework for medical
cial general intelligence,” Advances in artificial general intelligence: image segmentation,” IEEE Transactions on Pattern Analysis and
Concepts, architectures and algorithms, vol. 6, p. 36, 2007. Machine Intelligence, vol. 41, no. 7, pp. 1559–1572, jul 2019.
[59] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, [82] G. Mohan and M. M. Subashini, “Mri based medical image
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language analysis: Survey on brain tumor grade classification,” Biomedical
models,” arXiv preprint arXiv:2303.18223, 2023. Signal Processing and Control, vol. 39, pp. 139–161, 2018.
[60] G. Mialon, R. Dessı̀, M. Lomeli, C. Nalmpantis, R. Pasunuru, [83] P. A. Yushkevich, Y. Gao, and G. Gerig, “Itk-snap: An interactive
R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz tool for semi-automatic segmentation of multi-modality biomedi-
et al., “Augmented language models: a survey,” arXiv preprint cal images,” in 2016 38th annual international conference of the IEEE
arXiv:2302.07842, 2023. engineering in medicine and biology society (EMBC). IEEE, 2016,
[61] L. Fan, L. Li, Z. Ma, S. Lee, H. Yu, and L. Hemphill, “A bibliomet- pp. 3342–3345.
ric review of large language models research from 2017 to 2023,” [84] C. Rupprecht, I. Laina, N. Navab, G. D. Hager, and F. Tombari,
arXiv preprint arXiv:2304.02020, 2023. “Guide me: Interacting with deep networks,” in Proceedings of the
[62] C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to- IEEE Conference on Computer Vision and Pattern Recognition, 2018,
image diffusion model in generative ai: A survey,” arXiv preprint pp. 8551–8561.
arXiv:2303.07909, 2023. [85] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating
[63] S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Lan- object instances with a polygon-rnn,” in Proceedings of the IEEE
guage (technology) is power: A critical survey of” bias” in nlp,” conference on computer vision and pattern recognition, 2017, pp.
arXiv preprint arXiv:2005.14050, 2020. 5230–5238.
[64] F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, and [86] D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive
B. Xu, “Vlp: A survey on vision-language pre-training,” Machine annotation of segmentation datasets with polygon-rnn++,” in
Intelligence Research, vol. 20, no. 1, pp. 38–56, 2023. Proceedings of the IEEE conference on Computer Vision and Pattern
[65] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language Recognition, 2018, pp. 859–868.
pre-trained models,” arXiv preprint arXiv:2202.10936, 2022. [87] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu,
[66] S. Long, F. Cao, S. C. Han, and H. Yang, “Vision-and-language X. Huang, B. Li, C. Li et al., “Florence: A new foundation model
pretrained models: A survey,” arXiv preprint arXiv:2204.07356, for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
2022. [88] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert:
[67] A. Mogadala, M. Kalimuthu, and D. Klakow, “Trends in integra- Pre-training of generic visual-linguistic representations,” arXiv
tion of vision and language research: A survey of tasks, datasets, preprint arXiv:1908.08530, 2019.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 21
[89] J. Cho, J. Lu, D. Schwenk, H. Hajishirzi, and A. Kembhavi, “X- International Conference, Munich, Germany, October 5-9, 2015, Pro-
lxmert: Paint, caption and answer questions with multi-modal ceedings, Part III 18. Springer, 2015, pp. 234–241.
transformers,” arXiv preprint arXiv:2009.11278, 2020. [110] I. Giannakis, A. Bhardwaj, L. Sam, and G. Leontidis, “Deep learn-
[90] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, ing universal crater detection using segment anything model
and I. Misra, “Imagebind: One embedding space to bind them (sam),” arXiv preprint arXiv:2304.07764, 2023.
all,” in CVPR, 2023. [111] V. Ranjan, U. Sharma, T. Nguyen, and M. Hoai, “Learning to
[91] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked count everything,” in Proceedings of the IEEE/CVF Conference on
autoencoders are scalable vision learners,” in Proceedings of the Computer Vision and Pattern Recognition, 2021, pp. 3394–3403.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [112] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
2022, pp. 16 000–16 009. P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
[92] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, context,” in Computer Vision–ECCV 2014: 13th European Confer-
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V
et al., “An image is worth 16x16 words: Transformers for image 13. Springer, 2014, pp. 740–755.
recognition at scale,” in International Conference on Learning Repre- [113] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, and D. Ginhac, “Dsec-
sentations, 2021. mos: Segment any moving object with moving ego vehicle,” arXiv
[93] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss preprint arXiv:2305.00126, 2023.
for dense object detection,” in Proceedings of the IEEE international [114] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, C. Demonceaux, and
conference on computer vision, 2017, pp. 2980–2988. D. Ginhac, “Rgb-event fusion for moving object detection in
[94] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convo- autonomous driving,” arXiv preprint arXiv:2209.08323, 2022.
lutional neural networks for volumetric medical image segmen- [115] Y. Cao, X. Xu, C. Sun, Y. Cheng, Z. Du, L. Gao, and W. Shen,
tation,” in 2016 fourth international conference on 3D vision (3DV). “Segment any anomaly without training via hybrid prompt
Ieee, 2016, pp. 565–571. regularization,” arXiv preprint arXiv:2305.10724, 2023.
[95] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and [116] T.-N. Le, T. V. Nguyen, Z. Nie, M.-T. Tran, and A. Sugimoto,
A. Torralba, “Semantic understanding of scenes through the “Anabranch network for camouflaged object segmentation,”
ade20k dataset,” International Journal of Computer Vision, vol. 127, Computer vision and image understanding, vol. 184, pp. 45–56, 2019.
pp. 302–321, 2019. [117] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao,
[96] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, “Camouflaged object detection,” in Proceedings of the IEEE/CVF
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes conference on computer vision and pattern recognition, 2020, pp.
dataset for semantic urban scene understanding,” in Proc. of 2777–2787.
the IEEE Conference on Computer Vision and Pattern Recognition
[118] Y. Lv, J. Zhang, Y. Dai, A. Li, B. Liu, N. Barnes, and D.-P. Fan,
(CVPR), 2016.
“Simultaneously localize, segment and rank the camouflaged
[97] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, objects,” in Proceedings of the IEEE/CVF Conference on Computer
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in Vision and Pattern Recognition, 2021, pp. 11 591–11 601.
context,” in Computer Vision–ECCV 2014: 13th European Confer-
[119] B. Yin, X. Zhang, Q. Hou, B.-Y. Sun, D.-P. Fan, and L. Van Gool,
ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V
“Camoformer: Masked separable attention for camouflaged ob-
13. Springer, 2014, pp. 740–755.
ject detection,” arXiv preprint arXiv:2212.06570, 2022.
[98] X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang, “Images speak
[120] X. Hu, D.-P. Fan, X. Qin, H. Dai, W. Ren, Y. Tai, C. Wang,
in images: A generalist painter for in-context visual learning,”
and L. Shao, “High-resolution iterative feedback network for
arXiv preprint arXiv:2212.02499, 2022.
camouflaged object detection,” arXiv preprint arXiv:2203.11624,
[99] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova,
2022.
A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and
V. Lempitsky, “Resolution-robust large mask inpainting with [121] D. Williams, F. MacFarlane, and A. Britten, “Leaf only sam: A
fourier convolutions,” in Proceedings of the IEEE/CVF winter con- segment anything pipeline for zero-shot automated leaf segmen-
ference on applications of computer vision, 2022, pp. 2149–2159. tation,” 2023.
[100] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and [122] Y. Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, “Spot-
L. Van Gool, “Repaint: Inpainting using denoising diffusion the-difference self-supervised pre-training for anomaly detection
probabilistic models,” in Proceedings of the IEEE/CVF Conference on and segmentation,” in Computer Vision–ECCV 2022: 17th European
Computer Vision and Pattern Recognition, 2022, pp. 11 461–11 471. Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part
[101] W. Li, Z. Lin, K. Zhou, L. Qi, Y. Wang, and J. Jia, “Mat: Mask- XXX. Springer, 2022, pp. 392–408.
aware transformer for large hole image inpainting,” in Proceed- [123] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–
ings of the IEEE/CVF conference on computer vision and pattern a comprehensive real-world dataset for unsupervised anomaly
recognition, 2022, pp. 10 758–10 768. detection,” in Proceedings of the IEEE/CVF conference on computer
[102] Q. Dong, C. Cao, and Y. Fu, “Incremental transformer structure vision and pattern recognition, 2019, pp. 9592–9600.
enhanced image inpainting with masking positional encoding,” [124] Y. Huang, C. Qiu, and K. Yuan, “Surface defect saliency of
in Proceedings of the IEEE/CVF Conference on Computer Vision and magnetic tile,” The Visual Computer, vol. 36, pp. 85–96, 2020.
Pattern Recognition, 2022, pp. 11 358–11 368. [125] J. Božič, D. Tabernik, and D. Skočaj, “Mixed supervision for
[103] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, surface-defect detection: From weakly to fully supervised learn-
“High-resolution image synthesis with latent diffusion models,” ing,” Computers in Industry, vol. 129, p. 103459, 2021.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [126] C. He, K. Li, Y. Zhang, G. Xu, L. Tang, Y. Zhang, Z. Guo, and
Pattern Recognition, 2022, pp. 10 684–10 695. X. Li, “Weakly-supervised concealed object segmentation with
[104] T. Q. Chen and M. Schmidt, “Fast patch-based style transfer of sam-based pseudo labeling and multi-scale feature grouping,”
arbitrary style,” arXiv preprint arXiv:1612.04337, 2016. arXiv preprint arXiv: arXiv:2305.11003, 2023.
[105] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style- [127] S. Gao, W. Zhang, Y. Wang, Q. Guo, C. Zhang, Y. He, and
attentional networks,” in proceedings of the IEEE/CVF conference on W. Zhang, “Weakly-supervised salient object detection using
computer vision and pattern recognition, 2019, pp. 5880–5888. point supervison,” in Proceedings of the AAAI Conference on Ar-
[106] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Uni- tificial Intelligence, vol. 36, no. 1, 2022, pp. 670–678.
versal style transfer via feature transforms,” Advances in neural [128] J. Chen and X. Bai, “Learning to” segment anything” in thermal
information processing systems, vol. 30, 2017. infrared images through knowledge distillation with a large scale
[107] Z. Xu, E. Sangineto, and N. Sebe, “Stylerdalle: Language-guided dataset satir,” arXiv preprint arXiv:2304.07969, 2023.
style transfer using a vector-quantized tokenizer of a large-scale [129] C. Li, W. Xia, Y. Yan, B. Luo, and J. Tang, “Segmenting objects in
generative model,” arXiv preprint arXiv:2303.09268, 2023. day and night: Edge-conditioned cnn for thermal image semantic
[108] W. Ji, J. Li, Q. Bi, W. Li, and L. Cheng, “Segment anything is not segmentation,” IEEE Transactions on Neural Networks and Learning
always perfect: An investigation of sam on different real-world Systems, vol. 32, no. 7, pp. 3069–3082, 2020.
applications,” arXiv preprint arXiv:2304.05750, 2023. [130] X. Yang, H. Dai, Z. Wu, R. Bist, S. Subedi, J. Sun, G. Lu, C. Li,
[109] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional T. Liu, and L. Chai, “Sam for poultry science,” 2023.
networks for biomedical image segmentation,” in Medical Image [131] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and
Computing and Computer-Assisted Intervention–MICCAI 2015: 18th P. Luo, “Segformer: Simple and efficient design for semantic
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 22
segmentation with transformers,” Advances in Neural Information [153] Y. Liu, J. Zhang, Z. She, A. Kheradmand, and M. Armand, “Samm
Processing Systems, vol. 34, pp. 12 077–12 090, 2021. (segment any medical model): A 3d slicer integration to sam,”
[132] Y. Liu, Y. Wang, Y. Li, Q. Li, and J. Wang, “Setr-yolov5n: A arXiv preprint arXiv:2304.05622, 2023.
lightweight low-light lane curvature detection method based on [154] Z. Qiu, Y. Hu, H. Li, and J. Liu, “Learnable ophthalmology sam,”
fractional-order fusion model,” IEEE Access, vol. 10, pp. 93 003– arXiv preprint arXiv:2304.13425, 2023.
93 016, 2022. [155] J. Wu, “Promptunet: Toward interactive medical image segmen-
[133] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo tation,” arXiv preprint arXiv:2305.10300, 2023.
series in 2021,” arXiv preprint arXiv:2107.08430, 2021. [156] R. Deng, C. Cui, Q. Liu, T. Yao, L. W. Remedios, S. Bao, B. A.
[134] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, Landman, L. E. Wheless, L. A. Coburn, K. T. Wilson et al.,
and X. Wang, “Bytetrack: Multi-object tracking by associating ev- “Segment anything model (sam) for digital pathology: Assess
ery detection box,” in Computer Vision–ECCV 2022: 17th European zero-shot segmentation on whole slide imaging,” arXiv preprint
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part arXiv:2304.04155, 2023.
XXII. Springer, 2022, pp. 1–21. [157] M. Hu, Y. Li, and X. Yang, “Skinsam: Empowering skin can-
[135] S. Ren, F. Luzi, S. Lahrichi, K. Kassaw, L. M. Collins, K. Bradbury, cer segmentation with segment anything model,” arXiv preprint
and J. M. Malof, “Segment anything, from space?” arXiv preprint arXiv:2304.13973, 2023.
arXiv:2304.13000, 2023. [158] Y. Zhang, T. Zhou, P. Liang, and D. Z. Chen, “Input augmentation
[136] K. Bradbury, R. Saboo, T. L Johnson, J. M. Malof, A. Devarajan, with sam: Boosting medical image segmentation with segmenta-
W. Zhang, L. M Collins, and R. G Newell, “Distributed solar tion foundation model,” arXiv preprint arXiv:2304.11332, 2023.
photovoltaic array location and extent dataset for remote sensing [159] A. Wang, M. Islam, M. Xu, Y. Zhang, and H. Ren, “Sam meets
object identification,” Scientific data, vol. 3, no. 1, pp. 1–9, 2016. robotic surgery: An empirical study in robustness perspective,”
[137] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can se- arXiv preprint arXiv:2304.14674, 2023.
mantic labeling methods generalize to any city? the inria aerial [160] B. Wang, A. Aboah, Z. Zhang, and U. Bagci, “Gazesam: What you
image labeling benchmark,” in 2017 IEEE International Geoscience see is what you segment,” arXiv preprint arXiv:2304.13844, 2023.
and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 3226– [161] Z. Liu, H. Wen, Z. Zhu, Q. Li, L. Liu, T. Li, W. Xu, C. Hou,
3229. B. Huang, Z. Li et al., “Diagnosis of significant liver fibrosis in
[138] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, patients with chronic hepatitis b using a deep learning-based
F. Hughes, D. Tuia, and R. Raskar, “Deepglobe 2018: A challenge data integration network,” Hepatology International, vol. 16, no. 3,
to parse the earth through satellite images,” in Proceedings of p. 526–536, 2022.
the IEEE Conference on Computer Vision and Pattern Recognition [162] K. Huang, Q. Li, W. Zeng, X. Chen, L. Liu, X. Wan, C. Feng,
Workshops, 2018, pp. 172–181. Z. Li, Z. Liu, and C. Dong, “Ultrasound score combined with
[139] S. Mohajerani, T. A. Krammer, and P. Saeedi, “Cloud detection liver stiffness measurement by sound touch elastography for
algorithm for remote sensing images using fully convolutional staging liver fibrosis in patients with chronic hepatitis b: a clinical
neural networks,” arXiv preprint arXiv:1810.05782, 2018. prospective study,” Annals of Translational Medicine, vol. 10, no. 6,
[140] H. L. Aung, B. Uzkent, M. Burke, D. Lobell, and S. Ermon, 2022.
“Farm parcel delineation using spatio-temporal convolutional [163] Y. Chen, C. Zhang, L. Liu, C. Feng, C. Dong, Y. Luo, and X. Wan,
networks,” in Proceedings of the IEEE/CVF Conference on Computer “Uscl: pretraining deep ultrasound image diagnosis model
Vision and Pattern Recognition Workshops, 2020, pp. 76–77. through video contrastive representation learning,” in Medical
[141] S. Julka and M. Granitzer, “Knowledge distillation with segment Image Computing and Computer Assisted Intervention—MICCAI
anything (sam) model for planetary geological mapping,” arXiv 2021: 24th International Conference, Strasbourg, France, September
preprint arXiv:2305.07586, 2023. 27—October 1, 2021, Proceedings, Part VIII 24. Springer, 2021, p.
[142] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 627–637.
Proceedings of the IEEE international conference on computer vision, [164] L. Gao, R. Zhou, C. Dong, C. Feng, Z. Li, X. Wan, and L. Liu,
2017, pp. 2961–2969. “Multi-modal active learning for automatic liver fibrosis diagno-
[143] D. Wang, J. Zhang, B. Du, D. Tao, and L. Zhang, “Scaling- sis based on ultrasound shear wave elastography,” in 2021 IEEE
up remote sensing segmentation dataset with segment anything 18th International Symposium on Biomedical Imaging (ISBI). IEEE,
model,” arXiv preprint arXiv:2305.02034, 2023. 2021, p. 410–414.
[144] J. Zhang, Z. Zhou, G. Mai, L. Mu, M. Hu, and S. Li, “Text2seg: [165] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille,
Remote sensing image semantic segmentation via text-guided and Y. Zhou, “Transunet: Transformers make strong encoders
visual foundation models,” arXiv preprint arXiv:2304.10597, 2023. for medical image segmentation,” arXiv preprint arXiv:2102.04306,
[145] C. Hu and X. Li, “When sam meets medical images: An inves- 2021.
tigation of segment anything model (sam) on multi-phase liver [166] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and
tumor segmentation,” arXiv preprint arXiv:2304.08506, 2023. M. Wang, “Swin-unet: Unet-like pure transformer for medical
[146] S. Roy, T. Wald, G. Koehler, M. R. Rokuss, N. Disch, J. Holzschuh, image segmentation,” in Computer Vision–ECCV 2022 Workshops:
D. Zimmerer, and K. H. Maier-Hein, “Sam. md: Zero-shot med- Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. Springer,
ical image segmentation capabilities of the segment anything 2023, pp. 205–218.
model,” arXiv preprint arXiv:2304.05396, 2023. [167] R. Azad, R. Arimond, E. K. Aghdam, A. Kazerouni, and D. Mer-
[147] K. Zhang and D. Liu, “Customized segment anything model hof, “Dae-former: Dual attention-guided efficient transformer for
for medical image segmentation,” arXiv preprint arXiv:2304.13785, medical image segmentation,” arXiv preprint arXiv:2212.13504,
2023. 2022.
[148] S. Mohapatra, A. Gosai, and G. Schlaug, “Brain extraction com- [168] R. Azad, M. Heidari, M. Shariatnia, E. K. Aghdam, S. Karimija-
paring segment anything model (sam) and fsl brain extraction farbigloo, E. Adeli, and D. Merhof, “Transdeeplab: Convolution-
tool,” arXiv preprint arXiv:2304.04738, 2023. free transformer-based deeplab v3+ for medical image segmen-
[149] F. Putz, J. Grigo, T. Weissmann, P. Schubert, D. Hoefler, A. Gomaa, tation,” in Predictive Intelligence in Medicine: 5th International
H. B. Tkhayat, A. Hagag, S. Lettmaier, B. Frey et al., “The Workshop, PRIME 2022, Held in Conjunction with MICCAI 2022,
segment anything foundation model achieves favorable brain tu- Singapore, September 22, 2022, Proceedings. Springer, 2022, pp.
mor autosegmentation accuracy on mri to support radiotherapy 91–102.
treatment planning,” arXiv preprint arXiv:2304.07875, 2023. [169] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J.-
[150] Y. Li, M. Hu, and X. Yang, “Polyp-sam: Transfer sam for polyp K. Kämäräinen, H. J. Chang, M. Danelljan, L. Cehovin, A. Lukežič
segmentation,” arXiv preprint arXiv:2305.00293, 2023. et al., “The ninth visual object tracking vot2021 challenge results,”
[151] S. He, R. Bao, J. Li, P. E. Grant, and Y. Ou, “Accuracy of segment- in Proceedings of the IEEE/CVF International Conference on Computer
anything model (sam) in medical image segmentation tasks,” Vision, 2021, pp. 2711–2738.
arXiv preprint arXiv:2304.09324, 2023. [170] C. Zhang, G. Huang, L. Liu, S. Huang, Y. Yang, X. Wan, S. Ge,
[152] C. Mattjie, L. V. de Moura, R. C. Ravazio, L. S. Kupssinskü, and D. Tao, “Webuav-3 m: A benchmark for unveiling the power
O. Parraga, M. M. Delucis, and R. C. Barros, “Exploring the of million-scale deep uav tracking,” IEEE Transactions on Pattern
zero-shot capabilities of the segment anything model (sam) in Analysis and Machine Intelligence, 2022.
2d medical imaging: A comprehensive evaluation and practical [171] H. K. Cheng and A. G. Schwing, “Xmem: Long-term video
guideline,” arXiv preprint arXiv:2305.00109, 2023. object segmentation with an atkinson-shiffrin memory model,”
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 23
in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, modal fusion in automatic continuous cued speech recognition,”
Israel, October 23–27, 2022, Proceedings, Part XXVIII. Springer, IEEE Transactions on Multimedia, vol. 23, p. 292–305, 2020.
2022, pp. 640–658. [195] L. Liu, T. Hueber, G. Feng, and D. Beautemps, “Visual recognition
[172] Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, of continuous cued speech using a tandem cnn-hmm approach.”
“Segment and track anything,” in arXiv preprint arXiv:2305.06558, in Interspeech, 2018, p. 2643–2647.
2023. [196] L. Liu, J. Li, G. Feng, and X.-P. S. Zhang, “Automatic detection of
[173] Z. Yang and Y. Yang, “Decoupling features in hierarchical the temporal segmentation of hand movements in british english
propagation for video object segmentation,” arXiv preprint cued speech.” in INTERSPEECH, 2019, p. 2285–2289.
arXiv:2210.09782, 2022. [197] L. Liu, G. Feng, and D. Beautemps, “Automatic temporal seg-
[174] D. Fuoli, S. Gu, and R. Timofte, “Efficient video super-resolution mentation of hand movements for hand positions recognition
through recurrent latent space propagation,” in 2019 IEEE/CVF in french cued speech,” in 2018 IEEE International Conference on
International Conference on Computer Vision Workshop (ICCVW). Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.
IEEE, 2019, pp. 3476–3485. 3061–3065.
[175] M. Haris, G. Shakhnarovich, and N. Ukita, “Recurrent back- [198] S. Mo and P. Morgado, “Localizing visual sounds the easy way,”
projection network for video super-resolution,” in Proceedings of in ECCV, 2022, p. 218–234.
the IEEE/CVF conference on computer vision and pattern recognition, [199] ——, “A closer look at weakly-supervised audio-visual source
2019, pp. 3897–3906. localization,” in NeurIPS, 2022.
[176] Y. Huang, W. Wang, and L. Wang, “Bidirectional recurrent con- [200] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang,
volutional networks for multi-frame super-resolution,” Advances “Self-supervised generation of spatial audio for 360°video,” in
in neural information processing systems, vol. 28, 2015. NeurIPS, 2018.
[177] ——, “Video super-resolution via bidirectional recurrent con- [201] P. Morgado, Y. Li, and N. Nvasconcelos, “Learning representa-
volutional networks,” IEEE transactions on pattern analysis and tions from audio-visual spatial alignment,” in NeurIPS, 2020, pp.
machine intelligence, vol. 40, no. 4, pp. 1015–1028, 2017. 4733–4744.
[178] T. Hui, X. Tang, and C. L. Change Loy, “A lightweight convolu- [202] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event
tional neural network for optical flow estimation,” in Proceedings localization in unconstrained videos,” in ECCV, 2018.
of the IEEE Conference on Computer Vision and Pattern Recognition
[203] L. Liu, G. Feng, and D. Beautemps, “Inner lips feature extraction
(CVPR), 2018, pp. 8981–8989.
based on clnf with hybrid dynamic template for cued speech,”
[179] Z. Lu, Z. Xiao, J. Bai, Z. Xiong, and X. Wang, “Can sam boost
EURASIP Journal on Image and Video Processing, vol. 2017, p. 1–15,
video super-resolution?” in arXiv preprint arXiv:2305.06524, 2023.
2017.
[180] H. He, J. Zhang, M. Xu, J. Liu, B. Du, and D. Tao, “Scal-
[204] Y. Tian, D. Li, and C. Xu, “Unified multisensory perception:
able mask annotation for video text spotting,” arXiv preprint
Weakly-supervised audio-visual video parsing,” in ECCV, 2020,
arXiv:2305.01443, 2023.
p. 436–454.
[181] P.-T. Jiang and Y. Yang, “Segment anything is a good pseudo-label
generator for weakly supervised semantic segmentation,” arXiv [205] S. Mo and Y. Tian, “Multi-modal grouping network for weakly-
preprint arXiv:2305.01275, 2023. supervised audio-visual video parsing,” in NeurIPS, 2022.
[182] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, [206] Y. Li, H. Wang, Y. Duan, and X. Li, “Clip surgery for better ex-
J. Winn, and A. Zisserman, “The pascal visual object classes plainability with enhancement in open-vocabulary tasks,” arXiv
challenge: A retrospective,” International journal of computer vision, preprint arXiv:2304.05653, 2023.
vol. 111, pp. 98–136, 2015. [207] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Box-driven class-
[183] J. Cen, Z. Zhou, J. Fang, W. Shen, L. Xie, X. Zhang, and wise region masking and filling rate guided loss for weakly su-
Q. Tian, “Segment anything in 3d with nerfs,” arXiv preprint pervised semantic segmentation,” in Proceedings of the IEEE/CVF
arXiv:2304.12308, 2023. Conference on Computer Vision and Pattern Recognition, 2019, pp.
[184] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and 3136–3145.
H. Li, “Pifu: Pixel-aligned implicit function for high-resolution [208] J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribu-
clothed human digitization,” in Proceedings of the IEEE/CVF inter- tion map for weakly supervised semantic and instance segmenta-
national conference on computer vision, 2019, pp. 2304–2314. tion,” in Proceedings of the IEEE/CVF conference on computer vision
[185] K. Zhang, G. Riegler, N. Snavely, and V. Koltun, “Nerf++: An- and pattern recognition, 2021, pp. 2643–2652.
alyzing and improving neural radiance fields,” arXiv preprint [209] W. Sun, Z. Liu, Y. Zhang, Y. Zhong, and N. Barnes, “An alterna-
arXiv:2010.07492, 2020. tive to wsss? an empirical study of the segment anything model
[186] Y. Yin, Z. Fu, F. Yang, and G. Lin, “Or-nerf: Object removing (sam) on weakly-supervised semantic segmentation problems,”
from 3d scenes guided by multiview segmentation with neural arXiv preprint arXiv:2305.01586, 2023.
radiance fields,” 2023. [210] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang,
[187] T. N. Kipf and M. Welling, “Semi-supervised classification with “Towards rich feature discovery with class activation maps
graph convolutional networks,” arXiv preprint arXiv:1609.02907, augmentation for person re-identification,” in Proceedings of the
2016. IEEE/CVF conference on computer vision and pattern recognition,
[188] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation 2019, pp. 1389–1398.
learning on large graphs,” Advances in neural information process- [211] L. Liu, W. Lei, X. Wan, L. Liu, Y. Luo, and C. Feng, “Semi-
ing systems, vol. 30, 2017. supervised active learning for covid-19 lung ultrasound multi-
[189] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, symptom classification,” in 2020 IEEE 32nd International Confer-
and Y. Bengio, “Graph attention networks,” arXiv preprint ence on Tools with Artificial Intelligence (ICTAI). IEEE, 2020, pp.
arXiv:1710.10903, 2017. 1268–1273.
[190] S. Huang, Z. Jiang, H. Dong, Y. Qiao, P. Gao, and H. Li, [212] T. Chen, Z. Mai, R. Li, and W.-l. Chao, “Segment anything model
“Instruct2act: Mapping multi-modality instructions to robotic (sam) enhanced pseudo labels for weakly supervised semantic
actions with large language model,” 2023. segmentation,” arXiv preprint arXiv:2305.05803, 2023.
[191] Z. Liu, Y. He, W. Wang, W. Wang, Y. Wang, S. Chen, Q. Zhang, [213] B. Wu, L. Liu, Z. Zhu, Q. Liu, Z. He, and S. Lyu, “Adversarial ma-
Y. Yang, Q. Li, J. Yu, K. Li, Z. Chen, X. Yang, X. Zhu, Y. Wang, chine learning: A systematic survey of backdoor attack, weight
L. Wang, P. Luo, J. Dai, and Y. Qiao, “Interngpt: Solving vision- attack and adversarial example,” arXiv preprint arXiv:2302.09457,
centric tasks by interacting with chatgpt beyond language,” 2023. 2023.
[192] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, [214] R. Zheng, R. Tang, J. Li, and L. Liu, “Data-free backdoor removal
and A. Joulin, “Emerging properties in self-supervised vision based on channel lipschitzness,” in Computer Vision–ECCV 2022:
transformers,” in Proceedings of the IEEE/CVF international con- 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
ference on computer vision, 2021, pp. 9650–9660. Proceedings, Part V. Springer, 2022, pp. 175–191.
[193] J. Wang, Y. Zhao, L. Liu, H. Fan, T. Xu, Q. Li, and S. Li, “Memory- [215] ——, “Pre-activation distributions expose backdoor neurons,”
augmented contrastive learning for talking head generation,” Advances in Neural Information Processing Systems, vol. 35, pp.
arXiv preprint arXiv:2302.13469, 2023. 18 667–18 680, 2022.
[194] L. Liu, G. Feng, D. Beautemps, and X.-P. Zhang, “Re- [216] J. Cui, L. S. Liew, G. Sabaliauskaite, and F. Zhou, “A review on
synchronization using the hand preceding model for multi- safety failures, security attacks, and available countermeasures
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 24
A PPENDIX A
A P RELIMINARY S UMMARY OF O PEN S OURCE
P ROJECTS ON SAM
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 25
TABLE 1: Summary of Open Source Projects on SAM.
No. Project Title Project page Code base Affiliation Description
https://github.com/
Segment https://segment-anything. A foundation model for
001 SAM facebookresearch/ Meta
Anything com/ general segmentation.
segment-anything
A project dedicated to
https://colab.research.
Segment tracking and segmenting
google.com/drive/ https://github.com/z-x-yang/ Zhejiang
002 SAM-Track and Track any objects in videos, ei-
1R10N70AJaslzADFqb-a5OihYkllWEVxB?
Segment-and-Track-Anything University
Anything ther automatically or inter-
usp=sharing
actively.
A project by combining
https://github.
Grounded- https://github.com/ Grounding DINO and
Grounded- com/camenduru/ IDEA-
003 Segment- IDEA-Research/ SAM which aims to detect
SAM grounded-segment-anything-colab Research
Anything Grounded-Segment-Anything and segment Anything
with text inputs.
A new way of instance
segmentation by combin-
https://github.com/ ing SAM with Closed-
004 MMDet-SAM - - open-mmlab/playground/ OpenMMLab Set Object Detection, Open-
tree/main/mmdet sam Vocabulary Object Detec-
tion, Grounding Object De-
tection.
Zero-shot A project joins SAM and
Oriented https://github.com/ weakly supervised hori-
MMRotate-
005 Object - open-mmlab/playground/ OpenMMLab zontal box detection to
SAM
Detection tree/main/mmrotate sam achieve rotated box detec-
with SAM tion.
A solution of Text Detec-
tion/Recognition and SAM
https://github.com/ that segments every text
MMOCR-
006 - - open-mmlab/playground/ OpenMMLab character, with striking text
SAM
tree/main/mmocr sam removal and text inpaint-
ing demos driven by diffu-
sion models and Gradio.
A project join SAM and
https://github.com/
MMEditing- image generation to create
007 - - open-mmlab/playground/ OpenMMLab
SAM awesome images and edit
tree/main/mmagic sam
any part of them.
OpenMMLab
PlayGround:
Semi- A project combining
https://github.com/
Label-Studio- Automated Label-Studio and SAM to
008 - open-mmlab/playground/ OpenMMLab
SAM Annotation achieve semi-automated
tree/main/label anything
with Label- annotation.
Studio and
SAM
Segment https://github.com/
A pretrained model param-
Anything PaddlePaddle/PaddleSeg/ PaddlePaddle
009 PaddleSeg - eters of PaddlePaddle for-
with tree/release/2.8/contrib/
mat.
PaddleSeg SegmentAnything
Segmenting
https://huggingface.co/ https://github.com/ SAM In Context based on
010 SegGPT Everything BAAI-Vision
spaces/BAAI/SegGPT baaivision/Painter Painter.
In Context
Segment Ev- https://github. A project can Segment Ev-
erything Ev- https://huggingface.co/ com/UX-Decoder/ erything Everywhere with
011 SEEM Microsoft
erywhere All spaces/xdecoder/SEEM Segment-Everything-Everywhere-All-At-Once Multimodal prompts all at
at Once once.
CLIP Surgery
for Better
Explain- A work about SAM based
https://github.com/
ability with https://github.com/ on CLIP’s explainability to
012 CLIP Surgery xmed-lab/CLIP Surgery/ HKUST
Enhancement xmed-lab/CLIP Surgery achieve text to mask with-
blob/master/demo.ipynb
in Open out manual points.
Vocabulary
Tasks
Can SAM
Segment
Anything?
When SAM https://github.com/ SAM + Camouflaged object
013 SAMCOD - -
Meets luckybird1994/SAMCOD detection (COD) task.
Camouflaged
Object
Detection
Segment
https://huggingface. SAM combines Inpainting,
Inpaint Any- Anything https://github.com/ USTC and
014 co/spaces/InpaintAI/ which is able to remove the
thing Meets Image geekyutao/Inpaint-Anything EIT
Inpaint-Anything object smoothly.
Inpainting
Personalize
Segment https://github.
https://huggingface.co/ SAM with specific con-
015 PerSAM Anything com/ZrrSkywalker/ -
papers/2305.03048 cepts.
Model with Personalize-SAM
One Shot
Continued on next page
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 26
TABLE 1 – continued from previous page
No. Project Title Project page Code base Affiliation Description
Segment
A step-by-step tutorial
Anything https://github.com/
016 MedSAM - - with a small dataset to help
in Medical bowang-lab/MedSAM
you quickly utilize SAM.
Images
https://colab.research.
Segment- GroundedSAM google.com/drive/ https://github.
Grounding DINO + SAM
017 Any- Anomaly 1Rwio KfziuLp79Qh com/caoyunkang/ HUST
to segment any anomaly.
Anomaly Detection ugum64Hjnq4ZwsE?usp= Segment-Any-Anomaly
sharing
Semantic https://github.
Fudan A dense category annota-
018 SSA Segment - com/fudan-zvg/
University tion engine.
Anything Semantic-Segment-Anything
Magic Copy is a Chrome
extension that uses SAM to
https://github.com/
019 Magic Copy - - - extract a foreground object
kevmo314/magic-copy
from an image and copy it
to the clipboard.
Segment Segment https://huggingface. https://github.
020 Anything Anything co/spaces/curt-park/ com/Curt-Park/ - SAM combined with CLIP.
with Clip with Clip segment-anything-with-clip segment-anything-with-clip
Segment https://huggingface.
https://github.com/kadirnar/ Packaged version of the
021 MetaSeg Anything co/spaces/ArtGAN/ -
segment-anything-video SAM.
Video Segment-Anything-Video
Applied
Extended SAM’s click-
Computer
Segment based foreground
Vision Lab
SAM in Na- Anything https://www.napari-hub.org/ https://github.com/ separation to full
022 and German
pari Model (SAM) plugins/napari-sam MIC-DKFZ/napari-sam click-based semantic
Cancer
in Napari segmentation and instance
Research
segmentation.
Center
https://github.
SAM Medical SAM Medical
023 - com/amine0110/ - SAM for Medical Imaging.
Imaging Imaging
SAM-Medical-Imaging
3D-Box via https://github.com/ SAM is extended to 3D
024 3D-Box Segment - dvlab-research/ - perception by combining it
Anything 3D-Box-Segment-Anything with VoxelNeXt.
https://github.com/ Anything 3DNovel View,
025 Anything-3D - - Anything-of-anything/ - Anything-NeRF, Any
Anything-3D 3DFace.
Learning A new partially supervised
https://github.com/ UC Berkeley,
026 L2SET to Segment - training paradigm for in-
ronghanghu/seg every thing FAIR
EveryThing stance segmentation.
Edit
Edit anything in images
Edit Anything https://github.com/sail-sg/
027 - - powered by SAM, Control-
Anything by Segment- EditAnything
Net, StableDiffusion, etc.
Anything
IEA: Image
Image Edit Using stable diffusion and
028 Editing - https://github.com/feizc/IEA -
Anything SAM for image editing.
Anything
This extension aim for con-
necting AUTOMATIC1111
Segment Stable Diffusion WebUI
SAM for Sta- Anything https://github.com/ and Mikubill ControlNet
029 ble Diffusion for Stable - continue-revolution/ - Extension with SAM
Webui Diffusion sd-webui-segment-anything and GroundingDINO
WebUI to enhance Stable
Diffusion/ControlNet
inpainting.
https://colab.research.
Segment https://github.
Earth Obser- google.com/drive/ An earth observation tools
030 Anything EO com/aliaksandr960/ -
vation Tools 1RC1V68tD1O-YissBq9nOvS2PHEjAsFkA? for SAM.
tools segment-anything-eo
usp=share link
Towards
https://github.
Moving Ob- Segmenting A project about SAM +
031 - com/achalddave/ -
ject Detection Anything Moving Object Detection.
segment-any-moving
That Moves
Optical
Character https://www.zhihu.com/
https://github.com/ Combining MMOCR with
032 OCR-SAM Recognition question/593914819/answer/ -
yeungchenwa/OCR-SAM SAM and Stable Diffusion.
with Segment 2976012032
Anything
A project uses the SAM
Segment https://github.com/
Model and adds a bare-
Anything anuragxel/salt#
033 SALT - - bones interface to label im-
Labelling segment-anything-labelling-tool-salt
ages and saves the masks in
Tool
the COCO format.
Prompt Prompt https://github. An implementation
034 Segment Segment - com/RockeyCoss/ - of zero-shot instance
Anything Anything Prompt-Segment-Anything segmentation using SAM.
Continued on next page
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 27
TABLE 1 – continued from previous page
No. Project Title Project page Code base Affiliation Description
A project uses SAM for
generating rotated bound-
https://github.com/
035 SAM-RBox - - - ing boxes with MMRo-
Li-Qingyun/sam-mmrotate
tate, which is a comparison
method of H2RBox-v2.
MOTRv2:
Bootstrap-
ping End-
to-End Combining SAM with
https://github.com/
036 VISAM Multi-Object - - MOT, it creates the era of
BingfengYan/VISAM
Tracking by ”MOTS”.
Pretrained
Object
Detectors
The tools are developed
to ease the processing of
Segment https://github.
spatial data (GeoTIFF and
037 SegEO Anything EO - com/aliaksandr960/ -
TMS) with SAM using slid-
tools segment-anything-eo
ing window algorithm for
big files.
Napari Napari https://app.codecov. https://github.
038 Segment Segment io/gh/jookuma/ com/JoOkuma/ - SAM native Qt UI.
Anything Anything napari-segment-anything napari-segment-anything
Using CLIP and SAM to
Segment- Segment- https://github.com/
segment any instance you
039 Anything-U- Anything-U- - MaybeShewill-CV/ -
specify with text prompt of
Specify Specify segment-anything-u-specify
any instance names.
https://colab.research. Simple static web-based
Simple static
google.com/drive/ https://github.com/lujiazho/ mask drawer, supporting
040 SegDrawer web-based -
1PdWCpBgYwiQtvkdTBnW-y2T-s SegDrawer semantic segmentation
mask drawer
Fc-2iI?usp=sharing with SAM.
Track-Anything is a flexi-
Segment https://huggingface.
Track https://github.com/ ble and interactive tool for
041 Anything co/spaces/VIPLab/ SUSTech
Anything gaomingqi/Track-Anything video object tracking and
Meets Videos Track-Anything
segmentation.
A method uses SAM and
CLIP to ground and count
Count Any- https://github.com/ylqi/ any object that matches a
042 - - -
thing Count-Anything custom text prompt, with-
out requiring any point or
box annotation.
Relate Anything Model is
MMLab,
https://huggingface. capable of taking an image
Relate Any- https://github.com/Luodian/ NTU and
043 RAM co/spaces/mmlab-ntu/ as input and utilizing SAM
thing Model RelateAnything VisCom Lab,
relate-anything-model to identify the correspond-
KCL/TongJi
ing mask within the image.
Segment AnyRGBD is a
https://huggingface.
Segment Any Segment Any https://github.com/Jun-CEN/ toolbox to segment ren-
044 co/spaces/jcenaa/ -
RGBD RGBD SegmentAnyRGBD dered depth images based
Segment-Any-RGBD
on SAM.
https://huggingface. Some Applications that are
Show Show https://github.com/showlab/ Showlab,
045 co/spaces/weijiawu/ compatible with both SAM
Anything Anything ShowAnything NUS
ImageEditAnything and Generation.
Any-to-
Any Style An interactive demo based
Transfer: on Segment-Anything for
Transfer Any https://github.com/
046 Making - LV-lab, NUS style transfer which en-
Style Huage001/Transfer-Any-Style
Picasso and ables different content re-
Da Vinci gions apply different styles.
Collaborate
Caption-Anything is a ver-
https://colab.research.google.
satile image processing tool
Caption Any- com/github/ttengwang/ https://github.com/ VIP lab,
047 - that combines the capabil-
thing Caption-Anything/blob/ ttengwang/Caption-Anything SUSTech
ities of SAM, Visual Cap-
main/notebooks/tutorial.ipynb
tioning, and ChatGPT.
Transform Image into
Transform
https://zhaohengyuan1. Unique Paragraph with
Image2ParagraphImage Into https://github.com/showlab/
048 github.io/image2paragraph. - ChatGPT, BLIP2, OFA,
Unique Image2Paragraph
github.io/ GRIT, Segment Anything,
Paragraph
ControlNet.
LIME-SAM aims to create
an Explainable Artificial
Local
Intelligence (XAI) frame-
Interpretable
https://colab.research. work for image classifica-
Model-
google.com/drive/ https://github.com/ tion using LIME (Local In-
049 LIME SAM agnostic -
1bj6B-O47NHpqsWovOrVZcpWNhIfO56sj?
jaydeep-work/LIME-SAM terpretable Model-agnostic
Explanations
usp=sharing Explanations) as the base
Segment
algorithm, with the super-
Anything
pixel method replaced by
SAM.
An interactive demo based
Paint https://github.com/ on SAM for stroke-based
050 - - -
Anything Huage001/Paint-Anything painting which enables
human-like painting.
Continued on next page
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 28
TABLE 1 – continued from previous page
No. Project Title Project page Code base Affiliation Description
Customized SAMed is built upon the
Segment large-scale image segmen-
https://colab.research.
Anything tation model, SAM, to ex-
google.com/drive/ https://github.com/
051 SAMed Model for USTC plore the new research
1KCS5ulpZasYl9DgJJn59WsGEB8vwSI
hitachinsk/SAMed
Medical paradigm of customizing
m?usp=sharing
Image Seg- large-scale models for med-
mentation ical image segmentation.
A training-free Personal-
Personalize ization approach for SAM,
Segment https://huggingface. https://github. termed as PerSAM. Given
Personalize MMLab,
052 Anything co/spaces/justin-zk/ com/ZrrSkywalker/ only a single image with
SAM CUHK
with 1 Shot Personalize-SAM Personalize-SAM a reference mask, PerSAM
in 10 Seconds can segment specific visual
concepts.
Combining OwlViT with
Open- Open- Segment Anything - Open-
https://github.com/
vocabulary- vocabulary- vocabulary Detection
053 - ngthanhtin/owlvit segment -
Segment- Segment- and Segmentation (Text-
anything
Anything Anything conditioned, and Image-
conditioned).
Annotation anything in vi-
Labal- Label- https://github.
sual tasks just all in one-
054 Anything- Anything- - com/Yuqifan1117/ ZJU
pipeline with GPT-4. and
Pipeline Pipeline Labal-Anything-Pipeline
SAM.
Expand Segment Anything
Grounded
Model (SAM) to support
Grounded- Segment https://github.com/
https://cheems-seminar. text prompt input. The text
055 Segment- Anything: Cheems-Seminar/ HKU
github.io/ prompt could be object-
Any-Parts From Objects grounded-segment-any-parts
level(eg, dog) and part-
to Parts
level(eg, dog head).
Effortless AI-assisted data
https://www.youtube.com/ https://github.com/ labeling with AI support
056 AnyLabeling AnyLabeling -
watch?v=5qVJiYNX5Kk vietanhdev/anylabeling from Segment Anything
and YOLO.
Automated dense category
annotation engine that
Semantic- https://github.
https://replicate.com/cjwbw/ serves as the initial
057 SSA Segment- com/fudan-zvg/ -
semantic-segment-anything semantic labeling for
Anything Semantic-Segment-Anything
the Segment Anything
dataset (SA-1B).
Label Data Referring Image Segmen-
https://blog.roboflow.com/
with Segment https://github.com/ tation Benchmarking with
058 RefSAM label-data-segment-anything-model-sam/ -
Anything in helblazer811/RefSAM Segment Anything Model
Roboflow (SAM).
Launch:
Label Data https://blog.roboflow.com/ SAM-assisted labeling for
Roboflow
059 with Segment label-data-segment-anything-model-sam/
https://app.roboflow.com/ Roboflow training computer vision
Annotate
Anything in models.
Roboflow
https://github.com/ This is an experimental
IDEA-Research/ demo aims to combine Im-
ImageBind IDEA-
060 - - Grounded-Segment-Anything/ ageBind and SAM to gen-
SAM Research
tree/main/playground/ erate mask with different
ImageBind SAM modalities.
The Segment Anything Model (SAM) focuses on the ability to generate masks without annotator input and incorporates a variety of prompts, including visual and text prompts, to guide segmentation tasks. It is designed to handle a wide range of segmentation tasks using promptable segmentation tasks and has a robust training dataset formed through a train-annotate loop process . On the other hand, SegGPT unifies various segmentation tasks into an in-context learning framework, emphasizing strong zero-shot capabilities by leveraging a generalist model approach .
Eye-tracking technologies, exemplified by GazeSAM, have the potential to significantly enhance real-time interaction with segmentation models in clinical settings by utilizing eye movements as input prompts. This approach allows radiologists to collect segmentation masks efficiently during image diagnosis by merely looking at regions of interest . This integration of eye-tracking technology streamlines the segmentation process, reducing time and effort required from clinicians while maintaining accuracy, ultimately enhancing the efficiency of clinical workflows and decision-making processes in real-time .
The Medical SAM Adapter (MSA) integrates medical-specific domain knowledge into the SAM model, significantly improving its performance across 19 medical image segmentation tasks using various modalities like CT and MRI . It demonstrates how domain adaptation can enhance a model’s applicability in specialized fields. Similarly, the SAM-Adapter leverages domain-specific visuals or prompts to boost the model's capability by combining task-specific knowledge with general knowledge learned by the larger model, resulting in notably improved performance in challenging medical imaging tasks . These adaptations are crucial for tailoring generalist models like SAM to the nuances of medical imaging, ensuring better applicability and accuracy in a critical domain.
Visual prompts in the SAM play a crucial role by enabling the model to respond effectively to unseen user queries, thus enhancing its zero-shot generalization ability. The model uses a unified prompt scheme to encode different visual input prompts, such as points, boxes, scribbles, and masks, into a joint visual-semantic space, allowing it to handle diverse segmentation tasks without prior training on those specific tasks . This ability to integrate various prompt types contributes significantly to SAM's adaptability and effectiveness across different contextual situations, supporting versatile application scenarios .
Innovations in multimodal learning, such as the development of models like ImageBind that attempt to align six different modal inputs around image/video information, aim to enhance model robustness and social impact by creating more cohesive and context-aware learning frameworks . These models are designed to better capture and utilize the interdependencies of various data modalities, addressing challenges related to performance, robustness, and interpretability. By offering a unified approach to handling diverse data types, they show promise in addressing issues of model adaptability and reliability in diverse real-world applications .
Foundation models like SAM contribute significantly to the future of artificial general intelligence (AGI) by providing versatile models capable of performing various tasks through zero-shot generalization . These models are seen as a critical step towards AGI due to their ability to generalize from pre-trained information to new scenarios without extensive retraining, reflecting a more human-like adaptability and learning capability in artificial systems .
The Segment Anything Model (SAM) achieves flexibility in segmentation applications through the promptable segmentation task, which involves using prompts such as location, range, masks, or text descriptions guiding the segmentation targets . The process is facilitated by three components: the task definition, which entails specifying the data and purpose; the model architecture, designed to process and respond to various prompts; and the extensive dataset, SA-1B, created through an interactive train-annotate loop involving over a billion masks . Through this setup, SAM can adapt to a wide range of existing and novel segmentation tasks, allowing zero-shot applications across different data distributions .
The Segment Anything Model (SAM) has been found to struggle in scenarios involving ambiguous or concealed objects, as evidenced by its performance on camouflaged object detection tasks . Studies highlight the model's difficulty in maintaining segmentation accuracy when objects are not distinctly visible or identifiable, indicating areas for potential improvement in its foundational training or prompt handling methods . This serves as a key insight into the limitations that need to be addressed for enhancing SAM's versatility and reliability in less straightforward object segmentation tasks.
Advances in multimodal foundation models enhance vision-language alignment by capturing cross-modal interactions and learning unified embedding spaces. These models, such as CLIP, ALIGN, Florence, VLBERT, X-LXMERT, and DALL-E, can handle tasks involving classification, retrieval, object detection, and image captioning by aligning image-text data to learning universal visual representations . ImageBind further extends these capabilities by aligning six different modal information around image/video information, moving the research forward in aligning multiple modalities effectively .
SAM faces challenges in complex scenarios, such as misclassifying parts of indistinct objects and failing to predict certain elements in overlapping scenarios . It struggles to identify instruments in surgical scenarios involving blood, reflection, blur, and shade, showcasing its insufficient robustness against various forms of data corruption. These challenges highlight the limitations of SAM in maintaining performance across diverse and complex real-world applications .