0% found this document useful (0 votes)
431 views28 pages

Survey on Segment Anything Model

This document provides a comprehensive survey of the segment anything model (SAM) for computer vision tasks. It introduces foundation models including SAM and discusses SAM's applications to various image processing tasks like software scenes, real-world scenes, and complex scenes. The document analyzes SAM's advantages and limitations across different tasks and draws insights to guide future research. It also summarizes SAM's applications in vision and beyond. Finally, it maintains an updated list of papers and open-source projects on the foundation model SAM.

Uploaded by

1286313960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
431 views28 pages

Survey on Segment Anything Model

This document provides a comprehensive survey of the segment anything model (SAM) for computer vision tasks. It introduces foundation models including SAM and discusses SAM's applications to various image processing tasks like software scenes, real-world scenes, and complex scenes. The document analyzes SAM's advantages and limitations across different tasks and draws insights to guide future research. It also summarizes SAM's applications in vision and beyond. Finally, it maintains an updated list of papers and open-source projects on the foundation model SAM.

Uploaded by

1286313960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO.

XX, MAY 2023 1

A Comprehensive Survey on Segment Anything


Model for Vision and Beyond
Chunhui Zhang, Li Liu∗ , Member, IEEE, Yawen Cui, Guanjie Huang, Weilin Lin, Yiqian Yang, Yuehong Hu

Abstract—Artificial intelligence (AI) is evolving towards artificial general intelligence, which refers to the ability of an AI system to
perform a wide range of tasks and exhibit a level of intelligence similar to that of a human being. This is in contrast to narrow or
specialized AI, which is designed to perform specific tasks with a high degree of efficiency. Therefore, it is urgent to design a general
class of models, which we term foundation models, trained on broad data that can be adapted to various downstream tasks. The
recently proposed segment anything model (SAM) has made significant progress in breaking the boundaries of segmentation, greatly
arXiv:2305.08196v2 [cs.CV] 19 May 2023

promoting the development of foundation models for computer vision. To fully comprehend SAM, we conduct a survey study. As the
first to comprehensively review the progress of segmenting anything task for vision and beyond based on the foundation model of SAM,
this work focuses on its applications to various tasks and data types by discussing its historical development, recent progress, and
profound impact on broad applications. We first introduce the background and terminology for foundation models including SAM, as
well as state-of-the-art methods contemporaneous with SAM that are significant for segmenting anything task. Then, we analyze and
summarize the advantages and limitations of SAM across various image processing applications, including software scenes, real-world
scenes, and complex scenes. Importantly, many insights are drawn to guide future research to develop more versatile foundation
models and improve the architecture of SAM. We also summarize massive other amazing applications of SAM in vision and beyond.
Finally, we maintain a continuously updated paper list and an open-source project summary for foundation model SAM at here.

Index Terms—Survey, Artificial General Intelligence, Foundation Models, Segment Anything, Open Source Projects.

F
1 I NTRODUCTION

F OUNDATION models [1], [2], [3] have revolutionized


artificial intelligence (AI) in the past few years, thanks
to their thorough pre-training on web-scale datasets and
searchers have been inspired to explore large visual models
(LVMs) in the computer vision (CV) community. One line of
research is to explore scaling vision transformers to a huge
powerful zero-shot generalization across a wide range of size, pursuing the emergent capabilities exhibited in LLMs,
downstream tasks. More recently, the natural language such as ViT-G [9], ViT-22B [10], Swin Transformer V2 [11],
processing (NLP) community has undergone a significant and VideoMAE V2 [12]. Besides, a large body of works are
shift towards the development of large language models devoted to adding knowledge of additional modalities to
(LLMs), resulting in a series of ground-breaking works, e.g., enhance the capabilities of LVMs. Some notable examples
BERT [4], T5 [5], GPT-3 [6], and GPT-4 [7]. One of the most include CLIP [13], and ALIGN [14], which adopt a text
amazing applications of these models is the ChatGPT [8], encoder and an image encoder to learn visual and language
an AI chatbot developed by OpenAI that leverages a large representations of image and text pairs from massive noisy
language model called GPT-3.5 to generate human-like re- image-text data using contrastive learning [15]. After pre-
sponses to user inputs. training, the learned semantic knowledge can be used to
Due to the great success of foundation models in NLP, re- reference novel visual concepts on new data distributions
enabling the model with zero-shot transfer capability in
• Chunhui Zhang is with the Hong Kong University of Science and various downstream tasks, such as image-text retrieval [16],
Technology (Guangzhou), Guangzhou 511458, China and Cooperative [17], and image generation [18], [19].
Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai Although the progress brings new impetus to the de-
200240, China and also with the CloudWalk Technology Co., Ltd, 201203,
China. Email: [email protected].
velopment of CV, the generalization ability of obtained
• Li Liu is with the Hong Kong University of Science and Technology deep models remains limited [2], [20]. Recently, the CV
(Guangzhou), Guangzhou 511458, China. E-mail: [email protected]. community is witnessing a surge in exploring task-agnostic
• Yawen Cui is with the Hong Kong University of Science and Technology foundation models [20], [21], [22], [23], [24], [25]. A common
(Guangzhou), Guangzhou 511458, China and also with the University of
Oulu. E-mail: [email protected]. characteristic of these models is to rely on a foundation
• Guanjie Huang is with the Hong Kong University of Science model pre-trained on a broad dataset using a task that
and Technology (Guangzhou), Guangzhou 511458, China. E-mail: can solve a wide range of downstream tasks using prompt
[email protected].
• Weilin Lin is with the Hong Kong University of Science and Technology
learning [26]. This new research trend of developing task-
(Guangzhou), Guangzhou 511458, China. E-mail: [email protected]. agnostic foundation models is recently sparked by a model
• Yiqian Yang is with the Northwestern Polytechnical University, Xi’an called segment anything model (SAM) [20] designed for
710072, China. E-mail: [email protected]. general image segmentation. SAM is a promptable model
• Yuehong Hu is with the Central South University, Changsha 410083,
China. E-mail: [email protected]. trained over 1 billion masks on 11 million images using a
∗ promptable segmentation task that enables powerful zero-
Corresponding author.
This work was done at the Hong Kong University of Science and Technology shot generalization. Many researchers, such as Jim Fan [27],
(Guangzhou). consider this as “the GPT-3 moment for CV, as SAM has
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 2

learned the general concept of what an object is, even for unknown and in more directions. Finally, we conclude the survey in
objects, unfamiliar scenes (e.g., underwater and cell microscopy) Section 5. This survey will be regularly updated to reflect
and ambiguous cases” and demonstrated great potential as a the dynamic progress of the foundation model of SAM, as
fundamental model for CV [22], [28], [29]. this is a rapid-evolving and promising field towards AGI.
Recently, a large number of extended works have been
proposed by the community to explore the capability
2 BACKGROUND AND T ERMINOLOGY
boundaries of SAM and apply it to various tasks, e.g., med-
ical image analysis [30], [31], [32], [33], [34], [35], [36], [37], 2.1 Image Segmentation
[38], image inpainting [39], image editing [40], style trans- 2.1.1 Classic Segmentation
fer [41], infrastructure detection [42], camouflaged object Image segmentation is a fundamental computer vision task
detection [29], mirror and transparent objects detection [43], that separates a digital image into multiple parts by as-
image captioning [44], audio-visual localization [45], video signing each pixel to a class or object. Traditionally, the
object tracking [46], 3D reconstruction [47], few-shot object segmentation includes three major tasks: semantic, instance,
counting [48], and adversarial attacks [49], [50]. Concurrent and panoptic. Semantic segmentation [69], [70], [71], [72]
to SAM, Wang et al. [23] proposed a generalist model, assigns each pixel to a predefined semantic class label.
namely SegGPT, to unify various segmentation tasks into Instance segmentation [73], [74], [75] further separate in-
an in-context learning framework, which has demonstrated stances of the same class. Panoptic segmentation proposed
strong zero-shot capabilities. Furthermore, Zou et al. [24] by [76] combines semantic and instance segmentation to
proposed a more general segmentation system SEEM via understand scenes comprehensively. Researchers have fully
introducing more diverse prompts than SAM, including vi- explored the above tasks in past studies. Due to the opera-
sual prompts (points, boxes, scribbles, masks), text prompts, tion consistency of the above tasks at the pixel level, many
and referring prompts (referred regions of another image). studies have tried to use a unified framework to provide
As the authors claim that the introduced unified prompt solutions for three segmentation methods simultaneously,
scheme in SEEM can encode different prompts into the such as K-net [77], MaskFormer [78], and Mask2Former [79].
joint visual-semantic space to produce a strong zero-shot
generalization ability to address unseen user prompts for 2.1.2 Interactive Segmentation
segmentation. Additionally, some pioneering works explore Interactive segmentation [80] is a particular segment task
general AI methods for detecting/segmenting anything in with the character of leveraging information from the guid-
the open-vocabulary scenarios, e.g., Grounding DINO [51], ance of user interaction. Despite being a longstanding chal-
OVSeg [52], V3Det [53], and OpenSeg [54]. These advance- lenge, the problem has seen considerable improvement.
ments have led many researchers to believe that versatile Usually, the user provides some initial input, such as points,
foundation models are a critical step towards artificial gen- strokes, or bounding boxes, to indicate the rough location
eral intelligence (AGI) [55], [56], [57], [58]. and shape of the object. Then, the algorithm iteratively
To this end, this work provides a comprehensive survey refines the segmentation based on the user feedback, such
of these works with the goal of helping researchers under- as correcting mislabeled regions or adding missing parts.
stand the latest developments related to SAM models. This Interactive segmentation is useful for many applications
survey mainly focuses on various foundation models since that require precise object extraction, such as medical image
SAM, especially the applications of SAM to various tasks analysis [81], [82], [83], photo editing [84], and data annota-
and data types. Readers are referred to existing surveys [1], tion [85], [86].
[2], [3], [59], [60], [61], [62], [63], [64], [65], [66], [67] for
language, vision, and multimodal foundation models. To
the best of our knowledge, this survey is the first to com- 2.2 Foundation Models
prehensively review the recent progress of segmenting any- Foundation models are a new paradigm for building arti-
thing task for vision and beyond based on the foundation ficial intelligence systems that can be adapted to various
model of SAM. Concurrent to our work, [33], [68] briefly downstream tasks. They are based on training large neural
summarized recent efforts to extend SAM to vision and networks on massive amounts of data, often using self-
medical image segmentation tasks, however, we provide supervised learning techniques. This allows them to learn
a more comprehensive review with many new insights general representations and capabilities that can be trans-
from a broader perspective. Furthermore, we maintain a ferred to different domains and applications. The term was
continuously updated paper list and a project summary to coined by the Stanford Center for Research on Foundation
reflect the dynamic progress of the foundation model of Models (CRFM) in 2021 to capture the significance and
SAM during its development. challenges of this paradigm [1].
The remainder of this survey is organized as follows: The development of foundation models can be traced
Section 2 introduces the background and terminology for back to the rise of deep learning and self-supervised learn-
foundation models including SAM, as well as methods con- ing in the NLP field, which enabled learning powerful repre-
temporaneous with SAM that are important for segmenting sentations from raw text data. Early examples of foundation
anything task. Section 3 discusses the awesome works based models were pre-trained LLMs, such as BERT [4], T5 [5], and
on SAM for various image processing applications, includ- GPT-n series [6], [8], [7], which demonstrated impressive
ing software scenes, real-world scenes, and complex scenes. capabilities and performance on a wide range of NLP tasks.
Section 4 further discusses the follow-up works of SAM that In CV research, current foundation models try to take
extend SAM to vision-related applications, beyond vision, advantage of LLMs which are trained on large-scale data
Tete Xiao Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Dollár Ross Girshick
1
project lead 2
joint first author 3
equal contribution 4
directional lead

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023
Meta AI Research, FAIR 3

valid mask valid mask annotate


arXiv:2304.02643v1 [cs.CV] 5 Apr 2023
lightweight mask decoder model data

train
model
image Segment Anything 1B (SA-1B):
encoder
• 1+ billion masks
prompt
cat with • 11 million images
encoder
black ears • privacy respecting
• licensed images
segmentation prompt image prompt image

(a) Task: promptable segmentation (b) Model: Segment Anything Model (SAM) (c) Data: data engine (top) & dataset (bottom)
Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-
Fig. 1:able
Overview of the
segmentation SAaproject,
task, including
segmentation modeltask, model,
(SAM) and data.
that powers dataThe figure and
annotation is borrowed from the
enables zero-shot original
transfer to apaper
range [20].
of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.

, score
Abstract matching in some cases) fine-tuned models [10, 21]. Empir-
We introduce the Segment Anything (SA) project: a new
image
ical trends show thismask behavior
decoderimproving with model scale,

task, model, and dataset for image segmentation. encoder Using our dataset size, and total training compute [56, 10, 21, 51]. , score

efficient model in a data collection loop, we built the largest Foundation


conv models have
prompt also
encoder been explored in computer
segmentation dataset to date (by far), with over 1 billion vision, albeit to a lesser extent. Perhaps the most promi-
image , score
image
masks on 11M licensed and privacy respecting images. The embedding nent illustration
mask points aligns paired
box text textand images from the web.

model is designed and trained to be promptable, so it can For example, CLIP [82] and ALIGN [55] use
valid contrastive
masks

Figure 4:zero-shot
transfer SegmenttoAnything
new image Model (SAM) structure
distributions overview.
and tasks.A learning
heavyweight imageto train text
encoderpaperand image
outputs encoders that
an image embedding that align the can
two
Fig. 2: Overall ofWe SAM from the
modalities. original
Once trained, [20].
engineered text prompts enable
then be efficiently
evaluate queriedon
its capabilities by numerous
a variety oftasks
inputand prompts
find to produce object masks at amortized real-time speed. For ambiguous
that
prompts
its corresponding
zero-shot performance to more than one–object,
is impressive SAM can outputzero-shot
often competitive multiple valid generalization
masks andtoassociated
novel visual concepts
confidence and data
scores.
with or even superior to prior fully supervised results. We distributions. Such encoders also compose effectively with
and show superb performance in learning universal visual 2.3.1 Task to enable downstream tasks, such as image
other modules
are releasing the Segment Anything Modelimage-text
(SAM) and data. cor-
3. Segment Anything Model
representations from diverse, large-scale loss
The [15, 45,(e.g.,
ultimate
generation 64]goal
over masks.
of
DALL·E the[83]).To rank
SA project
While masks, is to
much theprogress
model pre-
provide a model
has
responding dataset
Representatives include(SA-1B) CLIP of[13], 1B masks
ALIGN and 11M[14],images
Florence at
dicts
been
with aamade
confidence
wideonrange score
vision ofand(i.e., estimated
language
functions IoU)
can for
encoders,
that each
becomputer
quicklymask.
vi-
adapted
https://segment-anything.com
[87], VLBERT We next describe
[88], X-LXMERTthe Segment to foster
[89], andresearch
Anything Model
DALL-E into[19]
foun-
(SAM) try tosion
dation models for computer vision.
manyincludesThe
Efficiency. existinga wide andrange
overall new
modelof problems
design is beyond
segmentation largely this
tasks scope,as per-
(such
motivated
for promptable
to capture the segmentation.
cross-modal SAM has
interactions three components,
between vision and forming by efficiency. Given a precomputed image embedding,exist.
and for many
edge of these,
detection, abundant
object training
proposal data does not
generation, instance
the
illustrated
language. They in canFig. be 4: an image or
transferred encoder,
directaact flexible prompt
on classifica- In this
segmentation, work, our
and goal is
segmenting to build a foundation
prompt encoder and mask decoder run in a web browser, on text),
objects from model
free-formfor
encoder,
1. and
Introduction
tion, retrieval, objecta fast mask decoder.
detection, videoWe build on Transformer
understanding, visual and image
CPU, caninsegmentation.
50ms. This
∼transfer That
zeroruntimeis, we
samples seektotonew
performance developdataa prompt-
enables seam-
distributions
vision models [14,
question-answering, image33, 20, 62] with specific
captioning, and image tradeoffs for
genera- able tasks.
less,
and model Since
real-time andinteractive
pre-train itcomplex
many promptingon a broad of dataset
our
functions model.using
can abetask realized
(amortized)
Large real-time
language modelsperformance.
pre-trained Weondescribe
web-scale these com-
datasets
tion tasks. Recently, ImageBind [90] attempted to align that enables
six through simplepowerful generalization.
combinations of With this
existing model,
tools. For we
example,
ponents
are at ainformation
high-level
revolutionizing NLPhere, with with
strongdetails in §A.and few-shot
zero-shot Losses and training. We supervise mask prediction with
different modal around image/video informa- ifaim to solve aa range of of
downstream segmentation problems
generalization [10]. These “foundation models” [8] can thethere
linear iscombination
bounding box loss
focal detector
[65] fordice
and humans,
loss [73]human
Image
tion and encoder.
learn Motivated
the unified by scalability
embedding and powerful
space, openingpre- on new
up instance data distributions
segmentation using
can prompt engineering.
be solvedsegmentation
by providing
used in [14]. We train for the promptable taskthe box
furthergeneralize
training
research toon
methods,tasks anduse
we
multimodal dataandistributions
MAE
foundation beyond
[47] pre-trained
models.those seen
Vision
Foun- Thea of
output success of geometric
the detector this as
plan hinges on
a prompt to three
the components:
model. Researchers
using mixture of prompts (for text prompts see
dationduring
Transformer training.
for (ViT) This[33]capability
minimally is often
adapted implemented
to process with
high task,inspiration
model, and[92,data. To wedevelop them, wethisaddress the using
models computer vision and multimodal learning take from
prompt engineering in which hand-crafted text is used to §7.5). Following 37],LLMs to achieve
simulate an interactive target,
setup
are stillresolution
an activeinputs area of[62]. The image
research, withencoder runs once per
many challenges following
and prompt questions about image segmentation:
by randomly sampling prompts in 11 rounds per mask, al- down-
engineering [6] to cover the pre-training and
prompt
image and thecan language
be appliedmodel priorto togenerate
promptinga valid textual re-
therobustness,
model.
opportunities for improving their performance, 1. What
stream
lowing SAM task
tasks. will enable
toMore
integrate zero-shotThe
specifically,
seamlessly generalization?
into concept
our of interactive
data engine.
sponse for the task at hand. When scaled and trained with
interpretability,
Prompt and social
encoder. We impact.two sets of prompts: sparse
consider segmentation is introduced to form
2. What is the corresponding model architecture? the promptable task and
abundant text corpora from the web, these models’ zero and
(points, boxes, text) and dense (masks).
few-shot performance compares surprisingly well to (even We represent 4. Segment
realize the Anything
training of the Data
model.
3. What data can power this task and model? Engine
points and boxes by positional encodings [95] summed with A unique characteristic of the promptable task is return-
learned embeddings
2.3 Segment AnythingforModel each prompt type and free-form text ing As segmentation
a valid segmentationmasks are masknot when
abundant givenon the
anyinter-
segmenta-
net, we built a data engine to enable the collection of our
with an off-the-shelf text encoder from CLIP [82]. Dense 1 tion prompt. A prompt can be anything indicating what to
1.1B mask dataset, SA-1B. The data engine has three
prompts from
SAM comes (i.e., masks) are embedded
the Segment using convolutions
Anything (SA) project andof segment. A valid segmentation mask means that even if the
stages: (1) a model-assisted manual annotation stage, (2) a
Meta insummed2023 element-wise
[20]. By finding with the image embedding.
foundation models that ap- input prompt would lead to ambiguity (such as an image
semi-automatic stage with a mix of automatically predicted
pearedMask in the NLP and
decoder. TheCV mask fields
decodershow strong performance,
efficiently maps the im- of a human wearing a T-shirt, the prompt point is on the T-
masks and model-assisted annotation, and (3) a fully auto-
researchers tried to build
age embedding, prompt a similar
embeddings, model andwhich
an output cantoken
unify shirt), it should be a reasonable mask for at least one object
matic stage in which our model generates masks without
the whole
to a image
mask. segmentation
This design, inspired task. However,
by [14, 20], theemploys
available a (which returns the mask of the human or The masks of the
annotator input. We go into details of each next.
data inmodification
the segmentation field is insufficient
of a Transformer decoder block and[103]differs from T-shirt are all reasonable).
followed
their design purpose. Assisted-manual stage. In the first stage, resembling clas-
by a dynamic maskTherefore,
prediction head. as shown in Fig.decoder
Our modified 1, they
divideblock
the pathway into three steps, namely Task, Model, sic interactive
2.3.2 Model segmentation, a team of professional annota-
uses prompt self-attention and cross-attention in two
tors labeled masksof by clicking foreground in /2.background
It mainlyob-consists
directions
and Data. (prompt-to-image
Correspondingly, embedding
a project and vice-versa)
for segmentation to
tasks The structure SAM is shown
ject points using a browser-based interactive segmentation
update all
is proposed, embeddings.
including the After
promptablerunning segmentation
two blocks, we up- task of three parts, a powerful image encoder (MAE [91] pre-
tool powered by SAM. Masks could be refined using pixel-
sample
(prompts the image
include embedding
providing and an MLP
a location, maps athe
a range, outputor trained ViT [92]); a prompt encoder, which is divided into
mask,
precise “brush” and “eraser” tools. Our model-assisted an-
token to a dynamic
a text description of the linear classifier, target),
segmentation which then the computes
SAM that sparse input (CLIP’s [13] text encoder is used as a position
notation runs in real-time directly inside a browser (using
the mask
can accept foreground
multiple prompt probability
inputsat and each realize
image location.
interactive encoder to process points, boxes, and text-form prompt)
precomputed image embeddings) enabling a truly interac-
use and the Dataset SA-1B formed using the data engine and
Resolving ambiguity. With one output, the model will av-
dense input (convolutions processes mask input); and a
tive experience. We did not impose semantic constraints for
of the erage
interactive train-annotate loop process
multiple valid masks if given an ambiguous prompt. with over one mask
labeling decoder
objects,(prompt-image
and annotators freely bidirectional
labeled both Transformer
“stuff” de-
billionTo masks.
address this, we modify the model to predict multiple coder using self-attention and cross-attention).
and “things” [1]. We suggested annotators label objects In addition,
output masks for a single prompt (see Fig. 3). We found they could name or describe, but did not collect these names
3 mask outputs is sufficient to address most common cases or descriptions. Annotators were asked to label objects in
(nested masks are often at most three deep: whole, part, and order of prominence and were encouraged to proceed to the
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 4

when the input prompts are ambiguous, the network will Meanwhile, the SegGPT [23] stand for Segment every-
rank the three possible mask outputs based on confidence. thing with a Generalist Painter, exploring in-context training
The loss functions used in training include focal loss [93] and inference scheme. It forms a generalist in-context learn-
and dice loss [94]. ing framework [98] that unifies different segmentation data
formats. And treating the training process as an in-context
2.3.3 Data coloring problem with a random coloring scheme instead of
Since there is insufficient public data for training, re- using predefined color space. This training process requires
searchers use the training-annotation iterative process to the model to focus on contextual information to accomplish
form a Data Engine to achieve model training and dataset the specific task. Based on these improvements, the model
construction simultaneously. The specific process can be can perform arbitrary segmentation tasks based on input
divided into three stages. images or videos through in-context inference.
1) Assisted-manual stage. Professional annotators use Also, SEEM [24] further broadens the scope of task ap-
the interactive labeling tool on the browser, com- plicability of single segmentation models. It further expands
bined with SAM for manual labeling. SAM first uses the types of supported prompts, including points, boxes,
the public dataset for training. As the data gradually scribbles, masks, texts, and referred regions of another
increases, the size of the image encoder of SAM also image. With the proposed joint visual-semantic space, the
increases. At the end of this stage, 4.3M masks and model has compatibility to compose flexible multi-prompt
120k images were collected. input. SEEM also can process as a classic segmentation
2) Semi-automatic stage. To increase the mask’s di- model when no prompt provides. However, it also suffers
versity and improve the model’s performance, the from limited training data and absent the support of part-
researchers first pre-filled the mask with which the based segmentation.
model can make high-confidence predictions. Then
they asked the annotator to annotate the unfilled 3 SAM FOR I MAGE P ROCESSING
part interactively. At the end of this stage, an image 3.1 Software Scenes
can provide an average of 72 masks.
3) Fully automatic stage. In this stage, due to the 3.1.1 Image Editing
collection of enough masks and the introduction Modern software scenes require operations on image edit-
of the ambiguity-aware model, the final training of ing and inpainting, e.g., removing objects, filling objects,
the SAM and the acquisition of the SA-1B data set and replacing objects. However, the existing inpainting
can be performed. The ambiguity-aware model en- works, like [99], [100], [101], [102], need fine annotations
ables SAM to predict effective masks even when the for each mask to achieve good performance, which is labor-
prompt is ambiguous. Specifically, the researchers intensitive. SAM [20], which can generate accurate masks
use a 32x32 grid to obtain prompt points on each with simple prompts such as points or boxes, can help assist
image uniformly. If the prompt point is located the image editing scenes.
on the target part or sub-part structure, the model
will return the mask of the sub-part, part, or the
whole object. And filter sort the outputs based on
confidence. At the end of this stage, the final SA-1B
dataset contains 11M images and 1.1B masks.
With the advantages of well-designed tasks, model struc-
ture, and massive high-quality training data, experimental
shows that the zero-sample transfer capability of the SAM
model has performed excellent results in single cue point
segmentation, edge detection, Object proposal, instance seg-
mentation, interactive segmentation, and multimodal seg-
mentation (Text-to-Mask) tasks. It even outperforms super-
vised models in some respects.
Fig. 3: Overall pipeline of Inpaint Anything (IA). The input
2.4 Concurrent Works image is segmented by the SAM and the targeted segment
Parallel to SAM research, many efforts have been made to is replaced by the output of the inpaint models to achieve
solve segmentation tasks with other general methods. different tasks. The figure is borrowed from the original
OneFormer [25] leverages the task-conditioned joint paper [39].
training strategy, task token, and query-text contrastive loss
to form a universal image segmentation framework. One- Inpaint Anything (IA) [39] designs a pipeline to solve
former enables training on all three traditional segmentation inpainting-related problems by combining the advantages
tasks within a single universal model and a multi-task of SAM, the state-of-the-art (SOTA) image inpainters [99],
training process. With different backbones, it outperforms and AI-generated content (AIGC) models [103]. The pipeline
specialized models on [95], Cityscapes [96], and COCO is illustrated in Fig. 3. For object removing, the pipeline is
datasets [97] with even costs much less training time and composed of SAM and SOTA inpainters, like LaMa [99]. The
resources. clicking action from the user is used as a prompt in SAM
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 5

to generate a mask for the object area, and the LaMa fills it 2) Obtain the style and content masks with SAM and
with corrosion and dilation operations. For object filling and input prompts.
replacing, the AIGC models, like Stable Diffusion (SD) [103], 3) Fuse the attention map with the mask-controlling
are used in the second step to fill the selected object with signals from the last step.
newly generated objects by text prompts. 4) Compute the stylized feature with the updated atten-
tion map and derive the final result.
With the defined pipeline, the paper proves that the
proposed method is a plug-and-play component to the
existing style transfer methods, including Local Transfor-
mation Based Style Transfer [104], [105], Global Transforma-
tion Based Style Transfer [106], and Diffusion Based Style
Transfer [107], which shows its great potential of broad
Fig. 4: Overall pipeline of Edit Everything from the original applications.
paper [40].

A similar idea can also be seen on Edit Everything [40]. 3.2 Real-World Scenes
As shown in Fig. 4, it allows users to edit images using 3.2.1 Detection
simple text instructions. Specifically, when inputting an
SAM holds the ability to assist the applications in many
image, SAM first separates it into several segments with
real-world scenes such as real-world object detection, object
no prompts, followed by a source prompt instructing CLIP
counting, and moving object detection scenes. Recently,
to rank the received segments. Only the segment with the
[108] evaluated the performance of SAM across a diverse
highest score is selected as the target one to be replaced
range of real-world segmentation scenarios, e.g., natural im-
with the newly generated object from SD with the target
age, agriculture, manufacture, remote sensing, and health-
prompt. Compared with the object-replacing solution in IA,
care scenes. The paper finds that it has excellent generaliza-
the authors train the CLIP with 400 million parameters and
tion on common scenes like natural images, while it shows
the SD with 1 billion parameters in Chinese scenarios to
less effectiveness in low-contrast scenes and requires strong
make it more reliable towards the Chinese text prompts.
prior knowledge in complex scenes.
Furthermore, the paper improves the realism of the image
by breaking down the complex prompts into smaller entities
and replaced in a sequential manner. Although it performs
well as a novel tool, the paper points out that it still needs
specific enhancement in different scenarios.

3.1.2 Style Transfer

Fig. 5: Illustrations of Any-to-Any Style Transfer from the


original paper [41].

Style transfer aims to transfer the style from a given


Fig. 6: Process of crack detection using SAM and U-Net. The
image (style image) to another given image (content image).
figure is borrowed from the original paper [42].
Typically, the transferred style is represented by the holistic
style of the style image or the local colors and textures of For example, in the application of civil infrastructure
the style image, and only one result will be generated for defect assessment, [42] utilizes SAM to detect cracks in con-
the content image, which lacks flexibility for the users to crete structures and compares its performance with the base-
interact with it. With promptable region selection ability line U-Net [109]. The crack detection process is illustrated in
from SAM, Any-to-Any Style Transfer [41] enables users Fig. 6. The results showed that the SAM outperforms the U-
to specify which style region to select and which content Net in the detection of longitudinal cracks, which are more
regions to apply during style transfer. The pipeline, which likely to find similar training images as in the normal scenes,
is shown in Fig. 5, is organized as follows: while in the unusual scene, i.e., spalling cracks, SAM is not
1) Encode the style and content images with pre-trained as good as U-Net.
VGG-19 as well as calculate the content-style atten- Unlike the complex image cases in crack detection, crater
tion map. detection is more suitable to utilize SAM as the detection
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 6

tool as the crater shapes focus on circular or elliptical. segmentation way. The authors apply the moving object
Craters are one of the most important morphological fea- bounding boxes in DSEC-MOD [114] as prompts to generate
tures in planetary exploration, detecting and counting them a large number of preparatory masks with SAM, which is
is an important but time-consuming task in planetary sci- accurate and reliable. The dataset contains 16 sequences of
ence. Although the existing works in machine learning and 13,314 frames in total and provides event-based data with
computer vision successfully solve some specific problems pixel-level annotations, which can be a valuable resource
in crater detection, they rely on specific types of data and for the MOS field.
thus fail to work well in a different data source.
In [110], the authors propose a universal crater detec-
tion scheme with the zero-shot generalization of SAM to
unfamiliar objects. The pipeline uses SAM to segment the
input images, which has no restrictions on the data type
and resolutions. Then, it utilizes circular-elliptical indexes to
filter the segmentation masks that are not circular-elliptical
shapes. Finally, a post-processing filter is employed to get Fig. 7: Framework of the SAA+ [115], which achieves zero-
rid of duplicates, artifacts, and false positive ones. The shot anomaly segmentation by introducting hybrid prompts
pipeline shows its great potential to be the general tool in as a regularization technique. The figure is borrowed from
the current field and the authors also discuss the drawbacks the original paper [115].
that only the specific shapes can be recognized.

3.2.2 Counting
3.3 Complex Scenes
Few-shot object counting is an important application scene
of computer vision in the real world, which counts an 3.3.1 Low-Contrast Scene
unseen-category object with only a few bounding boxes of Except for the normal scenes mentioned above, whether
examples provided. Since SAM shows impressive perfor- SAM can solve the segmentation issue in complex scenes,
mance with great generalization for unseen objects, it shows like low-contrast scenes, is also a meaningful question to
potential to be used in few-shot object counting. [48] is the broaden its application. To find out the generalization abil-
first paper testing SAM in this task and comparing it with ity of SAM in more complex scenes, Ji et al. [22] quan-
other baseline few-shot counting methods. The goal is to titatively compare it with cutting-edge models in three
find out whether SAM can segment and distinguish the concealed scenes, i.e., camouflaged animals, industrial de-
target objects using the reference examples. fects, and medical lesions. They conduct the experiments
To realize this, the authors define a pipeline. Firstly, they on three camouflaged object segmentation (COS) datasets,
calculate the dense image feature with an image encoder, i.e., CAMO [116] with 250 samples, COD10K [117] with
i.e., ViT-H. Secondly, they use bounding boxes as prompts to 2026 samples, and NC4K [118] with 4121 samples. And
generate segment masks for the reference examples, which compare it with the outstanding transformer-based model,
are then computed with the dense image feature as the CamoFormer-P/S [119] and HitNet [120]. They observe
feature vectors of the reference objects. Thirdly, they use from the results that SAM looks not skillful in concealed
point grids as prompts to segment everything and generate scenes and point out that the potential solution may rely
feature vectors for all masks. After that, they compute on the support of prior knowledge in the specific fields.
the cosine similarity of the feature vectors between the The same conclusion also can be drawn in [29], where the
predicted masks and reference examples, only the ones authors compare SAM with 22 SOTA methods in camou-
larger than the pre-defined threshold are considered the flaged object detection on the same three datasets mentioned
target objects. The paper compares the proposed SAM- above.
based method with other few-shot counting methods on two Dominic et al. [121] utilizes SAM in the field of plant
datasets, FSC-147 [111] and MS-COCO [112], and finds that phenotyping, e.g., potato leaves segmentation, by propos-
it falls behind the SOTA baselines, especially for small and ing a method named Leaf Only SAM. Specifically, they
congested objects. Thus, further improvement for SAM in combine SAM with four post-processing steps to identify
some special scenes is still needed. only leaf objects without any training data. After receiving
the segmentation masks from SAM, they first check the
3.2.3 Moving Object color by finding the green masks. To reduce ambiguity,
Moving object segmentation (MOS) is a crucial task for the they further check the main object by keeping the one with
application of computer vision in many real-world appli- IoU of more than 90% and removing the rest. Then, they
cation scenarios, such as autonomous driving. The existing compare the area of the minimum enclosing circle to filter
datasets for this research are mainly RGB or Lidar videos, the masks with incorrect shapes. Finally, they remove multi-
which lack the event information that can help under- leaf masks by summing all the mask objects in the image
stand the dynamic scenes better. To fill this gap, DSEC- and labeling each pixel to the amount of belonging masks.
MOS [113] is proposed with the moving vehicles’ frames The mask with a mean score over 1.5 is assumed to be a
and the corresponding event data, which can facilitate the duplicate. The experiments compare Leaf Only SAM with a
development of more accurate, robust, and efficient algo- fine-tuned Mask R-CNN and find that Leaf Only SAM does
rithms for autonomous driving. The contribution of SAM not outperform the latter with just a 10% performance drop,
to DSEC-MOS annotation is that it provides a promptable but its training-free ability shows its potential for this field.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 7

Cao et al. [115] proposes a new framework called Seg-


ment Any Anomaly + (SAA+) for zero-shot anomaly seg-
mentation as shown in Fig. 7. The framework utilizes hybrid
prompt regularization to improve the adaptability of mod-
ern foundation models, allowing for more accurate anomaly
segmentation without the need for domain-specific fine-
tuning. The authors conduct thorough experiments on four
anomaly segmentation benchmarks, i.e., VisA [122], MVTec-
AD [123], MTD [124], and KSDD2 [125], and achieve SOTA
performance. He et al. [126] proposes the first method (WS-
SAM) to leverage SAM for weakly-supervised concealed ob-
ject segmentation that addresses the challenges of segment- Fig. 9: Framework of creating and using SATIR in three
ing objects that are well blended with their surroundings us- steps. (a) Construct a pre-trained dataset with SAM. (b)
ing sparsely-annotated data (see Fig. 8). The proposed WS- Pretrain models with the dataset. (c) Finetune the pre-
SAM involves SAM-based pseudo labeling and multi-scale trained models on the target mask. The figure is borrowed
feature grouping to improve model learning and distinguish from the original paper [128].
concealed objects from the background. The authors found
that using only scribble supervision [127], SAM can generate
segmentation masks good enough to train a segmenter. model pretraining, which contains over 100,000 images with
pixel-annotation labels. To finally improve the performance
of models in this field, the authors propose a three-step
framework, which is shown in Fig. 9. They construct the
mentioned dataset with SAM, followed by pretraining the
model with it. Then, they finetune the pre-trained model
on the target task. The experiment on the public thermal
infrared semantic segmentation data, SODA [129], validates
its effectiveness in this field, where the backbone model pre-
Fig. 8: Framework of the WS-SAM [126] with scribble su- trained by SATIR outperforms others with approximately
pervision for weakly-supervised concealed object segmen- 1.3% in mean Intersection over Union (mIoU).
tation. The figure is borrowed from the original paper [126]. In the field of the poultry industry, thermal infrared
images are also used in chicken segmentation tasks, which
The glass scene is a more challenging one in concealed is an important agricultural application in the context of
scenes compared with the camouflaged animals and others. cage-free hens. To find out SAM’s capacity in this field,
It is closely related to the safety issue when applying the the paper [130] uses SAM in the chicken segmentation task
machine learning algorithm in our real world. For exam- and compares its performance with the other SOTA baseline
ple, autonomous mobile robots may easily crash into a methods, i.e., SegFormer [131] and SETR [132]. The authors
transparent front door without a reliable, transparent object find that SAM has superior performance compared with
detection algorithm. In this scene, the detected objects often the two baseline methods, especially when using the total
have mirror surfaces or just be transparent, which easily points prompts. However, with the thermal images as input,
makes the traditional detection algorithms fail. As SAM has they find that there are still some limitations in recognizing
strong capability in zero-shot segmentation to the unseen the arbitrary parts, e.g., the chicken tail. Except for the
objects, [43] evaluates it on both mirror and transparent chicken segmentation, they also explore the ability of SAM
objects scenes and compares it with SOTA methods from in chicken-tracking tasks by combining SAM, YOLOX [133],
semantic segmentation, shadow detection, saliency object and ByteTracker [134] as a tracker. And they realize real-
detection, and glass segmentation. The results from the time tracking with the combination and show its potential
experiments show that SAM can successfully identify the in this field.
objects behind the transparent ones but fail to recognize the
glass objects themselves. The overall performance of SAM 3.3.3 Overhead Image
is significantly worse than the methods trained specifically Overhead imagery problems are related to a set of widely-
with transparent objects, which proves that SAM is not studied tasks, where the objects are usually small and inten-
ready to be deployed in safety-critical situations that contain sive, e.g., the remote sensing images. In [135], the authors
glass. test whether the impressive generalization ability of SAM
can cover the scenes in this field. Specifically, they examine
3.3.2 Thermal Infrared Image the out-of-the-box accuracy of SAM on six existing bench-
The thermal infrared image scene is another kind of complex mark datasets of overhead imagery, including Solar [136],
scene where the images are always dark and hard to be Inria [137], DeepGlobe [138], 38-Cloud [139], DeepGlobe
annotated. Therefore, a large amount of unlabeled data is Roads [138], and Parcel Delineation [140], which encompass
wasted, and the models for this field fail to learn high 5 million square kilometers of surface area. The results
accuracy in a reliable way. To solve this problem, [128] demonstrate that SAM often generalizes well to overhead
utilizes SAM to generate pseudo labels and builds a large- imagery while failing in some cases with unique characteris-
scale thermal infrared segmentation dataset, SATIR, for tics on target objects, which is considered a systemic failure
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 8

and can be addressed through modest changes in SAM’s computer processing to create cross-sectional images (slices)
design. Similar observations and ideas are proposed in [141]. of the bones, blood vessels, and soft tissues inside your
The authors adopt SAM to the geological mapping task with body. This paper [145] presents a preliminary examination
remote sensing images of the Mars surface and find that it of SAM as an annotation tool for medical image analysis,
cannot be directly applied to domain-specific tasks due to a specifically for the segmentation of multi-phase liver tumors
lack of problem-specific bias. Thus, they change the SAM’s (MPLiTS). Their investigation focused on the prompts used,
design by introducing a domain-specific decoder, which can data resolution, and phases. The experimental findings
learn the problem-specific semantics by fine-tuning through demonstrate the effectiveness of SAM in this context, while
knowledge distillation with only five labeled images and also highlighting areas where MPLiTS could be improved.
achieve the same performance as the Mask-RCNN [142] To provide comprehensive guidance to the MPLiTS commu-
trained with over 400 labeled images. nity, the authors plan to conduct further investigations that
The unlabeled remote sensing dataset issue can also be encompass a wider range of aspects. The purpose of this
solved with SAM. With the small and intensive characters paper [146] is to perform an initial assessment of SAM’s out-
on objects in remote sensing images, they are hard and of-the-box zero-shot capabilities for medical image segmen-
cost-inefficient to be labeled by human experts. Thus, the tation by evaluating its performance on an abdominal CT
paper [143] develops an efficient pipeline with SAM to organ segmentation task using point or bounding box-based
generate a large-scale remote sensing segmentation dataset prompting. The findings indicate that SAM can generalize
named SAMRS. It will be detailed in the data annotations effectively to CT data, which could potentially accelerate
section. The paper [141] mentioned above also explores the the development of semi-automatic segmentation tools for
applicability of their proposed SAMs decoder for annotation clinicians. SAMed [147] is the proposed solution for medical
and finds that the bounding box-based segmentation mode image segmentation, which differs from previous methods
is more suitable for rapid annotation. Apart from that, some in that it leverages the SAM, a large-scale image segmen-
researchers combine the advantages of different foundation tation model. This approach involves customizing the SAM
models [144], like SAM and Grounding DINO, to achieve model for medical image segmentation by applying the low-
text-prompt guiding segmentation on remote sensing im- rank-based (LoRA) finetuning strategy to the SAM image
ages and prove its effectiveness in this field. The details are encoder. Unlike SAM, SAMed can perform better for se-
illustrated in the vision and language section. mantic segmentation tasks on medical images. The trained
SAMed model achieves comparable performance to SOTA
4 OTHER A PPLICATIONS : V ISION AND B EYOND methods. Additionally, since SAMed only updates a small
fraction of the SAM parameters, its deployment and storage
4.1 Vision Related Applications
costs are minimal in practical usage.
4.1.1 Medical Imaging For MRI images. MRI is a non-invasive diagnostic
The goal of medical image segmentation is to reveal the imaging technique that uses a powerful magnetic field,
anatomical or pathological structure of the corresponding radio waves, and a computer to produce detailed images
tissues, which can assist computer-aided diagnosis and of internal structures in the body. MRI is commonly used to
intelligent clinical surgery [161], [162]. Due to the rapid visualize the brain, spine, joints, and other soft tissues. This
development of computation power and medical data re- study [148] compares SAM with FSL’s Brain Extraction Tool
sources, deep learning-based medical image segmentation (BET), a widely used and current gold standard brain ex-
has achieved massive progress in accuracy and speed traction technique, on a variety of brain scans with varying
against traditional counterparts [163], [164]. With the emerg- image qualities, MR sequences, and brain lesions affecting
ing Visual Transformer (ViT), ViT-based medical image different brain regions. The findings show that SAM out-
methods [165], [166], [167], [168] has achieved surpassing performs BET based on average Dice coefficient, IoU, and
performance in the medical image segmentation. However, accuracy metrics, particularly in cases where image quality
such networks are towards a specific task, which lacks the is compromised by signal inhomogeneities, non-isotropic
generalization ability on other tasks. Recently, SAM has voxel resolutions, or the presence of brain lesions near or
been proposed to make it possible to solve multiple kinds of involving the outer regions of the brain and meninges.
segmentation tasks in a unified framework. In this context, Furthermore, SAM has superior segmentation properties,
researchers have paid attention to customizing SAM for enabling a fine-grain separation of different tissue com-
medical image segmentation and concluded some useful partments and brain structures. These results suggest that
strategies to improve its performance. This work [33] briefly SAM has the potential to become a more accurate, robust,
summarizes recent efforts to extend the success of SAM to and versatile tool for a broad range of brain extraction
medical image segmentation tasks, while we provide a more and segmentation applications. The paper [149] demon-
comprehensive summary with deeper insight in this section. strates that SAM can achieve high segmentation accuracy
According to the imaging format of medical images, for brain tumor MRI datasets in a point-to-mask setting,
the usage of SAM in medical image segmentation can and effectively generalizes to brain tumor MRI datasets
be categorized into six series: Computerized Tomography and achieves segmentation accuracy similar to what was
(CT) images, Magnetic Resonance Imaging (MRI) images, observed in the 2D photographs where it was previously
colonoscopy images, H&E stained histological sections im- evaluated. Furthermore, the authors identify the challenges
ages, multiple format images, and others. encountered when using SAM for tumor segmentation in
For CT images. A CT scan combines a series of X-ray im- MRI datasets and propose strategies to address them, which
ages taken from different angles around your body and uses can also be applied in the context of clinical implementation.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 9

[146] [149] [30], [32], [36], [37], [151]


[147] [158] [160]
[145] [148] [31], [150] [34], [35], [38], [152], [153] [156] [159]
[154], [155] [157]

Fig. 10: Outline of SAM for medical images, including Computerized Tomography (CT) images, Magnetic Resonance
Imaging (MRI) images, colonoscopy images, multiple format images, H&E stained histological sections images, and others.

For colonoscopy images. A colonoscopy is a test to check medical images.


inside the bowels. This report [31] evaluates the perfor-
mance of SAM in segmenting polyps under unprompted • Evaluation of SAM’s ability to segment medical im-
settings. Polyp-SAM [150] is a finetuned SAM model de- ages. This paper [30] is the first work at extending the
signed for polyp segmentation. The authors evaluate its per- success of SAM to medical images. It compiles an ex-
formance against various SOTA polyp segmentation mod- tensive medical image dataset comprising 11 distinct
els and compare the performance of two transfer learning modalities and containing more than 200,000 masks.
strategies: one involving finetuning of the encoders and the This work [37] is an extensive evaluation of SAM’s
other without. Experimental results on five public datasets ability to segment medical images on a collection of
illustrate SOTA performance on two datasets and impres- 11 medical imaging datasets from various modalities
sive performance on the remaining three. and anatomies. This paper [151] explores the accu-
For H&E stained histological sections images. H&E racy of SAM on 12 public medical image segmen-
stained histological sections refer to tissue samples that tation datasets which cover various organs (brain,
have been processed and stained with hematoxylin and breast, chest, lung, skin, liver, bowel, pancreas, and
eosin (H&E) dyes for microscopic examination. This staining prostate), image modalities (2D X-ray, histology, en-
technique is commonly used in histology and pathology doscopy, 3D MRI, and CT), and health conditions
to highlight different structures and cellular components (normal, lesioned). This work [32] evaluates SAM
within the tissue sample. This test can help find what’s on medical images and presents both quantitative
causing your bowel symptoms. This work [156] assesses the and qualitative results of zero-shot segmentation on
zero-shot segmentation performance of the SAM model on nine benchmarks for medical image segmentation.
representative segmentation tasks in whole slide imaging, These benchmarks encompass a variety of imaging
including tumor segmentation, non-tumor tissue segmenta- modalities, including optical coherence tomography
tion, and cell nuclei segmentation. The core results indicate (OCT), magnetic resonance imaging (MRI), and com-
that the zero-shot SAM model achieves remarkable segmen- puted tomography (CT), and different applications
tation performance for large connected objects. The authors such as dermatology, ophthalmology, and radiology.
also identify several limitations for digital pathology, in- This paper [36] evaluates SAM’s zero-shot general-
cluding image resolution, multiple scales, prompt selection, ization on medical images by collecting more than
and model fine-tuning. To address these limitations, few- 12 public medical image datasets that cover various
shot fine-tuning with images from downstream pathological organs and modalities. This paper [152] evaluates
segmentation tasks may help the model achieve better per- the zero-shot performance of SAM 2D in medical
formance in dense object segmentation in the future. This imaging by testing it on six datasets from four differ-
paper [158] demonstrates that its generated masks, features, ent imaging modalities, including X-ray, ultrasound,
and stability scores can be leveraged to build and train more dermatoscopy, and colonoscopy. The findings indi-
advanced medical image segmentation models. Specifically, cate that SAM 2D’s zero-shot performance is either
it showcases how SAM can be used to enhance image inputs comparable or superior to the existing SOTA models.
for a commonly used medical image segmentation model, Huang et al. [38] collect and sort 52 open-source
such as U-Net. The proposed method is tested on two datasets and build a large medical segmentation
datasets, and the experiments demonstrate its effectiveness. dataset with 16 modalities, 68 objects, and 553K
SkinSAM [157] presents a fine-tuned based model on SAM slices. A comprehensive analysis of different SAM
for skin cancer segmentation, which demonstrates remark- testing strategies on the so-called COSMOS 553K
able segmentation performance. Experimental results also dataset is conducted.
illustrate that larger models, such as ViT L and ViT H, can • SAM-based segmentation methods for medical im-
perform better than the smaller ViT b, the fine-tuned model ages. This paper [30] proposes a straightforward
(ViT b finetuned) showed the greatest improvement. fine-tuning approach to tailor the SAM model for
For multiple format images. Works in this part evaluate general medical image segmentation. Rigorous ex-
SAM or propose SAM-based segmentation methods for perimentation on 21 3D segmentation tasks and
more than one segmentation task with multiple kinds of 9 2D segmentation tasks illustrates that MedSAM
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 10

surpasses the default SAM model. Segment Any 4.1.2 Video


Medical Model (SAMM) [153] is an extension of In the field of computer vision, Video Object Tracking
SAM on 3D Slicer, a widely-used open-source image (VOT) [169], [170] and video segmentation are recognized
processing and visualization software that has been as crucial and indispensable tasks. It involves locating a
extensively used in the medical imaging community. particular object in a video frame and subsequently tracking
Medical SAM Adapter (MSA) [34] first integrates the it throughout the rest of the video. As such, it has various
medical-specific domain knowledge to the segmen- practical applications, such as surveillance and robotics.
tation model and shows superior performance on
19 medical image segmentation tasks with various
image modalities, including CT, MRI, ultrasound
image, fundus image, and dermoscopic images. This
paper [154] introduces Learnable Ophthalmology
SAM, a novel approach to multiple target segmen-
tation in ophthalmology multimodal images. The
proposed method incorporates a learnable prompt
layer that extracts medical prior knowledge from
each transformer layer. Using a one-shot mechanism
during training, this work only trains the prompt
layer and task head. Experiments are conducted
to demonstrate the effectiveness of the proposed Fig. 11: Pipeline of TAM. The figure is borrowed from the
method on four medical segmentation tasks (i.e., original paper [46].
blood vessel segmentation, lesion segmentation, and
retinal layer segmentation) across nine publicly avail- SAM [20] has made outstanding contributions in the
able datasets. SAM-Adapter [35] leverages domain- field of VOT. Track Anything Model (TAM) [46] was intro-
specific information or visual prompts to enhance duced, which achieves outstanding interactive tracking and
the segmentation network through the use of simple segmentation in videos with remarkable performance. This
yet effective adapters. By combining task-specific research paper suggests the utilization of a highly-effective
knowledge with general knowledge learned by the toolkit named Track-Anything for high-performance object
large model, SAM-Adapter can notably improve the tracking and segmentation in videos. Unlike current exist-
performance of SAM in challenging tasks, as demon- ing methods, it adopts an interactive approach for initial-
strated by extensive experiments. PromptUNet [155] ization and incorporates the SAM and XMem [171]. The
is proposed by extending the existing prompt types pipeline of TAM is illustrated in Fig. 11. SAM is a large
in SAM to include novel Supportive Prompts and En- segmentation model with powerful segmentation abilities,
face Prompts. The authors evaluate its capabilities on while XMem is an advanced semi-supervised VOS model.
19 medical image segmentation tasks using a variety The proposed method exhibits exceptional performance and
of image modalities and PromptUNet exceeds a wide user-friendliness in complex situations, with potential ap-
range of state-of-the-art (SOTA) medical image seg- plications in many fields. Additionally, this paper scruti-
mentation methods. nizes failed instances and recommends optimal solutions.
The proposed method has the ability to tackle challenging
For other medical images. This paper [159] evaluates situations in video object perception with efficiency.
SAM’s performance using two established robotic instru- Furthermore, another tracking model, namely SAM-
ment segmentation datasets from the MICCAI EndoVis 2017 Track [172] has been proposed. SAM-Track is a video seg-
and 2018 challenges. The extensive evaluation indicates mentation framework that enables object tracking and seg-
that SAM displays remarkable zero-shot generalization abil- mentation through both interactive and automatic meth-
ity with bounding box prompts. Moreover, the qualitative ods. The pipeline of SAM-Track is illustrated in Fig. 12.
figures demonstrate that the model either fails to predict SAM-Track combines Grounding-DINO [51], DeAOT [173],
certain parts of the instrument mask (such as the jaws and and SAM to achieve these capabilities. By employing both
wrist) or misclassifies parts of the instrument in scenarios interaction and automation, SAM-Track becomes a highly
where instruments overlap within the same bounding box versatile tool with potential beyond just object tracking and
or with point-based prompts. However, in some complex segmentation. Its combination of features has allowed it to
surgical scenarios involving blood, reflection, blur, and achieve successful results, demonstrating its impact.
shade, SAM is unable to identify instruments. Additionally, In addition to its applications in tracking tasks, SAM has
SAM’s performance is insufficiently robust to withstand also shown promise in enhancing the resolution of video.
various forms of data corruption. GazeSAM [160] is the Video super-resolution (VSR) [174] [175] [176] [177] [178]
first work to leverage the power of eye-tracking technology presents a primary challenge of handling large motions
and SAM to enhance the efficiency of daily clinical prac- in input frames, despite its potential in accurately aggre-
tice. It enables radiologists to collect segmentation masks gating information from multiple frames. This challenge
by simply looking at the region of interest during image requires accurate displacement estimation of input frames
diagnosis. This system tracks the radiologists’ gaze and uses in complex scenes, which is a challenge both in terms of
the resulting data as input prompts for SAM, which then real-time processing and computational complexity. [179]
generates the segmentation mask in real-time automatically. proposes a method to enhance the quality of video super-
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 11

more than 9 million mask annotations. The authors ar-


gue that annotating scene text at a finer level can lead
to remarkable improvements in detection and recognition
performance, even for curved text. Furthermore, the pa-
per identifies several potential research directions, such as
examining the effect of mask annotations, enhancing data
and model scalability, and generating character-level mask
annotations.
[143] presents a method to construct a large-scale re-
mote sensing image segmentation dataset named SAMRS
by utilizing existing remote sensing object detection datasets
and the data-centric machine learning model SAM. SAMRS
includes object category, location, and instance information,
Fig. 12: Pipeline of SAM-Track. The figure is borrowed from
which can be utilized for semantic segmentation, instance
the original paper [172].
segmentation, and object detection research. The authors
conducted a comprehensive analysis of the dataset and
discussed SAM’s performance in remote sensing image
resolution (VSR) by utilizing a SAM to build a more robust
segmentation, demonstrating its potential to improve anno-
and semantically-aware prior. Moreover, the authors have
tation efficiency. SAMRS surpasses previously existing high-
devised a lightweight module called SAM-guided refine-
resolution remote sensing image segmentation datasets in
ment module (SEEM) to boost the performance of existing
size. SAMRS is a valuable resource when conducting large-
methods. The framework of SEEM is illustrated in Fig. 13.
scale model pre-training research in the field of remote
The SEEM module is easy to integrate into existing methods.
sensing image segmentation.
Experimental results show that SEEM can provide superior
Generating high-quality pseudo-labels is a crucial step
performance with both comprehensive fine-tuning and fine
towards achieving accurate and reliable results in various
parameter adjustment.
computer vision applications. These pseudo-labels can be
used to train and test various computer vision models,
enabling researchers and practitioners to achieve state-of-
the-art results in object recognition, semantic segmentation,
and other related tasks. SAM has made it incredibly easy,
fast, and highly effective to generate high-quality pseudo-
labels [126] [128] [181], thus facilitating new breakthroughs
in the field of computer vision research and applications.
He et al. [126] propose a new method for weakly-
supervised multi-object semantic segmentation. The WS-
SAM framework utilizes the recently proposed vision foun-
dation model, SAM, to generate segmentation masks, and
Fig. 13: Overview of the proposed framework from the proposes several techniques, including multi-augmentation
original paper [179]. (a) Illustrate how to obtain SAM-based result fusion, pixel-level uncertainty weighting, and image-
representation. (b) Apply the proposed SAM-guided refine- level uncertainty filtration, to obtain reliable pseudo labels
ment module (SEEM) to the sliding-window-based method. for training a segmentation model. Chen et al. [128] intro-
(c) Apply SEEM to the bidirectional recurrent structure- duce a framework for pretraining thermal infrared image
based method. segmentation models using pseudo-labels generated with
the SAM model. The framework can effectively improve
the accuracy of segmentation results for specific categories,
4.1.3 Data Annotations exceeding the SOTA ImageNet pretrained model. The au-
Data annotation in AI involves the process of labeling data thors generated a thermal infrared segmentation dataset
for machine learning algorithms to help them learn to rec- containing over 100,000 images for pretraining. Jiang et
ognize specific patterns, objects, or features. Accurate data al. [181] present an approach to weakly supervised semantic
annotation is crucial to the development of effective ma- segmentation that utilizes cheap annotations such as image
chine learning models that can successfully perform tasks labels, points, and scribbles as prompts for SAM. This model
like object detection, classification, and natural language outputs object masks with precise boundaries, which are
processing. Due to the high cost associated with annotating used to generate pseudo labels for training segmentation
images and videos in certain fields, many datasets have networks. The experiments on the PASCAL VOC [182] show
not been labeled effectively, particularly at the pixel level. that SAM can serve as an effective pseudo-label generator.
However, the emergence of SAM [20] will facilitate the
effective labeling of such datasets.
4.2 Beyond Vision
SAMText [180] is a scalable pipeline for mask annotation
of scene text in videos. The pipeline utilizes the SAM 4.2.1 3D Reconstruction
for generating mask annotations in a large-scale dataset, In addition to achieving fine-grained 3D segmentation,
SAMText-9M, which contains over 2,400 video clips and SA3D [183] can also be used for 3D reconstruction. With
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 12

Fig. 15: Framework of OR-NeRF [186].

Fig. 14: Overall pipeline of SA3D [183]. Initially, the user


provides prompts for the target object, and SAM generates and depth priors through 2D inpainting methods. Finally,
a segmentation mask for it. This 2D mask is then projected the algorithm employs depth supervision and perceptual
onto a 3D mask grid using mask inverse rendering, taking loss for scene reconstruction to maintain consistency in
into account the density distribution. By rendering incom- appearance after object removal.
plete 2D masks from novel views, SA3D extracts reliable Although the above methods have demonstrated
prompts and queries SAM to obtain accurate segmenta- promising performance in 3D segmentation, there is still
tion masks. This iterative process, called cross-view self- much room for improvement. Here, we discuss several
prompting, alternates between mask inverse rendering and potential future directions for 3D reconstruction:
obtaining segmentation masks, ultimately resulting in the
final 3D segmentation results. • Improving the efficiency of NeRF-based methods:
NeRF-based methods are computationally expensive
and memory-intensive, limiting their scalability to
large-scale scenes. Future research can focus on im-
the 3D mask grid obtained from the previous section, we
proving the efficiency and scalability of NeRF-based
can determine the occupied space of objects in 3D and
methods to enable real-time 3D scene understanding.
reconstruct them in various ways.
• Incorporating additional input modalities: In addi-
However, due to the high memory requirements and tion to multi-view images, incorporating additional
computational complexity of NeRF-based methods, they input modalities such as depth maps and surface
are currently limited to relatively small scenes and cannot normals can improve the accuracy and efficiency of
handle large-scale outdoor scenes. To tackle this challenge, NeRF-based 3D reconstruction and segmentation.
several studies [184], [185] have proposed to use additional • Integrating with other 3D perception tasks: 3D
input modalities such as depth maps and surface normals segmentation is an important task in 3D scene under-
to improve the efficiency and accuracy of NeRF-based 3D standing. Integrating SA3D with other 3D perception
reconstruction. tasks, such as 3D object detection, 3D scene recon-
SAM [20] is a SOTA method for 2D image segmentation struction, and 3D pose estimation, can lead to more
that can segment anything with user-specified prompts. comprehensive 3D scene understanding.
SAM can be used in various applications such as object • Expanding the scope of applications: SA3D can be
detection, image retrieval, and image synthesis. However, applied to various applications such as robotics, aug-
SAM is currently limited to 2D image data and cannot be mented reality, and virtual reality. Future research
directly applied to 3D scene understanding. can explore the potential of SA3D in these fields and
The proposed SA3D framework, as shown in Fig. 14 develop novel applications.
extends the segmentation ability of SAM to 3D scenes by
leveraging NeRFs. SA3D can segment any object in a 3D 4.2.2 Non-Euclidean Domain
scene with one-shot manual prompting in a single rendered
view. SA3D utilizes mask inverse rendering and cross-view Various Tasks: Graph Classification
Transductive Multi-Class Inductive Multi-Label
Node Classification
Graph Regression
Node Classification
self-prompting techniques to project 2D masks onto 3D
mask grids and generate new prompts for different views, Heterogeneous
Topologies:
respectively. Compared to previous approaches based on
Diverse Input
NeRFs, SA3D can easily adapt to any pre-trained NeRF Dimensions:

without any changes and re-training.


Segment Non-Euclidean Anything (SNA)
On the contrary of reconstructing objects, this paper
[186] presents a new object-removing pipeline called OR- [ 2.584,
Diverse Output
NeRF that removes objects from 3D scenes using either point Dimensions:
Enzymes -1.654,
0.945 ]
or text prompts on a single view. The method achieves better Classes: ( 0 1 2 3 ) Labels: ( 0 1 2 )
performance in less time than previous works by using a
points projection strategy to rapidly spread user annotations Fig. 16: Illustrations of SNA from the original paper [28].
to all views. The algorithm shown in Fig. 15 leverages the
recent 2D segmentation model SAM [20] to predict masks In the context of graph neural networks, the Non-
with improved precision and efficiency and obtains color Euclidean domain refers to graphs that are irregular and do
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 13

not have a predefined structure like grids or lattices. These programs that form a comprehensive perception, planning, def main() -> dict:
image = GetObsImage()
masks = SAM(image=image)
objs, masks = ImageCrop(image=image, masks=masks)

graphs can represent a wide range of data, including social and action loop for robotic tasks. block = CLIPRetrieval(objs, "the polka dot block")
green_container = CLIPRetrieval(objs, "the green container")
block_loc = Pixel2Loc(block, masks)
container_loc = Pixel2Loc(green_container, masks)

networks, citation networks, e-commerce product graphs, # Pick the block and place it in the container
action = PickPlace(pick=block_loc, place=container_loc,
bounds=BOUNDS)
info = RobotExecution(action=action)

and molecule graphs. Due to the complexity and hetero- LLM Generated Policy
return info
Generated Policy
def main() -> dict:
image = GetObsImage()

geneity of these graphs, developing a foundation model for Instruction masks = SAM(image=image)
objs, masks = ImageCrop(image=image, masks=masks)

Put the green and purple polka dot block into the green container. # Retrieve the polka dot block, with the pure text modal

universal graph analysis has become a challenging task. Put <dragged_obj> into <base_obj> .
block = CLIPRetrieval(objs, "the polka dot block")

# Retrieve the polka dot block, with the visual modal


block = CLIPRetrieval(objs=objs, query=templates.get("dragged_obj"))

In recent years, several research works have been con- Put the first clicked object into the second one .
...
info = RobotExecution(action=action)
return info

ducted to address the challenges of graph analysis in the Python Interpreter


# Instruction: Rotate the {dragged_obj} 150 degrees.
def main_2() -> dict: Robotic Executor
"""Execute the given instructions of rotating the {dragged_obj} 150
Input Image CLIPRetrieval(…) Pixel2Loc(…)
degrees"""

Non-Euclidean domain. For example, the work presented in image = GetObsImage()


masks = SAM(image=image)
Uni-Model Instrcution Retrievalobjs, masks = ImageCrop(image=image, masks=masks)
Low-level
Controller
obj_0 = CLIPRetrieval(objs=objs, query=templates.get("dragged_obj"))

[187] proposes the Graph Convolutional Network (GCN), "green and purple
polka dot block "
Text
Encoder
𝐹!
loc_0 = Pixel2Loc(obj=obj_0, masks=masks)
action = PickPlace(pick=loc_0, place=loc_0,
yaw_angle_degree=150)
Coord.
Mapping
bounds=BOUNDS,

info = RobotExecution(action=action)

which is based on spectral graph convolutions and can be return info


𝐹!" ∙ 𝐹" 𝐹!# ∙ 𝐹" 𝐹!% ∙ 𝐹" 𝐹!$ ∙ 𝐹"

def GetObsImage()PickPlace(…)
-> Image.Image:

applied to a wide range of graph-related tasks. The Graph-


"""Get the current image to start the system."""
SAM(...) 𝐹!" 𝐹!# 𝐹!% 𝐹!pass
$ Skill Primitive:
def SAM(image: Image.Image) -> dict:
"""Get segmentation results with SAM"""
Image Pick Place
𝐹!" ∙ 𝐹! 𝐹!#∙ 𝐹! 𝐹!%∙ 𝐹! 𝐹!$∙pass
𝐹!

SAGE method proposed by Hamilton et al. [188] enables the Encoder def ImageCrop(image:

pass …
Image.Image, masks: dict):
Rotate Action
"""Crop image with given masks."""
Construction
𝐹!

scalable processing of large graphs by sampling neighbors import numpy as np


RobotSetting(…)

instead of using all of them. The Graph Attention Network ImageCrop(…)


Env
Cache
<dragged_obj> 1
Parameter Setting:

Speed

(GAT) introduced by Velivckovic et al. [189] utilizes the Cursor Click


Multi-Model Instrcution Retrieval
Action Space

attention mechanism to learn the weights for each neighbor


automatically. Fig. 17: Paradigm of the Instruct2Act [190] framework. Best
Despite the progress made in Non-Euclidean graph anal- viewed by zooming in.
ysis, there is still a need for more versatile and adaptable
Fig. 17 shows the overall pipeline of Instruct2Act [190].
foundation models. In particular, the recent work presented
In the perception section, pre-defined APIs are used to
in [20] introduces the SAM, which is a prompt-based frame-
access multiple foundation models. The SAM [20] accurately
work for general image analysis. SAM enables users to input
locates candidate objects, and CLIP [13] classifies them. The
natural language prompts to perform a range of image-
framework leverages the expertise of foundation models
related tasks, including object detection, segmentation, and
and robotic abilities to convert complex high-level instruc-
classification. Inspired by the success of SAM, the Segment
tions into precise policy codes. In this way, Instruct2Act
Anything in Non-Euclidean (SNA) [28] paradigm aims to
[190] is adjustable and flexible in accommodating various
develop a foundation model for universal graph analysis
instruction modalities and input types, catering to specific
that is flexible, adaptable, and capable of handling diverse
task demands.
graph samples and tasks.
The practicality and efficiency of Instruct2Act [190]
To address the challenges of handling diverse feature
were validated by assessing it on robotic tasks in
dimensions for varying tasks, the SNA approach shown
different scenarios within tabletop manipulation do-
in Fig. 16 introduces a specialized slimmable graph con-
mains. Moreover, the zero-shot method outperformed
volutional layer. This layer enables the dynamic activation
many state-of-the-art learning-based policies in several
or deactivation of its channels based on the input feature
tasks. The code for Instruct2Act [190] is available at
dimensions. Furthermore, the approach proposed in this
https://github.com/OpenGVLab/Instruct2Act, serving as
paper employs a meta-learning strategy that learns to select
a robust benchmark for high-level robotic instruction tasks
the optimal neurons based on the downstream tasks rather
with assorted modality inputs. This framework provides a
than relying on manual selection. This approach enables the
promising approach for enabling robots to perform complex
handling of a broader range of unseen tasks and represents
tasks by leveraging the power of foundation models and
a step towards developing more versatile and adaptable
large language models.
foundation models for universal graph analysis.
In conclusion, Instruct2Act [190] sets foot into the
In summary, the SNA paradigm aims to establish a
robotics area with the latest powerful tools for the first time,
foundation model for universal graph analysis in the
this may be a foot print of humanity stepping into the AGI.
Non-Euclidean domain. The SNA approach leverages the
We will see more research in this field, paving the way for a
prompt-based framework introduced in the SAM and in-
more productive society.
troduces specialized slimmable graph convolutional layers
and a meta-learning strategy to handle the challenges of
diverse graph samples and tasks. The proposed approach 4.2.4 Video Text Spotting
has the potential to inspire future research in developing Video text spotting is a challenging task that involves
more versatile and adaptable foundation models for graph localizing and recognizing text instances in video frames
analysis in the Non-Euclidean domain. or sequences. Traditional methods for video text spotting
have relied on the detection of bounding boxes and the
4.2.3 Robotics subsequent recognition of text instances within those boxes.
In recent years, the use of foundation models has ad- However, these methods are limited in their ability to
vanced various applications such as image segmentation accurately localize text instances, particularly those with
and natural language processing. In this paper [190], the au- irregular shapes or orientations.
thors present Instruct2Act, a framework that utilizes Large Recent advances in computer vision, such as the
Language Models (LLMs) to map multimodal instructions segmentation-based approach proposed by the SAM [20],
to sequential actions for robotic manipulation tasks. The have shown promise in addressing these limitations. The
framework employs the LLM model to generate Python SAM approach utilizes a deep neural network to generate
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 14

models and may inspire further research in the field of


computer vision.

4.2.5 Vision and Language

Mode 𝟙 Pre SAM Mode 𝟚 Post SAM

🦕
Grounding
Dino
Box Prompt

SAM CLIP
SAM
Text Prompt Text Prompt
CLIP

Fig. 18: Overview of SAMText [180] pipeline, which extends Building


Car
Road

Surgery

Point Prompt
Building
Car
Road

the SAM approach to generate mask annotations for scene


text images or video frames in a scalable manner. The Fig. 19: Overall structure of the Text2Seg [144] pipeline,
pipeline takes an input bounding box, which can either which consists of three methods for guiding the SAM model.
INTERNAL USE

come from existing annotations or be derived from a scene The pipeline begins by using a text prompt as input for
text detection model. Grounding DINO, which generates bounding boxes. These
bounding boxes are then passed to SAM, which produces
a segmentation map. In the second step, a text prompt
pixel-level segmentation masks for text instances, resulting is given to CLIP Surgery, resulting in a heatmap. This
in more accurate and fine-grained annotations. heatmap is sampled to create point prompts for SAM,
In this context, the SAMText [180] pipeline shown in which generates segmentation masks. Finally, SAM is used
Fig. 18 offers a scalable and efficient solution for generating to generate segmentation maps, and CLIP is employed to
mask annotations for video text spotting tasks. By applying compare the semantic similarity between these maps and
the SAM model to bounding box annotations, SAMText is the text prompt. The figure is borrowed from the original
able to generate mask annotations for large-scale video text paper [144].
datasets, as demonstrated by the SAMText-9M dataset.
While SAMText is a novel approach for generating mask SAM also benefits vision and language tasks, such as im-
annotations for video text spotting tasks, it builds upon the age caption and text-based segmentation. InternGPT [191]
foundation laid by the SAM model. Specifically, the SAM is a system that combines chatbots capable of planning and
model’s ability to generate high-quality pixel-level masks reasoning with non-verbal cues like pointing movements,
for objects in images has been adapted for the specific task allowing users to manipulate images and videos on a screen
of generating masks for text instances in video frames. directly. The system uses SAM to segment objects. In this
Given an input scene text image or video frame, SAM- study [144], the authors propose a pipeline called Text2Seg,
Text begins by extracting the bounding box coordinates from which leverages multiple visual foundation models to fa-
existing annotations or derived from a scene text detec- cilitate remote sensing image semantic segmentation tasks
tion model. If the boxes are oriented, SAMText calculates guided by text prompts. The authors focus on the remote
their minimum bounding rectangle to obtain the horizontal sensing domain, where images are notably dissimilar from
bounding boxes (HBB), which is then used as the input those in conventional scenarios, and traditional models of-
prompt for the SAM model to obtain mask labels. The ten underperform when confronted with testing data col-
SAM model is a segmentation model that is pre-trained on lected under different scenarios.
natural images and fine-tuned on the COCO-Text dataset to The authors incorporate multiple visual foundation
generate mask annotations for text instances. models into their pipeline, including the SAM, which was
After obtaining the mask for each text instance, post- introduced by Meta AI Research as the first foundation
processing may be necessary to ensure its connectivity. In model for the object segmentation task. SAM is capable
particular, if a mask comprises several segments, it may of performing zero-shot segmentation using various visual
be desirable to derive a minimum enclosing mask as an prompts as guidance, making it particularly suitable for
optional step in order to achieve a more cohesive repre- remote sensing imagery processing, where labeled datasets
sentation. Furthermore, optical flow estimation can also be are often sparse and inherently heterogeneous.
utilized to enhance the accuracy of the generated masks and However, adapting SAM for specific tasks remains a
ensure temporal consistency. non-trivial task, as the model segments all objects into
The SAMText pipeline provides an exciting avenue for distinct masks, making it impractical to use directly for
future research in video text spotting. By providing fine- semantic segmentation tasks. To address this issue, the
grained mask annotations for large-scale datasets, SAMText authors propose leveraging other foundation models that
enables the development and evaluation of more accurate focus on different aspects to generate visual prompts as
and effective video text spotting models. Additionally, the guidance for the SAM model. These models can generate
SAMText approach may inspire the development of new points or bounding boxes that help narrow down the areas
segmentation-based approaches for other computer vision where SAM needs to make predictions or assist in filtering
tasks. SAM’s predictions based on specific text prompts.
Overall, the SAMText pipeline represents an important The authors evaluated their pipeline (shown in Fig. 19)
contribution to the field of video text spotting, offering a on four commonly used remote sensing datasets and
scalable and efficient solution for generating fine-grained achieved promising results. They showed that although
mask annotations. The approach holds promise for advanc- SAM is effective in segmenting instances within the pro-
ing the accuracy and effectiveness of video text spotting vided frame, generating segmentation masks for a spe-
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 15

cific category remains challenging. In contrast, other visual according to user preferences.
foundation models, like Grounding DINO [192] and CLIP The CAT approach leverages the SAM [20] to perform
[13], exhibit a superior capacity to understand the semantic the segmentation task. SAM is a promptable segmentation
features of images, thus generating coarse-grained visual model that achieves strong performance across various im-
prompts. Integrating these models into a unified pipeline age domains. Specifically, SAM adapts interactive segmen-
allows us to exploit their combined strengths. tation to achieve promptable ability, where a prompt, any
Overall, this work provides insights into maximizing interaction (e.g., points, boxes) indicating what to segment
the applicability of visual foundation models in specific in an image, is used to prompt SAM to return a valid
contexts with minimal model tuning. The authors’ proposed segmentation mask. Once the user-specified segmentation
pipeline is not limited to any specific dataset and can be mask is obtained, it is easy to generate the desired caption
applied to various scenarios with minimal adjustments to according to the original image and mask prompt.
the prompt tuning process. The authors hope that their work The CAT approach is a training-free and adaptable solu-
will encourage additional research on the application of tion for controllable image captioning tasks. It expands the
visual foundation models for diverse downstream tasks and range of supported control signals, enhances the model’s
stimulate the development of increasingly powerful visual flexibility and scalability, and offers strong user-interactive
foundation models. capabilities. The work highlights the importance of mul-
timodal controls and prompts in controllable image cap-
Visual Chain-of-Thought tioning and provides insights into potential future research
Segmenter Cap�oner Text Refiner
directions in this field.
Ques�on: what is this?
In conclusion, vision and language tasks may not need
Visual Controls Language Controls Answer: bu�erfly fish
training models anymore with the help of SAM and LLMs.
Describe the bu�erfly
… Sen�ment Style
… fish in the image. We are about to see more powerful tools utilizing large pre-
Points Boxes Trajectory Length Language Answer: bu�erfly fish
in the aquarium trained models or APIs to solve various kinds of tasks.
Fig. 20: Overview of the Caption Anything [44] framework, 4.2.6 Audio and Vision
which enhances image captioning by incorporating multi-
modal controls that align with human intention, allowing Audio
Spectrogram global audio
for a diverse range of visual focuses and language styles. feature
The visual prompt is initially converted into a mask prompt Audio Pixel-wise
Audio-Visual
Encoder
using the segmenter. Then, the captioner generates a raw Fusion

caption specifically for the region outlined by the mask.


To ensure that the captioner emphasizes the user’s intended
Mask Decoder
object, a straightforward visual chain-of-thought technique Image
is employed for step-by-step inference. Finally, both the text Encoder
spatial-level
visual feature
prompt and the raw caption are input into the text refiner, Video Prompt
Encoder
which generates a user-preferred caption that aligns with Frame
Segment Anything
Model
the desired genre.
Prompt

Image captioning is the task of generating natural lan-


guage descriptions for given images. It is a fundamental Fig. 21: Illustration of the proposed Segment Anything
problem in computer vision and natural language process- Model for Audio-Visual localization and segmentation (AV-
ing, with various applications in robotics, image retrieval, SAM) from the original paper [45]. The pixel-wise audio-
and content-based image retrieval [193]. In recent years, visual fusion module combines audio features and visual
significant progress has been made in this field, especially features from pre-trained audio and image encoders. These
with the development of deep learning techniques. cross-modal representations are aggregated, and along with
One recent approach for controllable image captioning prompt embeddings from the prompt encoder, they are
is Caption Anything (CAT), proposed by Wang et al. [44]. fed into the mask decoder. The mask decoder utilizes this
The framework, as shown in Fig. 20, introduces multimodal information to generate the final segmentation mask.
controls to image captioning, rendering a variety of visual Audio and vision are two modalities that are closely re-
focuses and language styles aligned with human intention. lated and can provide complementary information to solve
The approach is formulated as a triplet solver, consisting many problems. In recent years, there has been an increasing
of a segmenter, a captioner, and a text refiner. The segmenter interest in joint audio-visual learning [194], [195], which
takes the interactive visual controls and represents the user- aims to learn the correlation between the two modalities
interested regions via pixel-level masks, which are subse- and leverage the complementary information for better per-
quently used by the captioner to generate raw descriptions in formance in various tasks.
relation to the specified region based on the original image One of the most popular applications of audio-visual
and the provided mask. To facilitate the captioner to focus on learning is in sound localization and segmentation [196],
the user-interested object, the authors design a visual chain- [197]. This task aims to predict the spatial location of indi-
of-thought technique with step-by-step inference. Finally, vidual sound sources in a video. Audio-visual localization
the text refiner refines the raw descriptions by incorporating and segmentation can be challenging due to the complex
user-defined language controls, tailoring the language style nature of the problem, as audio is not naturally aligned
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 16

with all objects that exist in the video. However, with the I
input from
the last block
Fc feature of the
class token
Ft feature of the
text token
Fi
feature of the
image tokens
sum
matrix
multiplication
recent advances in deep learning, researchers have devel-
oped many effective methods for this task. original block original block ... original block feature multi-label
surgery recognition
One approach [45] to audio-visual localization and seg- I
original path
Fc
mentation is to learn cross-modal representations that can ...
feature
surgery
visual
explainability
new block new block new block Ft
align audio and visual information. In Fig. 21, AV-SAM new path open-vocab.
Fi
leverages pixel-wise audio-visual fusion across audio and depth: d segmentation

visual features from the pre-trained audio encoder and im- CLIP Surgery on architecture CLIP Surgery on feature

age encoder to aggregate cross-modal representations. Then,


I I
the aggregated cross-modal features are fed into the prompt query value
encoder and mask decoder to generate the final audio-visual layer linear layer layer linear
key FFN value
segmentation masks. norm projection norm norm projection

Another approach to audio-visual localization and seg- value value

mentation is to use contrastive learning to learn cross- original block with q-k self-attention new block using v-v self-attention without FFN

modal correspondences. For example, EZ-VSL [198] and


SLAVC [199] both use contrastive learning to learn the Fig. 22: Overview of the CLIP Surgery [206], which in-
correspondence between audio and visual features. EZ-VSL troduces dual paths for open-vocabulary interactive seg-
uses a two-stream network to learn audio-visual alignment, mentation. The original path consists of the original block
while SLAVC uses a self-supervised contrastive learning with q-k self-attention, while the new path incorporates the
approach to learn the correspondence between audio and proposed v-v self-attention without a feed-forward network
visual features. (FFN), starting from a certain depth, denoted as d. It is
In addition to sound localization and segmentation, important to note that the parameters of the new block
there are many other applications of audio-visual learning, are copied from the original block to ensure consistency.
such as audio-visual spatialization [200], [201], audio-event Additionally, feature surgery is employed to combine the
localization [202], [203], and audio-visual parsing [204], features of image tokens Fi with the text features Ft for the
[205] have been addressed in previous studies. explainability task. Furthermore, feature surgery is applied
Overall, audio-visual learning has become an increas- to the feature of the class token Fc from the original path
ingly important field in deep learning and has many ap- for multi-label recognition. The figure is borrowed from the
plications in various domains. With the recent advances original paper [206].
in deep learning, we can expect to see more innovative
methods for joint audio-visual learning in the future.
suggested in the paper of SAM. Second, point prompts are
superior to mask prompts because SAM’s mask prompt is
4.2.7 Multimodal Visualization and Open-Vocabulary Inter-
designed for its own output logits, and the generated points
active Segmentation
are more suitable than masks from another model. Finally,
Recent work has shown that CLIP [13] can achieve im- text-to-points are more readily achievable than the solution
pressive performance on a variety of vision tasks with of text-to-boxes, which requires fine-tuning or additional
minimal or no task-specific training. However, its internal supervision.
mechanisms are not well understood. In this section, we The proposed method also has implications for the ex-
discuss a recent work [206] that applies CLIP to the task of plainability of CLIP in multimodal settings. Multimodal
open-vocabulary interactive segmentation, which is closely visualization is a promising direction for exploring CLIP’s
related to the concept of multimodal visualization. internal mechanisms. By visualizing the image-text pairs
Interactive segmentation is a computer vision task that during training, the authors were able to observe interesting
involves segmenting a target object from an image with phenomena related to CLIP’s learning process. However,
user guidance in the form of points, scribbles, or boxes the proposed method does not fully explain how CLIP is
during the inference phase. The SAM [20] is a recent work able to generate pixel-level results from text input. This
that enables interactive segmentation via text prompts in an suggests that further research is needed to better understand
open-vocabulary manner. SAM requires manual points to the mechanisms behind CLIP’s impressive performance on
guide the segmentation process. open-vocabulary tasks.
The proposed method [44] illustrated in Fig. 22 aims to In summary, the proposed method offers a promising
replace the need for manual points entirely by using CLIP solution for open-vocabulary interactive segmentation and
Surgery with text-only inputs. This approach provides pixel- has implications for the explainability of CLIP in multi-
level results from text input, which can be readily converted modal settings. Future work could explore the potential of
to point prompts for the SAM model. Specifically, the au- combining the proposed method with other SOTA models
thors select foreground points ranked ahead in the similarity to achieve even better results.
map and use the same number of points ranked last as
background points. The authors show that their method
4.3 More Directions
outperforms other explainability methods in terms of both
points accuracy and mIoU with SAM on four datasets. 4.3.1 Weakly-Supervised Semantic Segmentation
The proposed method offers several advantages over Recently, there are some works exploring the SAM in
other prompt formats in SAM. First, the method requires Weakly-Supervised Semantic Segmentation (WSSS). The ad-
text input only, without the annotation cost of manual points vantage of using the SAM in WSSS is that it can produce
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 17

satisfactory results without requiring model fine-tuning, has been shown to improve the mIoU of pseudo labels from
which is a cumbersome requirement in current WSSS tech- five SOTA WSSS methods by an average of 6.2% on the train
niques such as [207], [208], which involve classification re- set of the PASCAL VOC 2012 dataset.
training and pseudo-label generation. By using SAM as a
foundation model in WSSS, the process can be made more 4.3.2 Adversarial Robustness
straightforward and less complex.
As the concern regarding the susceptibility of computer
For example, [209] aimed to investigate the suitability of
vision systems to adversarial attacks grows [213], there
the SAM for WSSS and adapt it to generate pseudo-labels
is a need to understand and research adversarial attacks.
using only image-level class labels. The study conducted
Adversarial attacks involve adding minor, undetectable
an initial analysis of SAM as a foundation model in WSSS
perturbations to an input image to deceive the computer
and discovered that it can achieve comparable performance
vision system into misclassifying the image [214], [215]. This
without requiring fine-tuning. The study also revealed that
poses a significant concern because it may result in security
SAM could generate high-quality segmentation masks and
breaches and safety risks in applications like autonomous
even surpass human annotations in some cases.
vehicles [216] and facial recognition systems [217].
In [209], the performance of SAM was evaluated on the
PASCAL VOC and MS-COCO datasets, where it demon- Recently, [50] assessed the adversarial robustness of the
strated significant improvements over the latest SOTA meth- SAM model in the context of prompt-based segmentation in
ods. However, SAM encountered difficulties in certain situ- computer vision. The authors developed a framework called
ations due to the issue of semantic obscurity. While SAM Attack-SAM to evaluate SAM’s vulnerability to adversarial
performed well in most unambiguous settings, addressing attacks in the prompt-based mask prediction task. This is the
semantic obscurity may require investigating the use of first comprehensive investigation of SAM’s susceptibility
hierarchical-structured semantic classes and better prompts. to adversarial attacks, and the findings indicate that while
Additionally, the study suggests exploring SAM’s ability SAM is vulnerable to white-box attacks, it remains some-
to segment ”stuff” classes like ”sky,” ”sea,” and ”road” to what robust in the black-box setting. Understanding the ro-
enhance overall scene understanding. In [181], the authors bustness of SAM in the face of adversarial attacks is crucial
presented a WSSS method that utilizes SAM as a pseudo- for developing more secure computer vision systems. The
label generator. They employed diverse weak labels, such study provides the following insights:
as image-level labels, points, scribbles, and bounding boxes, • Firstly, the research reveals that SAM is vulnerable to
as prompts for SAM to produce precise class masks. These white-box attacks, meaning that an attacker with full
masks were then used to generate pseudo labels for training knowledge of the model’s architecture and parame-
segmentation networks. The method’s effectiveness was ters can generate adversarial examples that deceive
evaluated on the PASCAL VOC 2012 dataset, and the results the model with ease. However, SAM is relatively
indicate that SAM can serve as a reliable pseudo-label gen- robust in the black-box setting, where the attacker’s
erator, with scribbles as prompts achieving an 89.7% mIoU knowledge of the model is limited.
score on the training set. The final segmentation model • Secondly, the study shows that small objects tend to
obtained a 76.6% mIoU score on the test set. This suggests be more resistant to adversarial attacks in the black-
that the method can be valuable in training segmentation box setting, possibly due to the limited perturbation
networks with weak supervision, especially in scenarios of the small object region.
where pixel-level annotations are unavailable. • Thirdly, the paper offers insights into the transfer-
However, in [181], the authors also pointed out some ability of adversarial attacks among prompts in the
of SAM’s limitations in handling WSSS. For instance, when segment everything mode of SAM. Additionally, the
using point prompts located by CAMs [210], [211], SAM study provides a list of recent research papers and
may generate incorrect object masks due to the CAMs’ surveys on various topics related to generative AI,
coarse locations. Additionally, in the case of using bounding which can serve as a useful resource for researchers
box prompts, SAM may struggle to generate precise object in the field.
masks when multiple objects are placed on the same table.
These limitations imply that SAM may not always be capa- Besides, [49] presented BadSAM, which applied back-
ble of producing accurate object masks when dealing with door attacks to the SAM model for image segmentation.
WSSS. This paper demonstrated the potential security risks as-
In addition, due to the fact that current methods that rely sociated with such attacks and emphasized the need for
on CAM to generate pseudo labels suffer from limitations further investigation into defense strategies in this area.
such as partial and false activation. To overcome this, a new The experiments conducted in the study demonstrate that
approach in [212] was introduced in this study that utilizes the SAM model can be exploited by attackers, presenting
the SAM to generate high-quality pseudo labels by selecting a significant risk to end-users. The paper also outlines a
relevant masks and labeling them based on initial CAM seed process for initiating backdoor attacks on SAM and provides
masks or post-processed pseudo labels. The results show insights into the attacker’s knowledge and methodology. In
that this approach significantly improves the precision of [218], the researchers carried out a thorough investigation
segmentation while also reducing computational costs com- into the robustness of SAM in various real-world scenarios.
pared to existing post-processing modules. The approach is They conducted experiments that involved a wide range
highly versatile and compatible with existing WSSS models of image perturbations and domains, which revealed that
without modification to base networks or pipelines, and it SAM’s performance typically deteriorates when dealing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 18

with perturbed images. However, the degree of vulnerabil- instance segmentation, integrating it into XAI pipelines can
ity varies across different types of perturbations. By tailor- be computationally demanding. To address this concern,
ing the prompting techniques and utilizing domain-specific the authors proposed a lightweight per-input equivalent
knowledge that takes into account the unique characteris- (PIE) scheme, enabling efficient explanation using a surro-
tics of each dataset, it is possible to enhance the model’s gate model. Evaluation conducted on ImageNet and COCO
resilience to these perturbations and tackle challenges that datasets showcased the promising performance of EAC
are specific to each dataset. This work further highlights compared to commonly utilized XAI methods.
certain types of perturbations that have a significant impact
on the model’s ability to withstand challenges, identifying
areas where improvements can be made.

4.3.3 One Shot


It was known that training the SAM model needs large
datasets, which may be unreliable in some practical sce-
narios, such as medical imaging segmentation [219], or
acoustic-to-articulatory analysis [220]. As a result, there is a
growing interest in exploring segmentation techniques that
require minimal data and do not rely heavily on training.
One of the benefits of utilizing SAM in one-shot learning is
that it allows for personalized segmentation models to be
efficiently created using only one-shot data, which consists Fig. 23: Overview of the EAC approach, which employs a
of a user-provided image and a rough mask indicating three-phase pipeline to provide explanations for a DNN’s
the target object. To accomplish this, SAM’s image encoder prediction concerning an input image. More precisely, In
and the given mask are employed to encode the reference the initial phase, it utilizes the widely accepted instance
image’s target object embedding. The feature similarity be- segmentation model, SAM, to partition the input image
tween the object and all the pixels on the new test image into a collection of visual concepts. Moving on to the sec-
is then computed, and a positive-negative pair of points is ond phase, a per-input equivalent (PIE) surrogate model is
selected, encoded as prompt tokens, and used as a location trained to approximate the behavior of the target DNN. Fi-
prior to SAM. nally, in the third phase, the surrogate model is employed to
Recently, a novel training-free technique known as Per- efficiently explain the model’s prediction using the concepts
SAM was presented in [21] for creating personalized seg- obtained in the first phase. The figure is borrowed from the
mentation models for specific visual concepts using only original paper [222].
one-shot data. PerSAM accomplishes this by localizing the
target object in a reference image and utilizing target-guided However, the paper raises a significant concern regard-
attention, target-semantic prompting, and cascaded post- ing the potential negative social impacts associated with
refinement to segment it in other images or videos. The using EAC in specific domains. It suggests that EAC could
study also introduces a variant of the approach, PerSAM- be misapplied to explain DNN predictions for medical
F, which fine-tunes only two parameters to enhance perfor- images using unrelated concepts such as cats, trains, or
mance. The technique is assessed on a novel segmentation subtle errors. Such misleading explanations could have
dataset, PerSeg, and is found to perform competitively in severe consequences and misguide medical professionals
video object segmentation. The study also demonstrates that in their decision-making process. Consequently, deploying
PerSAM can assist in improving personalized text-to-image EAC without adequate safety checks on its outputs in sen-
generation. sitive domains may pose significant risks. Future research
efforts could focus on minimizing these negative societal
4.3.4 Explainable AI impacts while exploring further applications of EAC in
Explainable AI (XAI) plays a vital role in enhancing human diverse visual domains and tasks.
comprehension of DNNs and has gained significant atten-
tion from researchers [221]. It is worth noting that pixel-
based XAI methods currently in use explain DNN decisions
5 C ONCLUSION
by identifying key pixels, while concept-based XAI meth- This survey is the first to comprehensively review the recent
ods aim to provide explanations using concepts. However, progress on the foundation model of SAM for computer
the interpretation of pixels can be challenging, and the vision and beyond. Firstly, we summarize the development
imprecision of XAI methods can impact their reliability. history of foundation models, covering large language mod-
Additionally, prior concept-based approaches either require els, large visual models, and large multimodal models, as
human annotation or are limited to predefined concept sets. well as the essential terminology about SAM. With a special
A recent study [222] proposed an innovative approach focus on the applications of SAM to various tasks and data
that leverages the SAM to enhance concept-based XAI. The types, we summarize and compare the concurrent works
authors introduced Explain Any Concept (EAC), an effective of SAM and its follow-up works. Then, the great potential
and adaptable explanation method capable of elucidating of SAM in a wide range of image processing applications
DNN decisions using any concept. An overview of the is discussed, including software scenes, real-world scenes,
EAC can be seen in Figure 23. Although SAM excels in and complex scenes. We also analyze and summarize the
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 19

advantages and limitations of SAM across various applica- [19] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hi-
tions. These observations can provide some insights to guide erarchical text-conditional image generation with clip latents,”
arXiv preprint arXiv:2204.06125, 2022.
future research to develop stronger foundation models and [20] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
further improve the robustness and generalization capabili- T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment
ties of SAM. Finally, we summarize massive other amazing anything,” arXiv preprint arXiv:2304.02643, 2023.
applications of SAM in vision and beyond. The appendix [21] R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, P. Gao,
and H. Li, “Personalize segment anything model with one shot,”
provides a preliminary summary of open-source projects on arXiv preprint arXiv:2305.03048, 2023.
SAM in table format. [22] G.-P. Ji, D.-P. Fan, P. Xu, M.-M. Cheng, B. Zhou, and L. Van Gool,
“Sam struggles in concealed scenes–empirical study on” segment
anything”,” arXiv preprint arXiv:2304.06022, 2023.
R EFERENCES [23] X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang,
“Seggpt: Segmenting everything in context,” arXiv preprint
[1] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, arXiv:2304.03284, 2023.
S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill
[24] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee,
et al., “On the opportunities and risks of foundation models,”
“Segment everything everywhere all at once,” arXiv preprint
arXiv preprint arXiv:2108.07258, 2021.
arXiv:2304.06718, 2023.
[2] X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian,
and W. Gao, “Large-scale multi-modal pre-trained models: A [25] J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “One-
comprehensive survey,” arXiv preprint arXiv:2302.10035, 2023. former: One transformer to rule universal image segmentation,”
[3] P. P. Liang, A. Zadeh, and L.-P. Morency, “Foundations and recent arXiv preprint arXiv:2211.06220, 2022.
trends in multimodal machine learning: Principles, challenges, [26] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-
and open questions,” arXiv preprint arXiv:2209.03430, 2022. train, prompt, and predict: A systematic survey of prompting
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- methods in natural language processing,” ACM Computing Sur-
training of deep bidirectional transformers for language under- veys, vol. 55, no. 9, pp. 1–35, 2023.
standing,” arXiv preprint arXiv:1810.04805, 2018. [27] J. Fan, “Gpt-3 moment in computer vision,” https://twitter.com/
[5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, DrJimFan, 2023.
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer [28] Y. Jing, X. Wang, and D. Tao, “Segment anything in non-
learning with a unified text-to-text transformer,” The Journal of euclidean domains: Challenges and opportunities,” arXiv preprint
Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. arXiv:2304.11595, 2023.
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- [29] L. Tang, H. Xiao, and B. Li, “Can sam segment anything?
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., when sam meets camouflaged object detection,” arXiv preprint
“Language models are few-shot learners,” Advances in neural arXiv:2304.04709, 2023.
information processing systems, vol. 33, pp. 1877–1901, 2020. [30] J. Ma and B. Wang, “Segment anything in medical images,” arXiv
[7] OpenAI, “Gpt-4,” https://cdn.openai.com/papers/gpt-4.pdf, preprint arXiv:2304.12306, 2023.
2023. [31] T. Zhou, Y. Zhang, Y. Zhou, Y. Wu, and C. Gong, “Can sam
[8] ——, “Introducing chatgpt,” https://openai.com/blog/chatgpt, segment polyps?” arXiv preprint arXiv:2304.07583, 2023.
Accessed: 2023-05-10. [32] P. Shi, J. Qiu, S. M. D. Abaxi, H. Wei, F. P.-W. Lo, and W. Yuan,
[9] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vi- “Generalist vision foundation models for medical imaging: A
sion transformers,” in Proceedings of the IEEE/CVF Conference on case study of segment anything model on zero-shot medical
Computer Vision and Pattern Recognition, 2022, pp. 12 104–12 113. segmentation,” arXiv preprint arXiv:2304.12637, 2023.
[10] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, [33] Y. Zhang and R. Jiao, “How segment anything model (sam) boost
J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin medical image segmentation?” arXiv preprint arXiv:2305.03678,
et al., “Scaling vision transformers to 22 billion parameters,” arXiv 2023.
preprint arXiv:2302.05442, 2023. [34] J. Wu, R. Fu, H. Fang, Y. Liu, Z. Wang, Y. Xu, Y. Jin, and T. Arbel,
[11] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, “Medical sam adapter: Adapting segment anything model for
Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up medical image segmentation,” arXiv preprint arXiv:2304.12620,
capacity and resolution,” in Proceedings of the IEEE/CVF conference 2023.
on computer vision and pattern recognition, 2022, pp. 12 009–12 019. [35] T. Chen, L. Zhu, C. Ding, R. Cao, S. Zhang, Y. Wang, Z. Li, L. Sun,
[12] L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, P. Mao, and Y. Zang, “Sam fails to segment anything?–sam-
and Y. Qiao, “Videomae v2: Scaling video masked autoencoders adapter: Adapting sam in underperformed scenes: Camouflage,
with dual masking,” arXiv preprint arXiv:2303.16727, 2023. shadow, and more,” arXiv preprint arXiv:2304.09148, 2023.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar-
[36] D. Cheng, Z. Qin, Z. Jiang, S. Zhang, Q. Lao, and K. Li, “Sam
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning
on medical images: A comprehensive study on three prompt
transferable visual models from natural language supervision,”
modes,” arXiv preprint arXiv:2305.00035, 2023.
in International conference on machine learning. PMLR, 2021, pp.
[37] M. A. Mazurowski, H. Dong, H. Gu, J. Yang, N. Konz, and
8748–8763.
Y. Zhang, “Segment anything model for medical image analysis:
[14] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le,
an experimental study,” arXiv preprint arXiv:2304.10517, 2023.
Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-
language representation learning with noisy text supervision,” in [38] Y. Huang, X. Yang, L. Liu, H. Zhou, A. Chang, X. Zhou, R. Chen,
International Conference on Machine Learning. PMLR, 2021, pp. J. Yu, J. Chen, C. Chen et al., “Segment anything model for
4904–4916. medical images?” arXiv preprint arXiv:2304.14660, 2023.
[15] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with [39] T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen,
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, “Inpaint anything: Segment anything meets image inpainting,”
2018. arXiv preprint arXiv:2304.06790, 2023.
[16] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hock- [40] D. Xie, R. Wang, J. Ma, C. Chen, H. Lu, D. Yang, F. Shi, and X. Lin,
enmaier, and S. Lazebnik, “Flickr30k entities: Collecting region- “Edit everything: A text-guided generative system for images
to-phrase correspondences for richer image-to-sentence models,” editing,” arXiv preprint arXiv:2304.14006, 2023.
in Proceedings of the IEEE international conference on computer vision, [41] S. Liu, J. Ye, and X. Wang, “Any-to-any style transfer: Making
2015, pp. 2641–2649. picasso and da vinci collaborate,” arXiv e-prints, pp. arXiv–2304,
[17] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, 2023.
and C. L. Zitnick, “Microsoft coco captions: Data collection and [42] M. Ahmadi, A. G. Lonbar, A. Sharifi, A. T. Beris, M. Nouri, and
evaluation server,” arXiv preprint arXiv:1504.00325, 2015. A. S. Javidi, “Application of segment anything model for civil
[18] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, infrastructure defect assessment,” arXiv preprint arXiv:2304.12600,
M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” 2023.
in International Conference on Machine Learning. PMLR, 2021, pp. [43] D. Han, C. Zhang, Y. Qiao, M. Qamar, Y. Jung, S. Lee, S.-H. Bae,
8821–8831. and C. S. Hong, “Segment anything model (sam) meets glass:
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 20

Mirror and transparent objects cannot be easily detected,” arXiv and methods,” Journal of Artificial Intelligence Research, vol. 71, pp.
preprint arXiv:2305.00278, 2023. 1183–1317, 2021.
[44] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, [68] C. Zhang, S. Zheng, C. Li, Y. Qiao, T. Kang, X. Shan, C. Zhang,
S. Zhao, Y. Shan et al., “Caption anything: Interactive image C. Qin, F. Rameau, S.-H. Bae et al., “Asurvey on segment anything
description with diverse multimodal controls,” arXiv preprint model (sam): Vision foundation model meets prompt engineer-
arXiv:2305.02677, 2023. ing,” 2023.
[45] S. Mo and Y. Tian, “Av-sam: Segment anything model meets [69] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
audio-visual localization and segmentation,” arXiv preprint Yuille, “Semantic image segmentation with deep convolutional
arXiv:2305.01836, 2023. nets and fully connected crfs,” arXiv preprint arXiv:1412.7062,
[46] J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng, 2014.
“Track anything: Segment anything meets videos,” arXiv preprint [70] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethink-
arXiv:2304.11968, 2023. ing atrous convolution for semantic image segmentation,” arXiv
[47] Q. Shen, X. Yang, and X. Wang, “Anything-3d: Towards preprint arXiv:1706.05587, 2017.
single-view anything reconstruction in the wild,” arXiv preprint [71] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
arXiv:2304.10261, 2023. Yuille, “Deeplab: Semantic image segmentation with deep convo-
[48] Z. Ma, X. Hong, and Q. Shangguan, “Can sam count any- lutional nets, atrous convolution, and fully connected crfs,” IEEE
thing? an empirical study on sam counting,” arXiv preprint transactions on pattern analysis and machine intelligence, vol. 40,
arXiv:2304.10817, 2023. no. 4, pp. 834–848, 2017.
[49] Z. Guan, M. Hu, Z. Zhou, J. Zhang, S. Li, and N. Liu, “Badsam: [72] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
Exploring security vulnerabilities of sam via backdoor attacks,” “Encoder-decoder with atrous separable convolution for seman-
arXiv preprint arXiv:2305.03289, 2023. tic image segmentation,” in Proceedings of the European conference
[50] C. Zhang, C. Zhang, T. Kang, D. Kim, S.-H. Bae, and I. S. Kweon, on computer vision (ECCV), 2018, pp. 801–818.
“Attack-sam: Towards evaluating adversarial robustness of seg- [73] A. M. Hafiz and G. M. Bhat, “A survey on instance segmentation:
ment anything model,” arXiv preprint arXiv:2305.00866, 2023. state of the art,” International journal of multimedia information
[51] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, retrieval, vol. 9, no. 3, pp. 171–189, 2020.
J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino [74] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
with grounded pre-training for open-set object detection,” arXiv for instance segmentation,” in Proceedings of the IEEE conference
preprint arXiv:2303.05499, 2023. on computer vision and pattern recognition, 2018, pp. 8759–8768.
[52] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Va- [75] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time in-
jda, and D. Marculescu, “Open-vocabulary semantic segmenta- stance segmentation,” in Proceedings of the IEEE/CVF international
tion with mask-adapted clip,” arXiv preprint arXiv:2210.04150, conference on computer vision, 2019, pp. 9157–9166.
2022. [76] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, “Panop-
[53] J. Wang, P. Zhang, T. Chu, Y. Cao, Y. Zhou, T. Wu, B. Wang, C. He, tic segmentation,” in Proceedings of the IEEE/CVF Conference on
and D. Lin, “V3det: Vast vocabulary visual detection dataset,” Computer Vision and Pattern Recognition (CVPR), June 2019.
arXiv preprint arXiv:2304.03752, 2023. [77] W. Zhang, J. Pang, K. Chen, and C. C. Loy, “K-net: Towards
[54] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Open-vocabulary image unified image segmentation,” Advances in Neural Information Pro-
segmentation,” arXiv preprint arXiv:2112.12143, 2021. cessing Systems, vol. 34, pp. 10 326–10 338, 2021.
[78] B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification
[55] J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang,
is not all you need for semantic segmentation,” in NeurIPS, 2021.
Z. Zou, Z. Wu, W. He et al., “Towards artificial general intelli-
gence with hybrid tianjic chip architecture,” Nature, vol. 572, no. [79] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar,
7767, pp. 106–111, 2019. “Masked-attention mask transformer for universal image seg-
mentation,” in Proceedings of the IEEE/CVF Conference on Computer
[56] B. Goertzel, “Artificial general intelligence: concept, state of the
Vision and Pattern Recognition, 2022, pp. 1290–1299.
art, and future prospects,” Journal of Artificial General Intelligence,
[80] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep
vol. 5, no. 1, p. 1, 2014.
interactive object selection,” in Proceedings of the IEEE conference
[57] J. Grudin and R. Jacques, “Chatbots, humbots, and the quest
on computer vision and pattern recognition, 2016, pp. 373–381.
for artificial general intelligence,” in Proceedings of the 2019 CHI
[81] G. Wang, M. A. Zuluaga, W. Li, R. Pratt, P. A. Patel, M. Aertsen,
conference on human factors in computing systems, 2019, pp. 1–11.
T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren,
[58] B. Goertzel and P. Wang, “A foundational architecture for artifi- “DeepIGeoS: A deep interactive geodesic framework for medical
cial general intelligence,” Advances in artificial general intelligence: image segmentation,” IEEE Transactions on Pattern Analysis and
Concepts, architectures and algorithms, vol. 6, p. 36, 2007. Machine Intelligence, vol. 41, no. 7, pp. 1559–1572, jul 2019.
[59] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, [82] G. Mohan and M. M. Subashini, “Mri based medical image
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language analysis: Survey on brain tumor grade classification,” Biomedical
models,” arXiv preprint arXiv:2303.18223, 2023. Signal Processing and Control, vol. 39, pp. 139–161, 2018.
[60] G. Mialon, R. Dessı̀, M. Lomeli, C. Nalmpantis, R. Pasunuru, [83] P. A. Yushkevich, Y. Gao, and G. Gerig, “Itk-snap: An interactive
R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz tool for semi-automatic segmentation of multi-modality biomedi-
et al., “Augmented language models: a survey,” arXiv preprint cal images,” in 2016 38th annual international conference of the IEEE
arXiv:2302.07842, 2023. engineering in medicine and biology society (EMBC). IEEE, 2016,
[61] L. Fan, L. Li, Z. Ma, S. Lee, H. Yu, and L. Hemphill, “A bibliomet- pp. 3342–3345.
ric review of large language models research from 2017 to 2023,” [84] C. Rupprecht, I. Laina, N. Navab, G. D. Hager, and F. Tombari,
arXiv preprint arXiv:2304.02020, 2023. “Guide me: Interacting with deep networks,” in Proceedings of the
[62] C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to- IEEE Conference on Computer Vision and Pattern Recognition, 2018,
image diffusion model in generative ai: A survey,” arXiv preprint pp. 8551–8561.
arXiv:2303.07909, 2023. [85] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating
[63] S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Lan- object instances with a polygon-rnn,” in Proceedings of the IEEE
guage (technology) is power: A critical survey of” bias” in nlp,” conference on computer vision and pattern recognition, 2017, pp.
arXiv preprint arXiv:2005.14050, 2020. 5230–5238.
[64] F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, and [86] D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive
B. Xu, “Vlp: A survey on vision-language pre-training,” Machine annotation of segmentation datasets with polygon-rnn++,” in
Intelligence Research, vol. 20, no. 1, pp. 38–56, 2023. Proceedings of the IEEE conference on Computer Vision and Pattern
[65] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language Recognition, 2018, pp. 859–868.
pre-trained models,” arXiv preprint arXiv:2202.10936, 2022. [87] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu,
[66] S. Long, F. Cao, S. C. Han, and H. Yang, “Vision-and-language X. Huang, B. Li, C. Li et al., “Florence: A new foundation model
pretrained models: A survey,” arXiv preprint arXiv:2204.07356, for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
2022. [88] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert:
[67] A. Mogadala, M. Kalimuthu, and D. Klakow, “Trends in integra- Pre-training of generic visual-linguistic representations,” arXiv
tion of vision and language research: A survey of tasks, datasets, preprint arXiv:1908.08530, 2019.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 21

[89] J. Cho, J. Lu, D. Schwenk, H. Hajishirzi, and A. Kembhavi, “X- International Conference, Munich, Germany, October 5-9, 2015, Pro-
lxmert: Paint, caption and answer questions with multi-modal ceedings, Part III 18. Springer, 2015, pp. 234–241.
transformers,” arXiv preprint arXiv:2009.11278, 2020. [110] I. Giannakis, A. Bhardwaj, L. Sam, and G. Leontidis, “Deep learn-
[90] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, ing universal crater detection using segment anything model
and I. Misra, “Imagebind: One embedding space to bind them (sam),” arXiv preprint arXiv:2304.07764, 2023.
all,” in CVPR, 2023. [111] V. Ranjan, U. Sharma, T. Nguyen, and M. Hoai, “Learning to
[91] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked count everything,” in Proceedings of the IEEE/CVF Conference on
autoencoders are scalable vision learners,” in Proceedings of the Computer Vision and Pattern Recognition, 2021, pp. 3394–3403.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [112] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
2022, pp. 16 000–16 009. P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
[92] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, context,” in Computer Vision–ECCV 2014: 13th European Confer-
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V
et al., “An image is worth 16x16 words: Transformers for image 13. Springer, 2014, pp. 740–755.
recognition at scale,” in International Conference on Learning Repre- [113] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, and D. Ginhac, “Dsec-
sentations, 2021. mos: Segment any moving object with moving ego vehicle,” arXiv
[93] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss preprint arXiv:2305.00126, 2023.
for dense object detection,” in Proceedings of the IEEE international [114] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, C. Demonceaux, and
conference on computer vision, 2017, pp. 2980–2988. D. Ginhac, “Rgb-event fusion for moving object detection in
[94] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convo- autonomous driving,” arXiv preprint arXiv:2209.08323, 2022.
lutional neural networks for volumetric medical image segmen- [115] Y. Cao, X. Xu, C. Sun, Y. Cheng, Z. Du, L. Gao, and W. Shen,
tation,” in 2016 fourth international conference on 3D vision (3DV). “Segment any anomaly without training via hybrid prompt
Ieee, 2016, pp. 565–571. regularization,” arXiv preprint arXiv:2305.10724, 2023.
[95] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and [116] T.-N. Le, T. V. Nguyen, Z. Nie, M.-T. Tran, and A. Sugimoto,
A. Torralba, “Semantic understanding of scenes through the “Anabranch network for camouflaged object segmentation,”
ade20k dataset,” International Journal of Computer Vision, vol. 127, Computer vision and image understanding, vol. 184, pp. 45–56, 2019.
pp. 302–321, 2019. [117] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao,
[96] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, “Camouflaged object detection,” in Proceedings of the IEEE/CVF
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes conference on computer vision and pattern recognition, 2020, pp.
dataset for semantic urban scene understanding,” in Proc. of 2777–2787.
the IEEE Conference on Computer Vision and Pattern Recognition
[118] Y. Lv, J. Zhang, Y. Dai, A. Li, B. Liu, N. Barnes, and D.-P. Fan,
(CVPR), 2016.
“Simultaneously localize, segment and rank the camouflaged
[97] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, objects,” in Proceedings of the IEEE/CVF Conference on Computer
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in Vision and Pattern Recognition, 2021, pp. 11 591–11 601.
context,” in Computer Vision–ECCV 2014: 13th European Confer-
[119] B. Yin, X. Zhang, Q. Hou, B.-Y. Sun, D.-P. Fan, and L. Van Gool,
ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V
“Camoformer: Masked separable attention for camouflaged ob-
13. Springer, 2014, pp. 740–755.
ject detection,” arXiv preprint arXiv:2212.06570, 2022.
[98] X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang, “Images speak
[120] X. Hu, D.-P. Fan, X. Qin, H. Dai, W. Ren, Y. Tai, C. Wang,
in images: A generalist painter for in-context visual learning,”
and L. Shao, “High-resolution iterative feedback network for
arXiv preprint arXiv:2212.02499, 2022.
camouflaged object detection,” arXiv preprint arXiv:2203.11624,
[99] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova,
2022.
A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and
V. Lempitsky, “Resolution-robust large mask inpainting with [121] D. Williams, F. MacFarlane, and A. Britten, “Leaf only sam: A
fourier convolutions,” in Proceedings of the IEEE/CVF winter con- segment anything pipeline for zero-shot automated leaf segmen-
ference on applications of computer vision, 2022, pp. 2149–2159. tation,” 2023.
[100] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and [122] Y. Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, “Spot-
L. Van Gool, “Repaint: Inpainting using denoising diffusion the-difference self-supervised pre-training for anomaly detection
probabilistic models,” in Proceedings of the IEEE/CVF Conference on and segmentation,” in Computer Vision–ECCV 2022: 17th European
Computer Vision and Pattern Recognition, 2022, pp. 11 461–11 471. Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part
[101] W. Li, Z. Lin, K. Zhou, L. Qi, Y. Wang, and J. Jia, “Mat: Mask- XXX. Springer, 2022, pp. 392–408.
aware transformer for large hole image inpainting,” in Proceed- [123] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–
ings of the IEEE/CVF conference on computer vision and pattern a comprehensive real-world dataset for unsupervised anomaly
recognition, 2022, pp. 10 758–10 768. detection,” in Proceedings of the IEEE/CVF conference on computer
[102] Q. Dong, C. Cao, and Y. Fu, “Incremental transformer structure vision and pattern recognition, 2019, pp. 9592–9600.
enhanced image inpainting with masking positional encoding,” [124] Y. Huang, C. Qiu, and K. Yuan, “Surface defect saliency of
in Proceedings of the IEEE/CVF Conference on Computer Vision and magnetic tile,” The Visual Computer, vol. 36, pp. 85–96, 2020.
Pattern Recognition, 2022, pp. 11 358–11 368. [125] J. Božič, D. Tabernik, and D. Skočaj, “Mixed supervision for
[103] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, surface-defect detection: From weakly to fully supervised learn-
“High-resolution image synthesis with latent diffusion models,” ing,” Computers in Industry, vol. 129, p. 103459, 2021.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [126] C. He, K. Li, Y. Zhang, G. Xu, L. Tang, Y. Zhang, Z. Guo, and
Pattern Recognition, 2022, pp. 10 684–10 695. X. Li, “Weakly-supervised concealed object segmentation with
[104] T. Q. Chen and M. Schmidt, “Fast patch-based style transfer of sam-based pseudo labeling and multi-scale feature grouping,”
arbitrary style,” arXiv preprint arXiv:1612.04337, 2016. arXiv preprint arXiv: arXiv:2305.11003, 2023.
[105] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style- [127] S. Gao, W. Zhang, Y. Wang, Q. Guo, C. Zhang, Y. He, and
attentional networks,” in proceedings of the IEEE/CVF conference on W. Zhang, “Weakly-supervised salient object detection using
computer vision and pattern recognition, 2019, pp. 5880–5888. point supervison,” in Proceedings of the AAAI Conference on Ar-
[106] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Uni- tificial Intelligence, vol. 36, no. 1, 2022, pp. 670–678.
versal style transfer via feature transforms,” Advances in neural [128] J. Chen and X. Bai, “Learning to” segment anything” in thermal
information processing systems, vol. 30, 2017. infrared images through knowledge distillation with a large scale
[107] Z. Xu, E. Sangineto, and N. Sebe, “Stylerdalle: Language-guided dataset satir,” arXiv preprint arXiv:2304.07969, 2023.
style transfer using a vector-quantized tokenizer of a large-scale [129] C. Li, W. Xia, Y. Yan, B. Luo, and J. Tang, “Segmenting objects in
generative model,” arXiv preprint arXiv:2303.09268, 2023. day and night: Edge-conditioned cnn for thermal image semantic
[108] W. Ji, J. Li, Q. Bi, W. Li, and L. Cheng, “Segment anything is not segmentation,” IEEE Transactions on Neural Networks and Learning
always perfect: An investigation of sam on different real-world Systems, vol. 32, no. 7, pp. 3069–3082, 2020.
applications,” arXiv preprint arXiv:2304.05750, 2023. [130] X. Yang, H. Dai, Z. Wu, R. Bist, S. Subedi, J. Sun, G. Lu, C. Li,
[109] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional T. Liu, and L. Chai, “Sam for poultry science,” 2023.
networks for biomedical image segmentation,” in Medical Image [131] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and
Computing and Computer-Assisted Intervention–MICCAI 2015: 18th P. Luo, “Segformer: Simple and efficient design for semantic
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 22

segmentation with transformers,” Advances in Neural Information [153] Y. Liu, J. Zhang, Z. She, A. Kheradmand, and M. Armand, “Samm
Processing Systems, vol. 34, pp. 12 077–12 090, 2021. (segment any medical model): A 3d slicer integration to sam,”
[132] Y. Liu, Y. Wang, Y. Li, Q. Li, and J. Wang, “Setr-yolov5n: A arXiv preprint arXiv:2304.05622, 2023.
lightweight low-light lane curvature detection method based on [154] Z. Qiu, Y. Hu, H. Li, and J. Liu, “Learnable ophthalmology sam,”
fractional-order fusion model,” IEEE Access, vol. 10, pp. 93 003– arXiv preprint arXiv:2304.13425, 2023.
93 016, 2022. [155] J. Wu, “Promptunet: Toward interactive medical image segmen-
[133] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo tation,” arXiv preprint arXiv:2305.10300, 2023.
series in 2021,” arXiv preprint arXiv:2107.08430, 2021. [156] R. Deng, C. Cui, Q. Liu, T. Yao, L. W. Remedios, S. Bao, B. A.
[134] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, Landman, L. E. Wheless, L. A. Coburn, K. T. Wilson et al.,
and X. Wang, “Bytetrack: Multi-object tracking by associating ev- “Segment anything model (sam) for digital pathology: Assess
ery detection box,” in Computer Vision–ECCV 2022: 17th European zero-shot segmentation on whole slide imaging,” arXiv preprint
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part arXiv:2304.04155, 2023.
XXII. Springer, 2022, pp. 1–21. [157] M. Hu, Y. Li, and X. Yang, “Skinsam: Empowering skin can-
[135] S. Ren, F. Luzi, S. Lahrichi, K. Kassaw, L. M. Collins, K. Bradbury, cer segmentation with segment anything model,” arXiv preprint
and J. M. Malof, “Segment anything, from space?” arXiv preprint arXiv:2304.13973, 2023.
arXiv:2304.13000, 2023. [158] Y. Zhang, T. Zhou, P. Liang, and D. Z. Chen, “Input augmentation
[136] K. Bradbury, R. Saboo, T. L Johnson, J. M. Malof, A. Devarajan, with sam: Boosting medical image segmentation with segmenta-
W. Zhang, L. M Collins, and R. G Newell, “Distributed solar tion foundation model,” arXiv preprint arXiv:2304.11332, 2023.
photovoltaic array location and extent dataset for remote sensing [159] A. Wang, M. Islam, M. Xu, Y. Zhang, and H. Ren, “Sam meets
object identification,” Scientific data, vol. 3, no. 1, pp. 1–9, 2016. robotic surgery: An empirical study in robustness perspective,”
[137] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can se- arXiv preprint arXiv:2304.14674, 2023.
mantic labeling methods generalize to any city? the inria aerial [160] B. Wang, A. Aboah, Z. Zhang, and U. Bagci, “Gazesam: What you
image labeling benchmark,” in 2017 IEEE International Geoscience see is what you segment,” arXiv preprint arXiv:2304.13844, 2023.
and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 3226– [161] Z. Liu, H. Wen, Z. Zhu, Q. Li, L. Liu, T. Li, W. Xu, C. Hou,
3229. B. Huang, Z. Li et al., “Diagnosis of significant liver fibrosis in
[138] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, patients with chronic hepatitis b using a deep learning-based
F. Hughes, D. Tuia, and R. Raskar, “Deepglobe 2018: A challenge data integration network,” Hepatology International, vol. 16, no. 3,
to parse the earth through satellite images,” in Proceedings of p. 526–536, 2022.
the IEEE Conference on Computer Vision and Pattern Recognition [162] K. Huang, Q. Li, W. Zeng, X. Chen, L. Liu, X. Wan, C. Feng,
Workshops, 2018, pp. 172–181. Z. Li, Z. Liu, and C. Dong, “Ultrasound score combined with
[139] S. Mohajerani, T. A. Krammer, and P. Saeedi, “Cloud detection liver stiffness measurement by sound touch elastography for
algorithm for remote sensing images using fully convolutional staging liver fibrosis in patients with chronic hepatitis b: a clinical
neural networks,” arXiv preprint arXiv:1810.05782, 2018. prospective study,” Annals of Translational Medicine, vol. 10, no. 6,
[140] H. L. Aung, B. Uzkent, M. Burke, D. Lobell, and S. Ermon, 2022.
“Farm parcel delineation using spatio-temporal convolutional [163] Y. Chen, C. Zhang, L. Liu, C. Feng, C. Dong, Y. Luo, and X. Wan,
networks,” in Proceedings of the IEEE/CVF Conference on Computer “Uscl: pretraining deep ultrasound image diagnosis model
Vision and Pattern Recognition Workshops, 2020, pp. 76–77. through video contrastive representation learning,” in Medical
[141] S. Julka and M. Granitzer, “Knowledge distillation with segment Image Computing and Computer Assisted Intervention—MICCAI
anything (sam) model for planetary geological mapping,” arXiv 2021: 24th International Conference, Strasbourg, France, September
preprint arXiv:2305.07586, 2023. 27—October 1, 2021, Proceedings, Part VIII 24. Springer, 2021, p.
[142] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 627–637.
Proceedings of the IEEE international conference on computer vision, [164] L. Gao, R. Zhou, C. Dong, C. Feng, Z. Li, X. Wan, and L. Liu,
2017, pp. 2961–2969. “Multi-modal active learning for automatic liver fibrosis diagno-
[143] D. Wang, J. Zhang, B. Du, D. Tao, and L. Zhang, “Scaling- sis based on ultrasound shear wave elastography,” in 2021 IEEE
up remote sensing segmentation dataset with segment anything 18th International Symposium on Biomedical Imaging (ISBI). IEEE,
model,” arXiv preprint arXiv:2305.02034, 2023. 2021, p. 410–414.
[144] J. Zhang, Z. Zhou, G. Mai, L. Mu, M. Hu, and S. Li, “Text2seg: [165] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille,
Remote sensing image semantic segmentation via text-guided and Y. Zhou, “Transunet: Transformers make strong encoders
visual foundation models,” arXiv preprint arXiv:2304.10597, 2023. for medical image segmentation,” arXiv preprint arXiv:2102.04306,
[145] C. Hu and X. Li, “When sam meets medical images: An inves- 2021.
tigation of segment anything model (sam) on multi-phase liver [166] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and
tumor segmentation,” arXiv preprint arXiv:2304.08506, 2023. M. Wang, “Swin-unet: Unet-like pure transformer for medical
[146] S. Roy, T. Wald, G. Koehler, M. R. Rokuss, N. Disch, J. Holzschuh, image segmentation,” in Computer Vision–ECCV 2022 Workshops:
D. Zimmerer, and K. H. Maier-Hein, “Sam. md: Zero-shot med- Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. Springer,
ical image segmentation capabilities of the segment anything 2023, pp. 205–218.
model,” arXiv preprint arXiv:2304.05396, 2023. [167] R. Azad, R. Arimond, E. K. Aghdam, A. Kazerouni, and D. Mer-
[147] K. Zhang and D. Liu, “Customized segment anything model hof, “Dae-former: Dual attention-guided efficient transformer for
for medical image segmentation,” arXiv preprint arXiv:2304.13785, medical image segmentation,” arXiv preprint arXiv:2212.13504,
2023. 2022.
[148] S. Mohapatra, A. Gosai, and G. Schlaug, “Brain extraction com- [168] R. Azad, M. Heidari, M. Shariatnia, E. K. Aghdam, S. Karimija-
paring segment anything model (sam) and fsl brain extraction farbigloo, E. Adeli, and D. Merhof, “Transdeeplab: Convolution-
tool,” arXiv preprint arXiv:2304.04738, 2023. free transformer-based deeplab v3+ for medical image segmen-
[149] F. Putz, J. Grigo, T. Weissmann, P. Schubert, D. Hoefler, A. Gomaa, tation,” in Predictive Intelligence in Medicine: 5th International
H. B. Tkhayat, A. Hagag, S. Lettmaier, B. Frey et al., “The Workshop, PRIME 2022, Held in Conjunction with MICCAI 2022,
segment anything foundation model achieves favorable brain tu- Singapore, September 22, 2022, Proceedings. Springer, 2022, pp.
mor autosegmentation accuracy on mri to support radiotherapy 91–102.
treatment planning,” arXiv preprint arXiv:2304.07875, 2023. [169] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J.-
[150] Y. Li, M. Hu, and X. Yang, “Polyp-sam: Transfer sam for polyp K. Kämäräinen, H. J. Chang, M. Danelljan, L. Cehovin, A. Lukežič
segmentation,” arXiv preprint arXiv:2305.00293, 2023. et al., “The ninth visual object tracking vot2021 challenge results,”
[151] S. He, R. Bao, J. Li, P. E. Grant, and Y. Ou, “Accuracy of segment- in Proceedings of the IEEE/CVF International Conference on Computer
anything model (sam) in medical image segmentation tasks,” Vision, 2021, pp. 2711–2738.
arXiv preprint arXiv:2304.09324, 2023. [170] C. Zhang, G. Huang, L. Liu, S. Huang, Y. Yang, X. Wan, S. Ge,
[152] C. Mattjie, L. V. de Moura, R. C. Ravazio, L. S. Kupssinskü, and D. Tao, “Webuav-3 m: A benchmark for unveiling the power
O. Parraga, M. M. Delucis, and R. C. Barros, “Exploring the of million-scale deep uav tracking,” IEEE Transactions on Pattern
zero-shot capabilities of the segment anything model (sam) in Analysis and Machine Intelligence, 2022.
2d medical imaging: A comprehensive evaluation and practical [171] H. K. Cheng and A. G. Schwing, “Xmem: Long-term video
guideline,” arXiv preprint arXiv:2305.00109, 2023. object segmentation with an atkinson-shiffrin memory model,”
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 23

in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, modal fusion in automatic continuous cued speech recognition,”
Israel, October 23–27, 2022, Proceedings, Part XXVIII. Springer, IEEE Transactions on Multimedia, vol. 23, p. 292–305, 2020.
2022, pp. 640–658. [195] L. Liu, T. Hueber, G. Feng, and D. Beautemps, “Visual recognition
[172] Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, of continuous cued speech using a tandem cnn-hmm approach.”
“Segment and track anything,” in arXiv preprint arXiv:2305.06558, in Interspeech, 2018, p. 2643–2647.
2023. [196] L. Liu, J. Li, G. Feng, and X.-P. S. Zhang, “Automatic detection of
[173] Z. Yang and Y. Yang, “Decoupling features in hierarchical the temporal segmentation of hand movements in british english
propagation for video object segmentation,” arXiv preprint cued speech.” in INTERSPEECH, 2019, p. 2285–2289.
arXiv:2210.09782, 2022. [197] L. Liu, G. Feng, and D. Beautemps, “Automatic temporal seg-
[174] D. Fuoli, S. Gu, and R. Timofte, “Efficient video super-resolution mentation of hand movements for hand positions recognition
through recurrent latent space propagation,” in 2019 IEEE/CVF in french cued speech,” in 2018 IEEE International Conference on
International Conference on Computer Vision Workshop (ICCVW). Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.
IEEE, 2019, pp. 3476–3485. 3061–3065.
[175] M. Haris, G. Shakhnarovich, and N. Ukita, “Recurrent back- [198] S. Mo and P. Morgado, “Localizing visual sounds the easy way,”
projection network for video super-resolution,” in Proceedings of in ECCV, 2022, p. 218–234.
the IEEE/CVF conference on computer vision and pattern recognition, [199] ——, “A closer look at weakly-supervised audio-visual source
2019, pp. 3897–3906. localization,” in NeurIPS, 2022.
[176] Y. Huang, W. Wang, and L. Wang, “Bidirectional recurrent con- [200] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang,
volutional networks for multi-frame super-resolution,” Advances “Self-supervised generation of spatial audio for 360°video,” in
in neural information processing systems, vol. 28, 2015. NeurIPS, 2018.
[177] ——, “Video super-resolution via bidirectional recurrent con- [201] P. Morgado, Y. Li, and N. Nvasconcelos, “Learning representa-
volutional networks,” IEEE transactions on pattern analysis and tions from audio-visual spatial alignment,” in NeurIPS, 2020, pp.
machine intelligence, vol. 40, no. 4, pp. 1015–1028, 2017. 4733–4744.
[178] T. Hui, X. Tang, and C. L. Change Loy, “A lightweight convolu- [202] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event
tional neural network for optical flow estimation,” in Proceedings localization in unconstrained videos,” in ECCV, 2018.
of the IEEE Conference on Computer Vision and Pattern Recognition
[203] L. Liu, G. Feng, and D. Beautemps, “Inner lips feature extraction
(CVPR), 2018, pp. 8981–8989.
based on clnf with hybrid dynamic template for cued speech,”
[179] Z. Lu, Z. Xiao, J. Bai, Z. Xiong, and X. Wang, “Can sam boost
EURASIP Journal on Image and Video Processing, vol. 2017, p. 1–15,
video super-resolution?” in arXiv preprint arXiv:2305.06524, 2023.
2017.
[180] H. He, J. Zhang, M. Xu, J. Liu, B. Du, and D. Tao, “Scal-
[204] Y. Tian, D. Li, and C. Xu, “Unified multisensory perception:
able mask annotation for video text spotting,” arXiv preprint
Weakly-supervised audio-visual video parsing,” in ECCV, 2020,
arXiv:2305.01443, 2023.
p. 436–454.
[181] P.-T. Jiang and Y. Yang, “Segment anything is a good pseudo-label
generator for weakly supervised semantic segmentation,” arXiv [205] S. Mo and Y. Tian, “Multi-modal grouping network for weakly-
preprint arXiv:2305.01275, 2023. supervised audio-visual video parsing,” in NeurIPS, 2022.
[182] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, [206] Y. Li, H. Wang, Y. Duan, and X. Li, “Clip surgery for better ex-
J. Winn, and A. Zisserman, “The pascal visual object classes plainability with enhancement in open-vocabulary tasks,” arXiv
challenge: A retrospective,” International journal of computer vision, preprint arXiv:2304.05653, 2023.
vol. 111, pp. 98–136, 2015. [207] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Box-driven class-
[183] J. Cen, Z. Zhou, J. Fang, W. Shen, L. Xie, X. Zhang, and wise region masking and filling rate guided loss for weakly su-
Q. Tian, “Segment anything in 3d with nerfs,” arXiv preprint pervised semantic segmentation,” in Proceedings of the IEEE/CVF
arXiv:2304.12308, 2023. Conference on Computer Vision and Pattern Recognition, 2019, pp.
[184] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and 3136–3145.
H. Li, “Pifu: Pixel-aligned implicit function for high-resolution [208] J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribu-
clothed human digitization,” in Proceedings of the IEEE/CVF inter- tion map for weakly supervised semantic and instance segmenta-
national conference on computer vision, 2019, pp. 2304–2314. tion,” in Proceedings of the IEEE/CVF conference on computer vision
[185] K. Zhang, G. Riegler, N. Snavely, and V. Koltun, “Nerf++: An- and pattern recognition, 2021, pp. 2643–2652.
alyzing and improving neural radiance fields,” arXiv preprint [209] W. Sun, Z. Liu, Y. Zhang, Y. Zhong, and N. Barnes, “An alterna-
arXiv:2010.07492, 2020. tive to wsss? an empirical study of the segment anything model
[186] Y. Yin, Z. Fu, F. Yang, and G. Lin, “Or-nerf: Object removing (sam) on weakly-supervised semantic segmentation problems,”
from 3d scenes guided by multiview segmentation with neural arXiv preprint arXiv:2305.01586, 2023.
radiance fields,” 2023. [210] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang,
[187] T. N. Kipf and M. Welling, “Semi-supervised classification with “Towards rich feature discovery with class activation maps
graph convolutional networks,” arXiv preprint arXiv:1609.02907, augmentation for person re-identification,” in Proceedings of the
2016. IEEE/CVF conference on computer vision and pattern recognition,
[188] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation 2019, pp. 1389–1398.
learning on large graphs,” Advances in neural information process- [211] L. Liu, W. Lei, X. Wan, L. Liu, Y. Luo, and C. Feng, “Semi-
ing systems, vol. 30, 2017. supervised active learning for covid-19 lung ultrasound multi-
[189] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, symptom classification,” in 2020 IEEE 32nd International Confer-
and Y. Bengio, “Graph attention networks,” arXiv preprint ence on Tools with Artificial Intelligence (ICTAI). IEEE, 2020, pp.
arXiv:1710.10903, 2017. 1268–1273.
[190] S. Huang, Z. Jiang, H. Dong, Y. Qiao, P. Gao, and H. Li, [212] T. Chen, Z. Mai, R. Li, and W.-l. Chao, “Segment anything model
“Instruct2act: Mapping multi-modality instructions to robotic (sam) enhanced pseudo labels for weakly supervised semantic
actions with large language model,” 2023. segmentation,” arXiv preprint arXiv:2305.05803, 2023.
[191] Z. Liu, Y. He, W. Wang, W. Wang, Y. Wang, S. Chen, Q. Zhang, [213] B. Wu, L. Liu, Z. Zhu, Q. Liu, Z. He, and S. Lyu, “Adversarial ma-
Y. Yang, Q. Li, J. Yu, K. Li, Z. Chen, X. Yang, X. Zhu, Y. Wang, chine learning: A systematic survey of backdoor attack, weight
L. Wang, P. Luo, J. Dai, and Y. Qiao, “Interngpt: Solving vision- attack and adversarial example,” arXiv preprint arXiv:2302.09457,
centric tasks by interacting with chatgpt beyond language,” 2023. 2023.
[192] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, [214] R. Zheng, R. Tang, J. Li, and L. Liu, “Data-free backdoor removal
and A. Joulin, “Emerging properties in self-supervised vision based on channel lipschitzness,” in Computer Vision–ECCV 2022:
transformers,” in Proceedings of the IEEE/CVF international con- 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
ference on computer vision, 2021, pp. 9650–9660. Proceedings, Part V. Springer, 2022, pp. 175–191.
[193] J. Wang, Y. Zhao, L. Liu, H. Fan, T. Xu, Q. Li, and S. Li, “Memory- [215] ——, “Pre-activation distributions expose backdoor neurons,”
augmented contrastive learning for talking head generation,” Advances in Neural Information Processing Systems, vol. 35, pp.
arXiv preprint arXiv:2302.13469, 2023. 18 667–18 680, 2022.
[194] L. Liu, G. Feng, D. Beautemps, and X.-P. Zhang, “Re- [216] J. Cui, L. S. Liew, G. Sabaliauskaite, and F. Zhou, “A review on
synchronization using the hand preceding model for multi- safety failures, security attacks, and available countermeasures
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 24

for autonomous vehicles,” Ad Hoc Networks, vol. 90, p. 101823,


2019.
[217] A. Anjos and S. Marcel, “Counter-measures to photo attacks
in face recognition: a public database and a baseline,” in 2011
international joint conference on Biometrics (IJCB). IEEE, 2011, pp.
1–7.
[218] Y. Wang, Y. Zhao, and L. Petzold, “An empirical study on the
robustness of the segment anything model (sam),” arXiv preprint
arXiv:2305.06422, 2023.
[219] C. Zhang, Y. Chen, L. Liu, Q. Liu, and X. Zhou, “Hico: Hierarchi-
cal contrastive learning for ultrasound video model pretraining,”
in Proceedings of the Asian Conference on Computer Vision, 2022, p.
229–246.
[220] J. Wang, J. Liu, X. Li, M. Yu, J. Gao, Q. Fang, and L. Liu,
“Two-stream joint-training for speaker independent acoustic-to-
articulatory inversion,” in ICASSP 2023-2023 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2023, p. 1–5.
[221] A. Holzinger, A. Saranti, C. Molnar, P. Biecek, and W. Samek,
“Explainable ai methods-a brief overview,” in xxAI-Beyond Ex-
plainable AI: International Workshop, Held in Conjunction with ICML
2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers.
Springer, 2022, pp. 13–38.
[222] A. Sun, P. Ma, Y. Yuan, and S. Wang, “Explain any concept:
Segment anything meets concept-based explanation,” 2023.

A PPENDIX A
A P RELIMINARY S UMMARY OF O PEN S OURCE
P ROJECTS ON SAM
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 25
TABLE 1: Summary of Open Source Projects on SAM.
No. Project Title Project page Code base Affiliation Description
https://github.com/
Segment https://segment-anything. A foundation model for
001 SAM facebookresearch/ Meta
Anything com/ general segmentation.
segment-anything
A project dedicated to
https://colab.research.
Segment tracking and segmenting
google.com/drive/ https://github.com/z-x-yang/ Zhejiang
002 SAM-Track and Track any objects in videos, ei-
1R10N70AJaslzADFqb-a5OihYkllWEVxB?
Segment-and-Track-Anything University
Anything ther automatically or inter-
usp=sharing
actively.
A project by combining
https://github.
Grounded- https://github.com/ Grounding DINO and
Grounded- com/camenduru/ IDEA-
003 Segment- IDEA-Research/ SAM which aims to detect
SAM grounded-segment-anything-colab Research
Anything Grounded-Segment-Anything and segment Anything
with text inputs.
A new way of instance
segmentation by combin-
https://github.com/ ing SAM with Closed-
004 MMDet-SAM - - open-mmlab/playground/ OpenMMLab Set Object Detection, Open-
tree/main/mmdet sam Vocabulary Object Detec-
tion, Grounding Object De-
tection.
Zero-shot A project joins SAM and
Oriented https://github.com/ weakly supervised hori-
MMRotate-
005 Object - open-mmlab/playground/ OpenMMLab zontal box detection to
SAM
Detection tree/main/mmrotate sam achieve rotated box detec-
with SAM tion.
A solution of Text Detec-
tion/Recognition and SAM
https://github.com/ that segments every text
MMOCR-
006 - - open-mmlab/playground/ OpenMMLab character, with striking text
SAM
tree/main/mmocr sam removal and text inpaint-
ing demos driven by diffu-
sion models and Gradio.
A project join SAM and
https://github.com/
MMEditing- image generation to create
007 - - open-mmlab/playground/ OpenMMLab
SAM awesome images and edit
tree/main/mmagic sam
any part of them.
OpenMMLab
PlayGround:
Semi- A project combining
https://github.com/
Label-Studio- Automated Label-Studio and SAM to
008 - open-mmlab/playground/ OpenMMLab
SAM Annotation achieve semi-automated
tree/main/label anything
with Label- annotation.
Studio and
SAM
Segment https://github.com/
A pretrained model param-
Anything PaddlePaddle/PaddleSeg/ PaddlePaddle
009 PaddleSeg - eters of PaddlePaddle for-
with tree/release/2.8/contrib/
mat.
PaddleSeg SegmentAnything
Segmenting
https://huggingface.co/ https://github.com/ SAM In Context based on
010 SegGPT Everything BAAI-Vision
spaces/BAAI/SegGPT baaivision/Painter Painter.
In Context
Segment Ev- https://github. A project can Segment Ev-
erything Ev- https://huggingface.co/ com/UX-Decoder/ erything Everywhere with
011 SEEM Microsoft
erywhere All spaces/xdecoder/SEEM Segment-Everything-Everywhere-All-At-Once Multimodal prompts all at
at Once once.
CLIP Surgery
for Better
Explain- A work about SAM based
https://github.com/
ability with https://github.com/ on CLIP’s explainability to
012 CLIP Surgery xmed-lab/CLIP Surgery/ HKUST
Enhancement xmed-lab/CLIP Surgery achieve text to mask with-
blob/master/demo.ipynb
in Open out manual points.
Vocabulary
Tasks
Can SAM
Segment
Anything?
When SAM https://github.com/ SAM + Camouflaged object
013 SAMCOD - -
Meets luckybird1994/SAMCOD detection (COD) task.
Camouflaged
Object
Detection
Segment
https://huggingface. SAM combines Inpainting,
Inpaint Any- Anything https://github.com/ USTC and
014 co/spaces/InpaintAI/ which is able to remove the
thing Meets Image geekyutao/Inpaint-Anything EIT
Inpaint-Anything object smoothly.
Inpainting
Personalize
Segment https://github.
https://huggingface.co/ SAM with specific con-
015 PerSAM Anything com/ZrrSkywalker/ -
papers/2305.03048 cepts.
Model with Personalize-SAM
One Shot
Continued on next page
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 26
TABLE 1 – continued from previous page
No. Project Title Project page Code base Affiliation Description
Segment
A step-by-step tutorial
Anything https://github.com/
016 MedSAM - - with a small dataset to help
in Medical bowang-lab/MedSAM
you quickly utilize SAM.
Images
https://colab.research.
Segment- GroundedSAM google.com/drive/ https://github.
Grounding DINO + SAM
017 Any- Anomaly 1Rwio KfziuLp79Qh com/caoyunkang/ HUST
to segment any anomaly.
Anomaly Detection ugum64Hjnq4ZwsE?usp= Segment-Any-Anomaly
sharing
Semantic https://github.
Fudan A dense category annota-
018 SSA Segment - com/fudan-zvg/
University tion engine.
Anything Semantic-Segment-Anything
Magic Copy is a Chrome
extension that uses SAM to
https://github.com/
019 Magic Copy - - - extract a foreground object
kevmo314/magic-copy
from an image and copy it
to the clipboard.
Segment Segment https://huggingface. https://github.
020 Anything Anything co/spaces/curt-park/ com/Curt-Park/ - SAM combined with CLIP.
with Clip with Clip segment-anything-with-clip segment-anything-with-clip
Segment https://huggingface.
https://github.com/kadirnar/ Packaged version of the
021 MetaSeg Anything co/spaces/ArtGAN/ -
segment-anything-video SAM.
Video Segment-Anything-Video
Applied
Extended SAM’s click-
Computer
Segment based foreground
Vision Lab
SAM in Na- Anything https://www.napari-hub.org/ https://github.com/ separation to full
022 and German
pari Model (SAM) plugins/napari-sam MIC-DKFZ/napari-sam click-based semantic
Cancer
in Napari segmentation and instance
Research
segmentation.
Center
https://github.
SAM Medical SAM Medical
023 - com/amine0110/ - SAM for Medical Imaging.
Imaging Imaging
SAM-Medical-Imaging
3D-Box via https://github.com/ SAM is extended to 3D
024 3D-Box Segment - dvlab-research/ - perception by combining it
Anything 3D-Box-Segment-Anything with VoxelNeXt.
https://github.com/ Anything 3DNovel View,
025 Anything-3D - - Anything-of-anything/ - Anything-NeRF, Any
Anything-3D 3DFace.
Learning A new partially supervised
https://github.com/ UC Berkeley,
026 L2SET to Segment - training paradigm for in-
ronghanghu/seg every thing FAIR
EveryThing stance segmentation.
Edit
Edit anything in images
Edit Anything https://github.com/sail-sg/
027 - - powered by SAM, Control-
Anything by Segment- EditAnything
Net, StableDiffusion, etc.
Anything
IEA: Image
Image Edit Using stable diffusion and
028 Editing - https://github.com/feizc/IEA -
Anything SAM for image editing.
Anything
This extension aim for con-
necting AUTOMATIC1111
Segment Stable Diffusion WebUI
SAM for Sta- Anything https://github.com/ and Mikubill ControlNet
029 ble Diffusion for Stable - continue-revolution/ - Extension with SAM
Webui Diffusion sd-webui-segment-anything and GroundingDINO
WebUI to enhance Stable
Diffusion/ControlNet
inpainting.
https://colab.research.
Segment https://github.
Earth Obser- google.com/drive/ An earth observation tools
030 Anything EO com/aliaksandr960/ -
vation Tools 1RC1V68tD1O-YissBq9nOvS2PHEjAsFkA? for SAM.
tools segment-anything-eo
usp=share link
Towards
https://github.
Moving Ob- Segmenting A project about SAM +
031 - com/achalddave/ -
ject Detection Anything Moving Object Detection.
segment-any-moving
That Moves
Optical
Character https://www.zhihu.com/
https://github.com/ Combining MMOCR with
032 OCR-SAM Recognition question/593914819/answer/ -
yeungchenwa/OCR-SAM SAM and Stable Diffusion.
with Segment 2976012032
Anything
A project uses the SAM
Segment https://github.com/
Model and adds a bare-
Anything anuragxel/salt#
033 SALT - - bones interface to label im-
Labelling segment-anything-labelling-tool-salt
ages and saves the masks in
Tool
the COCO format.
Prompt Prompt https://github. An implementation
034 Segment Segment - com/RockeyCoss/ - of zero-shot instance
Anything Anything Prompt-Segment-Anything segmentation using SAM.
Continued on next page
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 27
TABLE 1 – continued from previous page
No. Project Title Project page Code base Affiliation Description
A project uses SAM for
generating rotated bound-
https://github.com/
035 SAM-RBox - - - ing boxes with MMRo-
Li-Qingyun/sam-mmrotate
tate, which is a comparison
method of H2RBox-v2.
MOTRv2:
Bootstrap-
ping End-
to-End Combining SAM with
https://github.com/
036 VISAM Multi-Object - - MOT, it creates the era of
BingfengYan/VISAM
Tracking by ”MOTS”.
Pretrained
Object
Detectors
The tools are developed
to ease the processing of
Segment https://github.
spatial data (GeoTIFF and
037 SegEO Anything EO - com/aliaksandr960/ -
TMS) with SAM using slid-
tools segment-anything-eo
ing window algorithm for
big files.
Napari Napari https://app.codecov. https://github.
038 Segment Segment io/gh/jookuma/ com/JoOkuma/ - SAM native Qt UI.
Anything Anything napari-segment-anything napari-segment-anything
Using CLIP and SAM to
Segment- Segment- https://github.com/
segment any instance you
039 Anything-U- Anything-U- - MaybeShewill-CV/ -
specify with text prompt of
Specify Specify segment-anything-u-specify
any instance names.
https://colab.research. Simple static web-based
Simple static
google.com/drive/ https://github.com/lujiazho/ mask drawer, supporting
040 SegDrawer web-based -
1PdWCpBgYwiQtvkdTBnW-y2T-s SegDrawer semantic segmentation
mask drawer
Fc-2iI?usp=sharing with SAM.
Track-Anything is a flexi-
Segment https://huggingface.
Track https://github.com/ ble and interactive tool for
041 Anything co/spaces/VIPLab/ SUSTech
Anything gaomingqi/Track-Anything video object tracking and
Meets Videos Track-Anything
segmentation.
A method uses SAM and
CLIP to ground and count
Count Any- https://github.com/ylqi/ any object that matches a
042 - - -
thing Count-Anything custom text prompt, with-
out requiring any point or
box annotation.
Relate Anything Model is
MMLab,
https://huggingface. capable of taking an image
Relate Any- https://github.com/Luodian/ NTU and
043 RAM co/spaces/mmlab-ntu/ as input and utilizing SAM
thing Model RelateAnything VisCom Lab,
relate-anything-model to identify the correspond-
KCL/TongJi
ing mask within the image.
Segment AnyRGBD is a
https://huggingface.
Segment Any Segment Any https://github.com/Jun-CEN/ toolbox to segment ren-
044 co/spaces/jcenaa/ -
RGBD RGBD SegmentAnyRGBD dered depth images based
Segment-Any-RGBD
on SAM.
https://huggingface. Some Applications that are
Show Show https://github.com/showlab/ Showlab,
045 co/spaces/weijiawu/ compatible with both SAM
Anything Anything ShowAnything NUS
ImageEditAnything and Generation.
Any-to-
Any Style An interactive demo based
Transfer: on Segment-Anything for
Transfer Any https://github.com/
046 Making - LV-lab, NUS style transfer which en-
Style Huage001/Transfer-Any-Style
Picasso and ables different content re-
Da Vinci gions apply different styles.
Collaborate
Caption-Anything is a ver-
https://colab.research.google.
satile image processing tool
Caption Any- com/github/ttengwang/ https://github.com/ VIP lab,
047 - that combines the capabil-
thing Caption-Anything/blob/ ttengwang/Caption-Anything SUSTech
ities of SAM, Visual Cap-
main/notebooks/tutorial.ipynb
tioning, and ChatGPT.
Transform Image into
Transform
https://zhaohengyuan1. Unique Paragraph with
Image2ParagraphImage Into https://github.com/showlab/
048 github.io/image2paragraph. - ChatGPT, BLIP2, OFA,
Unique Image2Paragraph
github.io/ GRIT, Segment Anything,
Paragraph
ControlNet.
LIME-SAM aims to create
an Explainable Artificial
Local
Intelligence (XAI) frame-
Interpretable
https://colab.research. work for image classifica-
Model-
google.com/drive/ https://github.com/ tion using LIME (Local In-
049 LIME SAM agnostic -
1bj6B-O47NHpqsWovOrVZcpWNhIfO56sj?
jaydeep-work/LIME-SAM terpretable Model-agnostic
Explanations
usp=sharing Explanations) as the base
Segment
algorithm, with the super-
Anything
pixel method replaced by
SAM.
An interactive demo based
Paint https://github.com/ on SAM for stroke-based
050 - - -
Anything Huage001/Paint-Anything painting which enables
human-like painting.
Continued on next page
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2023 28
TABLE 1 – continued from previous page
No. Project Title Project page Code base Affiliation Description
Customized SAMed is built upon the
Segment large-scale image segmen-
https://colab.research.
Anything tation model, SAM, to ex-
google.com/drive/ https://github.com/
051 SAMed Model for USTC plore the new research
1KCS5ulpZasYl9DgJJn59WsGEB8vwSI
hitachinsk/SAMed
Medical paradigm of customizing
m?usp=sharing
Image Seg- large-scale models for med-
mentation ical image segmentation.
A training-free Personal-
Personalize ization approach for SAM,
Segment https://huggingface. https://github. termed as PerSAM. Given
Personalize MMLab,
052 Anything co/spaces/justin-zk/ com/ZrrSkywalker/ only a single image with
SAM CUHK
with 1 Shot Personalize-SAM Personalize-SAM a reference mask, PerSAM
in 10 Seconds can segment specific visual
concepts.
Combining OwlViT with
Open- Open- Segment Anything - Open-
https://github.com/
vocabulary- vocabulary- vocabulary Detection
053 - ngthanhtin/owlvit segment -
Segment- Segment- and Segmentation (Text-
anything
Anything Anything conditioned, and Image-
conditioned).
Annotation anything in vi-
Labal- Label- https://github.
sual tasks just all in one-
054 Anything- Anything- - com/Yuqifan1117/ ZJU
pipeline with GPT-4. and
Pipeline Pipeline Labal-Anything-Pipeline
SAM.
Expand Segment Anything
Grounded
Model (SAM) to support
Grounded- Segment https://github.com/
https://cheems-seminar. text prompt input. The text
055 Segment- Anything: Cheems-Seminar/ HKU
github.io/ prompt could be object-
Any-Parts From Objects grounded-segment-any-parts
level(eg, dog) and part-
to Parts
level(eg, dog head).
Effortless AI-assisted data
https://www.youtube.com/ https://github.com/ labeling with AI support
056 AnyLabeling AnyLabeling -
watch?v=5qVJiYNX5Kk vietanhdev/anylabeling from Segment Anything
and YOLO.
Automated dense category
annotation engine that
Semantic- https://github.
https://replicate.com/cjwbw/ serves as the initial
057 SSA Segment- com/fudan-zvg/ -
semantic-segment-anything semantic labeling for
Anything Semantic-Segment-Anything
the Segment Anything
dataset (SA-1B).
Label Data Referring Image Segmen-
https://blog.roboflow.com/
with Segment https://github.com/ tation Benchmarking with
058 RefSAM label-data-segment-anything-model-sam/ -
Anything in helblazer811/RefSAM Segment Anything Model
Roboflow (SAM).
Launch:
Label Data https://blog.roboflow.com/ SAM-assisted labeling for
Roboflow
059 with Segment label-data-segment-anything-model-sam/
https://app.roboflow.com/ Roboflow training computer vision
Annotate
Anything in models.
Roboflow
https://github.com/ This is an experimental
IDEA-Research/ demo aims to combine Im-
ImageBind IDEA-
060 - - Grounded-Segment-Anything/ ageBind and SAM to gen-
SAM Research
tree/main/playground/ erate mask with different
ImageBind SAM modalities.

Common questions

Powered by AI

The Segment Anything Model (SAM) focuses on the ability to generate masks without annotator input and incorporates a variety of prompts, including visual and text prompts, to guide segmentation tasks. It is designed to handle a wide range of segmentation tasks using promptable segmentation tasks and has a robust training dataset formed through a train-annotate loop process . On the other hand, SegGPT unifies various segmentation tasks into an in-context learning framework, emphasizing strong zero-shot capabilities by leveraging a generalist model approach .

Eye-tracking technologies, exemplified by GazeSAM, have the potential to significantly enhance real-time interaction with segmentation models in clinical settings by utilizing eye movements as input prompts. This approach allows radiologists to collect segmentation masks efficiently during image diagnosis by merely looking at regions of interest . This integration of eye-tracking technology streamlines the segmentation process, reducing time and effort required from clinicians while maintaining accuracy, ultimately enhancing the efficiency of clinical workflows and decision-making processes in real-time .

The Medical SAM Adapter (MSA) integrates medical-specific domain knowledge into the SAM model, significantly improving its performance across 19 medical image segmentation tasks using various modalities like CT and MRI . It demonstrates how domain adaptation can enhance a model’s applicability in specialized fields. Similarly, the SAM-Adapter leverages domain-specific visuals or prompts to boost the model's capability by combining task-specific knowledge with general knowledge learned by the larger model, resulting in notably improved performance in challenging medical imaging tasks . These adaptations are crucial for tailoring generalist models like SAM to the nuances of medical imaging, ensuring better applicability and accuracy in a critical domain.

Visual prompts in the SAM play a crucial role by enabling the model to respond effectively to unseen user queries, thus enhancing its zero-shot generalization ability. The model uses a unified prompt scheme to encode different visual input prompts, such as points, boxes, scribbles, and masks, into a joint visual-semantic space, allowing it to handle diverse segmentation tasks without prior training on those specific tasks . This ability to integrate various prompt types contributes significantly to SAM's adaptability and effectiveness across different contextual situations, supporting versatile application scenarios .

Innovations in multimodal learning, such as the development of models like ImageBind that attempt to align six different modal inputs around image/video information, aim to enhance model robustness and social impact by creating more cohesive and context-aware learning frameworks . These models are designed to better capture and utilize the interdependencies of various data modalities, addressing challenges related to performance, robustness, and interpretability. By offering a unified approach to handling diverse data types, they show promise in addressing issues of model adaptability and reliability in diverse real-world applications .

Foundation models like SAM contribute significantly to the future of artificial general intelligence (AGI) by providing versatile models capable of performing various tasks through zero-shot generalization . These models are seen as a critical step towards AGI due to their ability to generalize from pre-trained information to new scenarios without extensive retraining, reflecting a more human-like adaptability and learning capability in artificial systems .

The Segment Anything Model (SAM) achieves flexibility in segmentation applications through the promptable segmentation task, which involves using prompts such as location, range, masks, or text descriptions guiding the segmentation targets . The process is facilitated by three components: the task definition, which entails specifying the data and purpose; the model architecture, designed to process and respond to various prompts; and the extensive dataset, SA-1B, created through an interactive train-annotate loop involving over a billion masks . Through this setup, SAM can adapt to a wide range of existing and novel segmentation tasks, allowing zero-shot applications across different data distributions .

The Segment Anything Model (SAM) has been found to struggle in scenarios involving ambiguous or concealed objects, as evidenced by its performance on camouflaged object detection tasks . Studies highlight the model's difficulty in maintaining segmentation accuracy when objects are not distinctly visible or identifiable, indicating areas for potential improvement in its foundational training or prompt handling methods . This serves as a key insight into the limitations that need to be addressed for enhancing SAM's versatility and reliability in less straightforward object segmentation tasks.

Advances in multimodal foundation models enhance vision-language alignment by capturing cross-modal interactions and learning unified embedding spaces. These models, such as CLIP, ALIGN, Florence, VLBERT, X-LXMERT, and DALL-E, can handle tasks involving classification, retrieval, object detection, and image captioning by aligning image-text data to learning universal visual representations . ImageBind further extends these capabilities by aligning six different modal information around image/video information, moving the research forward in aligning multiple modalities effectively .

SAM faces challenges in complex scenarios, such as misclassifying parts of indistinct objects and failing to predict certain elements in overlapping scenarios . It struggles to identify instruments in surgical scenarios involving blood, reflection, blur, and shade, showcasing its insufficient robustness against various forms of data corruption. These challenges highlight the limitations of SAM in maintaining performance across diverse and complex real-world applications .

You might also like