1 s2.0 S0925231225017850 Main
1 s2.0 S0925231225017850 Main
PII: S0925-2312(25)01785-0
DOI: https://doi.org/10.1016/j.
neucom.2025.131113
Reference: NEUCOM 131113
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version
of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note
that during the production process errors may be discovered which could
affect the content, and all legal disclaimers that apply to the journal pertain.
© 2025 Elsevier B.V. All rights are reserved, including those for text and data
mining, AI training, and similar technologies.
Highlights
PONet: Prototype Optimization Network for Few-shot Medical
Image Segmentation
Wang Siqi, Yu Xiaosheng, Chi Jianning, Wu Chengdong, Gao Xiujing
Abstract
Although deep learning algorithms have achieved remarkable success in med-
ical image segmentation, their reliance on extensive manually annotated
datasets restricts the generalization to scenarios with limited training data.
Few-shot segmentation has emerged as a promising solution to this problem
by enabling effective segmentation with limited data. However, when applied
directly to medical images, there are often two primary challenges: inter-class
and intra-class inconsistency. First, inter-class inconsistency refers to the
high similarity between different pathological states or tissue structures in
pathological images, which negatively impacts the quality of support proto-
types. Second, intra-class inconsistency stems from the varying appearances
of the same organ across different samples, thereby disrupting the support-
query alignment process. To counter these challenges, we propose a proto-
type optimization network to achieve accurate few-shot medical image seg-
mentation. Specifically, we first employ the simple linear iterative clustering
(SLIC) method to generate multiple foreground and background sub-regions
within the support image. Subsequently, we introduce a boundary proto-
∗
Corresponding author.
Email addresses: [email protected] (Wang Siqi),
[email protected] (Yu Xiaosheng), [email protected] (Chi
Jianning), [email protected] (Wu Chengdong), [email protected]
(Gao Xiujing)
and uncomment
1. Introduction
Medical image segmentation aims to accurately identify and depict anatom-
ical structures or pathological regions in various medical images, including
ultrasound, computed tomography (CT), and magnetic resonance imaging
(MRI) scans [1]. The target segmentation results can be widely used in
clinical processes such as disease diagnosis [2], treatment planning [3], and
computer-assisted intervention [4]. In recent years, deep learning algorithms
based on the convolutional neural network (CNN) [5, 6], Transformer [7, 8],
and Mamba [9, 10], have made significant progress in medical image process-
ing, thereby effectively reducing the workload of healthcare professionals. In
addition, the Segment Anything Model (SAM) [11] has gained considerable
attention due to its outstanding performance in conventional image segmen-
tation tasks [12]. However, when applied to medical image segmentation, its
effectiveness is diminished due to the model’s lack of training on medical-
specific data. To address this limitation, Ma et al. presented MedSAM
[13], a foundational model trained on a large-scale annotated medical image
dataset, designed to enable universal medical image segmentation. Zhou et
al. introduced MIT-SAM [14], a novel model for text-assisted medical im-
age segmentation, which incorporates a SAM-enhanced image encoder and
2
a Bert-based text encoder. Despite advancements, these methods continue
to face significant challenges. For instance, their training processes typically
require large amounts of annotated data, which is challenging to obtain due
to high costs, variations in imaging equipment, and privacy concerns. These
factors hinder the practical application of these deep learning approaches.
Furthermore, due to the scarcity of abnormal organs and rare lesion samples,
models trained on such medical datasets tend to recognize only the anatom-
ical structures they have seen, struggling to generalize to unseen semantic
categories.
To tackle the above challenges, few-shot segmentation (FSS) has emerged
as a promising solution that enables deep learning algorithms to have supe-
rior performance with limited data [15]. Specifically, FSS methods enable the
segmentation of unseen anatomical structures with few labeled data. This
capability is particularly valuable in clinical contexts, such as diagnosing rare
or complex diseases. Specifically, FSS methods initially extract representa-
tive semantic information for a specific class from a limited set of labeled
images (i.e., the support set), then use the learned knowledge to guide the
segmentation of unlabeled images (i.e., the query set).
The development of FSS has derived numerous effective methods for
medical image segmentation, typically categorized into two-branch interac-
tive methods [16, 17, 18] and prototypical methods [19, 20]. Interaction-
based approaches often integrate techniques such as attention mechanisms
and contrastive learning to facilitate interactions between support and query
branches, enabling the propagation of learned knowledge from annotated sup-
port images to unannotated query images. In contrast, prototypical methods
are more prevalent in medical image segmentation. Its main workflow can be
summarized as mining foreground features from support images to generate
prototypes, and then using the prototypes for feature matching to segment
the target region in the query image, as depicted in Figure 1(a).
Although prototype-based FSS methods have made progress in medical
image segmentation, there remains room for improvement. We comprehen-
sively analyze the factors that lead to incorrect prediction masks, summarized
in two aspects: (1) Inter-class inconsistency leads to the generation of poor
support prototypes. As illustrated in Figure 2(a), inter-class inconsistency
occurs when different pathological states or tissue structures within the same
image are highly similar, resulting in the generation of low-quality support
prototypes. These inferior prototypes may fail to regenerate their corre-
sponding support masks, thereby hindering the accurate capture of specific
3
Figure 1: Workflow for generating foreground and background prototypes.
Figure 2: Visual segmentation results of the prototype-based method [19] under inter-class
and intra-class inconsistency: (a) The first column represents the inter-class inconsistency
that results in poor quality of the support prototype, which may even fail to restore
the segmentation mask of the supporting image. The second column illustrates cases of
query mask segmentation failure caused by poor support prototypes. (b) Query image
segmentation results guided by different support images.
features during the guided query image segmentation process and negatively
impacting segmentation accuracy. (2) Intra-class inconsistency leads to poor
generalization ability in the support-query guided process. Intra-class in-
consistency refers to the variation among samples within the same category.
4
Figure 3: Workflow for generating foreground and background prototypes.
The selection of support and query images in training is random, so they in-
evitably have differences in appearance, such as organ shape, size, and gray
level. Common prototype methods are mainly based on support features
to generate support prototypes and design a series of modules to enhance
the expression ability of support prototypes, which reduces the difference
between support and query. We consider this strategy to be suboptimal, as
it overlooks the need for customization specific to the query samples. As
shown in Figure 2(b), there is a noticeable variation in the query segmen-
tation quality when guided by prototypes generated from different support
images.
To mitigate the above issues, we propose the prototype optimization net-
work (PONet), which is composed of the boundary prototype contrastive
learning (BPCL) module and query guidance prototype optimization (QGPO)
module for accurate few-shot medical image segmentation. As is shown in
Figure 3, we first use the simple linear iterative clustering (SLIC) method
[21] to over-segment the support image, and combine the mask to generate
three independent sets of region-level prototypes from the support feature:
foreground prototypes, adjacent-boundary background prototypes, and non-
adjacent-boundary background prototypes. Instead of using a single sup-
port prototype, we guide segmentation by optimizing multiple support fore-
ground and background prototypes. Then, to enhance the quality of the
support prototype, we design the BPCL module to emphasize the impor-
tance of adjacent-boundary background prototypes, as these often exhibit
features similar to the foreground. Specifically, the BPCL module primar-
ily optimizes the background prototypes by introducing contrastive learning
to increase the distance between adjacent-boundary background prototypes
5
and foreground prototypes. This optimization of the background prototype
helps alleviate inter-class inconsistency during the subsequent guided query
segmentation process. To mitigate intra-class inconsistency, we introduce the
QGPO module, which designs a query-guided support prototype optimiza-
tion strategy that generates customized support prototypes based on the
current query image. Notably, the QGPO module incorporates the Mamba
model [22], which captures long-range dependencies while maintaining linear
complexity, to enhance the interaction between support and query, refining
the support prototypes.
In short, the main contributions of our work are:
1) We propose a new FSS method, PONet, to solve the data scarcity
problem in medical segmentation applications.
2) We propose the BPCL module, which introduces contrastive learning
to enhance the quality of supported prototypes, thereby addressing inter-class
inconsistency.
3) We propose the QGPO module, which introduces the query-guided
optimization strategy to enhance the connection between support features
and the query image, thereby alleviating intra-class inconsistency.
4) Extensive experiments on Abdominal MRI (Abd-MRI), Abdominal
CT (Abd-CT), and Cardiac MRI (Car-MRI) datasets demonstrate that our
PONet achieves state-of-the-art (SOTA) segmentation performance.
2. Related Works
2.1. Supervised Medical Image Segmentation
Medical image segmentation plays a vital role in many clinical studies and
practical applications by automatically identifying and depicting regions of
interest. In recent years, convolutional neural networks (CNN) have shown
remarkable performance in medical image segmentation, successfully solving
a variety of segmentation tasks such as tissue structure, lesion region, and or-
gan delineation. As one of the most prominent CNN architectures, U-Net [23]
employed an encoder-decoder structure and incorporates skip connections to
map the features from the encoder stage to the decoder stage, facilitating the
interaction between texture and semantic features, which proves effective in
various medical segmentation tasks. Subsequently, various U-Net-based vari-
ants have been proposed, including U-Net++ [24], nnU-Net [25], and atten-
tion U-Net [26], each with representative contributions. Additionally, several
supervised medical segmentation methods based on Transformer [7, 8] and
6
Mamba [9, 10] architectures have also shown promising performance. For in-
stance, Patil et al. [7] proposed the permutation invariant multi-headed self-
attention module integrated into a U-shaped transformer architecture, which
enhances segmentation performance by improving the robustness across dif-
ferent spatial locations in medical images. Gu et al. [8] proposed RAMIS, a
novel hybrid architecture combining CNN and Vision Transformer for med-
ical image segmentation. Cheng et al. [9] proposed Mamba-Sea, a novel
Mamba-based framework for domain generalization in medical image seg-
mentation, which incorporates global-to-local sequence augmentation to im-
prove performance. While these medical image segmentation methods have
achieved certain success, their efficacy often depends on the availability of
widely annotated datasets, which poses a potential limitation to their appli-
cation in real-world medical scenarios.
7
MASNet, a multi-scale and attention-based few-shot semantic segmentation
network that enhances feature representation through a multi-scale feature
enhancement module and channel attention, achieving improved accuracy in
segmentation tasks. Wang et al. [36] proposed ESSNet, an Embedded-Self-
Supplementing Network that combines semantic word embedding and query
set self-supplementing information to address inter-class inconsistency and
information loss in FSS tasks. Wu et al. [37] proposed DefectSAM, which
fine-tunes SAM by incorporating few-shot learning and low-rank adaptation
to achieve effective industrial defect segmentation with limited training data.
Wang et al. [38] proposed LPFS, a meta-learning-based few-shot segmenta-
tion method that leverages a learnable prototype module and global-attention
correlation map to effectively adapt models to unseen geographic categories
with minimal support examples.
Although FSS techniques have made significant strides in natural image
segmentation, their direct application to medical images presents several chal-
lenges. For instance, medical images often have limited grayscale variation
and unclear boundaries, making segmentation more difficult. Furthermore,
ethical concerns related to personal privacy further restrict the availability
of labeled data, thereby limiting the effectiveness of FSS methods. These
challenges underscore the necessity of developing specialized FSS techniques
for medical tasks.
8
framework based on cross attention transformer, termed CAT, which mines
correlations between support and query images, and eliminates useless pixel
information to enhance query feature purity. Gong et al. [41] proposed a
CGNet model that incorporates a cross feature module (CFM) to enhance
lesion detail understanding by facilitating interaction between query and sup-
port sets, and a support guide query (SGQ) module to refine segmentation by
integrating features at different scales to enhance segmentation performance
for intracranial hemorrhage. Although interactive methods have shown some
progress, they still face significant challenges. Attention-based models typ-
ically exhibit high computational complexity and require a larger number
of labeled images for training, which may lead to overfitting if insufficient
data is available. Contrastive learning or cross-feature guidance generally
involves two stages, resulting in increased computational demands. Further-
more, cross-feature guidance methods are often affected by blurry boundaries,
which may cause the background features of the query image to mix with
the foreground features of the support image, leading to suboptimal query
segmentation performance.
In contrast, the approach based on prototypical network structures is
more commonly used. Ouyang et al. [19] proposed a self-supervised few-shot
medical image segmentation model, SSL-ALPNet, which laid a solid foun-
dation for subsequent research on prototype learning. Subsequently, a series
of outstanding FSS works [20, 43, 44, 45, 46, 47, 48] combine self-supervised
with prototype learning. For instance, Zhu et al. [44] proposed a search and
filtering (S&F) module designed based on the self-selection mechanism to al-
leviate the impact of intra-class differences during the support-guided query
segmentation process. Rashid et al. [48] proposed the ViT-CAPS model,
which leverages Vision Transformers, the adaptive context embedding mod-
ule, and the meta prompt generator to improve few-shot segmentation per-
formance in dynamic, low-annotation settings. Although the aforementioned
methods have made some progress, they still face certain limitations. For
example, they typically rely on a single support prototype to guide query
segmentation. When the support image itself contains significant inter-class
inconsistency, it leads to the generation of low-quality support prototypes,
which severely impact the performance of query segmentation. Additionally,
they only consider support features during the prototype generation process,
overlooking potential appearance differences between the target objects in
the support and query images, thus hindering the generalization ability of
the support prototype. In contrast, our proposed PONet optimizes multi-
9
ple local support prototypes in a self-supervised manner through the BPCL
module and establishes relationships between support and query prototypes
using the QGPO module, generating customized prototypes suited for the
current query and enhancing medical FSS segmentation performance.
Recent advancements in computer vision have introduced the Segment
Anything Model (SAM) [11], which has been pre-trained on 11 million im-
ages and 1 billion masks, demonstrating exceptional performance across a
wide range of general image segmentation tasks. SAM has attracted a great
deal of attention in the field of medical segmentation, but it performs poorly
when it is applied directly to medical images due to the huge differences
between natural and medical images [12]. Consequently, SAM-inspired mod-
els such as MedSAM [13], MIT-SAM [14], BiASAM [49], MASG-SAM [50],
have demonstrated the potential of foundational segmentation models for 3D
medical image segmentation. For instance, Zhou et al. proposed BiASAM
[49], which uniquely incorporates two bidirectional attention mechanisms into
SAM for medical image few-shot segmentation. These models typically re-
quire input prompts, such as click points or bounding boxes, for effective
operation. Although SAM-based medical segmentation methods have pro-
gressed, they still face significant limitations due to the scarcity of labeled
medical data, compounded by privacy concerns and the high cost of data
acquisition.
3. Methods
3.1. Problem Definition
In the FSS task, the complete dataset D is divided into two separate
subsets: the training set D train and the testing set D test . The segmentation
model is trained on the training set D train , and then the trained model is
evaluated on the test set D test , D train is labeled by C train , and D test is labeled
by C test , where C train and C test are disjoint, i.e. C train ∩ C test = ∅.
FSS aims at segmenting query objects of a specific semantic class based on
extremely few labeled support images. Following the current approach [32],
we combine training and testing with meta-learning [51], also called episodic
learning. Specifically, we use episode mode to set N-way K-shot segmentation
task, where N represents the number of classes to be segmented in each
episode and K represents the number of images contained in each class. In
each episode for a specific class c, the input to the model F(S, Q) consists
K
of the support set S = (I is , M is ) i=1 and the query set Q = {I q , M q }.
10
Table 1: A summary of the segmentation methods involved in related works.
11
Figure 4: Overview of the proposed PONet.
I is and I q represent the i -th support image and query image, their masks
are represented as M is and M q , respectively. The model takes the support
set S and the query image I q as input, leveraging the semantic knowledge
extracted from the labeled support images to guide the segmentation of the
query image and output its predicted mask. The prediction mask is then
supervised using the ground truth mask M q . Following previous works [19]
and [20] on medical FSS works, we set N = K = 1.
12
and F q are used as inputs to the BPCL and QGPO modules to optimize
regional support background prototypes P bs and foreground prototypes P fs ,
respectively. Finally, the optimized support prototype P s = P bs ∪ P fs and
query feature F q are used to generate the final query image prediction.
b_a
adjacent-boundary background prototype P b_a s = {ps,n }K n=1 , and the non-
2
b_n
adjacent-boundary background prototype P b_n s = {ps,n }K n=1 . This process
3
employs the mask averaging pooling (MAP) strategy, which is guided by the
following formulations:
HW
1 X
pfs,n = MAP (F s , X n ) = F s,i X n,i (1)
|X n | i=1
13
HW
1 X
pb_a
s,n = MAP (F s , Y n ) = F s,i Y n,i (2)
|Y n | i=1
HW
1 X
pb_n
s,n = MAP (F s , Z n ) = F s,i Z n,i (3)
|Z n | i=1
As mentioned, inter-class inconsistency usually occurs in adjacent-boundary
background regions. We mitigate this inconsistency by introducing con-
trastive learning loss Lcontrast . Specifically, we set the positive sample P positive
and the negative sample P negative by performing global averaging pooling
(GAP) operations on P b_n s and P fs , respectively. Then, we set P b_a
s as the
b_a
anchor and utilize the triplet loss to update each prototype pn in P b_a s
to be closer to the positive sample P positive and farther from the negative
sample P negative . The specific formula above is as follows:
b_a b_a
(6)
PK2
Lcontrast = n=1 max L2 pn , P positive − L2 pn , P negative + M, 0
used to guide the subsequent QGPO module and final segmentation of query
image.
14
Figure 5: The structural diagram of the query prototype generation module.
M bq = 1 − σ(0.5(S(GAP(P bs ), F q ) − τ2 )) (8)
where M fq and M bq denote the query foreground and background masks,
respectively. S(x, y) = −αCos(x, y) denotes the cosine similarity with α = 20
is the scaling factor. σ denotes the Sigmoid function, τ1 , τ2 are the learnable
threshold derived by the ResNet-50 through two distinct fully connected
layers. The query prediction mask M q is then obtained as:
M q = λM fq + (1 − λ)(1 − M bq ) (9)
15
Based on this, the query prototype P q can be generated using the MAP
operation.
HW
1 X
P q = MAP (F q , M q ) = F q,i M q,i (10)
|M q | i=1
Finally, we introduce the Mamba model [22], which embeds the prototype
P̃sfas input sequence data to capture essential information while maintain-
ing long-range dependencies. To be specific, Mamba uses the Selective State
Space Model (SSM) to identify key information from the input P̃sf and gen-
f
erate the refined P̂ s . As shown in Figure 4(b), the Mamba block consists
of layer normalization (LN), Linear layer (Linear), 1d convolution (Conv1d),
selective SSM, SiLU activation [54], and residual connection. Assume the
input matrix x, the output matrix y is computed by the Mamba as follows:
x1 = SelectiveSSM(SiLu(Conv1d(Linear(LN(x)))) (13)
x2 = Silu(Linear(LN(x))) (14)
y = Linear(x + (x1 ⊙ x2 )))) (15)
16
where ⊙ denotes the dot product. Please refer to Mamba [22] for a more
detailed explanation of the selective SSM approach.
To enhance bias alleviation, we employ a stacked arrangement of three
QGPO modules to iteratively update support and query prototypes, and the
number of modules is set to 3 for optimal performance. Experimental results
have shown that a stack of three modules achieves the optimal effect (see
Section 4.3 for details).
In summary, in the QGPO module, given the previous support foreground
prototype P fs , it is first fed into the QPG module to generate the query proto-
type P q . Subsequently, both P fs and P q are input into the Bias-alleviation
Mamba (BaM) module to generate the updated support foreground pro-
f f
totype P̂ s . This updated P̂ s then replaces P fs , and the entire process is
repeated three times. The update process can be expressed as:
P q = QPG(P fs , P bs , F q ) (16)
f
P̂ s = BaM(P fs , P q ) (17)
Following optimization by the QGPO module, the support foreground
prototype P fs , combined with the support background prototype P bs , is used
to predict the final query image mask M q through Eqs. (7), (8), and (9).
17
where BCE denotes the binary cross-entropy (BCE) loss. Furthermore, for
the support loss Lsupport , we first generate the support mask M s by combin-
ing the query prediction mask M q with the support foreground prototype P fs
and foreground prototype P bs , as described in Eqs. (7), (8), and (9). Subse-
quently, the Lsupport is used to measure the error between the corresponding
GT: M̃ s and the predicted mask: M s , which is guided as follows:
4. Experiments
4.1. Experimental Setup
1)Benchmark Datasets: We evaluated the proposed PONet on three pub-
licly available datasets, including Abd-MRI [55], Abd-CT [56], and Card-MRI
[57]. Specifically, Abd-MRI is an abdominal MRI dataset from the ISBI 2019
Combined Healthy Abdominal Organ Segmentation Challenge (CHAOS). It
includes 20 3D T2-SPIR MRI scans, each with an average of 36 slices. Abd-
CT is an abdominal CT dataset from the MICCAI 2015 Multi-Atlas Abdom-
inal Labeling challenge, consisting of 30 3D abdominal CT scans. Abd-MRI
and Abd-CT share the same annotation classes, which include the liver, left
kidney (LK), right kidney (RK), and spleen. Card-MRI, from the MICCAI
2019 multi-sequence cardiac MRI segmentation challenge, includes 35 3D car-
diac MRI scans with an average of 13 slices per scan. The annotation labels
for Card-MRI include left ventricle myocardium (LV-MYO), left ventricular
blood pool (LV-BP), and right ventricle (RV).
2)Implementation Details: Our PONet is implemented using PyTorch
(v1.10.2) on the NVIDIA RTX 3090 GPU. During training, we utilize the
standard feature extraction approach, employing the ResNet-50 pre-trained
on the MS-COCO dataset as the backbone of the feature extractor. Refer to
[19] and [58] for the data pre-processing pipeline. Adopting the meta-learning
strategy, we configure the network for 1-way 1-shot learning, performing 50K
iterations with a batch size of 1. The network is optimized using the SGD
[59] optimizer, with an initial learning rate of 0.001, with a step decay of
0.8 for each 1000 iterations. We select support and query slices according to
the strategy in [43] and use five-fold cross-validation for training and testing,
recording the final average values.
3)Experiment Settings: In the evaluation phase, we repeat the training
for each experimental scenario 5 times and use the Dice Similarity Coefficient
18
(DSC) and the Boundary F1 (BF1) score to record the mean experimental
results, thereby assessing the similarity between the predicted mask and the
ground truth. The BF1 score is designed to assess the segmentation quality
on the boundaries. It evaluates the alignment between the dilated predicted
boundaries and the ground truth boundaries, disregarding the performance
on interior pixels and concentrating solely on the precision of the boundary
delineation. We set the dilation parameters as 0.75% of the image diagonal
length pixels to calculate the BF1 score. Following [19, 46], we also use
two different supervision settings to evaluate the generalization ability of the
proposed PONet to novel samples. Specifically, in Setting 1 : test classes are
allowed to appear in the background of training slices. This situation makes
it possible for test classes to participate in the training implicitly and not
be treated as truly invisible new classes, while in Setting 2, slices containing
test classes are forcefully removed from the training data to ensure that test
classes are truly invisible. Notably, setting 2 does not apply to Card-MRI
because all organ classes are usually simultaneously present on one slice.
Therefore, we only consider setting 1 for Card-MRI.
19
Table 2: Experimental comparison results on the Abd-MRI and Abd-CT datasets.
Abd-MRI Abd-CT
Setting Method Source Year
LK RK Spleen Liver mDSC mBF1 LK RK Spleen Liver mDSC mBF1
SSL-ALPNet[19] TMI 2022 81.92 85.18 72.18 76.10 78.84 65.26 72.36 71.81 70.96 78.29 73.35 61.43
SR&CL[42] MICCAI 2022 79.34 87.42 76.01 80.23 80.77 67.35 73.45 71.22 73.41 76.06 73.53 64.22
ADNet++[20] MIA 2023 86.80 86.62 75.69 74.85 80.99 66.35 53.47 50.29 65.76 74.24 60.94 51.27
Q-Net[43] IntelliSys 2023 78.36 87.98 75.99 81.74 81.02 68.22 76.89 71.87 76.31 77.08 75.54 65.31
CRAPNet[17] WACV 2023 81.95 86.42 74.32 76.46 79.79 67.63 74.69 74.18 70.37 75.41 73.66 63.91
CAT[18] MICCAI 2023 74.01 78.90 68.83 78.98 75.18 64.36 63.36 60.05 67.65 75.31 66.59 58.78
RPT[44] MICCAI 2023 81.83 88.73 76.37 82.59 82.38 70.77 76.52 80.57 72.38 81.32 77.69 69.48
PFMNet[45] CMIG 2024 77.48 81.35 72.33 73.55 76.17 65.88 70.32 75.48 68.52 69.36 70.92 62.81
1
DGPANet[46] TIM 2024 85.84 86.99 79.62 81.31 83.70 75.45 82.67 79.56 83.28 65.59 77.77 71.52
CGNet[41] CMIG 2025 80.43 82.69 76.33 77.31 79.19 70.62 75.26 70.38 73.35 73.22 73.05 66.17
PGRNet[47] TMI 2025 81.44 87.44 81.72 83.27 83.47 74.96 74.23 79.88 72.09 82.48 77.17 71.37
ViT-CAPS[48] Neuro. 2025 80.59 85.82 77.38 78.93 80.68 70.63 78.69 76.38 76.60 78.82 70.52 61.77
BiASAM[49] ISPL 2025 82.35 83.50 76.89 78.82 80.39 72.21 76.58 75.36 76.21 77.35 76.37 70.73
MASG-SAM[50] JBHI 2025 84.36 86.58 77.65 82.35 82.73 74.36 76.85 77.16 78.09 77.82 77.48 72.61
U-Net[23] MICCAI 2015 80.64 80.35 75.39 77.22 78.40 71.33 76.33 74.56 76.36 78.25 76.37 69.67
ISONet[60] ESWA 2025 81.28 84.52 76.58 78.63 80.25 71.67 77.25 76.85 74.44 76.89 76.35 70.25
Ours - - 88.66 90.26 80.32 84.60 85.96 77.49 80.65 80.89 79.19 83.63 81.39 75.45
SSL-ALPNet[19] TMI 2022 73.63 78.39 67.02 73.05 73.02 61.22 63.34 54.82 60.25 73.65 63.02 54.43
SR&CL[42] MICCAI 2022 77.07 84.24 73.73 75.55 77.65 72.41 67.39 63.37 67.36 73.63 67.94 62.77
ADNet++[20] MIA 2023 76.25 77.82 69.88 70.65 73.65 67.21 45.62 45.36 61.76 68.42 55.30 48.07
Q-Net[43] IntelliSys 2023 64.81 65.94 65.37 78.25 68.59 60.88 65.67 51.47 63.38 77.07 64.40 57.21
CRAPNet[17] WACV 2023 74.66 82.77 70.82 73.82 75.52 68.58 70.91 67.33 70.17 70.45 69.72 63.35
CAT[18] MICCAI 2023 75.31 83.23 67.31 75.02 75.22 70.31 68.82 64.56 66.02 80.51 70.88 63.74
RPT[44] MICCAI 2023 74.51 86.73 75.80 81.09 79.53 72.32 72.36 67.54 71.95 74.13 71.49 66.21
PFMNet[45] CMIG 2024 72.11 74.35 68.35 69.31 71.03 66.35 66.32 69.35 65.30 65.21 66.54 58.28
2
DGPANet[46] TIM 2024 73.76 75.96 74.10 69.21 73.72 68.21 74.10 68.06 65.91 65.56 68.41 62.33
CGNet[41] CMIG 2025 76.38 77.25 74.36 72.09 75.02 69.28 70.23 66.36 69.39 70.82 69.20 62.33
PGRNet[47] TMI 2025 77.38 81.25 73.58 78.48 77.67 70.58 69.72 67.88 68.36 75.35 70.32 64.13
ViT-CAPS[48] Neuro. 2025 74.63 79.36 74.21 75.06 75.81 70.88 73.48 69.14 67.25 70.21 70.02 64.11
BiASAM[49] ISPL 2025 73.55 77.43 72.80 73.98 74.44 67.59 70.66 68.17 67.89 73.16 70.47 64.33
MASG-SAM[50] JBHI 2025 76.82 82.58 73.55 79.09 78.01 72.43 70.85 69.16 70.38 75.08 71.36 65.12
U-Net[23] MICCAI 2015 74.35 77.31 68.31 70.02 72.49 66.36 68.20 66.31 69.37 71.25 68.78 62.57
ISONet[60] ESWA 2025 73.06 76.58 71.35 72.33 73.33 66.21 69.33 67.39 68.21 70.25 68.79 63.11
Ours - - 78.83 87.23 76.62 81.66 81.83 75.53 76.21 72.12 74.53 77.56 75.10 68.33
* Best results are shown in bold, suboptimal results are indicated by a horizontal line.
20
Table 3: Experimental comparison results on the Card-MRI dataset.
Card-MRI
Setting Method Source Year
LV-BP LV-MYO RV mDSC mBF1
SSL-ALPNet[19] TMI 2022 83.99 66.74 79.96 76.90 68.35
SR&CL[42] MICCAI 2022 84.74 65.83 78.41 76.32 69.22
ADNet++[20] MIA 2023 82.79 58.67 67.57 69.68 64.43
Q-Net[43] IntelliSys 2023 90.25 65.92 78.19 78.15 71.69
CRAPNet[17] WACV 2023 83.02 65.48 78.27 75.59 69.08
CAT[18] MICCAI 2023 90.54 66.85 79.71 79.03 73.36
RPT[44] MICCAI 2023 89.57 66.82 80.17 78.85 74.66
PFMNet[45] CMIG 2024 86.35 61.58 74.38 74.10 69.31
1
DGPANet[46] TIM 2024 89.82 67.62 80.09 79.18 74.13
CGNet[41] CMIG 2025 87.82 64.28 75.33 75.81 68.19
PGRNet[47] TMI 2025 88.52 62.59 77.47 76.52 70.22
ViT-CAPS[48] Neuro. 2025 86.53 60.86 76.57 74.65 67.39
BiASAM[49] ISPL 2025 88.12 63.59 77.23 76.31 70.28
MASG-SAM[50] JBHI 2025 89.35 65.93 78.88 78.05 73.36
U-Net[23] MICCAI 2015 84.28 61.39 75.48 73.71 67.77
ISONet[60] ESWA 2025 86.86 61.83 75.85 74.84 65.25
Ours - - 91.44 67.85 80.65 79.98 75.45
* Best results are shown in bold, suboptimal results are indicated by a horizontal line.
accurately identifies four organ regions and delineates the foreground edge
regions. Specifically, our method can accurately predict the intact bound-
ary of the liver and spleen organ compared with other methods, effectively
mitigating inconsistency caused by background similarity. Additionally, it
exhibits superior performance in capturing edge details for the LK and RK
organs. For the Abd-CT and Card-MRI datasets, our method can provide a
more comprehensive depiction of organ boundaries compared to other base-
line methods, effectively mitigating false segmentation issues caused by am-
biguous boundaries.
21
Figure 6: Qualitative results on the Abd-MRI and Abd-CT datasets under Setting 1.
22
1) Combined Ablation Analysis for BPCL and QGPO Modules: In the
proposed PONet, we introduce the BPCL and QGPO modules to enhance
segmentation performance for medical FSS tasks. To evaluate the effective-
ness of these modules, we conduct a joint ablation study. Specifically, we
design three different experimental settings. Setting (a) removes the BPCL
and QGPO modules from the PONet architecture, directly guiding the query
segmentation through a prototype that supports image generation. Settings
(b) and (c) build upon setting (a) by incorporating the proposed BPCL and
QGPO modules, respectively.
The experimental results are shown in Table 4. By comparing setting (b)
with setting (a) and our method with setting (c), we evaluate the network’s
performance with and without the BPCL module. The quantitative results
show that the mean DSC score improves by 6.53% and 4.25%, respectively.
Furthermore, as shown in the visualization results in Figure 8, the LK and
spleen organs often experience false segmentation due to boundary confusion
and organ similarity. Our method introduces the BPCL module to reduce
the class inconsistency between LK and the spleen, thereby achieving more
accurate segmentation results. For instance, in setting (a), the DSC for LK
is 72.55%, with 822 false positive (FP) pixels caused by boundary confusion.
In setting (b), the DSC for LK increases to 80.55%, with the FP pixels re-
duced to 283, demonstrating that the introduction of BPCL leads to a 65%
reduction in FP. By comparing setting (c) with setting (a) and our method
with setting (b), we evaluate the network’s performance with and without
the QGPO module. The quantitative results show that the mean DSC score
improves by 8.22% and 4.25%, respectively, demonstrating that the integra-
tion of the QGPO module effectively enhances segmentation performance.
Furthermore, the segmentation results in Figure 8 further illustrate that in-
corporating the QGPO module leads to more complete segmentation results.
In addition to the visual comparisons, the DSC box plot clearly shows the
distribution across different settings. As shown in Figure 9, the box plots of
mean DSC under different settings further highlight that the integration of
the BPCL and QGPO modules not only improves accuracy but also enhances
robustness.
To further verify the effectiveness of the proposed BPCL and QGPO
modules, we apply the t-SNE [61] to visualize the feature distributions under
different settings. The t-SNE is performed during the testing phase, and
each point represents a query feature. As shown in Figure 10, in setting
(a), we observed that the inter-class distance between features is small, while
23
the intra-class distance is large, resulting in insufficient distinguishability be-
tween feature samples. In contrast, after integrating BPCL and QGPO, the
inter-class distance between features increases, and the intra-class distance
decreases, thereby enhancing the distinguishability between feature samples.
The visualization clearly shows that query features of our method exhibit
intra-class cohesion and inter-class separation, which contributes to improved
segmentation performance.
Module Abd-MRI
Setting
BPCL QGPO LK RK Spleen Liver mDSC
(a) % % 76.23 75.63 68.58 73.54 73.49 ± 4.64
(b) ✓ % 82.38 83.21 74.58 79.83 80.02 ± 4.18
(c) % ✓ 85.14 84.85 72.63 80.24 81.21 ± 3.70
Our method ✓ ✓ 88.66 90.26 80.32 84.60 85.96 ± 2.64
24
Figure 9: Compare the box plots of mean DSC under different settings on the Abd-MRI
dataset.
Figure 10: Visualization of t-SNE embedding for the query features. The colors purple,
blue, green, red, and yellow correspond to the background, LK, RK, liver, and spleen,
respectively.
25
of support information, we propose using the SLIC method to over-segment
the supporting image and then extract multiple supporting foreground and
background sub-region prototypes. An ablation study is conducted to assess
the impact of the number of superpixel blocks N on PONet performance.
Specifically, we assess the DSC scores for four organs across various settings of
N. As shown in Figure 11, the DSC scores for each organ increased to varying
degrees as region N increased. More specifically, from N = 10, the DSC scores
of the four organs increased significantly, reaching the peak at N = 100,
whereas the DSC scores of the RK and LK organs began to decrease at N =
140. This trend suggests that increasing the number of partitions can enhance
the representativeness of subregions, thereby improving the refinement of
support prototypes by eliminating less relevant parts. However, for smaller-
shaped organs such as the LK and RK, an excessive number of subregions
can result in insufficient feature information within each block, potentially
introducing noise and degrading segmentation performance.
26
M and evaluate the DSC scores for each organ. As shown in Figure 12, the
highest mean DSC score of 85.96 was achieved with M =3. Beyond this
configuration, additional stacking does not result in further improvements.
While increasing the number of modules helps accelerate model convergence,
too many stacked QGPO modules tend to overfit the model and thus fail to
achieve superior segmentation performance.
To further explore the QGPO module’s impact on performance across
networks with varying parameters, we introduce the parameter quantity
shifting-fitting performance (PQS-FP) coordinate system proposed by Xi-
ang et al. [62]. This framework categorizes models into two regions based
on their behavior as the number of parameters increases: the underfitting
decay region (UAR) and the overfitting deterioration region (OER). Specif-
ically, we begin by reducing the number of channels in the ResNet-50 back-
bone, changing the default configuration from C1 = [256, 512, 1024, 2048] to
C2 = [128, 256, 512, 1024]. We then compare the impact of varying the num-
ber of QGPO blocks on the network performance under these two configu-
rations. The quantitative experimental results are shown in Table 5. Then,
based on the network performance under different settings, we further plotted
the localization of these settings in the PQS-FP coordinate system. As shown
in Figure 13, when C1 and M ∈ {1, 2, 3, 5}, it falls under the UAR category.
In this regime, increasing the number of parameters helps to reduce insuf-
ficiency, thereby improving classification accuracy. However, when C1 and
M ∈ {7, 11}, it falls under the OER category, where increasing the number
of parameters exacerbates overfitting. Although the complexity of the model
is enhanced, segmentation performance is degraded. This trend is consistent
with the observed DSC mean scores in Table 5. The introduction of the
PQS-FP coordinate system effectively captures the relationship between the
number of parameters and segmentation performance when stacking different
quantities of QGPO models.
27
Figure 12: Mean DSC scores for different settings of M.
Table 5: DSC scores for different M module settings under varying channel numbers.
28
Figure 13: The positioning of different settings in the PQS-FP coordinate system.
29
Table 6: The quantitative experimental comparison results of different optimizers.
30
Figure 14: Comparison of qualitative segmentation results under the 1-shot and 5-shot
settings.
6) Complexity analysis:
The proposed PONet uses the Mamba module to refine support proto-
types. Mamba’s success is mainly attributed to its ability to capture long-
range dependencies while maintaining linear complexity concerning the input
sequence length, making it a promising alternative to CNN and Transform-
ers. To validate its effectiveness, we conducted an ablation study, evaluating
the accuracy and complexity of the proposed method under different con-
figurations, including the Convolutional Block Attention Module (CBAM)
[65], Transformer [66], Gated Axial-Attention (GAA) [67], and Mamba. To
evaluate the trade-off between performance and cost, we report the results
of inference runtime, number of parameters (Params), floating-point oper-
ations (FLOPs), frames processed per second (FPS), and mean DSC. As
shown in Table 8, integrating Mamba into the QGPO module delivers opti-
mal adaptation performance, yielding the best DSC. Furthermore, in terms
of model efficiency, the Mamba-based model outperforms the other configu-
rations. These findings indicate that the introduction of Mamba effectively
balances performance and efficiency, making it suitable for resource-limited
clinical settings. Additionally, we believe that improving the efficiency of
the FSS model without compromising segmentation performance could be a
promising direction for future research.
31
Table 8: Compare the inference runtime [s], the number of parameters (Params) [M],
floating-point operations (FLOPs) [G], frames processed per second (FPS), and the mean
DSC across different configurations.
5. Conclusion
In this work, we introduce PONet, a model specifically designed for med-
ical image few-shot segmentation. PONet aims to address common issues of
inter-class and intra-class inconsistency in medical image FSS tasks. Specif-
ically, we use the SLIC method to over-segment the support images and
generate multiple support foreground and background sub-regions using the
support mask. We then propose the BPCL module, which employs con-
trastive learning to reduce inconsistency from adjacent-boundary background
regions, decreasing the likelihood of them being activated as foreground and
thus minimizing inter-class inconsistency. Next, we introduce the QGPO
module, which employs a query-guided support foreground prototype opti-
mization strategy to make support foreground prototypes more adaptable to
the content of the query image, thereby minimizing intra-class inconsistency.
Extensive experimental results show that the proposed PONet outperforms
other SOTA methods.
Although our FSS segmentation method has made progress, it still faces
challenges with limited labeled data. In the future, we aim to explore more
effective approaches, such as very few-shot or zero-shot learning, to improve
the model’s performance in real-world scenarios with scarce annotations. In
addition, we will focus on addressing real-time performance in clinical envi-
ronments, optimizing the model to achieve instant response without compro-
mising accuracy.
Abbreviations
Computed tomography (CT); Magnetic resonance imaging (MRI); Few-
shot segmentation (FSS); Prototype optimization network (PONet); Bound-
32
ary prototype contrastive learning (BPCL); Query guidance prototype op-
timization (QGPO); Simple linear iterative clustering (SLIC); State-of-the-
art (SOTA); Convolutional neural networks (CNN); Abdominal MRI (Abd-
MRI); Abdominal CT (Abd-CT); Cardiac MRI (Car-MRI); Mask average
pooling (MAP); Class-agnostic segmentation network (CANet); Large lan-
guage models (LLMs); Cross feature module (CFM); Support guide query
(SGQ); Search and filtering (S&F); Segment anything model (SAM); Global
averaging pooling (GAP); Query prototype generation (QPG); Selective state
space model (SSM); Layer normalization (LN); 1d convolution (Conv1d);
Ground truth (GT); Binary cross-entropy (BCE); Combined Healthy Ab-
dominal Organ Segmentation Challenge (CHAOS); Left kidney (LK); Right
kidney (RK); Left ventricle myocardium (LV-MYO); Left ventricular blood
pool (LV-BP); Right ventricle (RV); Dice Similarity Coefficient (DSC); Triplet
loss (TL); Parameter quantity shifting-fitting performance (PQS-FP); Under-
fitting decay region (UAR); Overfitting deterioration region (OER); Convo-
lutional Block Attention Module (CBAM); Gated Axial-Attention (GAA);
parameters (Params); floating-point operations (FLOPs).
Declarations
This work was supported in part by the National Natural Science Foun-
dation of China under Grant nos. 62306187, 62403108, the Ministry of In-
dustry and Information Technology Project TC220H05X-04, the Liaoning
Provincial Natural Science Foundation Joint Fund 2023-MSBA-075 and the
Fundamental Research Funds for the Central Universities N2426005.
References
[1] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani,
J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al., The multi-
modal brain tumor image segmentation benchmark (brats), IEEE trans-
actions on medical imaging 34 (10) (2014) 1993–2024.
33
segmentation for radiation treatment planning: A critical review, Ra-
diotherapy and Oncology 160 (2021) 185–191.
[8] J. Gu, F. Tian, I.-S. Oh, Ramis: Increasing robustness and accuracy
in medical image segmentation with hybrid cnn-transformer synergy,
NEUROCOMPUTING 618 (FEB 14 2025). doi:10.1016/j.neucom.
2024.129009.
34
[11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dolla’r, R. Girshick,
Segment anything, in: 2023 IEEE/CVF INTERNATIONAL CONFER-
ENCE ON COMPUTER VISION, ICCV, IEEE International Confer-
ence on Computer Vision, IEEE; IEEE Comp Soc; CVF, 2023, pp.
3992–4003, iEEE/CVF International Conference on Computer Vision
(ICCV), Paris, FRANCE, OCT 02-06, 2023. doi:10.1109/ICCV51070.
2023.00371.
35
[18] Y. Lin, Y. Chen, K.-T. Cheng, H. Chen, Few shot medical image seg-
mentation with cross attention transformer, in: H. Greenspan, A. Mad-
abhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood,
R. Taylor (Eds.), Medical Image Computing and Computer Assisted In-
tervention – MICCAI 2023, Springer Nature Switzerland, Cham, 2023,
pp. 233–243.
36
mentation, IEEE TRANSACTIONS ON MEDICAL IMAGING 39 (6)
(2020) 1856–1867. doi:10.1109/TMI.2019.2959609.
[31] B. Yang, C. Liu, B. Li, J. Jiao, Q. Ye, Prototype mixture models for few-
shot semantic segmentation, in: Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part VIII 16, Springer, 2020, pp. 763–778.
37
in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 7131–7140.
[34] L. Zhu, T. Chen, D. Ji, J. Ye, J. Liu, Llafs: When large language
models meet few-shot segmentation, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2024, pp.
3065–3075.
[35] Y. Yang, Y. Gao, L. Wei, M. He, Y. Shi, H. Wang, Q. Li, Z. Zhu,
Self-support matching networks with multiscale attention for few-shot
semantic segmentation, NEUROCOMPUTING 594 (AUG 14 2024).
doi:10.1016/j.neucom.2024.127811.
[36] X. Wang, Q. Chen, Y. Yang, Word vector embedding and self-
supplementing network for generalized few-shot semantic segmentation,
NEUROCOMPUTING 613 (JAN 14 2025). doi:10.1016/j.neucom.
2024.128737.
[37] Z. Wu, S. Zhao, Y. Zhang, Y. Jin, Defectsam: Prototype prompt
guided sam for few-shot defect segmentation, IEEE TRANSACTIONS
ON INSTRUMENTATION AND MEASUREMENT 74 (2025). doi:
10.1109/TIM.2025.3548183.
[38] J. Wang, Y. Liu, Q. Zhou, Z. Wang, F. Wang, Few-shot semantic seg-
mentation on remote sensing images with learnable prototype, IEEE
TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 63
(2025). doi:10.1109/TGRS.2025.3568475.
[39] P. Teng, W. Liu, X. Wang, D. Wu, C. Yuan, Y. Cheng, D.-S. Huang,
Beyond singular prototype: A prototype splitting strategy for few-shot
medical image segmentation, NEUROCOMPUTING 597 (SEP 7 2024).
doi:10.1016/j.neucom.2024.127990.
[40] K. Tang, S. Wang, Y. Chen, Cross modulation and region contrast learn-
ing network for few-shot medical image segmentation, IEEE Signal Pro-
cessing Letters (2024).
[41] W. Gong, Y. Luo, F. Yang, H. Zhou, Z. Lin, C. Cai, Y. Lin, J. Chen,
Cgnet: Few-shot learning for intracranial hemorrhage segmentation,
COMPUTERIZED MEDICAL IMAGING AND GRAPHICS 121 (APR
2025). doi:10.1016/j.compmedimag.2025.102505.
38
[42] R. Wang, Q. Zhou, G. Zheng, Few-shot medical image segmentation
regularized with self-reference and contrastive learning, in: L. Wang,
Q. Dou, P. T. Fletcher, S. Speidel, S. Li (Eds.), Medical Image Com-
puting and Computer Assisted Intervention – MICCAI 2022, Springer
Nature Switzerland, Cham, 2022, pp. 514–523.
39
[50] W. Zhou, G. Guan, Y. Gao, P. Si, M. Xu, Q. Yan, Masg-sam: Enhanc-
ing few-shot medical image segmentation with multi-scale attention and
semantic guidance, IEEE Journal of Biomedical and Health Informatics
(2025) 1–12doi:10.1109/JBHI.2025.3571430.
[52] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: 2016 IEEE CONFERENCE ON COMPUTER VI-
SION AND PATTERN RECOGNITION (CVPR), IEEE Conference
on Computer Vision and Pattern Recognition, IEEE Comp Soc; Comp
Vis Fdn, 2016, pp. 770–778, 2016 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), Seattle, WA, JUN 27-30, 2016.
doi:10.1109/CVPR.2016.90.
40
[56] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, A. Klein,
Miccai multi-atlas labeling beyond the cranial vault–workshop and
challenge, in: Proc. MICCAI Multi-Atlas Labeling Beyond Cranial
Vault—Workshop Challenge, Vol. 5, 2015, p. 12.
[61] L. van der Maaten, G. Hinton, Visualizing data using t-sne, JOURNAL
OF MACHINE LEARNING RESEARCH 9 (2008) 2579–2605.
41
Conference on Computer Vision (ECCV), Munich, GERMANY, SEP
08-14, 2018. doi:10.1007/978-3-030-01234-2\_1.
42
Jianning Chi received the Ph.D. degree in Computer Sci-
ence from the University of Saskatchewan, Canada, in 2017.
He is an associate professor with the Faculty of Robot Sci-
ence and Engineering, Northeastern University, Shenyang,
China. His research interests include image quality enhance-
ment, object recognition, and scene understanding.
43
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.
☐ The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Siqi Wang received the M.S. degree from Northeastern Uni-
versity in 2021. He is currently pursuing the Ph.D. degree in
the Faculty of Robot Science and Engineering, Northeastern
University, Shenyang, China. His research interests include
medical image segmentation, deep learning, and feature fu-
sion.
1
Xiujing Gao received the Ph.D. degree from the Tokyo
University of Marine Science and Technology, Japan. He
is currently a professor with the College of Intelligent Ma-
rine Science and Technology, Fujian University of Technol-
ogy, Fuzhou, China. His research interests include underwa-
ter robotics, unmanned vessels, and autonomous navigation.