0% found this document useful (0 votes)
62 views11 pages

Guo 等 - 2024 - UCTNet Uncertainty-guided CNN-transformer Hybrid Networks for Medical Image Segmentation

The document presents UCTNet, a novel uncertainty-guided CNN-Transformer hybrid network designed for medical image segmentation, which minimizes functional overlap between CNNs and transformers by focusing transformer operations on uncertain regions identified through uncertainty estimation. UCTNet demonstrates superior performance on various medical datasets, achieving high Dice similarity coefficients, and is adaptable to other frameworks without significant computational costs. The proposed architecture aims to enhance feature representation robustness while reducing redundancy in feature extraction processes.

Uploaded by

qw1403534122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views11 pages

Guo 等 - 2024 - UCTNet Uncertainty-guided CNN-transformer Hybrid Networks for Medical Image Segmentation

The document presents UCTNet, a novel uncertainty-guided CNN-Transformer hybrid network designed for medical image segmentation, which minimizes functional overlap between CNNs and transformers by focusing transformer operations on uncertain regions identified through uncertainty estimation. UCTNet demonstrates superior performance on various medical datasets, achieving high Dice similarity coefficients, and is adaptable to other frameworks without significant computational costs. The proposed architecture aims to enhance feature representation robustness while reducing redundancy in feature extraction processes.

Uploaded by

qw1403534122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Pattern Recognition 152 (2024) 110491

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical


image segmentation
Xiayu Guo a , Xian Lin a , Xin Yang a , Li Yu a , Kwang-Ting Cheng b , Zengqiang Yan a ,∗
a
School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China
b
School of Engineering, Hong Kong University of Science and Technology, Kowloon, Hong Kong, China

ARTICLE INFO ABSTRACT

Keywords: Transformer, born for long-range dependency establishment, has been widely studied as a complementary of
CNN-Transformer hybrid convolutional neural networks (CNNs) in medical image segmentation. However, existing CNN-Transformer
Uncertainty hybrid approaches simply pursue implicit feature fusion without considering their underlying functional
Functional overlap
overlap. Medical images typically follow stable anatomical structures, making convolution capable of handling
Masked self-attention
most segmentation targets. Without differentiation, enforcing transformers to operate self-attention for all
Medical image segmentation
image patches would result in severe redundancy, hindering global feature extraction. In this paper, we
propose a simple yet effective hybrid network named UCTNet where transformers only focus on establishing
global dependency for CNN’s unreliable regions predicted through uncertainty estimation. In this way, CNN
and transformer are explicitly fused to minimize functional overlap. More importantly, with fewer regions
to handle, UCTNet is of better convergence to learn more robust feature representations for hard examples.
Extensive experiments on publicly-available datasets demonstrate the superiority of UCTNet against the state-
of-the-art approaches, achieving 89.44%, 92.91%, and 91.15% in Dice similarity coefficient on Synapse, ACDC,
and ISIC2018 respectively. Furthermore, such a CNN-Transformer hybrid strategy is highly extendable to other
frameworks without introducing additional computational burdens. Code is available at https://github.com/
innocence0206/UCTNet.

1. Introduction Fig. 1, including TransUNet [7] (i.e., Fig. 1(a)) inserting sequential
transformer layers in the deepest stage of UNet [3] and operating
With the development and popularization of medical imaging equip- self-attention at lower spatial resolutions to reduce computation cost,
ment, there is a growing demand for automatic image segmentation to UNETR [8] (i.e., Fig. 1(b)) stacking transformer and CNN blocks as
assist clinical diagnosis [1,2]. Despite the great success of convolutional the encoder and the decoder respectively, nnFormer [9] (i.e., Fig. 1(c))
neural networks (CNNs) [3–5], the inductive bias of locality and weight adopting an interleaved combination of convolution and self-attention
sharing of convolution has become a major concern, and makes vision operations, and PHTrans [10] (i.e., Fig. 1(d)) parallelizing Swin trans-
transformer (ViT) [6] a revolutionary in medical image analysis at- former and CNN blocks to simultaneously aggregate global and local
tributed to its impressive performance on capturing long-range/global representations. Despite the competitive results achieved by the above
dependency. Nonetheless, the data-hungry peculiarity makes it hard hybrid architectures, one fundamental issue has been overlooked: Not
for ViT to fully unleash its potential given relatively limited medical
all patches require global contextual information equally. In other
image data, and the daunting computational complexity may bring
words, most image patches just rely on local information for segmen-
unacceptable training costs, especially for volumetric medical image
tation which can be well captured by CNNs, and treating all patches
segmentation.
equally for long-range dependency establishment in transformers may
To alleviate the deployment burdens of transformers, one straight-
cause severe redundancy and distract the convergence. It is consistent
forward way is to replace vanilla ViT with lightweight architectures like
Swin transformer [11], while another more commonly-adopted way with experimental observations that transformers in hybrid models may
in medical image segmentation is to develop CNN-Transformer hybrid degenerate to by-pass modules (i.e., attention collapse [12–14]). How
architectures targeting to capture both local and global information. to minimize the functional overlap between CNN and transformer is
Representative CNN-Transformer hybrid architectures are illustrated in rarely-explored in CNN-Transformer hybrid architectures.

∗ Corresponding author.
E-mail address: [email protected] (Z. Yan).

https://doi.org/10.1016/j.patcog.2024.110491
Received 6 November 2023; Received in revised form 1 April 2024; Accepted 7 April 2024
Available online 10 April 2024
0031-3203/© 2024 Elsevier Ltd. All rights reserved.
X. Guo et al. Pattern Recognition 152 (2024) 110491

Fig. 1. Illustration of representative CNN-Transformer hybrid architectures, including TransUNet [7], UNETR [8], nnFormer [9], PHTrans [10], and the proposed UCTNet. The
main difference between UCTNet and others is the uncertainty-guided feature flow (colored in red) to explicitly minimize the functional overlap between CNNs and transformers.
(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 2. Exemplar uncertainties in CNN predictions. In (b), CNN misidentifies the right kidney as the gallbladder and liver due to its inability to capture global dependency (e.g.,
location relationship) among organs. In (c), the uncertainty map reflects segmentation errors, indicating the potential need for more global information through transformers. (For
interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

In this work, we, for the first time, identify the functional overlap in transformer are explicitly fused with minimal functional overlap. The
existing CNN-Transformer architectures and propose a simple yet effec- main contributions are summarized as follows:
tive uncertainty-guided CNN-Transformer hybrid network named UCT-
Net for medical image segmentation. Specifically, transformer blocks 1. A simple yet effective CNN-Transformer hybrid architecture
are trained only on those low-confidence/uncertain regions of CNN UCTNet for medical image segmentation, aiming to fully com-
predictions. The motivation for developing uncertainty as guidance bine the strengths of CNN and transformer and minimize feature
for ViT is derived from two observations. One is that CNN is capa- redundancy.
ble of handling most targets for segmentation while building global 2. A plug-n-play uncertainty-guided vision transformer (UgViT)
dependencies for those well-segmented regions through self-attention block where transformers only build long-term dependencies
would result in severe redundancy due to its quadratic computational for CNN’s uncertain regions to reduce training difficulties and
complexity. The other is that global semantic information is not equally thereby improve the feature representation robustness.
demanded across the image space. Only those low-confidence or high- 3. Superior segmentation performance against the state-of-the-art
uncertainty segmentation regions are in great need of global contextual approaches on multiple publicly-available datasets.
information, as features captured by CNN are not sufficiently discrim-
inative. Exemplar CNN predictions from a hybrid model are presented The rest of this paper is organized as follows. Section 2 reviews
in Fig. 2(b). Compared to the ground truth in Fig. 2(a), most regions related works on medical image segmentation with transformer-based
are correctly segmented except for the right kidney, a portion of and hybrid architectures. Section 3 describes the proposed UCTNet in
which is misidentified as gallbladder due to feature similarity learned detail. We present a thorough evaluation against the state-of-the-art
by CNN. However, from the anatomical perspective, the gallbladder methods on publicly-available datasets in Section 4 and ablation studies
would never appear at the symmetrical position of the left kidney and in Section 5. Section 6 concludes this paper.
is predominantly situated between the liver and stomach within the
scan image. This discrepancy highlights CNN’s failure to establish the 2. Related work
spatial relationship in the globe across organs, which is crucial for
segmenting regions lacking distinguishable features. As illustrated in 2.1. Transformer-based architectures
the corresponding uncertainty map of Fig. 2(c), the region exhibiting
the highest uncertainty, colored in deep red, precisely coincides with Despite the great success of CNNs in medical image analysis, its
the right kidney, validating the potential of using uncertainty as a guide inductive bias has been a rising concern in performance improvement.
for transformers to perform self-attention selectively. In this way, in Transformer, capable of capturing long-range dependency through
addition to lower computation cost, given fewer patches (i.e., tokens), self-attention is supposed to be a more powerful tool compared to
the demand for extensive training data could be eased for transform- CNNs. Unfortunately, the data-hungry nature of ViT usually makes it
ers, leading to better convergence and capabilities of capturing more sub-optimal when trained on small-scale medical image datasets [15,
discriminative global information. Moreover, considering the poor per- 16]. Therefore, Swin transformer [11], with the introduction of lo-
formance of transformers on boundary detection (due to rigid patch cal bias, becomes a more popular baseline module compared to the
partitioning), uncertain regions are further refined in a hierarchical naive vision transformer in pure transformer-based approaches. For
way to remove blurred boundaries. Through such designs, CNN and instance, Cao et al. [17] proposed a UNet-like segmentation model

2
X. Guo et al. Pattern Recognition 152 (2024) 110491

Fig. 3. Overview of the proposed UCTNet and its core module UCTBlock. EConvBlock and DConvBlock denote the convolution blocks in the encoder and decoder of UCTNet
respectively, sharing the same structure composing two 3 × 3 × 3 convolutional layers each of which is followed by a GELU non-linear activation layer and an instance normalization
layer. The feature maps are halved (doubled) after each stage.

with pure Swin transformers, achieving competitive results as CNNs. and the decoder is comprised of uncertainty-guided CNN-Transformer
DS-TransUNet [18] designed an encoder comprising dual-size Swin blocks (UCTBlocks), as illustrated in Fig. 3. The encoder and de-
transformer blocks to process multi-scale feature representations and coder are organized in U-shape and connected by skip connections.
a fusion module to build global interactions among multi-scale fea- To minimize the functional overlap between CNN and transformer
tures through a self-attention mechanism. VT-UNet [19] employed in each UCTBlock, in addition to the typical joint loss of cross en-
window-based cross-attention and self-attention parallelly in the up- tropy and dice losses penalized to DConvBlock’s output, a separate
sampling path to preserve full global contextual information during cross-entropy loss of uncertain regions is imposed to the correspond-
feature decoding. Though resorting to Swin transformer helps alle- ing uncertainty-guided vision transformer (UgViT) to accelerate its
viate data dependence, we argue that a model built on transform- convergence.
ers entirely is computationally expensive and is inferior in capturing The complete workflow of UCTBlock is described in Fig. 3. Given a
local features compared to CNNs. It explains why most existing ap- 3D input 𝑥, the encoder based on EConvBlocks first extracts hierarchical
proaches adopt CNN-Transformer hybrid architectures for medical feature maps {𝑓0𝐸 , 𝑓1𝐸 , … , 𝑓𝑆−1𝐸 } (i.e., from high to low resolutions)

image segmentation. where 𝑆 denotes the number of stages before fed into the decoder.
For any stage 𝑖 ∈ 𝑆 of the decoder, 𝑓𝑖𝐸 and the feature maps 𝑓𝑖+1 𝐷

2.2. CNN-Transformer hybrid architectures from the previous decoder stage (i.e., the (𝑖 + 1)th UCTBlock) are
combined and fed to its DConvBlock, constructing the input of UgViT
One straightforward way to complement global contextual infor- denoted as 𝑧𝑖 . Then, 𝑧𝑖 is forwarded for uncertainty estimation and
mation is jointly using CNNs and transformers. TransUNet [7] and boundary removal to produce an uncertainty mask 𝑀𝑖 with the same
TransBTS [20] deployed a cascade of transformer layers at the lowest- spatial size of 𝑧𝑖 . 𝑧𝑖 is split into 3D patches and fed to UgViT together
resolution stage of UNet [3] to encode global features. Transformers in with 𝑀𝑖 for training. For those uncertain regions in 𝑀𝑖 , contextual
MISSFormer [21] and CoTr [22] used multi-scale features from the CNN information encoded by DConvBlock may not be sufficient for accurate
encoder to model long-range dependencies. For up-sampling features segmentation, so building long-range dependency based on 𝑀𝑖 through
conditioned on local and detailed features from down-sampling paths, transformers is necessary. After refinement through global interactions
Li et al. [23] introduced transformer blocks to the decoder of UNet [3]. across image patches by self-attention, the patch sequence is reshaped
Wang et al. [24] proposed to bridge the semantic and resolution to the volumetric space and then added to DConvBlock’s output via
gap between low-level and high-level features through an effective a residual connection. In this way, features contain both fine-grained
feature fusion with multi-scale channel-wise cross attention in skip spatial information and enriched global information for uncertain re-
connections. In nnFormer [9], convolution and self-attention operations gions. The fused feature representations are then passed to the next
were interleaved and responsible for precise spatial information and UCTBlock, and mapped to a precise prediction mask in the final stage.
global interaction. To continuously aggregate hierarchical representa- Each component is described in the following.
tions from global and local features, PHTrans [10] designed a parallel
hybrid architecture based on Swin transformers and CNNs. CiT-Net [25] 3.2. Uncertainty map calculation
adopted a dual-branch structure consisting of a dynamically adap-
tive CNN branch and a cross-dimensional fusion transformer branch, Based on the observation that CNN can already segment most
both of which followed the encoder–decoder design to maximize the regions accurately and the rest that are hard to segment usually bear
preservation of local and global features. CTC-Net [26] adopted a multi- high uncertainties, we make the transformers encode global contextual
encoder-single-decoder architecture and a cross-domain fusion block to information not uniformly but selectively under the guidance of un-
fuse CNN-Transformer hybrid features. TransHRNet [27] used paral- certainty estimation to minimize functional overlap. Next, we explain
lel transformers to connect CNN-Transformer features from different how to localize and refine uncertain regions (see Fig. 4). Uncertainty
resolution streams. Despite the promising results achieved by these estimation is realized via Normalized Softmax Entropy whose magni-
hybrid architectures, they fail to explicitly differentiate the functions tude indicates the degree of uncertainty. To be specific, the output
of CNN and transformer. In other words, such two sub-networks strive 𝑧𝑖 ∈ R𝐻×𝑊 ×𝐷×𝐶 of DConvBlock in the 𝑖th UCTBlock is mapped to
to segment any target image equally, resulting in feature overlapping 𝑧′𝑖 ∈ R𝐻×𝑊 ×𝐷×𝐿 firstly where (𝐻, 𝑊 , 𝐷) represents the shape of 𝑧𝑖 , 𝐶
and poor convergence on global feature extraction. is the channel dimension of 𝑧𝑖 , and 𝐿 is the number of categories, and
then a voxel-wise uncertainty map 𝑈𝑖 ∈ R𝐻×𝑊 ×𝐷 ∈ [0, 1] is formulated
3. Method as:
∑ ′ ′
𝑙∈𝐿 𝑧𝑖 (ℎ, 𝑤, 𝑑, 𝑙) log 𝑧𝑖 (ℎ, 𝑤, 𝑑, 𝑙)
3.1. Overview 𝑈𝑖 (ℎ, 𝑤, 𝑑) = − , (1)
log |𝐿|
UCTNet follows a simple encoder–decoder architecture, where the where 𝑧′𝑖 (ℎ, 𝑤, 𝑑, 𝑙) denotes the prediction probability of class 𝑙 at posi-
encoder consists of pure convolution blocks denoted as EConvBlocks tion (ℎ, 𝑤, 𝑑). To tell UgViT in UCTBlock where to focus or neglect, an

3
X. Guo et al. Pattern Recognition 152 (2024) 110491

Fig. 4. Detailed training visualization of the proposed uncertainty-guided transformer.

appropriate threshold 𝑇 is selected to perform binarization, producing UgViT remains focused solely on addressing uncertainties within the
a binary map 𝑈𝑖𝐵 ∈ {0, 1}𝐻×𝑊 ×𝐷 . Voxels with normalized entropy segmentation task. As the size of attention matrices is 𝑁 × 𝑁, the
higher than 𝑇 are set to 1 and fed to UgViT to perform self-attention, sequence is expanded to a mask denoted as 𝑀𝑖′ ∈ R𝑁×𝑁 . Then, MMSA
while others are filtered out. Ideally, the worse 𝑧′𝑖 is, the higher 𝑈𝑖 is calculated as
holds. However, models might be over-confident to assign a large 𝑄𝐾 𝑇
prediction probability to a wrong class/label, resulting in quite low MMSA(𝑄, 𝐾, 𝑉 , 𝑀𝑖′ ) = 𝑀𝑖′ ⋅ Softmax( √ )𝑉 . (3)
𝑑𝑚′
entropy. Hence, we suggest choosing a 𝑇 as small as possible aiming
to preserve wrongly-segmented regions and avoid including too many Features of UgViT and DConvBlock are then fused before fed to the
well-segmented regions. next UCTBlock. At this point, we have illustrated the overall framework
In addition to the interference from certain regions, boundaries, completely, and the calculation procedure in the 𝑖th UCTBlock can be
usually the most challenging regions in most cases, would be counted summarized as:
as uncertainties. In other words, a vast number of pixels in 𝑈𝑖𝐵 would 𝑧𝑖 = DConvBlock([𝑓𝑖𝐸 , 𝑓𝑖+1
𝐷
]), (4)
belong to boundaries. Unfortunately, vanilla vision transformers strug-
gle to preserve accurate boundary information due to rigid patch 𝑀𝑖 = RemBound(UncerEst(𝑧𝑖 )), (5)
partitioning [28]. That is, if transformers are forced to learn repre- 𝑓𝑖𝐷 = UgViT(𝑀𝑖 , 𝑧𝑖 ) + 𝑧𝑖 , (6)
sentations for boundaries, their losses are more likely to dominate the
training process, which in turn hinders transformers from exploring where 𝑓𝑖𝐸
is the output of the 𝑖th EConvBlock in the encoder, 𝐷
is the
𝑓𝑖+1
truly-valuable global information. To address this, we propose to re- output of the previous decoder stage (i.e., the (𝑖+1)th UCTBlock), and [⋅]
move boundaries in a hierarchical way across different stages where a represents concatenation supplementing low-level spatial information
boundary map 𝐵𝑖 ∈ {0, 1}𝐻×𝑊 ×𝐷 is calculated by determining whether from the encoder. UncerEst(⋅) and RemBound(⋅) denote the uncertainty
any position (ℎ, 𝑤, 𝑑) has the same class label with adjacent voxels in a estimation and boundary removal operations respectively. According to
𝑟𝑖 × 𝑟𝑖 neighborhood. Across different stages, 𝑟𝑖 is dynamically adjusted Eq. (3), the computational complexity of vanilla ViT is
according to feature resolutions. Then, a mask of uncertainty regions 𝛺ViT = 4𝑁𝑑𝑚 𝑑𝑚′ + 2𝑁 2 𝑑𝑚′ , (7)
𝑀𝑖 ∈ {0, 1}𝐻×𝑊 ×𝐷 is obtained by
while for UgViT the computational complexity is reduced to
𝑀𝑖 = 𝑈𝑖𝐵 ∩ 𝐵̄ 𝑖 , (2)
𝛺UgViT = 4𝑁𝑑𝑚 𝑑𝑚′ + (1 + 𝛾)𝑁 2 𝑑𝑚′ , (8)
where 𝐵̄ 𝑖 ∈ {0, 1}𝐻×𝑊 ×𝐷 is the complementary set of 𝐵𝑖 . After remov- ∑𝑁 ′
ing boundary areas, the proportion of addressable uncertainty regions where 𝛾 = 𝑗 𝑀𝑗,0 ∕𝑁 < 1 is the ratio of patches with established
in 𝑀𝑖 is further increased, benefiting the convergence of UgViT for self-attention relationships. It should be noted that we only introduce
performance improvement. UgViT to the decoder in consideration of the quality of uncertainty
estimation. Lacking high-level semantic information, the encoder has
3.3. Uncertainty-guided vision transformer difficulty producing high-quality segmentation, resulting in a large
portion of uncertainty regions and degenerating UgViT into a vanilla
Leveraging the uncertainty map computed, UgViT takes vision transformer in this case. Comparatively, after multi-layer feature
𝑧𝑖 ∈ R𝐻×𝑊 ×𝐷×𝐶 and 𝑀𝑖 ∈ {0, 1}𝐻×𝑊 ×𝐷 as inputs whose basic structure extraction, rich semantic information has been included in the decoder,
is the same as the vanilla ViT with masked multi-head self-attention enabling the generation of applicable uncertainty estimation and the
(MMSA). In UgViT, the volumetric input 𝑧𝑖 is first evenly divided establishment of reliable long-range dependencies.
into image patches, then reshaped and embedded to a 1D sequence
𝑥𝑖 ∈ R𝑁×𝑑𝑚 , where 𝑑𝑚 = (𝑃ℎ × 𝑃𝑤 × 𝑃𝑑 ) × 𝐶, (𝑃ℎ , 𝑃𝑤 , 𝑃𝑑 ) is the 3.4. Loss function
resolution of each patch, and 𝑁 = (𝐻 × 𝑊 × 𝐷)∕(𝑃ℎ × 𝑃𝑤 × 𝑃𝑑 )
is the number of patches. In MMSA, a linear layer projects 𝑥𝑖 to As discussed before, ViT would easily run into attention collapse

construct the triplet (𝑄, 𝐾, 𝑉 ) ∈ R𝑁×𝑑𝑚 subsequently, where 𝑑𝑚′ is the without extra training constraints when trained on small-scale datasets.
dimension of feature embeddings. The core idea of UgViT is to erase Therefore, we design an effective supervision strategy for UgViT in
the attention relationships established for low-uncertainty regions. In UCTBlock. Since it only learns to segment uncertain regions, we make
implementation, we convert 𝑀𝑖 to a dimension-matched mask and set the cross entropy loss of these regions 𝑖𝑇 𝑆𝑒𝑔 as the training constraint
target rows in attention matrices to zero. Specifically, we split 𝑀𝑖 to of UgViT in the 𝑖th UCTBlock, defined as
𝑃ℎ × 𝑃𝑤 × 𝑃𝑑 patches and transform them to a sequence of length 𝑁
𝑖𝑇 𝑆𝑒𝑔 = CE(𝑃𝑇𝑖 , 𝑌 ) ⋅ 𝑀𝑖 , (9)
through max pooling. That is, a patch will be preserved for computation
as long as it contains uncertain pixels. It is noteworthy that, since the where CE is short for the cross-entropy loss and 𝑃𝑇𝑖
denotes the seg-
segmentation loss imposed onto UgViT ignores those certain regions, mentation results of UgViT. Particularly, we cut off the gradient flow

4
X. Guo et al. Pattern Recognition 152 (2024) 110491

between UgViT and DConvBlock, that is, 𝑖𝑇 𝑆𝑒𝑔 updates the parameters warm-up strategy. Specifically, the training process is divided into two
of UgViT only. In this way, UgViT is forced to learn different and more stages consisting of 1000 epochs for each and the parameters of UgViT
robust representations than DConvBlock to minimize 𝑖𝑇 𝑆𝑒𝑔 . are frozen until the second stage. The initial learning rates are set
The overall loss function of UCTNet is written as as 0.01 and 0.001 respectively for the two stages and are decreased

𝑆 gradually following a poly decay strategy. Without accessible ground
= 𝛼𝑖 𝑖𝑇 𝑆𝑒𝑔 + 𝛽𝑖 𝑖𝐶𝑆𝑒𝑔 , (10) truth during inference, taking the outputs of DConvBlock as pseudo
𝑖=0 labels or skipping boundary removal is optional.
where 𝑖𝐶𝑆𝑒𝑔 is the joint loss consisting of both the cross-entropy loss
and the Dice loss for the 𝑖th DConvBlock, and {𝛼𝑖 , 𝛽𝑖 } are balancing 4.3. Evaluation on synapse
hyper-parameters.
4.3.1. Learning frameworks for comparison
4. Evaluation The state-of-the-art CNN-based and transformer-based/-hybrid ar-
chitectures have been included for comparison. CNN-based architec-
In this section, we conduct extensive comparison experiments tures include U-Net [3], Attention U-Net [4], and nnU-Net [5], 2D
against the state-of-the-art CNN-based and transformer-based/-hybrid transformer-based/-hybrid approaches include TransUNet [7], Swin-
approaches on publicly-available datasets. UNet [17], TransClaw UNet [31], LeVit-UNet-384 [32], MT-UNet [33],
MISSFormer [21], CASTformer [34], CTC-Net [26], TransCASCADE
4.1. Datasets [35], and APFormer [13], 3D transformer-based/-hybrid approaches in-
clude CoTr [22], UNETR [8], UNETR++ [36], PHTrans [10], DFormer
The following three datasets covering 2D and 3D medical image [37], and nnFormer [9]. For a fair comparison, all quantitative results
data were adopted for evaluation: are extracted from the original papers or reported by the newest
published papers.
4.1.1. Synapse
The Synapse dataset1 includes 30 abdominal CT scans with 13 4.3.2. Quantitative results
organs annotated by clinical radiologists. The slices of scans vary in Quantitative comparison results of different methods on Synapse
volume sizes, ranging from 85 to 198, but each slice has the same are summarized in Table 1. Among CNN-based methods, nnU-Net
spatial resolution of 512 × 512 pixels. Following the data split in [7] achieves the best performance, significantly outperforming other com-
and the subsequent transformer-based approaches [9,10,21], 18 scans parison approaches. Compared to the CNN-based methods, D-Former
are used for training, the rest 12 scans are for validation and test, owns the best segmentation performance in transformer-based/hybrid
and 8 out of 13 abdominal organs are selected for evaluation by Dice approaches, outperforming nnU-Net. It is noticed that there exists a
Similarity Coefficient (DSC). noticeable performance gap between 2D and 3D approaches, indicat-
ing the necessity of cross-slice information. Compared to D-Former,
4.1.2. ACDC UCTNet achieves better segmentation performance on five out of eight
The ACDC dataset2 contains end-diastolic and end-systolic MRI organs, leading to an average increase of 0.61% in Avg. DSC. Compared
images of 100 patients, with pixel-level annotation on the left ventricle to CNN-based approaches, UCTNet outperforms nnU-Net in segmenting
(LV), right ventricle (RV), and myocardium (MYO). We split the dataset six out of eight organs, leading to an average increase of 2.45% in Avg.
into three subsets following [7] and the latest works [9,29,30] with 70, DSC.
10, and 20 samples for training, validation, and test respectively, and
DSC is used for evaluation. 4.3.3. Qualitative results
Exemplar qualitative results produced by different approaches on
4.1.3. ISIC2018
Synapse are provided in Fig. 5. Despite the effectiveness of nnU-Net on
The ISIC2018 dataset3 consists of 2594 dermoscopic images of skin.
segmenting most organs, insufficient global information may affect the
To be consistent with the latest results reported in [12], we follow
segmentation performance of some small-size organs, leading to under-
the same data split that 2074 images are training samples, and 520
segmentation, as illustrated in the first and seventh columns in Fig. 5.
images are validation/test samples. The evaluation metrics include
Relying on transformers solely like UNETR is even worse than nnU-Net,
DSC, sensitivity (Se), specificity (Sp), accuracy (Acc), and intersection
indicating poor convergence. Combining CNNs and transformers like
over union (IoU).
PHTrans achieves better segmentation performance compared to nnU-
Net but still is sub-optimal, especially in the third and sixth columns.
4.2. Implementation details
Comparatively, UCTNet achieves the best segmentation performance,
not only in reducing false positives and false negatives but also in
UCTNet is implemented under the framework of nnU-Net [5] and all
shape preservation. The above results validate the value of minimiz-
experiments are conducted on one 24G NVIDIA Geforce RTX 3090 GPU
ing the functional overlap between CNN and transformer in hybrid
based on PyTorch 1.10.2. For Synapse, we crop the original volumetric
architectures.
images into sub-volumes of 48 × 192 × 192 pixels randomly for training
with a batch size of 2. For ACDC and ISIC2018, we re-design UCTNet
4.4. Evaluation on ACDC
from 3D to 2D, and crop input data into patches of 256 × 224 pixels
and 256 × 256 pixels respectively with the same batch size of 4 for
4.4.1. Learning frameworks for comparison
training. The number of stages 𝑆 is set to 5 for the encoder and 6 for
the decoder of UCTNet. Empirically, we take 0.001 as the uncertainty Both state-of-the-art CNN-based and transformer-based/-hybrid ar-
binarization threshold 𝑇 in all UCTBlocks. Given the poor accuracy chitectures have been included for comparison. 3D approaches in-
of uncertainty estimation in the early training phase, we utilize a clude nnU-Net [5], PHTrans [10], D-Former [37], and nnFormer [9],
while 2D approaches include TransUNet [7], Swin-UNet [17], LeViT-
UNet-384 [32], MISSFormer [21], CTC-Net [26], TransCASCADE [35],
1
https://www.synapse.org/#!Synapse:syn3193805/wiki/217789. MERIT [29], and H2Former [30]. For a fair comparison, all quantitative
2
https://www.creatis.insa-lyon.fr/Challenge/acdc/. results are extracted from either the original papers or reported by the
3
https://challenge.isic-archive.com/data/. newest officially-published papers.

5
X. Guo et al. Pattern Recognition 152 (2024) 110491

Table 1
Quantitative comparison on Synapse measured in DSC (%). Gall, L-Kid, and R-Kid are the abbreviations of gallbladder, left kidney, and right
kidney. The best and second-best results are marked in bold and underlined.
Model type Method Year Avg. Aorta Gall L-Kid R-Kid Liver Pancreas Spleen Stomach
UNet [3] 2015 76.85 89.07 69.72 77.77 68.60 93.43 53.98 86.67 75.58
CNN Att-UNet [4] 2019 77.77 89.55 68.88 77.98 71.11 93.57 58.04 87.30 75.75
nnU-Net [5] 2021 86.99 93.01 71.77 85.57 88.18 97.23 83.01 91.86 85.26
TransUNet [7] 2021 77.48 87.23 63.13 81.87 77.02 94.08 55.86 85.08 75.62
Swin-UNet [17] 2021 79.13 85.47 66.53 83.28 79.61 94.29 56.58 90.66 76.60
TransClaw UNet [31] 2021 78.09 85.87 61.38 84.83 79.36 94.28 57.65 87.74 73.55
LeViT-UNet-384 [32] 2021 78.53 87.33 62.23 84.61 80.25 93.11 59.07 88.86 72.76
2D transformer
MT-UNet [33] 2022 78.59 87.92 64.99 81.47 77.29 93.06 59.46 87.75 76.81
MISSFormer [21] 2022 81.96 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81
CASTformer [34] 2022 82.55 89.05 67.48 86.05 82.17 95.61 67.49 91.00 81.55
CTC-Net [26] 2023 78.41 86.46 63.53 83.71 80.79 93.78 59.73 86.87 72.39
TransCASCADE [35] 2023 82.68 86.63 68.48 87.66 84.56 94.43 65.33 90.79 83.52
APFormer [13] 2023 83.53 90.84 64.36 90.54 85.99 94.93 72.16 91.88 77.55
CoTr [22] 2021 72.60 83.27 60.41 79.58 73.01 91.93 45.07 82.84 64.67
UNETR [8] 2022 79.56 89.99 60.56 85.66 84.80 94.46 56.25 87.81 73.99
UNETR++ [36] 2022 87.22 92.52 71.25 87.54 87.18 96.42 81.10 95.77 86.01
3D transformer PHTrans [10] 2022 88.55 92.54 80.89 85.25 91.30 97.04 83.42 91.20 86.75
D-Former [37] 2022 88.83 92.12 80.09 92.60 91.91 96.99 76.67 93.78 86.44
nnFormer [9] 2023 86.57 92.04 70.17 86.57 86.25 96.84 83.35 90.51 86.83
UCTNet (Ours) 2023 89.44 92.86 83.74 85.95 90.97 97.06 84.51 93.28 87.12

Fig. 5. Qualitative results of different approaches on the Synapse dataset. Detailed regions are boxed for better visualization and comparison.

4.4.2. Quantitative results 4.4.3. Qualitative results


Quantitative comparison results of different methods on ACDC Exemplar qualitative results produced by different approaches on
are summarized in Table 2. Among comparison approaches, nnU-Net, ACDC are provided in Fig. 6. Though nnU-Net can handle most targets,
D-Former, and H2Former achieve the best performance among CNN- it encounters severe performance instability where some organs are
based, 2D, and 3D transformer-based/-hybrid architectures. Compared
completely missing as illustrated in the third column. Introducing trans-
to CNNs, capturing long-range dependency through well-designed
formers to complement global information may be helpful to alleviate
transformers is more beneficial, leading to better results. Different from
under-segmentation, but it may also produce more false positives as il-
Synapse, the SOTA 2D approach H2Former is even better than those 3D
methods on ACDC, indicating the data is highly anisotropic. Compared lustrated in the sixth column. Comparatively, UCTNet achieves the best
to H2Former, UCTNet achieves consistent performance improvements and the most stable segmentation performance across different cases,
by an average increase of 0.36%, 0.66%, and 0.51% for RV, Myo, and compared to other comparison approaches. Especially in the third and
LV respectively measured in DSC, leading to an average increase of the sixth columns, UCTNet effectively recalls the segmentation of RV
0.51% in Avg. DSC. while avoiding over-segmentation given no target organs.

6
X. Guo et al. Pattern Recognition 152 (2024) 110491

Fig. 6. Qualitative results of different approaches on the ACDC dataset.

Table 2
Quantitative comparison on ACDC measured in DSC (%). The best and second-best results are marked in bold and
underlined.
Model type Method Year Avg. RV Myo LV
U-Net [3] 2015 89.41 87.77 85.88 94.67
CNN Att-UNet [4] 2019 89.01 87.30 85.07 94.66
nnU-Net [5] 2021 91.62 90.24 89.24 95.36
PHTrans [10] 2022 91.79 90.13 89.48 95.76
3D transformer D-Former [37] 2022 92.29 91.33 89.60 95.93
nnFormer [9] 2023 92.06 90.94 89.58 95.65
TransUNet [7] 2021 89.71 88.86 84.53 95.73
Swin-UNet [17] 2021 90.00 88.55 85.62 95.83
LeViT-UNet-384 [32] 2021 90.32 89.55 87.64 93.76
2D transformer MISSFormer [21] 2023 91.19 89.85 88.38 95.34
CTC-Net [26] 2023 90.77 90.09 85.52 96.72
TransCASCADE [35] 2023 91.63 89.14 90.25 95.50
MERIT [29] 2023 92.32 90.87 90.00 96.08
H2Former [30] 2023 92.40 91.31 90.12 95.76
UCTNet (Ours) 2023 92.91 91.67 90.78 96.27

4.5. Evaluation on ISIC2018 performance improvements. Compared to Ms RED, UCTNet achieves


consistent performance improvements by an average increase of 0.90%,
4.5.1. Learning frameworks for comparison 1.26%, 0.31%, and 0.54% in DSC, IoU, Acc, and SP respectively. As for
Both state-of-the-art CNN-based and transformer-based/-hybrid ar- ConvFormer, UCTNet outperforms it by an average increase of 0.59%,
chitectures have been included for comparison, including CA-Net [38], 1.08%, and 0.51% in DSC, IoU, and SP respectively.
CPFNet [39], nnU-Net [5], TransUNet [7], MedT [40], SETR [41],
TransFuse [42], FAT-Net [43], Ms RED [44], Patcher [45], and Con-
4.5.3. Qualitative results
vFormer [12]. Among these methods, Ms RED and FAT-Net are specif-
Exemplar qualitative results produced by different approaches on
ically designed for skin legion segmentation. For a fair comparison,
all quantitative results of comparison approaches are reported by the ISIC2018 are provided in Fig. 7. Given high-contrast lesions, most
newest published work [12] following the same data split. comparison approaches effectively localize the target regions with ac-
ceptable over-segmentation. For the low-contrast lesions, they would
4.5.2. Quantitative results encounter severe over-segmentation, resulting in extensive false pos-
Quantitative comparison results of different methods on ISIC2018 itives. In addition, transformer-based/-hybrid approaches are more
are summarized in Table 3. Among CNN-based methods, Ms RED likely to encounter boundary distortion, validating the analysis in
achieves the best performance while ConvFormer achieves the best Section 1. Compared to the state-of-the-art transformer-based meth-
performance among transformer-based methods. Compared to the CNN- ods, UCTNet effectively reduces false positives with better boundary
based methods, introducing transformers may not necessarily bring preservation, validating the necessity of boundary removal in UgViT.

7
X. Guo et al. Pattern Recognition 152 (2024) 110491

Fig. 7. Qualitative results of different approaches on the ISIC2018 dataset.

Table 3
Quantitative comparison on ISIC2018. The best and second-best results are marked in bold and underlined.
Model type Method Year DSC (%) IoU (%) Acc (%) Se (%) Sp (%)
CA-Net [38] 2020 89.63 82.88 96.14 91.31 97.53
CPFNet [39] 2020 89.70 83.02 95.90 90.58 97.00
CNN
nnU-Net [5] 2021 90.22 83.76 96.48 90.96 97.98
Ms RED [44] 2022 90.25 83.77 96.44 91.30 97.42
TransUNet [7] 2021 88.75 81.78 95.67 91.47 96.25
MedT [40] 2021 88.75 81.74 95.85 89.32 97.80
SETR [41] 2021 89.03 82.16 96.00 89.83 97.64
TransFuse [42] 2021 89.28 82.23 96.05 90.40 97.28
Transformer
FAT-Net [43] 2022 89.72 83.07 96.25 92.79 96.89
Patcher [45] 2022 89.11 82.36 96.08 90.13 97.75
ConvFormer [12] 2023 90.56 83.95 96.74 91.31 97.45
UCTNet (Ours) 2023 91.15 85.03 96.75 91.13 97.96

introducing ViT to enrich global information is beneficial, the perfor-


Table 4
Component-wise ablation study of UCTNet on ISIC2018. baseline represents the original mance improvements are marginal, validating the necessity of mini-
CNN architecture. ViT is replacing UgViT with a vanilla vision transformer trained by mizing the functional overlap between CNNs and transformers. Directly
𝑇 𝑆𝑒𝑔 without using uncertainty as guidance. 𝑈 𝐵 ⋅ ViT denotes UgViT without boundary using the baseline’s uncertainty regions to guide the training of ViT is
removal in masked attention. UCTNet can be regarded as UgViT ∼ 𝑀 ⋅ ViT.
counter-productive, resulting in performance degradation across almost
Components DSC (%) IoU (%) Acc (%) Se (%) Sp (%)
all metrics. It is because most pixels in 𝑈 𝐵 come from boundary regions
baseline 90.73 84.54 96.47 92.21 97.43 which are quite challenging for ViT to handle and in turn introduce
+ ViT 90.88 84.88 96.56 91.62 97.58
more noise/interference. Comparatively, UCTNet (i.e., UgViT ∼ 𝑀 ⋅
+ 𝑈 𝐵 ⋅ ViT 90.78 84.68 96.58 92.42 97.37
UCTNet 91.15 85.03 96.75 91.13 97.96 ViT) achieves stable performance improvements with larger margins,
proving the necessity of boundary removal in UgViT.

5.2. Effect of neighborhood {𝑟𝑖 } in boundary removal


5. Discussion
As described in Section 3.2, we adopt a hierarchical strategy for
boundary removal by assigning varying neighborhood sizes 𝑟𝑖 to the top
In this section, a series of ablation studies are conducted for a
two stages. To evaluate its potential effects, a series of ablation studies
comprehensive evaluation of UCTNet.
under varying 𝑟0 and 𝑟1 are conducted on ISIC2018 as summarized in
Table 5. In general, the setting of 𝑟0 and 𝑟1 would not significantly affect
5.1. Component-wise ablation study of UCTNet the overall segmentation performance of UCTNet. Given smaller 𝑟0 and
𝑟1 , more boundary pixels are likely to be included into the uncertainty
Quantitative results of component-wise ablation studies on ISIC2018 mask 𝑀, misleading the convergence of UgViT. Using larger 𝑟0 and 𝑟1
are summarized in Table 4. Compared to the CNN baseline, though may ignore some key uncertain regions, making UgViT less beneficial.

8
X. Guo et al. Pattern Recognition 152 (2024) 110491

Fig. 8. Visualization on Synapse, including (a) CT slices, (b) ground truth, (c)–(d) segmentation maps produced by DConvBlock and UgViT respectively, (e) uncertainty maps of
DConvBlock, (f)–(h) patches fall in high uncertain regions, and (g)–(i) attention maps calculated by UgViT corresponding to the patches/tokens of (f)–(h).

Table 5
Ablation study on the hyper-parameters of boundary removal on ISIC2018. 𝑟0 and 𝑟1
denote the neighborhood sizes corresponding to the model stages 0 and 1.
𝑟1 𝑟0 DSC (%) IoU (%) Acc (%) Se (%) Sp (%)
3 5 91.01 84.86 96.72 91.10 97.85
5 7 91.07 84.87 96.74 90.94 97.88
7 9 91.15 85.03 96.75 91.13 97.96
9 11 91.07 84.93 96.74 91.07 97.86
11 13 91.05 84.89 96.71 91.15 97.88

Table 6
Ablation studies of UgViT working as a plug-and-play module on different backbones
evaluated on Synapse and ISIC2018. For a fair comparison, we re-implemented the
CE-Net and H2Former, maintaining consistent experimental settings, to evaluate the
sole impact of UgViT.
Methods Synapse ISIC2018
DSC↑ HD↓ DSC↑ HD↓
CE-Net [46] 83.62 21.78 90.13 5.13
CE-Net + UgViT 84.05 12.43 90.92 4.00
H2Former [30] 83.93 25.20 90.35 5.57 Fig. 9. The evolution of 𝛾 over the training period on Synapse, ACDC and ISIC2018
H2Former + UgViT 84.44 15.59 90.66 4.31 datasets.

UgViT are visualized by selecting four patches/tokens with high uncer-


Both cases would slightly degrade the segmentation performance. In tainty values and plotting token-specific dependency. Notably, learned
practice, we suggest tuning {𝑟𝑖 } from smaller values till the best settings attention for each patch is distributed across multiple positions within
through grid searching. the image, underscoring the indispensability of global information to
eliminate such wrongly segmented regions.
5.3. Extendability of hybrid strategy
5.5. The effect of ugvit on complexity reduction
UgViT is supposed to work as a plug-and-play module for effec-
tive global information extraction and is highly extendable to other The indicator 𝛾, representing the ratio of patches included for
frameworks. To verify this, two state-of-the-art approaches are selected self-attention calculation in Eq. (8), reflects how much computational
as backbones, including CE-Net [46] and H2Former [30]. CE-Net is complexity is reduced by UgViT. To evaluate the benefit of UgViT
a pure CNN-based model while H2Former adopts a CNN-Transformer for complexity reduction, we plot the 𝛾-curves across three datasets
hybrid architecture. Quantitative results of both backbones with and as illustrated in Fig. 9. The observed trends are consistent with our
without UgViT on Synapse and ISIC2018 are summarized in Table 6. theoretical expectations that the proportion of uncertain regions grad-
In general, UgViT brings consistent performance improvements on ually decreases as training progresses and more patches are recalled by
UgViT. It is noticed that 𝛾-curves vary across datasets which are data-
both backbones, demonstrating the effectiveness of UgViT for better
and task-specific. The decrease of 𝛾 demonstrates that UgViT effectively
CNN-Transformer collaboration.
mitigates the quadratic computational cost associated with vanilla ViT,
achieving a satisfactory level of complexity reduction.
5.4. Interpretability of UgViT
6. Conclusion
To more intuitively understand the effect of UgViT, we further visu-
alize the segmentation maps, the corresponding uncertainty maps, and In this paper, we first identify the functional overlap between
the learned attention maps in Fig. 8. It is observed that UgViT excels CNN and transformer in existing CNN-Transformer hybrid architec-
in rectifying segmentation errors where CNN erroneously identifies tures and then propose an uncertainty-guided architecture UCTNet
background as foreground. Upon the comparison between Fig. 8(c) and to explicitly encourage CNN and transformer to focus on separate
(d) in the first row of Fig. 8, UgViT effectively eliminates false positives regions for effective medical image segmentation. In UCTNet, through
while reducing misdetections around the pancreas and stomach. Simi- uncertainty estimation of CNN’s predictions and thereafter boundary
larly, in the second row of Fig. 8, background areas mislabeled for the removal, masked self-attention is constructed to explicitly tell ViT
liver are notably reduced through UgViT. Attention matrices learned by where to focus. Through such a simple yet effective design, UCTNet

9
X. Guo et al. Pattern Recognition 152 (2024) 110491

consistently outperforms the state-of-the-art approaches for medical [11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin trans-
image segmentation on several publicly-available datasets including former: Hierarchical vision transformer using shifted windows, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision, ICCV, 2021, pp.
both 3D and 2D. We believe the core idea of alleviating the functional
10012–10022.
overlap between CNN and transformer would inspire future work on [12] X. Lin, Z. Yan, X. Deng, C. Zheng, L. Yu, ConvFormer: Plug-and-play CNN-
developing CNN-Transformer hybrid architectures. style transformers for improving medical image segmentation, in: International
Conference on Medical Image Computing and Computer-Assisted Intervention,
MICCAI, Springer, 2023, pp. 642–651.
CRediT authorship contribution statement
[13] X. Lin, L. Yu, K.-T. Cheng, Z. Yan, The lighter the better: Rethinking transformers
in medical image segmentation through adaptive pruning, IEEE Trans. Med.
Xiayu Guo: Data curation, Formal analysis, Methodology, Software, Imaging 42 (8) (2023) 2325–2337.
Writing – original draft. Xian Lin: Formal analysis, Validation, Visu- [14] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, J. Feng, DeepViT:
alization. Xin Yang: Formal analysis, Validation, Writing – review & Towards deeper vision transformer, 2021, arXiv:2103.11886.
[15] Z. Lu, H. Xie, C. Liu, Y. Zhang, Bridging the gap between vision transformers
editing. Li Yu: Investigation, Validation, Writing – review & editing. and convolutional neural networks on small datasets, in: Conference on Neural
Kwang-Ting Cheng: Formal analysis, Investigation, Validation, Writ- Information Processing Systems, NeurIPS, 2022, pp. 14663–14677.
ing – review & editing. Zengqiang Yan: Conceptualization, Formal [16] T. Chen, Z. Zhang, Y. Cheng, A. Awadallah, Z. Wang, The principle of di-
analysis, Methodology, Project administration, Supervision, Validation, versity: Training stronger vision transformers calls for reducing all levels of
redundancy, in: Proceedings of the IEEE Conference on Computer Vision and
Writing – review & editing.
Pattern Recognition, CVPR, 2022, pp. 12020–12030.
[17] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-
Declaration of competing interest UNet: UNet-like pure transformer for medical image segmentation, 2021, arXiv:
2105.05537.
[18] A. Lin, B. Chen, J. Xu, Z. Zhang, G. Lu, DS-TransUNet: Dual swin transformer
The authors declare that they have no known competing finan-
U-Net for medical image segmentation, IEEE Trans. Instrum. Meas. 71 (2022)
cial interests or personal relationships that could have appeared to 1–15.
influence the work reported in this paper. [19] H. Peiris, M. Hayat, Z. Chen, G. Egan, M. Harandi, A robust volumetric
transformer for accurate 3D tumor segmentation, in: International Conference
on Medical Image Computing and Computer-Assisted Intervention, MICCAI,
Data availability
Springer, 2022, pp. 162–172.
[20] W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, J. Li, TransBTS: Multimodal brain
All data is publicly available. tumor segmentation using transformer, in: International Conference on Medical
Image Computing and Computer-Assisted Intervention, MICCAI, Springer, 2021,
pp. 109–119.
Acknowledgments
[21] X. Huang, Z. Deng, D. Li, X. Yuan, Y. Fu, MISSFormer: An effective transformer
for 2D medical image segmentation, IEEE Trans. Med. Imaging 42 (5) (2023)
This work was supported in part by the National Natural Science 1484–1494.
Foundation of China under Grants 62202179 and 62271220, in part by [22] Y. Xie, J. Zhang, C. Shen, Y. Xia, CoTr: Efficiently bridging CNN and transformer
for 3D medical image segmentation, in: International Conference on Medical
the Natural Science Foundation of Hubei Province of China under Grant
Image Computing and Computer-Assisted Intervention, MICCAI, Springer, 2021,
2022CFB585, and in part by the National Natural Science Foundation pp. 171–180.
of China/Research Grants Council Joint Research Scheme under Grant [23] Y. Li, W. Cai, Y. Gao, X. Hu, More than encoder: Introducing transformer
N_HKUST627/20. decoder to upsample, in: IEEE International Conference on Bioinformatics and
Biomedicine, BIBM, 2022, pp. 1597–1602.
[24] H. Wang, P. Cao, J. Wang, O.R. Zaiane, UCTransNet: Rethinking the skip
References connections in U-Net from a channel-wise perspective with transformer, in:
Association for the Advancement of Artificial Intelligence, AAAI, 2022, pp.
[1] K. Wang, X. Zhang, X. Zhang, Y. Lu, S. Huang, D. Yang, EANet: Iterative edge 2441–2449.
attention network for medical image segmentation, Pattern Recognit. 127 (2022) [25] T. Lei, R. Sun, X. Wang, Y. Wang, X. He, A. Nandi, CiT-Net: Convolutional neural
108636. networks hand in hand with vision transformers for medical image segmentation,
[2] J. Chen, C. Chen, W. Huang, J. Zhang, K. Debattista, J. Han, Dynamic contrastive in: Proceedings of the International Joint Conference on Artificial Intelligence,
learning guided by class confidence and confusion degree for medical image IJCAI, 2023, pp. 1017–1025.
segmentation, Pattern Recognit. 145 (2024) 109881. [26] F. Yuan, Z. Zhang, Z. Fang, An effective CNN and transformer complementary
[3] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for network for medical image segmentation, Pattern Recognit. 136 (2023) 109228.
biomedical image segmentation, in: International Conference on Medical Image [27] Q. Yan, S. Liu, S. Xu, C. Dong, Z. Li, J.Q. Shi, Y. Zhang, D. Dai, 3D medical
Computing and Computer-Assisted Intervention, MICCAI, Springer, 2015, pp. image segmentation using parallel transformers, Pattern Recognit. 138 (2023)
234–241. 109432.
[4] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, D. [28] X. Lin, L. Yu, K.-T. Cheng, Z. Yan, BATFormer: Towards boundary-aware
Rueckert, Attention gated networks: Learning to leverage salient regions in lightweight transformer for efficient medical image segmentation, IEEE J.
medical images, Med. Image Anal. 53 (2019) 197–207. Biomed. Health Inform. 27 (7) (2023) 3501–3512.
[5] F. Isensee, P.F. Jaeger, S.A. Kohl, J. Petersen, K.H. Maier-Hein, nnU-Net: A [29] M.M. Rahman, R. Marculescu, Multi-scale hierarchical vision transformer with
self-configuring method for deep learning-based biomedical image segmentation, cascaded attention decoding for medical image segmentation, 2023, arXiv:2303.
Nature Methods 18 (2) (2021) 203–211. 16892.
[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, [30] A. He, K. Wang, T. Li, C. Du, S. Xia, H. Fu, H2former: An efficient hierarchical
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 hybrid transformer for medical image segmentation, IEEE Trans. Med. Imaging
words: Transformers for image recognition at scale, 2020, arXiv:2010.11929. 42 (9) (2023) 2763–2775.
[7] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A.L. Yuille, Y. Zhou, [31] C. Yao, M. Hu, G. Zhai, X. Zhang, TransClaw U-Net: Claw U-Net with
TransUNet: Transformers make strong encoders for medical image segmentation, transformers for medical image segmentation, 2021, arXiv:2107.05188.
2021, arXiv:2102.04306. [32] G. Xu, X. Wu, X. Zhang, X. He, LeViT-UNet: Make faster encoders with
[8] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H.R. transformer for medical image segmentation, 2021, arXiv:2107.08623.
Roth, D. Xu, UNETR: Transformers for 3D medical image segmentation, in: [33] H. Wang, S. Xie, L. Lin, Y. Iwamoto, X. Han, Y. Chen, R. Tong, Mixed transformer
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer U-Net for medical image segmentation, in: IEEE International Conference on
Vision, WACV, 2022, pp. 574–584. Acoustics, Speech, and Signal Processing, ICASSP, 2022, pp. 2390–2394.
[9] H. Zhou, J. Guo, Y. Zhang, X. Han, L. Yu, L. Wang, Y. Yu, nnFormer: Interleaved [34] C. You, R. Zhao, F. Liu, S. Dong, S. Chinchali, U. Topcu, L. Staib, J.S.
transformer for volumetric segmentation, IEEE Trans. Image Process. 32 (9) Duncan, Class-aware adversarial transformers for medical image segmentation,
(2023) 4036–4045. in: Conference on Neural Information Processing Systems, NeurIPS, 2022, pp.
[10] W. Liu, T. Tian, W. Xu, H. Yang, X. Pan, PHTrans: Parallelly aggregating 29582–29596.
global and local representations for medical image segmentation, in: International [35] M.M. Rahman, R. Marculescu, Medical image segmentation via cascaded at-
Conference on Medical Image Computing and Computer-Assisted Intervention, tention decoding, in: Proceedings of the IEEE/CVF Winter Conference on
MICCAI, Springer, 2022, pp. 235–244. Applications of Computer Vision, WACV, 2023, pp. 6222–6231.

10
X. Guo et al. Pattern Recognition 152 (2024) 110491

[36] A. Shaker, M. Maaz, H. Rasheed, S. Khan, M.-H. Yang, F.S. Khan, UNETR++: Xian Lin is a Ph.D. student at the School of Electronic Information and Commu-
Delving into efficient and accurate 3D medical image segmentation, 2022, arXiv: nications, Huazhong University of Science and Technology, China. She received her
2212.0449. B.E. degree from the College of Physical Science and Technology, Central China
[37] Y. Wu, K. Liao, J. Chen, J. Wang, D.Z. Chen, H. Gao, J. Wu, D-former: A U-shaped Normal University, China. Her research interests lie in medical image analysis, efficient
dilated transformer for 3D medical image segmentation, Neural. Comput. Appl. transformers, and medical foundation models.
35 (2023) 1931–1944.
[38] R. Gu, G. Wang, T. Song, R. Huang, M. Aertsen, J. Deprest, S. Ourselin, T.
Vercauteren, S. Zhang, CA-Net: Comprehensive attention convolutional neural Xin Yang received the Ph.D. degree from the Department of Electrical Computer
networks for explainable medical image segmentation, IEEE Trans. Med. Imaging Engineering, University of California, Santa Barbara (UCSB), CA, USA, 2013. She is
40 (2) (2020) 699–711. currently a Professor with the Department of Electronic Information and Communica-
[39] S. Feng, H. Zhao, F. Shi, X. Cheng, M. Wang, Y. Ma, D. Xiang, W. Zhu, X. Chen, tions, Huazhong University of Science and Technology, Wuhan, China. Her research
CPFNet: Context pyramid fusion network for medical image segmentation, IEEE interests include medical image analysis and 3-D vision.
Trans. Med. Imaging 39 (10) (2020) 3008–3018.
[40] J.M.J. Valanarasu, P. Oza, I. Hacihaliloglu, V.M. Patel, Medical transformer:
Li Yu received her B.Sc., M.Sc. and Ph.D. degrees in communication and information
Gated axial-attention for medical image segmentation, 2021, arXiv:2102.10662.
system from HUST in 1992, 1996 and 1999, respectively. Since 1999 she has been
[41] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T.
with the School of EIC, HUST, where she is now a full professor and the director of
Xiang, P.H.S. Torr, et al., Rethinking semantic segmentation from a sequence-to-
Multimedia and Communication Network Center. She was a visiting professor in the
sequence perspective with transformers, in: Proceedings of the IEEE Conference
Department of Electrical and Computer Engineering at University of Waterloo, hosted
on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 6881–6890.
by Prof. Sherman Shen. Her research interests are multimedia information processing,
[42] Y. Zhang, H. Liu, Q. Hu, TransFuse: Fusing transformers and CNNs for medical
medical image analysis, and deep learning.
image segmentation, 2021, arXiv:2102.08005.
[43] S. Chen, G. Chen, W. Wang, B. Lei, Z. Wen, H. Wu, FAT-Net: Feature adaptive
transformers for automated skin lesion segmentation, Med. Image Anal. 76 Kwang-Ting Cheng received the Ph.D. degree in electrical engineering and computer
(2022) 102327. sciences from the University of California at Berkeley, Berkeley, in 1988. Before joining
[44] D. Dai, C. Dong, S. Xu, Q. Yan, Z. Li, C. Zhang, N. Luo, Ms RED: A novel multi- HKUST, he was a professor of electrical and computer engineering (ECE) with the
scale residual encoding and decoding network for skin lesion segmentation, Med. University of California, Santa Barbara (UCSB), where he served in 1993. He is a
Image Anal. 75 (2022) 102293. world authority in the field of electronics testing and design verification, as well as an
[45] Y. Ou, Y. Yuan, X. Huang, S.T.C. Wong, J. Volpi, J.Z. Wang, K. Wong, impactful contributor across a wide range of research areas including design automation
Patcher: Patch transformers with mixture of experts for precise medical image of electronic and photonic systems, computer vision, and medical image analysis. He
segmentation, in: International Conference on Medical Image Computing and is a Fellow of IEEE.
Computer-Assisted Intervention, MICCAI, Springer, 2022, pp. 475–484.
[46] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, J. Liu, CE-
Net: Context encoder network for 2D medical image segmentation, IEEE Trans. Zengqiang Yan received the Ph.D. degree from the Department of Computer Science
Med. Imaging 38 (10) (2019) 2281–2292. and Engineering, Hong Kong University of Science and Technology, 2020. He is
currently an Associate Professor with the Department of Electronic Information and
Xiayu Guo is a M.D. student at the School of Electronic Information and Communi- Communications, Huazhong University of Science and Technology, Wuhan, China. His
cations, Huazhong University of Science and Technology, China. Her research interests research interests include medical image analysis, federated learning, and computer
include computer vision and medical image analysis. vision.

11

You might also like