0% found this document useful (0 votes)

29 views15 pages

A Novel Deep Learning Model For Medical Image Segmentation With Convolutional Neural Network and Transformer

The document presents a novel deep learning model called MRC-TransUNet for medical image segmentation, which integrates convolutional neural networks and transformers to enhance performance. This model addresses limitations of existing methods by using a lightweight MR-ViT and a reciprocal attention module, resulting in improved accuracy in segmenting medical images across various datasets. Experimental results demonstrate that MRC-TransUNet outperforms state-of-the-art segmentation methods, highlighting its potential for clinical applications.

Uploaded by

ancientbreathingwithsuraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views15 pages

A Novel Deep Learning Model For Medical Image Segmentation With Convolutional Neural Network and Transformer

Uploaded by

ancientbreathingwithsuraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

https://doi.org/10.1007/s12539-023-00585-9

ORIGINAL RESEARCH ARTICLE

A Novel Deep Learning Model for Medical Image Segmentation

with Convolutional Neural Network and Transformer
Zhuo Zhang1 · Hongbing Wu2 · Huan Zhao1 · Yicheng Shi3 · Jifang Wang1 · Hua Bai1 · Baoshan Sun2

Received: 9 March 2023 / Revised: 26 July 2023 / Accepted: 1 August 2023 / Published online: 4 September 2023
© International Association of Scientists in the Interdisciplinary Areas 2023

Abstract
Accurate segmentation of medical images is essential for clinical decision-making, and deep learning techniques have shown
remarkable results in this area. However, existing segmentation models that combine transformer and convolutional neural
networks often use skip connections in U-shaped networks, which may limit their ability to capture contextual information
in medical images. To address this limitation, we propose a coordinated mobile and residual transformer UNet (MRC-
TransUNet) that combines the strengths of transformer and UNet architectures. Our approach uses a lightweight MR-ViT to
address the semantic gap and a reciprocal attention module to compensate for the potential loss of details. To better explore
long-range contextual information, we use skip connections only in the first layer and add MR-ViT and RPA modules in the
subsequent downsampling layers. In our study, we evaluated the effectiveness of our proposed method on three different
medical image segmentation datasets, namely, breast, brain, and lung. Our proposed method outperformed state-of-the-
art methods in terms of various evaluation metrics, including the Dice coefficient and Hausdorff distance. These results
demonstrate that our proposed method can significantly improve the accuracy of medical image segmentation and has the
potential for clinical applications.
Graphical Abstract
Illustration of the proposed MRC-TransUNet. For the input medical images, we first subject them to an intrinsic downsam-
pling operation and then replace the original jump connection structure using MR-ViT. The output feature representations at

Zhuo Zhang and Hongbing Wu are joint first authors.

* Hua Bai
[email protected]
* Baoshan Sun
[email protected]
1
Tianjin Key Laboratory of Optoelectronic Detection
Technology and Systems, School of Electronic
and Information Engineering, Tiangong University,
Tianjin 300387, China
2
School of Computer Science and Technology, Tiangong
University, Tianjin 300387, China
3
College of Management and Economics, Tianjin University,
Tianjin 300072, China

13
Vol.:(0123456789)
664 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

different scales are fused by the RPA module. Finally, an upsampling operation is performed to fuse the features to restore
them to the same resolution as the input image.

Keywords Deep learning · Medical image segmentation · Transformer · UNet · Attention mechanism

1 Introduction usually used in medical segmentation tasks to extract fea-

ture information of lesions by deep encoder structures, and
Medical image segmentation is important for accurate dis- later perform classification of each pixel to obtain predicted
ease screening and patient prognosis. Accurate evaluation segmented images. Long et al. [6] proposed an end-to-end
of segmentation results can provide physicians with infor- framework based on fully convolutional neural networks,
mation on disease progression, thus improving the quality which has greatly influenced and inspired subsequent medi-
of their clinical diagnosis and treatment. However, manual cal segmentation models. Ronneberger et al. [7] proposed
segmentation may be accurate, but it is time consuming a U-shaped network (UNet) that extracts feature informa-
and laborious [1], and its subjectivity limits the criteria for tion using downsampling and reconstructs the segmenta-
disease diagnosis. Traditional segmentation methods are tion map by upsampling, which has become an accepted
based on thresholding [2], statistical shape models [3], and framework for medical segmentation and serves as a bench-
Graph Cut [4]. Such segmentation methods rely heavily on mark network for many medical segmentation tasks to this
user-specified parameters and preprocessing of MRI images, day. UNet + + [8] continues the UNet’s encoding–decoding
which are cumbersome steps. In contrast, automated seg- idea by embedding U-shaped networks of different depths to
mentation provides a unified standard for segmentation that cope with different medical segmentation tasks. In addition,
can further improve the precision and speed of the segmen- both Res-UNet [9] and Att-UNet [10] utilized the convolu-
tation of the region of interest of the object [5]. Therefore, tional feature extraction capability to reconstruct segmenta-
automated segmentation techniques for lesions are crucial tion maps by U-shaped structures. However, the limitation
for the development of scientific research and clinical diag- of the perceptual field cannot be avoided in convolutional
nosis in the medical field. operations and there is an inherent inductive bias, which
In the past few years, with the rapid evolution of medical makes it difficult to construct long-range dependencies as
segmentation, convolutional neural networks (CNNs) have well as global contextual connections in the image, and the
been widely used due to their advanced power. CNNs are segmentation accuracy can be limited as a result.

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 665

In computer vision (CV), transformers have been arising from skip connections. Specifically, CTrans used
explored as a more flexible alternative to CNNs. Trans- the long-dependency modeling advantage of the channel
formers are commonly used in natural language processing cross-fusion module (CCT) to fuse multi-scale encoder
(NLP) [11]. Recent advances in deep learning have led to features and a cross-attention module (CCA) to disam-
the emergence of vision transformers (ViTs) [12]. Similar biguate with decoder features. To overcome the above
to NLP tasks, transformers use 2D image patches with posi- challenges and to exploit the ideas of UCTransNet, we
tional embeddings as input sequences in the CNN domain propose a coordinated mobile and residual transformers
and apply global self-attention to the entire image, which UNet (MRC-TransUNet), a framework designed to opti-
can address issues such as the lack of global information mize U-shaped structures for automatic medical image
in CNNs. segmentation.
Recent research has combined transformers with other The MRC-TransUNet is a deep learning network for
architectures to improve medical image segmentation. For medical image segmentation, based on the U-shaped archi-
example, TransUNet [13] combines ViT and UNet, using tecture. Specifically, we first propose the mobile and resid-
transformers to extract feature information after convolu- ual visual transformer (MR-ViT), which combines Mobi-
tion through multiple layers. This feature information is leViT and ResNet [17], to replace skip connections except
then fused with a U-shaped structure of skip connections to for the first layer. This method is designed to capture the
recover low-level and high-level information. The Medical global information of the low-level features downsampled
Transformer (MedT) model, proposed in [14], introduces at each layer. It is noteworthy that other CNNs combined
a gated axial attention mechanism that enhances existing with ViT are large-scale and yield several times the num-
transformer architectures by incorporating a gating mecha- ber of model parameters than the benchmark network. In
nism within the self-attention module. This modification contrast, MR-ViT not only has a low number of parameters
allows the model to effectively handle datasets of any size. but also can effectively decode the local information with
Swin-UNet [15] is based entirely on transformers for 2D the global information to learn the global representation
medical image segmentation, effectively combining the from different perspectives [18]. On the other hand, we
inductive bias of spatial locality, hierarchical structure, and propose a reciprocal attention (RPA) for solving the prob-
translational invariance. Exploring long-range dependencies lem of detail loss in multi-scale feature fusion to further
and global contextual features through self-attention is at the improve the model segmentation capability of the encod-
core of the transformer-based approach. While these devel- ing–decoding process. Through the utilization of these
opments in transformer-based approaches are encouraging refinements, the MRC-TransUNet effectively enhances
for the development of medical segmentation tasks, some performance, resulting in more accurate medical image
limitations still remain. segmentation. Our main achievements can be summarized
The existing models that combine transformer and con- as follows.
volutional neural network for medical image segmentation
often overlook a critical issue: the global multilevel mod- 1. In this paper, we propose a transformer-based MRC-
eling problem of skip connections in U-shaped networks. In TransUNet framework, which effectively combines the
the current U-shaped network architecture, skip connections advantages of transformer as well as CNN, reduces the
are limited in their ability to capture larger-scale contextual number of parameters of traditional transformer com-
information and global dependencies beyond adjacent levels. bined with CNN, and enhances the flexibility of segmen-
While they effectively propagate and utilize local features, tation. The core idea of this method is to replace skip
skip connections struggle to capture long-range dependen- connections with transformers, which results in more
cies and global contextual information, which can restrict accurate segmentation of medical images.
the model’s ability to model complex scenes and subtle fea- 2. The novel MR-ViT module and RPA module are pro-
tures. This limitation may prevent U-shaped networks from posed to build global dependencies between different
effectively leveraging the global information in some tasks scales for downsampled feature extraction, which can
and can negatively impact their performance. Therefore, focus on more directional information as well as loca-
there is a need to develop novel skip connection schemes tion information, thus effectively fusing the contextual
that can perform excellent global multilevel modeling to information between multi-scale features.
enhance the performance of U-shaped networks in medical 3. Our method is a more appropriate combination of UNet
image segmentation. and transformer, with lower number of parameters and
To find a solution that can replace skip connections, higher performance. Compared with other advanced
Wang et al. [16] proposed UCTransNet, which uses the segmentation methods, the experimental results on the
CTrans module for solving the semantic gap problem three datasets perform better.

13
666 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

2 Related Work patches of medical images, and this strategy can improve
the segmentation results. UCTransNet [16] was proposed
This section provides a summary of commonly used CNN- in 2022, which uses CTrans instead of UNet’s skip connec-
based methods in medical segmentation, as well as the appli- tions, and achieves multi-scale channel cross-fusion.
cation of transformers in computer vision. Additionally, we
discuss the use of skip connections in segmentation models. 2.3 Skip Connections in U‑Shaped Structures

The skip connection mechanism, which was first proposed

2.1 CNN‑Based Medical Image Segmentation in UNet [7], has been shown to be an effective method for
bridging the semantic gap between the encoder and decoder
Impressive convolutional neural network (CNN) seg- in medical image segmentation. This mechanism enables
mentation architectures, such as FCN [6], UNet [7], and the reuse of low-level features learned by the encoder, thus
UNet + + [8], have been extensively used for medical image improving segmentation accuracy. It also allows for the
segmentation and have demonstrated high performance. By recovery of detailed information about the target object [31,
introducing an attention gate mechanism, Attention-UNet 32]. Several U-based structures have been proposed for this
[10] enables the model to selectively emphasize important purpose [33, 34]. MultiResUNet [35] identified a semantic
features and suppress irrelevant ones at different scales. Res- gap between encoder and decoder features at the same level,
UNet [19] improved retinal vessel segmentation through the and introduced a Res Path with a Res Path to enhance skip
use of a weighted attention mechanism and a skip connec- connections. In a study by Wang et al. [16], it was noted that
tion based on ResNet. R2U-Net [20] combined residual the use of skip connections may not always be optimal in dif-
networks and U-Net for more accurate feature extraction. ferent datasets. This highlights the need for more appropriate
PraNet [21] has been specially developed to segment pol- action schemes for feature fusion than simple concatenation.
yps using parallel-partitioning (PPD) and reverse-attention
(RA) modules. KiU-Net [22] proposed a novel framework
that improves the segmentation of small anatomical struc-
tures by using under-complete and over-complete features. 3 Methods
However, CNN-based networks lack the capability to model
long-range dependencies and contextual connectivity. [23, In this section, our proposed MRC-TransUNet is described
24]. Although there have been attempts to address these in detail and shown in Fig. 1. We first briefly introduce the
challenges [25], modeling contextual dependencies still MRC-TransUNet, followed by the MR-ViT module and the
remains difficult [26]. RPA module.

3.1 MRC‑TransUNet
2.2 Visual Transformer
Current medical segmentation methods based on UNet [7]
Vision transformers are used directly for the whole image focus on the improvement of encoder and decoder to obtain
through a transformer with global self-attention, which has more accurate feature information. For example, TransUNet
achieved great success in the computer vision. The self- [13] modified the encoder in the form of a CNN combined
attention mechanism has been demonstrated to significantly with a transformer, Swin-UNet [15] used a transformer
improve the ability of these models to capture global context instead of a convolutional layer in a U-shaped structure, and
and long-range dependencies in images and videos [27–30]. Res-UNet [19] and Attention-UNet [10] highlighted salient
Especially in medical image processing, vision transform- local regions by introducing an attention mechanism for spe-
ers have shown great potential for improving segmenta- cific features. Then, it has been shown that the skip connec-
tion performance. Swin-UNet [15] is based entirely on the tions of UNet is a potential factor limiting the segmentation
transformer for 2D medical image segmentation, but the capability (e.g. UNet + + [8] after redesigning the skip con-
transformer is not sufficient for low-level detail information nections so that the sub-network of decoders can aggregate
acquisition. Therefore, the combination of the two has been features at different scales and the network becomes more
widely studied. TransUNet [13] combined the transformer flexible, UCTransNet [16] found experimentally that the fea-
with UNet, performed feature extraction by CNN, and then tures of encoders and decoders are inconsistent). In some
fed it into the transformer module for decoding, proving cases, the shallow features of encoders and decoders may
that the transformer can be employed as a powerful encoder have a semantic gap that simple skip connections cannot
for medical image segmentation. MedT [14] used shallow bridge. This can result in impaired final performance due to
global branching and deep local branching to operate on the the shallow features containing less semantic information.

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 667

Fig. 1 Illustration of the

proposed MRC-TransUNet. For
the input medical images, we
first subject them to an intrinsic
downsampling operation and
then replace the original jump
connection structure using
MR-ViT. The RPA module then
fuses output feature representa-
tions from different scales, and
an upsampling operation is used
to restore the features to the
input image’s resolution

Figure 1 illustrates an overview of our MRC-TransU- 3.2.1 Local Representations

Net framework. In practice, the given medical images
I ∈ ℝ3×H×W , H × W denote the image resolution. Unlike As in MobileViT [18], for a given input tensor X ∈ ℝC×H×W ,
the existing U-shaped segmentation network, the proposed MR-ViT applies a 3 × 3 standard convolutional layer followed
MRC-TransUNet employs a novel ViT module instead of by a point-by-point convolutional layer (1 × 1) convolutional
the existing simple skip connection. The proposed MR- layer to produce XL ∈ ℝd×H×W . The two convolutional layers
ViT is directly applied at three downsampling layers for serve different purposes: the first one learns local spatial infor-
local feature extraction as well as global feature extrac- mation within the input tensor, while the second one projects
tion of decomposed images at different semantic scales. the tensor into a higher dimensional space.
After the MR-ViT module, three detailed feature maps are
obtained as C2 ∈ ℝ128×H∕2×W∕2 , C3 ∈ ℝ256×H∕4×W∕4 , and 3.2.2 Global Representations
C4 ∈ ℝ512×H∕8×W∕8 . Using the self-attention of the trans-
former, the low-level features in the downsampling are To model long-term nonlocal dependencies, vision transform-
aggregated into a feature representation with contextual ers (ViTs) are very suitable choices and have performed well
information. Furthermore, to restore valuable spatial infor- in various vision recognition tasks. However, ViTs are mostly
mation during decoding, our MRC-TransUNet introduces a weighty and exhibit sub-standard optimizability due to the lack
RPA module in upsampling, which not only filters useless of spatial inductive bias (lack of spatial inductive bias) [36,
information in low-level details but also complements long- 37]. In global representations, XL is expanded into N nonover-
range feature information lost in individual directions and lapping patches XU ∈ ℝd×P×N , where P = wh, and N = HW/P
two dimensions. Finally, our MRC-TransUNet framework is is the number of patches [18]. For each p ∈ {1 ⋯ , P}, the
capable of accurately predicting pixel-level semantic label patches are encoded by transformer to obtain XT ∈ ℝd×P×N
maps with a size of H × W . with the following equation:

3.2 MR‑ViT
XT (p) = Transformer(XT (p)), 1 ≤ p ≤ P. (1)

As shown in Fig. 2, the MR-ViT module aims to model the

local and global information in the input tensor, again with
an information circulation module for the input tensor, com-
bined with its own low-level features.

13
668 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

Fig. 2 Illustration of the MR-ViT block

3.2.3 Information Circulation Block block contains richer local features. Afterward, 3 × 3 convo-
lution is used to reduce the dimensionality to keep the input
To capture local features more accurately and reduce the and output of the same size, unlike MobileViTv3 [38], we
number of parameters as much as possible, we introduce a add features of local representations to the output of 3 × 3
residual module in the information circulation block and per- convolution for optimizing the deep architecture.
form dimensionality reduction by point-by-point convolu-
tion. After that, the feature map sequentially passes through 3.3 RPA Module
3 × 3 convolution layers and uses 1 × 1 convolution to recover
to the original number of channels. The obtained tensor is As shown in Fig. 3, the RPA module gets two one-dimen-
added with the input tensor to achieve the local information sional directional features and one two-dimensional spatial
fusion of low-level features and high-level features. feature weight, after which the two attentional features are
fused to obtain the output graph Fout ∈ ℝC×H×W .
3.2.4 Fusion The RPA module consists of two branches, each of which
decomposes the channel attention into two 1-dimensional
In the fusion module, we concatenate the output of the infor- features. These features aggregate the input features along
mation circulation block with the output of global represen- two spatial directions [39], enabling the module to cap-
tations because we believe that the information circulation ture long-range dependencies along one direction while

Fig. 3 Illustration of the recip-

rocal attention

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 669

preserving location information along the other direction. Fout = Yc + Ys . (9)

For a given input Fin ∈ ℝC×H×W a pair of spatially oriented
aggregated feature maps zh and zw are obtained using pooling
kernel of dimensions (H, 1) and (1, W) encoded along the
horizontal and vertical directions, respectively, which helps 4 Experiments
the network to locate the target of interest more precisely,
where the output of the c-th channel with height h and width 4.1 Datasets
w is represented as
To verify the validity of the proposed method, we conducted
1 ∑
zhc (h) = X (h, i), (2) experiments on MRI images or ultrasound images of three
W 0≤i≤W c
different organs of the breast, brain, and lung. The following
three datasets were used, the details of which are shown in
1 ∑ Table 1.
zwc (w) = X (j, w). (3)
H 0≤j≤H c
4.1.1 Breast Ultrasound Images Dataset
After performing the concatenate operation on the two
feature maps using 1 × 1 convolution for the transform opera- Breast cancer is a prevalent cancer among women, and
tion to obtain the feature map f and then decompose it into early detection and treatment can significantly improve
f h and f w along the spatial dimension, after which it is trans- patients’ survival rates. The dataset collected in 2018 con-
formed into a tensor with the same number of channels using sists of breast ultrasound images from female patients aged
two 1 × 1 convolutions respectively [39] to further obtain 25–75 years. The dataset contains 780 PNG format images
the output Yc. with an average size of 500 × 500 pixels. For each original
image in the dataset, corresponding ground truth images are
f = l(F1×1 (Cat([zh , zw ]))), (4) provided. The dataset is categorized into three groups based
on the diagnosis: normal, benign, and malignant [40]. Only
ch = 𝜎(F1×1 (f h )), (5) samples with lesions are selected, and the dataset’s division
is presented in Table 2 ([Online 1]. https://w ww.k aggle.c om/
cw = 𝜎(F1×1 (f w )), datasets/aryashah2k/breast-ultrasound-images-dataset).
(6)
4.1.2 Brain Tumor Images Dataset
Yc = X ⊗ chc ⊗ cwc , (7)
Brain tumors are abnormal cell clusters that can develop
where ⊗ denotes the element multiplication, 𝜎 denotes the
within the inflexible skull, potentially leading to brain
Sigmoid function, and l denotes the nonlinear activation
impairment and life-threatening complications. These
function.
tumors can be either benign or malignant.
Then, the avgpool operation is performed after passing a
To obtain information on brain tumors for this study, we
7 × 7 convolution layer to obtain a two-dimensional atten-
used two datasets: a public dataset sourced from Kaggle and
tion weight feature map Ms. The output Ys is obtained by the
element multiplication operation as follows:
Ys = Fin ⊗ Ms = Fin ⊗ 𝜎(F7×7 [Avgpool(Fin )]), (8) Table 2 Division of the training and testing set on the breast ultra-
sound images dataset
where ⊗ represents element multiplication and 𝜎 represents Benign Malignant Total
Sigmoid function.
Finally, the two branched features are fused to obtain the Training set 302 150 452
final feature output map Fout. Testing set 122 60 182

Table 1 Illustration on different Datasets Breast Brain Lung

datasets
Quantity 634 3504 15,153
Benign Malignant Public Private COVID-19 Normal Viral Pneumonia

Quantity 424 210 2647 857 3616 10,192 1345

13
670 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

Table 3 Division of the training and testing set on the brain tumor model performance with the size of our dataset and computa-
images dataset tional resources.
Public Private Total In our experiments, the proposed MRC-TransUNet as well
as other comparison models are trained with a batch size of 16
Training set 2244 646 2890
for 500 epochs using the Adam optimizer [43] with an initial
Testing set 403 211 614
learning rate of 0.001. In addition, our experiments use an
early stopping mechanism for tuning, all networks were trained
on an Nvidia Tesla V100 GPU using Pytorch (version 1.7.0).
Table 4 Division of the training and testing set on the COVID-19 The resulting feature maps obtained were binarized to obtain
radiography database the final binary mask predictions. It is worth noting that no
COVID-19 Normal Viral pneu- Total data augmentation was performed in all our experiments, the
monia reasons are as follows:
Training set 2582 7280 960 10,822
1. Augmentation techniques are often applied by rotating,
Testing set 1034 2912 385 4331
flipping, cropping, scaling, and other transformations
to the images to expand the dataset. However, in medi-
cal images, these transformations may result in loss of
a private dataset from brain tumor patients who received important information, such as the location, shape, and
treatment at Tianjin Huanhu Hospital between 2016 and size of lesions.
2022. Patients were retrospectively selected based on spe- 2. Accurate diagnosis and treatment in medical image anal-
cific inclusion criteria, including pathologically confirmed ysis depend on the accurate identification and analysis
intracranial meningioma, histopathological grading fol- of subtle features and structures in the images. There-
lowing WHO guidelines, and absence of cerebral soften- fore, accuracy is more important than data quantity for
ing and severe or diffuse brain atrophy before MRI scan. medical images. If augmentation techniques introduce
This study was approved by the Ethics Committee of Tian- noise or other forms of uncertainty, they may affect the
jin Huanhu Hospital. The dataset distribution is shown in accuracy of the model.
Table 3 ([Online 2]. https://www.kaggle.com/datasets/tinas 3. We used mostly open-source datasets that have been
hri/brain-tumor-segmentation-datasets). curated and validated by experts in the field, which do
not require extensive preprocessing or augmentation.
These datasets were already of high quality and con-
4.1.3 COVID‑19 Radiography Database
tained sufficient diversity to represent the underlying
population.
A collaboration between researchers from Qatar University
in Doha, Qatar, and the University of Dhaka in Bangladesh,
During training, we utilized the Binary Cross-Entropy loss
along with medical doctors from Pakistan and Malaysia,
as the loss function. The objective of this loss function is to
resulted in the creation of a database containing chest X-ray
minimize the discrepancy between the predicted probability
images of individuals who were COVID-19 positive, as well
distribution and the actual probability distribution of the target.
as images of those with normal and viral pneumonia [41,
This loss function quantifies the dissimilarity between the two
42]. The dataset was categorized following the distribution
probability distributions. In particular, its mathematical defini-
outlined in Table 4 ([Online 3]. https://www.kaggle.com/
tion is given by
datasets/tawsifurrahman/covid19-radiography-database).
loss = −(y log(p(x) + (1 − y) log(1 − p(x)), (10)
4.2 Implementation Details where p(x) is the model output and y is the true label.

To enable faster convergence and improve generalization per- 4.3 Evaluation Metrics

formance, we resized all input images to 128 × 128, which
not only allowed us to use larger batch sizes during training The Dice score is a metric commonly used for evaluating the
but also reduced the computational and memory require- degree of overlap between two binary sets, which is defined as
ments of the model. Additionally, using the fixed image size
of 128 × 128 preserved image details and avoided distortion Dice =
2TP
. (11)
and deformation caused by enlarging smaller images while FP + 2TP + FN
meeting computational and memory constraints. The choice
of image size and batch size was carefully selected to balance

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 671

The values of Dice score are in the range of [0, 1], where Euclidean distance between their surface vertices, which is
0 indicates no overlap between segmentation results and defined as
labeled maps and 1 indicates complete overlap. TP repre- ∑ ∑
sents true positive, FP represents false positive, TN repre- sG d(sG , S(R) + sR d(sR , S(G)
ASD(G, R) = . (15)
sents true negative, and FN represents false negative. �S(R)� + �S(R)�
IoU is another evaluation metric in the segmentation
model that measures the degree of overlap between the
ground truth and the predicted segmentation. In layman’s
terms, IoU represents the cross-merge ratio between two
5 Results and Discussion
samples, which is defined as
To verify the superiority of our proposed MRC-TransUNet,
P∩T we compared it with several state-of-the-art medical image
IoU = . (12)
P∪T segmentation methods, including SegNet [45], UNet [7],
UNet + + [8], PSPNet [46], CCNet [24], UCTransNet [16],
The range interval is still [0, 1], and the larger the value
and TransUNet [13].
the better the segmentation effect. P denotes the predicted
segmentation result; T denotes the manual labeling result.
5.1 Results on Breast Ultrasound Images Dataset
The minimum distance of any vertex v to S(A) is defines
as
5.1.1 Qualitative Comparison
d(v, S(A)) = min min ∥ v − sA ∥, (13)
sA ∈S(A)
Figure 4 shows a qualitative visual comparison of the seg-
where S(A) represents the set of surface vertices of a 3D mentation results on the Breast Ultrasound Images Dataset
volume A and ‖⋅‖ denotes the Euclidean distance, with a using the proposed method and other state-of-the-art meth-
greater value indicating a larger distance. ods. We can observe that our method outperforms the other
Hausdorff distance [44] is a measure that describes the methods in breast cancer segmentation. Breast cancers in
degree of similarity between two sets of points. It is defined ultrasound images are easily confused with surrounding tis-
as sues and structures, as shown in the first row of Fig. 4, where
SegNet, UNet, PSPNet, and TransUNet misidentified the
HD(G, R) = max{sup d(sG , S(R), sup d(sR , S(G))}, excess tissue, producing serious false positives. Segmenta-
sG sR (14)
tion of small targets is a very challenging task, and the visual
where sG and sR refer to surface vertices in the automated comparison results in the second column of Fig. 4 show that
segmentation result R and the corresponding ground truth many segmentation models are error prone, such as UNet,
segmentation G, respectively, and the symbol “sup” denotes PSPNet, CCNet, and TransUNet. In contrast, our method
the supremum. has a strong capability in identifying and segmenting small
The ASD is a metric that quantifies the similarity between targets. Another typical example is the malignant breast can-
two binary masks (G and R) by computing the average cer with unclear boundary (shown in the third column), all

Fig. 4 Visual comparison of the segmentation performance of different models on Breast Ultrasound Images Dataset. To enhance the visualiza-
tion of differences between segmentation predictions and ground truths, we use suitable boxes to highlight the key regions

13
672 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

models are difficult to recognize these objects, and it can be UCTransNet outperformed other methods (such as SegNet,
found from the results that our method, although still some PSPNet, and CCNet) in all evaluation metrics, indicating
details cannot be recognized well (red marked part), the pro- that the U-shaped encoding–decoding structure could effec-
posed method is closest to Ground Truth and can segment tively extract lesion features of breast cancer and achieve
the target effectively compared with other methods. accurate reconstruction. However, we observed that Tran-
sUNet, which combines the UNet and Transformer struc-
5.1.2 Quantitative Comparison tures, showed the worst segmentation results in the three
evaluation metrics (71.52% for DSC, 55.66% for IoU, and
The quantitative comparison of segmentation results for 19.6550 mm and 5.5359 mm for HD95 and ASD, respec-
Breast Ultrasound Images Dataset, as the dataset with tively). This may be attributed to the fact that the transformer
the smallest sample size in the three datasets, is shown in model is more difficult to train compared to CNNs without
Table 5. From the results, we can find that the proposed a pre-trained model and a large amount of training data, as
method, MRC-TransUNet, obtained 78.8% DSC in breast reported in previous studies [18]. In this dataset, we only
cancer segmentation, which is better than all other state- used 452 images as training data, so simply adding a trans-
of-the-art methods. In addition, the proposed method former to UNet may not be sufficient to learn the spatial
also achieved optimal HD95 and ASD (15.0148 mm and information of breast cancer, resulting in poor segmentation
4.3169 mm, respectively). results of TransUNet in this dataset.
Compared with the baseline network UNet (78.07% We conducted a separate experiment to compare the
and 64.02% for DSC and IoU, respectively), the pro- results of different methods on benign and malignant images,
posed method (MRC-TransUNet) in this paper achieved and the findings are presented in Table 6. Benign lesions
improvement of 0.73% and 1% in these two evaluation typically have clear boundaries, which enable all models to
metrics, respectively. We found that UNet, UNet + + , and achieve relatively good results. However, due to the unclear
boundaries and irregular shapes of malignant lesions, the
overall segmentation results of MRC-TransUNet on malig-
Table 5 Results of comparisons on the breast ultrasound images data-
set nant lesions (DSC of 74.94% and IoU of 59.92%) were
slightly lower than those of UNet by less than 1%. These
Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ASD (mm) ↓ results suggest that, while MRC-TransUNet uses a light-
↓
weight transformer to compensate for the disadvantages of
SegNet [45] 72.97 57.44 17.8096 5.5794 the original structure, such as the large number of param-
UNet [7] 78.07 64.02 15.9410 4.6369 eters and the loss of local details, its performance may be
UNet + + [8] 75.73 60.94 15.7513 4.6903 inferior to CNN-based models in segmentation tasks with
PSPNet [46] 74.41 59.25 17.7525 5.8106 very small training samples, such as the malignant dataset
CCNet [25] 72.97 57.44 17.7510 6.7503 which only contains 150 images.
UCTransNet 76.34 61.74 15.5047 5.0491 Our method preserves the original convolution operation
[16] in downsampling to capture rich local detail features, while
TransUNet 71.52 55.66 19.6550 5.5359 utilizing a transformer with smaller parameters instead of
[13]
skip connections to integrate local and global information
Ours 78.80 65.02 15.0148 4.3169
features for improved segmentation performance without
Bold indicates the best data for each column significantly increasing the number of parameters. The

Table 6 Comparative Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ↓ ASD (mm) ↓
experimental results of different
categories on the breast Benign Malignant Benign Malignant Benign Malignant Benign Malignant
ultrasound images dataset
SegNet [45] 76.04 69.43 61.34 53.17 15.5883 22.3261 4.5062 7.6541
UNet [7] 80.55 75.29 67.44 60.37 14.0255 19.8359 4.1602 5.6060
UNet + + [8] 79.64 71.43 66.18 55.56 13.3265 20.6817 3.5532 7.2197
PSPNet [46] 78.02 70.49 63.96 54.43 16.1365 21.0383 4.8670 7.6497
CCNet [25] 75.36 69.84 60.46 53.66 15.7940 22.7763 5.6343 8.9498
UCTransNet [16] 79.76 72.46 66.34 56.82 12.6012 20.0085 3.9899 7.1494
TransUNet [13] 74.33 68.30 59.14 51.86 17.2406 24.5645 4.4657 7.6038
Ours 82.29 74.94 69.90 59.92 11.7911 18.5696 3.4030 5.8699

Bold indicates the best data for each column

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 673

results presented above demonstrate that MRC-TransUNet Table 7 Results of comparisons on the brain tumor images dataset
achieves good segmentation performance on small-sample Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ASD (mm) ↓
breast cancer datasets. ↓

SegNet [45] 80.28 67.05 5.5771 1.5396

5.2 Results on Brain Tumor Images Dataset
UNet [7] 85.97 74.99 4.2509 1.1932
UNet + + [8] 83.96 72.35 4.4558 1.2641
5.2.1 Qualitative Comparison
PSPNet [46] 82.73 70.55 4.9200 1.3041
CCNet [25] 79.95 66.60 5.7753 1.6039
Figure 5 illustrates a qualitative visual comparison of the
UCTransNet 85.61 74.84 4.5416 1.2197
segmentation results using the proposed method and other
[16]
state-of-the-art methods on the Brain Tumor Images Data-
TransUNet 82.13 69.68 5.8442 1.4546
set. The results demonstrate that our proposed model still [13]
exhibits great potential for brain tumor segmentation tasks. Ours 86.14 75.65 4.2007 1.1252
In the first column, UNet, PSPNet, and UCTransNet can
accurately segment the target contour of the brain tumor. Bold indicates the best data for each column
However, the tumor contains a cystic part that only our pro-
posed MRC-TransUNet can fully segment the core region HD95 (0.0502 mm improvement compared to UNet), and
of the tumor and achieve the best performance in handling 1.1252 mm ASD (0.068 mm improvement compared to
edge details. In the second column of Fig. 5, most models UNet), outperforming all other state-of-the-art methods. As
cannot accurately identify small and irregularly shaped brain shown by the Table 7, the DSC of UNet, UCTransNet, and
tumors, and CCNet even fails to recognize the presence of MRC-TransUNet are all higher than 85%, proving the per-
tumors. In the third column, the tumor has a clear outline formance of the U-shaped structure. In addition, we found
and is relatively easier to segment. UNet, UNet + + , PSP- that the results of TransUNet in this dataset outperform the
Net, UCTransNet, and our proposed MRC-TransUNet can segmentation performance in the breast dataset, and the
all accurately identify the target region, but our proposed performance improvement of the transformer structure for
MRC-TransUNet outperforms the other models in terms of model segmentation is enhanced as the sample size of the
segmentation details. dataset increases.
Table 8 shows the segmentation results comparing the
5.2.2 Quantitative Comparison public dataset with private dataset in the Brain Tumor
Images Dataset. Both of our proposed methods demonstrate
The quantitative comparison results of different methods in excellent performance. In the public dataset, compared to
Brain Tumor Images Dataset are shown in Table 7. From UNet, the proposed method achieves a 0.19% improvement
the results, we can find that the proposed method MRC- in Dice similarity coefficient (DSC), a 0.54% improvement
TransUNet obtained 86.14% DSC in brain tumor segmen- in intersection over union (IoU), 0.027 mm improvement in
tation (0.17% improvement compared to UNet), 75.65% 95th percentile Hausdorff distance (HD95), and 0.0278 mm
IoU (0.66% improvement compared to UNet), 4.2007 mm improvement in average symmetric surface distance (ASD).

Fig. 5 Visual comparison of the segmentation performance of different models on Brain Tumor Images Dataset. To enhance the visualization of
differences between segmentation predictions and ground truths, we use suitable boxes to highlight the key regions

13
674 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

Table 8 Comparative Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ↓ ASD (mm) ↓
experimental results of different
categories on the brain tumor Public Private Public Private Public Private Public Private
images dataset
SegNet [45] 75.44 88.70 60.56 79.69 6.7303 3.3744 1.8457 0.9642
UNet [7] 83.36 89.83 71.20 81.54 4.9947 2.9624 1.3919 0.8151
UNet + + [8] 81.06 88.70 68.16 79.69 5.0991 3.2271 1.4899 0.8415
PSPNet [46] 78.96 89.19 65.24 80.49 5.4914 3.8285 1.5280 0.8854
CCNet [25] 75.64 89.36 60.83 80.77 6.3551 4.6679 1.8789 1.0851
UCTransNet [16] 82.95 90.41 70.87 82.49 5.3408 3.0153 1.3890 0.9035
TransUNet [13] 78.08 89.32 64.04 80.70 6.9516 3.7291 1.6902 1.0152
Ours 83.55 90.85 71.74 83.23 4.9677 2.7356 1.3641 0.6704

Bold indicates the best data for each column

In the private dataset, compared to UNet, the proposed Table 9 Results of comparisons on the COVID-19 radiography data-
method achieves a 1.02% improvement in DSC, a 1.69% base
improvement in IoU, a 0.2268 mm improvement in HD95, Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ASD (mm) ↓
and a 0.1447 mm improvement in ASD. These results further ↓
validate the good generalization ability of MRC-TransUNet
SegNet [42] 95.09 90.64 1.8158 0.7551
for different medical image segmentation tasks.
UNet [7] 95.48 91.35 1.4601 0.6869
UNet + + [8] 95.48 91.35 1.3846 0.6863
PSPNet [43] 95.26 90.95 2.0291 0.7405
5.3 Results on COVID‑19 Radiography Database
CCNet [25] 95.31 91.03 2.0815 0.7348
UCTransNet 95.48 91.35 1.3931 0.6848
The COVID-19 Radiography Database used in this paper [16]
contains 15,153 sample images of lungs. The large data TransUNet 95.46 91.31 1.4667 0.6912
volume can compensate for the lack of convolutional induc- [13]
tive bias of the transformer structure, while CNN can get Ours 95.70 91.75 1.3544 0.6840
better characterization learning capability in the large-scale
Bold indicates the best data for each column
dataset. In this dataset, both CNN and transformer show
excellent segmentation and get significant performance
improvement. It is hard to see the gap between the segmen- 0.0565 mm (PSPNet), 0.0508 mm (CCNet), 0.0008 mm
tation effects of different models in qualitative comparison (UCTransNet), and 0.0072 mm (TransUNet), respectively.
results, so we only show quantitative comparison results in In addition, the DSC of each model is higher than 95%,
this subsection. which proves the balanced distribution of this dataset.
Table 9 shows the quantitative comparison results of dif- UCTransNet is only lower than MRC-TransUNet in each
ferent methods on COVID-19 Radiography Database. The evaluation metric, which proves the effectiveness of trans-
proposed MRC-TransUNet outperforms other modules by former instead of skip connections. TransUNet shows good
using a transformer instead of skip connections to fuse multi- segmentation performance as the sample of the dataset
scale features and by focusing on feature regions with an increases, and the performance in this dataset has been bet-
excellent attention module. The DSC, IoU, HD95, and ASD ter than SegNet, CCNet, and PSPNet, proving that ViTs can
of the proposed method are 95.7%, 91.75%, 1.3544 mm, and reach or even surpass the performance of CNN in large-scale
0.684 mm, respectively. Compared with other algorithms, samples.
the proposed method improves the DSC by 0.61% (SegNet),
0.22% (UNet), 0.22% (UNet + +), 0.44% (PSPNet), 0.39% 5.4 Results of Ablation Experiments
(CCNet), 0.22% (UCTransNet), and 0.24% (TransUNet);
the IoU by 1.11% (SegNet), 0.4% (UNet), 0.4% (UNet + +), In this section, further ablation study was performed in dif-
0.8% (PSPNet), 0.72% (CCNet), 0.4% (UCTransNet), and ferent medical image segmentation datasets to evaluate the
0.44% (TransUNet); the HD95 by 0.4614 mm (SegNet), performance of each component of the proposed MRC-Tran-
0.1057 mm (UNet), 0.0302 mm (UNet + +), 0.6747 mm sUNet. We used only DSC and HD95 as evaluation metrics.
(PSPNet), 0.7271 mm (CCNet), 0.0387 mm (UCTransNet), The experimental results are shown in Table 10, with Breast
and 0.1123 mm (TransUNet); and the ASD by 0.0711 mm denoting Breast Ultrasound Images Dataset, Brain denoting
(SegNet), 0.0029 mm (UNet), 0.0023 mm (UNet + +), Brain Tumor Images Dataset, and Lung denoting COVID-19

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 675

Table 10 Results of ablation Model DSC (%) ↑ HD95 (mm) ↓

study
Breast Brain Lung BUSI Brain Lung

UNet 78.07 85.97 95.48 15.9410 4.2509 1.4601

MVM 78.57 86.05 95.52 15.6795 4.2020 1.3741
RPA 78.19 85.98 95.55 14.8684 4.2213 1.3889
Ours 78.80 86.14 95.70 15.0148 4.2007 1.3544

Bold indicates the best data for each column

Radiography Database. MVM and RPA represent the addi- MRC-TransUNet had lower HD95 and ASD values of
tion of only MR-ViT module and reciprocal attention (RPA) 15.0148 mm and 4.3169 mm, respectively, compared to
module to the model. From Table 10, we obtain the follow- UCTransNet and TransUNet. On the Brain Tumor Images
ing observations. dataset, MRC-TransUNet achieved a DSC of 86.14%, an
IoU of 75.65%, a HD95 of 4.2007 mm, and an ASD of
(1) By adding MR-ViT to replace the original skip con- 1.1252 mm, which were again better than UCTransNet and
nections, “MVM” has 0.5%, 0.08%, and 0.04% per- TransUNet. On the COVID-19 Radiography dataset, MRC-
formance improvement in DSC and 0.2615 mm, TransUNet achieved a DSC of 95.70%, an IoU of 91.75%,
0.0489 mm, and 0.086 mm performance improvement a HD95 of 1.3544 mm, and an ASD of 0.6840 mm, which
in HD95 on the three datasets, respectively. The method outperformed UCTransNet and TransUNet.
facilitates the fusion of multi-scale feature information Overall, our proposed MRC-TransUNet achieved superior
compared to the common UNet. segmentation performance while using significantly fewer
(2) When we use the proposed RPA module, the DSC of parameters, demonstrating its potential for practical medical
the three datasets improves from 78.07%, 85.97%, and image analysis.
95.48% to 78.19%, 85.98%, and 95.55%, respectively,
and the HD95 improves from 15.941 mm, 4.2509 mm,
and 1.4601 mm to 14.8684 mm, 4.2213 mm, and
1.3889 mm. This indicates that the proposed attention 6 Conclusion
mechanism can improve the model’s ability to focus on
the region of interest and thus improve the segmenta- In this paper, we propose MRC-TransUNet, a transformer-
tion performance. based framework that enhances the segmentation quality of
(3) Our proposed method outperforms other metrics on all medical images using a U-shaped architecture. Our approach
datasets, which demonstrates the effectiveness of the replaces traditional skip connections with a lightweight
combination of the two modules. Our results illuminate transformer (MR-ViT) to enable more feature extraction lay-
the effectiveness of using a lightweight transformer to ers in the encoder–decoder and better model global context.
fuse multi-scale features instead of skip connections on We also introduce a novel attention mechanism before multi-
segmentation performance in a U-shaped segmentation scale fusion, which effectively fuses multi-scale feature
framework. representations from the encoder by constructing attention
weights for 1D location features and 2D spatial features to
5.5 Parameter of Transformer‑based Methods model long-range dependencies. Our approach outperforms
previous state-of-the-art methods, as demonstrated through
We evaluated the segmentation performance and parameter extensive experiments on three medical image segmentation
efficiency of three transformer-based medical segmenta- tasks. However, we acknowledge the limitations of our study.
tion methods: UCTransNet, TransUNet, and our proposed Specifically, our proposed method did not show a signifi-
MRC-TransUNet. Among the three, MRC-TransUNet had cant improvement over UNet in quantitative evaluation. To
the lowest number of parameters, with only 112 MB, while improve the performance and applicability of our method,
TransUNet had the largest parameter count of 401 MB, and we plan to explore several avenues in future work. These
UCTransNet had 253 MB. Our proposed MRC-TransUNet include designing segmentation models based on small-sam-
demonstrated superior performance in medical image seg- ple medical images, investigating how visual transformers
mentation tasks. can better model long-range dependencies, and addressing
MRC-TransUNet outperformed UCTransNet and Tran- other optimization challenges.
sUNet on the Breast Ultrasound Images dataset, achieving Acknowledgements The authors wish to express our sincere appre-
a higher DSC of 78.80% and IoU of 65.02%. Additionally, ciation to Chenzi Zheng (College of Foreign Languages, Nankai

13
676 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677

University) for her valuable assistance in editing the English language 14. Valanarasu J, Oza P, Hacihaliloglu I et al (2021) Medical trans-
of our research. former: gated axial-attention for medical image segmentation.
https://doi.org/10.48550/arXiv.2102.10662
Funding This work was supported in part by National Natural Sci- 15. Cao H, Wang YY, Chen J et al (2021) Swin-Unet: Unet-like pure
ence Foundation of China (61972456, 61173032) and Tianjin Research transformer for medical image segmentation. https://doi.org/10.
Innovation Project for Postgraduate Students (2022SKY126). 48550/arXiv.2105.05537
16. Wang H, Cao P, Wang J et al (2022) UCTransNet: rethinking the
Data availability The data are available from the corresponding author skip connections in u-net from a channel-wise perspective with
on reasonable request. transformer. Proc AAAI Conf Artif Intell 36(3):2441–2449.
https://doi.org/10.48550/arXiv.2109.04335
Declarations 17. He K, Zhang X, Ren S et al (2016) Deep residual learning for
image recognition. In: 2016 IEEE Conference on Computer
Conflict of interest The authors declare no conflict of interest. Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
2016, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
18. Mehta S and Rastegari M (2021) MobileViT: light-weight, gen-
eral-purpose, and mobile-friendly vision transformer. https://
doi.org/10.48550/arXiv.2110.02178
References 19. Xiao X, Shen L, Luo Z et al (2018) Weighted Res-UNet for
high-quality retina vessel segmentation. In: 2018 9th Interna-
tional conference on information technology in medicine and
1. Shirokikh B, Dalechina A, Shevtsov A et al (2020) Deep learning
education (ITME), Hangzhou, China, 2018, pp 327–331. https://
for brain tumor segmentation in radiosurgery: prospective clinical
doi.org/10.1109/itme.2018.00080
evaluation. In: LNIP, BrainLes 2019, vol 11992, Springer, Cham,
20. Alom MZ, Hasan M, Yakopcic C et al (2018) Recurrent residual
pp 119–128. https://doi.org/10.1007/978-3-030-46640-4_12
convolutional neural network based on U-Net (R2U-Net) for
2. Otsu N (2007) A threshold selection method from Gray-level his-
medical image segmentation. https://doi.org/10.48550/arXiv.
tograms. IEEE Trans Syst Man Cybern 9(1):62–66. https://doi.
1802.06955
org/10.1109/TSMC.1979.4310076
21. Fan D-P, Ji GP, Zhou T et al (2020) Pranet: Parallel reverse
3. Prastawa M, Bullitt E, Gerig G (2009) Simulation of brain tumors
attention network for polyp segmentation. In: Medical image
in MR images for evaluation of segmentation efficacy. Med Image
computing and computer assisted intervention–MICCAI 2020:
Anal 13(2):297–311. https://d oi.o rg/1 0.1 016/j.m edia.2 008.1 1.0 02
23rd international conference, Lima, Peru. Proceedings, Part
4. Corso JJ, Sharon E, Dube S et al (2008) Efficient multilevel brain
VI 23. Springer, Cham, pp 263–273. https://doi.org/10.48550/
tumor segmentation with integrated Bayesian model classification.
arXiv.2006.11392
IEEE Trans Med Imaging 27(5):629–640. https://d oi.o rg/1 0.1 109/
22. Valanarasu JMJ, Sindagi VA, Hacihaliloglu I et al (2020) Kiu-
TMI.2007.912817
net: towards accurate segmentation of biomedical images using
5. Lin AL, Chen BZ, Xu JY et al (2022) DS-TransUNet: dual swin
over-complete representations. IN: Medical image computing
transformer U-Net for medical image segmentation. IEEE Trans
and computer assisted intervention–MICCAI 2020: 23rd inter-
Instrum Meas 71:4005615. https://doi.org/10.1109/TIM.2022.
national conference, Lima, Peru. Springer, Cham, pp 363–373.
3178991
https://doi.org/10.1007/978-3-030-59719-1_36
6. Long J, Shelhamer E, Darrell T (2015) Fully convolutional net-
23. Wang X, Girshick R, Gupta A et al (2018) Non-local neural
works for semantic segmentation. IEEE Trans Pattern Anal Mach
networks. In: 2018 IEEE/CVF conference on computer vision
Intell 39(4):640–651. https://d oi.o rg/1 0.1 109/T
PAMI.2 016.2 5726
and pattern recognition, pp 7794–7803. https://doi.org/10.1109/
83
CVPR.2018.00813
7. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional
24. Huang Z, Wang X, Huang L et al (2023) CCNet: criss-cross
networks for biomedical image segmentation. In: LNIP, MICCAI
attention for semantic segmentation. Int Conf Comput Vis
2015, vol 9351, Springer, Cham, pp 234–241. https://doi.org/10.
45(6):6896–6908. https://doi.org/10.1109/TPAMI.2020.30070
1007/978-3-319-24574-4_28
32
8. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N et al (2018)
25. Li J, Huo HT, Li C et al (2021) Multigrained attention network
Unet++: A nested u-net architecture for medical image segmen-
for infrared and visible image fusion. IEEE Trans Instrum Meas
tation. In: LNIP, DLMIA 2018, vol 11045, Springer, Cham, pp
70:5002412. https://doi.org/10.1109/TIM.2020.3029360
3–11. https://doi.org/10.1007/978-3-030-00889-5_1
26. Tang JH, Zou B, Li C et al (2021) Plane-wave image reconstruc-
9. Guerrero R, Qin C, Oktay O et al (2018) White matter hyperin-
tion via generative adversarial network and attention mechanism.
tensity and stroke lesion segmentation and differentiation using
IEEE Trans Instrum Meas 70:4505115. https://doi.org/10.1109/
convolutional neural networks. Neuroimage-Clin 17:918–934.
TIM.2021.3087819
https://doi.org/10.1016/j.nicl.2017.12.022
27. Zhao R, Huang Z, Liu T et al (2021) Structure-enhanced atten-
10. Oktay O, Schlemper J, Folgoc LL et al (2018) Attention U-Net:
tive learning for spine segmentation from ultrasound volume pro-
learning where to look for the pancreas. https://doi.org/10.48550/
jection images. In: IEEE international conference on acoustics,
arXiv.1804.03999
speech and signal processing (ICASSP). IEEE, New York, pp
11. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you
1195–1199. https://d oi.o rg/1 0.1 109/I CASSP
39728.2 021.9 41465 8
need. Adv Neural Inf Process Syst. https://d oi.o rg/1 0.4 8550/a rXiv.
28. Liu T, Zhang C, Lam KM et al (2022) Decouple and resolve:
1706.03762
transformer-based models for online anomaly detection from
12. Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is
weakly labeled videos. IEEE Trans Inf Forensics Secur 18:15–28.
worth 16x16 words: transformers for image recognition at scale.
https://doi.org/10.1109/TIFS.2022.3216479
https://doi.org/10.48550/arXiv.2010.11929
29. Li K, Wang Y, Zhang J et al (2023) Uniformer: unifying convolu-
13. Chen J, Lu Y, Yu Q et al (2021) TransUNet: transformers make
tion and self-attention for visual recognition. IEEE Trans Pattern
strong encoders for medical image segmentation. https://doi.org/
Anal Mach Intell 1–18. https://doi.org/10.1109/TPAMI.2023.
10.48550/arXiv.2102.04306
3282631

13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 677

30. Zhang Z, Zhang X, Yang Y et al (2023) Accurate segmentation 39. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient
algorithm of acoustic neuroma in the cerebellopontine angle based mobile network design. Comput Vis Pattern Recogn. https://doi.
on ACP-TransUNet. Front Neurosci 17:1207149. https://doi.org/ org/10.48550/arXiv.2103.02907
10.3389/fnins.2023.1207149 40. Al-Dhabyani W, Gomaa M, Khaled H et al (2019) Dataset of
31. Drozdzal M, Vorontsov E, Chartrand G et al (2016) The impor- breast ultrasound images. Data Brief 28:104863. https://doi.org/
tance of skip connections in biomedical image segmentation. In: 10.1016/j.dib.2019.104863
LNIP, DLMIA 2016, vol 10008, Springer, Cham, pp 179–187. 41. Rahman T, Amith K, Yazan Q et al (2021) Exploring the effect
https://doi.org/10.1007/978-3-319-46976-8_19 of image enhancement techniques on COVID-19 detection using
32. Huang G, Liu Z, Laurens V et al (2016) Densely connected convo- chest X-ray images. Comput Biol Med 132:104319. https://doi.
lutional networks. In: 2017 IEEE conference on computer vision org/10.1016/j.compbiomed.2021.104319
and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 42. Chowdhury MEH, Rahman T, Khandakar A et al (2020) Can AI
2261–2269. https://doi.org/10.1109/CVPR.2017.243 help in screening viral and COVID-19 pneumonia? IEEE Access
33. Li X, Hao C, Qi X et al (2017) H-DenseUNet: hybrid densely 8:132665–132676. https://doi.org/10.1109/ACCESS.2020.30102
connected UNet for liver and liver tumor segmentation from CT 87
volumes. IEEE Trans Med Imaging 37(12):2663–2674. https:// 43. Kingma D and Ba J (2014) Adam: a method for stochastic optimi-
doi.org/10.1109/TMI.2018.2845918 zation. Preprint at https://arXiv.org/arXiv:1412.6980. https://doi.
34. Huang H, Lin L, Tong R et al (2020) UNet 3+: a full-scale con- org/10.48550/arXiv.1412.6980
nected UNet for medical image segmentation. In: ICASSP 2020 - 44. Beauchemin M, Thomson KP, Edwards G (1998) On the Haus-
2020 IEEE international conference on acoustics, speech and sig- dorff distance used for the evaluation of segmentation results. Can
nal processing (ICASSP), Barcelona, Spain, 2020, pp 1055–1059. J Remote Sens 24(1):3–8. https://d oi.o rg/1 0.1 080/0 70389 92.1 998.
https://doi.org/10.1109/ICASSP40776.2020.9053405 10874685
35. Ibtehaz N, Sohel Rahman M (2019) MultiResUNet: rethinking 45. Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep
the U-net architecture for multimodal biomedical image segmen- convolutional encoder-decoder architecture for image segmenta-
tation. Neural Netw 121:74–87. https://doi.org/10.1016/j.neunet. tion. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495.
2019.08.025 https://doi.org/10.1109/TPAMI.2016.2644615
36. Xiao T, Singh M, Mintun E et al (2021) Early convolutions help 46. Zhao H, Shi J, Qi X et al (2016) Pyramid scene parsing network.
transformers see better. Adv Neural Inf Process Syst. https://doi. In: 2017 IEEE conference on computer vision and pattern recogni-
org/10.48550/arXiv.2106.14881 tion (CVPR), Honolulu, HI, USA, 2017, pp 6230–6239. https://
37. Graham B, El-Nouby A, Touvron H et al. (2021) LeViT: a vision doi.org/10.1109/cvpr.2017.660
transformer in ConvNet’s clothing for faster inference. https://doi.
org/10.48550/arXiv.2104.01136 Springer Nature or its licensor (e.g. a society or other partner) holds
38. Wadekar SN and Chaurasia A (2022) Mobilevitv3: mobile- exclusive rights to this article under a publishing agreement with the
friendly vision transformer with simple and effective fusion of author(s) or other rightsholder(s); author self-archiving of the accepted
local, global and input features. Preprint at https://arXiv.org/ manuscript version of this article is solely governed by the terms of
arXiv:2209.15159 such publishing agreement and applicable law.