A Novel Deep Learning Model For Medical Image Segmentation With Convolutional Neural Network and Transformer
A Novel Deep Learning Model For Medical Image Segmentation With Convolutional Neural Network and Transformer
https://doi.org/10.1007/s12539-023-00585-9
Received: 9 March 2023 / Revised: 26 July 2023 / Accepted: 1 August 2023 / Published online: 4 September 2023
© International Association of Scientists in the Interdisciplinary Areas 2023
Abstract
Accurate segmentation of medical images is essential for clinical decision-making, and deep learning techniques have shown
remarkable results in this area. However, existing segmentation models that combine transformer and convolutional neural
networks often use skip connections in U-shaped networks, which may limit their ability to capture contextual information
in medical images. To address this limitation, we propose a coordinated mobile and residual transformer UNet (MRC-
TransUNet) that combines the strengths of transformer and UNet architectures. Our approach uses a lightweight MR-ViT to
address the semantic gap and a reciprocal attention module to compensate for the potential loss of details. To better explore
long-range contextual information, we use skip connections only in the first layer and add MR-ViT and RPA modules in the
subsequent downsampling layers. In our study, we evaluated the effectiveness of our proposed method on three different
medical image segmentation datasets, namely, breast, brain, and lung. Our proposed method outperformed state-of-the-
art methods in terms of various evaluation metrics, including the Dice coefficient and Hausdorff distance. These results
demonstrate that our proposed method can significantly improve the accuracy of medical image segmentation and has the
potential for clinical applications.
Graphical Abstract
Illustration of the proposed MRC-TransUNet. For the input medical images, we first subject them to an intrinsic downsam-
pling operation and then replace the original jump connection structure using MR-ViT. The output feature representations at
* Hua Bai
[email protected]
* Baoshan Sun
[email protected]
1
Tianjin Key Laboratory of Optoelectronic Detection
Technology and Systems, School of Electronic
and Information Engineering, Tiangong University,
Tianjin 300387, China
2
School of Computer Science and Technology, Tiangong
University, Tianjin 300387, China
3
College of Management and Economics, Tianjin University,
Tianjin 300072, China
13
Vol.:(0123456789)
664 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
different scales are fused by the RPA module. Finally, an upsampling operation is performed to fuse the features to restore
them to the same resolution as the input image.
Keywords Deep learning · Medical image segmentation · Transformer · UNet · Attention mechanism
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 665
In computer vision (CV), transformers have been arising from skip connections. Specifically, CTrans used
explored as a more flexible alternative to CNNs. Trans- the long-dependency modeling advantage of the channel
formers are commonly used in natural language processing cross-fusion module (CCT) to fuse multi-scale encoder
(NLP) [11]. Recent advances in deep learning have led to features and a cross-attention module (CCA) to disam-
the emergence of vision transformers (ViTs) [12]. Similar biguate with decoder features. To overcome the above
to NLP tasks, transformers use 2D image patches with posi- challenges and to exploit the ideas of UCTransNet, we
tional embeddings as input sequences in the CNN domain propose a coordinated mobile and residual transformers
and apply global self-attention to the entire image, which UNet (MRC-TransUNet), a framework designed to opti-
can address issues such as the lack of global information mize U-shaped structures for automatic medical image
in CNNs. segmentation.
Recent research has combined transformers with other The MRC-TransUNet is a deep learning network for
architectures to improve medical image segmentation. For medical image segmentation, based on the U-shaped archi-
example, TransUNet [13] combines ViT and UNet, using tecture. Specifically, we first propose the mobile and resid-
transformers to extract feature information after convolu- ual visual transformer (MR-ViT), which combines Mobi-
tion through multiple layers. This feature information is leViT and ResNet [17], to replace skip connections except
then fused with a U-shaped structure of skip connections to for the first layer. This method is designed to capture the
recover low-level and high-level information. The Medical global information of the low-level features downsampled
Transformer (MedT) model, proposed in [14], introduces at each layer. It is noteworthy that other CNNs combined
a gated axial attention mechanism that enhances existing with ViT are large-scale and yield several times the num-
transformer architectures by incorporating a gating mecha- ber of model parameters than the benchmark network. In
nism within the self-attention module. This modification contrast, MR-ViT not only has a low number of parameters
allows the model to effectively handle datasets of any size. but also can effectively decode the local information with
Swin-UNet [15] is based entirely on transformers for 2D the global information to learn the global representation
medical image segmentation, effectively combining the from different perspectives [18]. On the other hand, we
inductive bias of spatial locality, hierarchical structure, and propose a reciprocal attention (RPA) for solving the prob-
translational invariance. Exploring long-range dependencies lem of detail loss in multi-scale feature fusion to further
and global contextual features through self-attention is at the improve the model segmentation capability of the encod-
core of the transformer-based approach. While these devel- ing–decoding process. Through the utilization of these
opments in transformer-based approaches are encouraging refinements, the MRC-TransUNet effectively enhances
for the development of medical segmentation tasks, some performance, resulting in more accurate medical image
limitations still remain. segmentation. Our main achievements can be summarized
The existing models that combine transformer and con- as follows.
volutional neural network for medical image segmentation
often overlook a critical issue: the global multilevel mod- 1. In this paper, we propose a transformer-based MRC-
eling problem of skip connections in U-shaped networks. In TransUNet framework, which effectively combines the
the current U-shaped network architecture, skip connections advantages of transformer as well as CNN, reduces the
are limited in their ability to capture larger-scale contextual number of parameters of traditional transformer com-
information and global dependencies beyond adjacent levels. bined with CNN, and enhances the flexibility of segmen-
While they effectively propagate and utilize local features, tation. The core idea of this method is to replace skip
skip connections struggle to capture long-range dependen- connections with transformers, which results in more
cies and global contextual information, which can restrict accurate segmentation of medical images.
the model’s ability to model complex scenes and subtle fea- 2. The novel MR-ViT module and RPA module are pro-
tures. This limitation may prevent U-shaped networks from posed to build global dependencies between different
effectively leveraging the global information in some tasks scales for downsampled feature extraction, which can
and can negatively impact their performance. Therefore, focus on more directional information as well as loca-
there is a need to develop novel skip connection schemes tion information, thus effectively fusing the contextual
that can perform excellent global multilevel modeling to information between multi-scale features.
enhance the performance of U-shaped networks in medical 3. Our method is a more appropriate combination of UNet
image segmentation. and transformer, with lower number of parameters and
To find a solution that can replace skip connections, higher performance. Compared with other advanced
Wang et al. [16] proposed UCTransNet, which uses the segmentation methods, the experimental results on the
CTrans module for solving the semantic gap problem three datasets perform better.
13
666 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
2 Related Work patches of medical images, and this strategy can improve
the segmentation results. UCTransNet [16] was proposed
This section provides a summary of commonly used CNN- in 2022, which uses CTrans instead of UNet’s skip connec-
based methods in medical segmentation, as well as the appli- tions, and achieves multi-scale channel cross-fusion.
cation of transformers in computer vision. Additionally, we
discuss the use of skip connections in segmentation models. 2.3 Skip Connections in U‑Shaped Structures
3.1 MRC‑TransUNet
2.2 Visual Transformer
Current medical segmentation methods based on UNet [7]
Vision transformers are used directly for the whole image focus on the improvement of encoder and decoder to obtain
through a transformer with global self-attention, which has more accurate feature information. For example, TransUNet
achieved great success in the computer vision. The self- [13] modified the encoder in the form of a CNN combined
attention mechanism has been demonstrated to significantly with a transformer, Swin-UNet [15] used a transformer
improve the ability of these models to capture global context instead of a convolutional layer in a U-shaped structure, and
and long-range dependencies in images and videos [27–30]. Res-UNet [19] and Attention-UNet [10] highlighted salient
Especially in medical image processing, vision transform- local regions by introducing an attention mechanism for spe-
ers have shown great potential for improving segmenta- cific features. Then, it has been shown that the skip connec-
tion performance. Swin-UNet [15] is based entirely on the tions of UNet is a potential factor limiting the segmentation
transformer for 2D medical image segmentation, but the capability (e.g. UNet + + [8] after redesigning the skip con-
transformer is not sufficient for low-level detail information nections so that the sub-network of decoders can aggregate
acquisition. Therefore, the combination of the two has been features at different scales and the network becomes more
widely studied. TransUNet [13] combined the transformer flexible, UCTransNet [16] found experimentally that the fea-
with UNet, performed feature extraction by CNN, and then tures of encoders and decoders are inconsistent). In some
fed it into the transformer module for decoding, proving cases, the shallow features of encoders and decoders may
that the transformer can be employed as a powerful encoder have a semantic gap that simple skip connections cannot
for medical image segmentation. MedT [14] used shallow bridge. This can result in impaired final performance due to
global branching and deep local branching to operate on the the shallow features containing less semantic information.
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 667
3.2 MR‑ViT
XT (p) = Transformer(XT (p)), 1 ≤ p ≤ P. (1)
13
668 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
3.2.3 Information Circulation Block block contains richer local features. Afterward, 3 × 3 convo-
lution is used to reduce the dimensionality to keep the input
To capture local features more accurately and reduce the and output of the same size, unlike MobileViTv3 [38], we
number of parameters as much as possible, we introduce a add features of local representations to the output of 3 × 3
residual module in the information circulation block and per- convolution for optimizing the deep architecture.
form dimensionality reduction by point-by-point convolu-
tion. After that, the feature map sequentially passes through 3.3 RPA Module
3 × 3 convolution layers and uses 1 × 1 convolution to recover
to the original number of channels. The obtained tensor is As shown in Fig. 3, the RPA module gets two one-dimen-
added with the input tensor to achieve the local information sional directional features and one two-dimensional spatial
fusion of low-level features and high-level features. feature weight, after which the two attentional features are
fused to obtain the output graph Fout ∈ ℝC×H×W .
3.2.4 Fusion The RPA module consists of two branches, each of which
decomposes the channel attention into two 1-dimensional
In the fusion module, we concatenate the output of the infor- features. These features aggregate the input features along
mation circulation block with the output of global represen- two spatial directions [39], enabling the module to cap-
tations because we believe that the information circulation ture long-range dependencies along one direction while
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 669
13
670 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
Table 3 Division of the training and testing set on the brain tumor model performance with the size of our dataset and computa-
images dataset tional resources.
Public Private Total In our experiments, the proposed MRC-TransUNet as well
as other comparison models are trained with a batch size of 16
Training set 2244 646 2890
for 500 epochs using the Adam optimizer [43] with an initial
Testing set 403 211 614
learning rate of 0.001. In addition, our experiments use an
early stopping mechanism for tuning, all networks were trained
on an Nvidia Tesla V100 GPU using Pytorch (version 1.7.0).
Table 4 Division of the training and testing set on the COVID-19 The resulting feature maps obtained were binarized to obtain
radiography database the final binary mask predictions. It is worth noting that no
COVID-19 Normal Viral pneu- Total data augmentation was performed in all our experiments, the
monia reasons are as follows:
Training set 2582 7280 960 10,822
1. Augmentation techniques are often applied by rotating,
Testing set 1034 2912 385 4331
flipping, cropping, scaling, and other transformations
to the images to expand the dataset. However, in medi-
cal images, these transformations may result in loss of
a private dataset from brain tumor patients who received important information, such as the location, shape, and
treatment at Tianjin Huanhu Hospital between 2016 and size of lesions.
2022. Patients were retrospectively selected based on spe- 2. Accurate diagnosis and treatment in medical image anal-
cific inclusion criteria, including pathologically confirmed ysis depend on the accurate identification and analysis
intracranial meningioma, histopathological grading fol- of subtle features and structures in the images. There-
lowing WHO guidelines, and absence of cerebral soften- fore, accuracy is more important than data quantity for
ing and severe or diffuse brain atrophy before MRI scan. medical images. If augmentation techniques introduce
This study was approved by the Ethics Committee of Tian- noise or other forms of uncertainty, they may affect the
jin Huanhu Hospital. The dataset distribution is shown in accuracy of the model.
Table 3 ([Online 2]. https://www.kaggle.com/datasets/tinas 3. We used mostly open-source datasets that have been
hri/brain-tumor-segmentation-datasets). curated and validated by experts in the field, which do
not require extensive preprocessing or augmentation.
These datasets were already of high quality and con-
4.1.3 COVID‑19 Radiography Database
tained sufficient diversity to represent the underlying
population.
A collaboration between researchers from Qatar University
in Doha, Qatar, and the University of Dhaka in Bangladesh,
During training, we utilized the Binary Cross-Entropy loss
along with medical doctors from Pakistan and Malaysia,
as the loss function. The objective of this loss function is to
resulted in the creation of a database containing chest X-ray
minimize the discrepancy between the predicted probability
images of individuals who were COVID-19 positive, as well
distribution and the actual probability distribution of the target.
as images of those with normal and viral pneumonia [41,
This loss function quantifies the dissimilarity between the two
42]. The dataset was categorized following the distribution
probability distributions. In particular, its mathematical defini-
outlined in Table 4 ([Online 3]. https://www.kaggle.com/
tion is given by
datasets/tawsifurrahman/covid19-radiography-database).
loss = −(y log(p(x) + (1 − y) log(1 − p(x)), (10)
4.2 Implementation Details where p(x) is the model output and y is the true label.
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 671
The values of Dice score are in the range of [0, 1], where Euclidean distance between their surface vertices, which is
0 indicates no overlap between segmentation results and defined as
labeled maps and 1 indicates complete overlap. TP repre- ∑ ∑
sents true positive, FP represents false positive, TN repre- sG d(sG , S(R) + sR d(sR , S(G)
ASD(G, R) = . (15)
sents true negative, and FN represents false negative. �S(R)� + �S(R)�
IoU is another evaluation metric in the segmentation
model that measures the degree of overlap between the
ground truth and the predicted segmentation. In layman’s
terms, IoU represents the cross-merge ratio between two
5 Results and Discussion
samples, which is defined as
To verify the superiority of our proposed MRC-TransUNet,
P∩T we compared it with several state-of-the-art medical image
IoU = . (12)
P∪T segmentation methods, including SegNet [45], UNet [7],
UNet + + [8], PSPNet [46], CCNet [24], UCTransNet [16],
The range interval is still [0, 1], and the larger the value
and TransUNet [13].
the better the segmentation effect. P denotes the predicted
segmentation result; T denotes the manual labeling result.
5.1 Results on Breast Ultrasound Images Dataset
The minimum distance of any vertex v to S(A) is defines
as
5.1.1 Qualitative Comparison
d(v, S(A)) = min min ∥ v − sA ∥, (13)
sA ∈S(A)
Figure 4 shows a qualitative visual comparison of the seg-
where S(A) represents the set of surface vertices of a 3D mentation results on the Breast Ultrasound Images Dataset
volume A and ‖⋅‖ denotes the Euclidean distance, with a using the proposed method and other state-of-the-art meth-
greater value indicating a larger distance. ods. We can observe that our method outperforms the other
Hausdorff distance [44] is a measure that describes the methods in breast cancer segmentation. Breast cancers in
degree of similarity between two sets of points. It is defined ultrasound images are easily confused with surrounding tis-
as sues and structures, as shown in the first row of Fig. 4, where
SegNet, UNet, PSPNet, and TransUNet misidentified the
HD(G, R) = max{sup d(sG , S(R), sup d(sR , S(G))}, excess tissue, producing serious false positives. Segmenta-
sG sR (14)
tion of small targets is a very challenging task, and the visual
where sG and sR refer to surface vertices in the automated comparison results in the second column of Fig. 4 show that
segmentation result R and the corresponding ground truth many segmentation models are error prone, such as UNet,
segmentation G, respectively, and the symbol “sup” denotes PSPNet, CCNet, and TransUNet. In contrast, our method
the supremum. has a strong capability in identifying and segmenting small
The ASD is a metric that quantifies the similarity between targets. Another typical example is the malignant breast can-
two binary masks (G and R) by computing the average cer with unclear boundary (shown in the third column), all
Fig. 4 Visual comparison of the segmentation performance of different models on Breast Ultrasound Images Dataset. To enhance the visualiza-
tion of differences between segmentation predictions and ground truths, we use suitable boxes to highlight the key regions
13
672 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
models are difficult to recognize these objects, and it can be UCTransNet outperformed other methods (such as SegNet,
found from the results that our method, although still some PSPNet, and CCNet) in all evaluation metrics, indicating
details cannot be recognized well (red marked part), the pro- that the U-shaped encoding–decoding structure could effec-
posed method is closest to Ground Truth and can segment tively extract lesion features of breast cancer and achieve
the target effectively compared with other methods. accurate reconstruction. However, we observed that Tran-
sUNet, which combines the UNet and Transformer struc-
5.1.2 Quantitative Comparison tures, showed the worst segmentation results in the three
evaluation metrics (71.52% for DSC, 55.66% for IoU, and
The quantitative comparison of segmentation results for 19.6550 mm and 5.5359 mm for HD95 and ASD, respec-
Breast Ultrasound Images Dataset, as the dataset with tively). This may be attributed to the fact that the transformer
the smallest sample size in the three datasets, is shown in model is more difficult to train compared to CNNs without
Table 5. From the results, we can find that the proposed a pre-trained model and a large amount of training data, as
method, MRC-TransUNet, obtained 78.8% DSC in breast reported in previous studies [18]. In this dataset, we only
cancer segmentation, which is better than all other state- used 452 images as training data, so simply adding a trans-
of-the-art methods. In addition, the proposed method former to UNet may not be sufficient to learn the spatial
also achieved optimal HD95 and ASD (15.0148 mm and information of breast cancer, resulting in poor segmentation
4.3169 mm, respectively). results of TransUNet in this dataset.
Compared with the baseline network UNet (78.07% We conducted a separate experiment to compare the
and 64.02% for DSC and IoU, respectively), the pro- results of different methods on benign and malignant images,
posed method (MRC-TransUNet) in this paper achieved and the findings are presented in Table 6. Benign lesions
improvement of 0.73% and 1% in these two evaluation typically have clear boundaries, which enable all models to
metrics, respectively. We found that UNet, UNet + + , and achieve relatively good results. However, due to the unclear
boundaries and irregular shapes of malignant lesions, the
overall segmentation results of MRC-TransUNet on malig-
Table 5 Results of comparisons on the breast ultrasound images data-
set nant lesions (DSC of 74.94% and IoU of 59.92%) were
slightly lower than those of UNet by less than 1%. These
Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ASD (mm) ↓ results suggest that, while MRC-TransUNet uses a light-
↓
weight transformer to compensate for the disadvantages of
SegNet [45] 72.97 57.44 17.8096 5.5794 the original structure, such as the large number of param-
UNet [7] 78.07 64.02 15.9410 4.6369 eters and the loss of local details, its performance may be
UNet + + [8] 75.73 60.94 15.7513 4.6903 inferior to CNN-based models in segmentation tasks with
PSPNet [46] 74.41 59.25 17.7525 5.8106 very small training samples, such as the malignant dataset
CCNet [25] 72.97 57.44 17.7510 6.7503 which only contains 150 images.
UCTransNet 76.34 61.74 15.5047 5.0491 Our method preserves the original convolution operation
[16] in downsampling to capture rich local detail features, while
TransUNet 71.52 55.66 19.6550 5.5359 utilizing a transformer with smaller parameters instead of
[13]
skip connections to integrate local and global information
Ours 78.80 65.02 15.0148 4.3169
features for improved segmentation performance without
Bold indicates the best data for each column significantly increasing the number of parameters. The
Table 6 Comparative Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ↓ ASD (mm) ↓
experimental results of different
categories on the breast Benign Malignant Benign Malignant Benign Malignant Benign Malignant
ultrasound images dataset
SegNet [45] 76.04 69.43 61.34 53.17 15.5883 22.3261 4.5062 7.6541
UNet [7] 80.55 75.29 67.44 60.37 14.0255 19.8359 4.1602 5.6060
UNet + + [8] 79.64 71.43 66.18 55.56 13.3265 20.6817 3.5532 7.2197
PSPNet [46] 78.02 70.49 63.96 54.43 16.1365 21.0383 4.8670 7.6497
CCNet [25] 75.36 69.84 60.46 53.66 15.7940 22.7763 5.6343 8.9498
UCTransNet [16] 79.76 72.46 66.34 56.82 12.6012 20.0085 3.9899 7.1494
TransUNet [13] 74.33 68.30 59.14 51.86 17.2406 24.5645 4.4657 7.6038
Ours 82.29 74.94 69.90 59.92 11.7911 18.5696 3.4030 5.8699
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 673
results presented above demonstrate that MRC-TransUNet Table 7 Results of comparisons on the brain tumor images dataset
achieves good segmentation performance on small-sample Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ASD (mm) ↓
breast cancer datasets. ↓
Fig. 5 Visual comparison of the segmentation performance of different models on Brain Tumor Images Dataset. To enhance the visualization of
differences between segmentation predictions and ground truths, we use suitable boxes to highlight the key regions
13
674 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
Table 8 Comparative Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ↓ ASD (mm) ↓
experimental results of different
categories on the brain tumor Public Private Public Private Public Private Public Private
images dataset
SegNet [45] 75.44 88.70 60.56 79.69 6.7303 3.3744 1.8457 0.9642
UNet [7] 83.36 89.83 71.20 81.54 4.9947 2.9624 1.3919 0.8151
UNet + + [8] 81.06 88.70 68.16 79.69 5.0991 3.2271 1.4899 0.8415
PSPNet [46] 78.96 89.19 65.24 80.49 5.4914 3.8285 1.5280 0.8854
CCNet [25] 75.64 89.36 60.83 80.77 6.3551 4.6679 1.8789 1.0851
UCTransNet [16] 82.95 90.41 70.87 82.49 5.3408 3.0153 1.3890 0.9035
TransUNet [13] 78.08 89.32 64.04 80.70 6.9516 3.7291 1.6902 1.0152
Ours 83.55 90.85 71.74 83.23 4.9677 2.7356 1.3641 0.6704
In the private dataset, compared to UNet, the proposed Table 9 Results of comparisons on the COVID-19 radiography data-
method achieves a 1.02% improvement in DSC, a 1.69% base
improvement in IoU, a 0.2268 mm improvement in HD95, Method DSC (%) ↑ IoU (%) ↑ HD95 (mm) ASD (mm) ↓
and a 0.1447 mm improvement in ASD. These results further ↓
validate the good generalization ability of MRC-TransUNet
SegNet [42] 95.09 90.64 1.8158 0.7551
for different medical image segmentation tasks.
UNet [7] 95.48 91.35 1.4601 0.6869
UNet + + [8] 95.48 91.35 1.3846 0.6863
PSPNet [43] 95.26 90.95 2.0291 0.7405
5.3 Results on COVID‑19 Radiography Database
CCNet [25] 95.31 91.03 2.0815 0.7348
UCTransNet 95.48 91.35 1.3931 0.6848
The COVID-19 Radiography Database used in this paper [16]
contains 15,153 sample images of lungs. The large data TransUNet 95.46 91.31 1.4667 0.6912
volume can compensate for the lack of convolutional induc- [13]
tive bias of the transformer structure, while CNN can get Ours 95.70 91.75 1.3544 0.6840
better characterization learning capability in the large-scale
Bold indicates the best data for each column
dataset. In this dataset, both CNN and transformer show
excellent segmentation and get significant performance
improvement. It is hard to see the gap between the segmen- 0.0565 mm (PSPNet), 0.0508 mm (CCNet), 0.0008 mm
tation effects of different models in qualitative comparison (UCTransNet), and 0.0072 mm (TransUNet), respectively.
results, so we only show quantitative comparison results in In addition, the DSC of each model is higher than 95%,
this subsection. which proves the balanced distribution of this dataset.
Table 9 shows the quantitative comparison results of dif- UCTransNet is only lower than MRC-TransUNet in each
ferent methods on COVID-19 Radiography Database. The evaluation metric, which proves the effectiveness of trans-
proposed MRC-TransUNet outperforms other modules by former instead of skip connections. TransUNet shows good
using a transformer instead of skip connections to fuse multi- segmentation performance as the sample of the dataset
scale features and by focusing on feature regions with an increases, and the performance in this dataset has been bet-
excellent attention module. The DSC, IoU, HD95, and ASD ter than SegNet, CCNet, and PSPNet, proving that ViTs can
of the proposed method are 95.7%, 91.75%, 1.3544 mm, and reach or even surpass the performance of CNN in large-scale
0.684 mm, respectively. Compared with other algorithms, samples.
the proposed method improves the DSC by 0.61% (SegNet),
0.22% (UNet), 0.22% (UNet + +), 0.44% (PSPNet), 0.39% 5.4 Results of Ablation Experiments
(CCNet), 0.22% (UCTransNet), and 0.24% (TransUNet);
the IoU by 1.11% (SegNet), 0.4% (UNet), 0.4% (UNet + +), In this section, further ablation study was performed in dif-
0.8% (PSPNet), 0.72% (CCNet), 0.4% (UCTransNet), and ferent medical image segmentation datasets to evaluate the
0.44% (TransUNet); the HD95 by 0.4614 mm (SegNet), performance of each component of the proposed MRC-Tran-
0.1057 mm (UNet), 0.0302 mm (UNet + +), 0.6747 mm sUNet. We used only DSC and HD95 as evaluation metrics.
(PSPNet), 0.7271 mm (CCNet), 0.0387 mm (UCTransNet), The experimental results are shown in Table 10, with Breast
and 0.1123 mm (TransUNet); and the ASD by 0.0711 mm denoting Breast Ultrasound Images Dataset, Brain denoting
(SegNet), 0.0029 mm (UNet), 0.0023 mm (UNet + +), Brain Tumor Images Dataset, and Lung denoting COVID-19
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 675
Radiography Database. MVM and RPA represent the addi- MRC-TransUNet had lower HD95 and ASD values of
tion of only MR-ViT module and reciprocal attention (RPA) 15.0148 mm and 4.3169 mm, respectively, compared to
module to the model. From Table 10, we obtain the follow- UCTransNet and TransUNet. On the Brain Tumor Images
ing observations. dataset, MRC-TransUNet achieved a DSC of 86.14%, an
IoU of 75.65%, a HD95 of 4.2007 mm, and an ASD of
(1) By adding MR-ViT to replace the original skip con- 1.1252 mm, which were again better than UCTransNet and
nections, “MVM” has 0.5%, 0.08%, and 0.04% per- TransUNet. On the COVID-19 Radiography dataset, MRC-
formance improvement in DSC and 0.2615 mm, TransUNet achieved a DSC of 95.70%, an IoU of 91.75%,
0.0489 mm, and 0.086 mm performance improvement a HD95 of 1.3544 mm, and an ASD of 0.6840 mm, which
in HD95 on the three datasets, respectively. The method outperformed UCTransNet and TransUNet.
facilitates the fusion of multi-scale feature information Overall, our proposed MRC-TransUNet achieved superior
compared to the common UNet. segmentation performance while using significantly fewer
(2) When we use the proposed RPA module, the DSC of parameters, demonstrating its potential for practical medical
the three datasets improves from 78.07%, 85.97%, and image analysis.
95.48% to 78.19%, 85.98%, and 95.55%, respectively,
and the HD95 improves from 15.941 mm, 4.2509 mm,
and 1.4601 mm to 14.8684 mm, 4.2213 mm, and
1.3889 mm. This indicates that the proposed attention 6 Conclusion
mechanism can improve the model’s ability to focus on
the region of interest and thus improve the segmenta- In this paper, we propose MRC-TransUNet, a transformer-
tion performance. based framework that enhances the segmentation quality of
(3) Our proposed method outperforms other metrics on all medical images using a U-shaped architecture. Our approach
datasets, which demonstrates the effectiveness of the replaces traditional skip connections with a lightweight
combination of the two modules. Our results illuminate transformer (MR-ViT) to enable more feature extraction lay-
the effectiveness of using a lightweight transformer to ers in the encoder–decoder and better model global context.
fuse multi-scale features instead of skip connections on We also introduce a novel attention mechanism before multi-
segmentation performance in a U-shaped segmentation scale fusion, which effectively fuses multi-scale feature
framework. representations from the encoder by constructing attention
weights for 1D location features and 2D spatial features to
5.5 Parameter of Transformer‑based Methods model long-range dependencies. Our approach outperforms
previous state-of-the-art methods, as demonstrated through
We evaluated the segmentation performance and parameter extensive experiments on three medical image segmentation
efficiency of three transformer-based medical segmenta- tasks. However, we acknowledge the limitations of our study.
tion methods: UCTransNet, TransUNet, and our proposed Specifically, our proposed method did not show a signifi-
MRC-TransUNet. Among the three, MRC-TransUNet had cant improvement over UNet in quantitative evaluation. To
the lowest number of parameters, with only 112 MB, while improve the performance and applicability of our method,
TransUNet had the largest parameter count of 401 MB, and we plan to explore several avenues in future work. These
UCTransNet had 253 MB. Our proposed MRC-TransUNet include designing segmentation models based on small-sam-
demonstrated superior performance in medical image seg- ple medical images, investigating how visual transformers
mentation tasks. can better model long-range dependencies, and addressing
MRC-TransUNet outperformed UCTransNet and Tran- other optimization challenges.
sUNet on the Breast Ultrasound Images dataset, achieving Acknowledgements The authors wish to express our sincere appre-
a higher DSC of 78.80% and IoU of 65.02%. Additionally, ciation to Chenzi Zheng (College of Foreign Languages, Nankai
13
676 Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677
University) for her valuable assistance in editing the English language 14. Valanarasu J, Oza P, Hacihaliloglu I et al (2021) Medical trans-
of our research. former: gated axial-attention for medical image segmentation.
https://doi.org/10.48550/arXiv.2102.10662
Funding This work was supported in part by National Natural Sci- 15. Cao H, Wang YY, Chen J et al (2021) Swin-Unet: Unet-like pure
ence Foundation of China (61972456, 61173032) and Tianjin Research transformer for medical image segmentation. https://doi.org/10.
Innovation Project for Postgraduate Students (2022SKY126). 48550/arXiv.2105.05537
16. Wang H, Cao P, Wang J et al (2022) UCTransNet: rethinking the
Data availability The data are available from the corresponding author skip connections in u-net from a channel-wise perspective with
on reasonable request. transformer. Proc AAAI Conf Artif Intell 36(3):2441–2449.
https://doi.org/10.48550/arXiv.2109.04335
Declarations 17. He K, Zhang X, Ren S et al (2016) Deep residual learning for
image recognition. In: 2016 IEEE Conference on Computer
Conflict of interest The authors declare no conflict of interest. Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
2016, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
18. Mehta S and Rastegari M (2021) MobileViT: light-weight, gen-
eral-purpose, and mobile-friendly vision transformer. https://
doi.org/10.48550/arXiv.2110.02178
References 19. Xiao X, Shen L, Luo Z et al (2018) Weighted Res-UNet for
high-quality retina vessel segmentation. In: 2018 9th Interna-
tional conference on information technology in medicine and
1. Shirokikh B, Dalechina A, Shevtsov A et al (2020) Deep learning
education (ITME), Hangzhou, China, 2018, pp 327–331. https://
for brain tumor segmentation in radiosurgery: prospective clinical
doi.org/10.1109/itme.2018.00080
evaluation. In: LNIP, BrainLes 2019, vol 11992, Springer, Cham,
20. Alom MZ, Hasan M, Yakopcic C et al (2018) Recurrent residual
pp 119–128. https://doi.org/10.1007/978-3-030-46640-4_12
convolutional neural network based on U-Net (R2U-Net) for
2. Otsu N (2007) A threshold selection method from Gray-level his-
medical image segmentation. https://doi.org/10.48550/arXiv.
tograms. IEEE Trans Syst Man Cybern 9(1):62–66. https://doi.
1802.06955
org/10.1109/TSMC.1979.4310076
21. Fan D-P, Ji GP, Zhou T et al (2020) Pranet: Parallel reverse
3. Prastawa M, Bullitt E, Gerig G (2009) Simulation of brain tumors
attention network for polyp segmentation. In: Medical image
in MR images for evaluation of segmentation efficacy. Med Image
computing and computer assisted intervention–MICCAI 2020:
Anal 13(2):297–311. https://d oi.o rg/1 0.1 016/j.m edia.2 008.1 1.0 02
23rd international conference, Lima, Peru. Proceedings, Part
4. Corso JJ, Sharon E, Dube S et al (2008) Efficient multilevel brain
VI 23. Springer, Cham, pp 263–273. https://doi.org/10.48550/
tumor segmentation with integrated Bayesian model classification.
arXiv.2006.11392
IEEE Trans Med Imaging 27(5):629–640. https://d oi.o rg/1 0.1 109/
22. Valanarasu JMJ, Sindagi VA, Hacihaliloglu I et al (2020) Kiu-
TMI.2007.912817
net: towards accurate segmentation of biomedical images using
5. Lin AL, Chen BZ, Xu JY et al (2022) DS-TransUNet: dual swin
over-complete representations. IN: Medical image computing
transformer U-Net for medical image segmentation. IEEE Trans
and computer assisted intervention–MICCAI 2020: 23rd inter-
Instrum Meas 71:4005615. https://doi.org/10.1109/TIM.2022.
national conference, Lima, Peru. Springer, Cham, pp 363–373.
3178991
https://doi.org/10.1007/978-3-030-59719-1_36
6. Long J, Shelhamer E, Darrell T (2015) Fully convolutional net-
23. Wang X, Girshick R, Gupta A et al (2018) Non-local neural
works for semantic segmentation. IEEE Trans Pattern Anal Mach
networks. In: 2018 IEEE/CVF conference on computer vision
Intell 39(4):640–651. https://d oi.o rg/1 0.1 109/T
PAMI.2 016.2 5726
and pattern recognition, pp 7794–7803. https://doi.org/10.1109/
83
CVPR.2018.00813
7. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional
24. Huang Z, Wang X, Huang L et al (2023) CCNet: criss-cross
networks for biomedical image segmentation. In: LNIP, MICCAI
attention for semantic segmentation. Int Conf Comput Vis
2015, vol 9351, Springer, Cham, pp 234–241. https://doi.org/10.
45(6):6896–6908. https://doi.org/10.1109/TPAMI.2020.30070
1007/978-3-319-24574-4_28
32
8. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N et al (2018)
25. Li J, Huo HT, Li C et al (2021) Multigrained attention network
Unet++: A nested u-net architecture for medical image segmen-
for infrared and visible image fusion. IEEE Trans Instrum Meas
tation. In: LNIP, DLMIA 2018, vol 11045, Springer, Cham, pp
70:5002412. https://doi.org/10.1109/TIM.2020.3029360
3–11. https://doi.org/10.1007/978-3-030-00889-5_1
26. Tang JH, Zou B, Li C et al (2021) Plane-wave image reconstruc-
9. Guerrero R, Qin C, Oktay O et al (2018) White matter hyperin-
tion via generative adversarial network and attention mechanism.
tensity and stroke lesion segmentation and differentiation using
IEEE Trans Instrum Meas 70:4505115. https://doi.org/10.1109/
convolutional neural networks. Neuroimage-Clin 17:918–934.
TIM.2021.3087819
https://doi.org/10.1016/j.nicl.2017.12.022
27. Zhao R, Huang Z, Liu T et al (2021) Structure-enhanced atten-
10. Oktay O, Schlemper J, Folgoc LL et al (2018) Attention U-Net:
tive learning for spine segmentation from ultrasound volume pro-
learning where to look for the pancreas. https://doi.org/10.48550/
jection images. In: IEEE international conference on acoustics,
arXiv.1804.03999
speech and signal processing (ICASSP). IEEE, New York, pp
11. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you
1195–1199. https://d oi.o rg/1 0.1 109/I CASSP
39728.2 021.9 41465 8
need. Adv Neural Inf Process Syst. https://d oi.o rg/1 0.4 8550/a rXiv.
28. Liu T, Zhang C, Lam KM et al (2022) Decouple and resolve:
1706.03762
transformer-based models for online anomaly detection from
12. Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is
weakly labeled videos. IEEE Trans Inf Forensics Secur 18:15–28.
worth 16x16 words: transformers for image recognition at scale.
https://doi.org/10.1109/TIFS.2022.3216479
https://doi.org/10.48550/arXiv.2010.11929
29. Li K, Wang Y, Zhang J et al (2023) Uniformer: unifying convolu-
13. Chen J, Lu Y, Yu Q et al (2021) TransUNet: transformers make
tion and self-attention for visual recognition. IEEE Trans Pattern
strong encoders for medical image segmentation. https://doi.org/
Anal Mach Intell 1–18. https://doi.org/10.1109/TPAMI.2023.
10.48550/arXiv.2102.04306
3282631
13
Interdisciplinary Sciences: Computational Life Sciences (2023) 15:663–677 677
30. Zhang Z, Zhang X, Yang Y et al (2023) Accurate segmentation 39. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient
algorithm of acoustic neuroma in the cerebellopontine angle based mobile network design. Comput Vis Pattern Recogn. https://doi.
on ACP-TransUNet. Front Neurosci 17:1207149. https://doi.org/ org/10.48550/arXiv.2103.02907
10.3389/fnins.2023.1207149 40. Al-Dhabyani W, Gomaa M, Khaled H et al (2019) Dataset of
31. Drozdzal M, Vorontsov E, Chartrand G et al (2016) The impor- breast ultrasound images. Data Brief 28:104863. https://doi.org/
tance of skip connections in biomedical image segmentation. In: 10.1016/j.dib.2019.104863
LNIP, DLMIA 2016, vol 10008, Springer, Cham, pp 179–187. 41. Rahman T, Amith K, Yazan Q et al (2021) Exploring the effect
https://doi.org/10.1007/978-3-319-46976-8_19 of image enhancement techniques on COVID-19 detection using
32. Huang G, Liu Z, Laurens V et al (2016) Densely connected convo- chest X-ray images. Comput Biol Med 132:104319. https://doi.
lutional networks. In: 2017 IEEE conference on computer vision org/10.1016/j.compbiomed.2021.104319
and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 42. Chowdhury MEH, Rahman T, Khandakar A et al (2020) Can AI
2261–2269. https://doi.org/10.1109/CVPR.2017.243 help in screening viral and COVID-19 pneumonia? IEEE Access
33. Li X, Hao C, Qi X et al (2017) H-DenseUNet: hybrid densely 8:132665–132676. https://doi.org/10.1109/ACCESS.2020.30102
connected UNet for liver and liver tumor segmentation from CT 87
volumes. IEEE Trans Med Imaging 37(12):2663–2674. https:// 43. Kingma D and Ba J (2014) Adam: a method for stochastic optimi-
doi.org/10.1109/TMI.2018.2845918 zation. Preprint at https://arXiv.org/arXiv:1412.6980. https://doi.
34. Huang H, Lin L, Tong R et al (2020) UNet 3+: a full-scale con- org/10.48550/arXiv.1412.6980
nected UNet for medical image segmentation. In: ICASSP 2020 - 44. Beauchemin M, Thomson KP, Edwards G (1998) On the Haus-
2020 IEEE international conference on acoustics, speech and sig- dorff distance used for the evaluation of segmentation results. Can
nal processing (ICASSP), Barcelona, Spain, 2020, pp 1055–1059. J Remote Sens 24(1):3–8. https://d oi.o rg/1 0.1 080/0 70389 92.1 998.
https://doi.org/10.1109/ICASSP40776.2020.9053405 10874685
35. Ibtehaz N, Sohel Rahman M (2019) MultiResUNet: rethinking 45. Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep
the U-net architecture for multimodal biomedical image segmen- convolutional encoder-decoder architecture for image segmenta-
tation. Neural Netw 121:74–87. https://doi.org/10.1016/j.neunet. tion. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495.
2019.08.025 https://doi.org/10.1109/TPAMI.2016.2644615
36. Xiao T, Singh M, Mintun E et al (2021) Early convolutions help 46. Zhao H, Shi J, Qi X et al (2016) Pyramid scene parsing network.
transformers see better. Adv Neural Inf Process Syst. https://doi. In: 2017 IEEE conference on computer vision and pattern recogni-
org/10.48550/arXiv.2106.14881 tion (CVPR), Honolulu, HI, USA, 2017, pp 6230–6239. https://
37. Graham B, El-Nouby A, Touvron H et al. (2021) LeViT: a vision doi.org/10.1109/cvpr.2017.660
transformer in ConvNet’s clothing for faster inference. https://doi.
org/10.48550/arXiv.2104.01136 Springer Nature or its licensor (e.g. a society or other partner) holds
38. Wadekar SN and Chaurasia A (2022) Mobilevitv3: mobile- exclusive rights to this article under a publishing agreement with the
friendly vision transformer with simple and effective fusion of author(s) or other rightsholder(s); author self-archiving of the accepted
local, global and input features. Preprint at https://arXiv.org/ manuscript version of this article is solely governed by the terms of
arXiv:2209.15159 such publishing agreement and applicable law.
13