0% found this document useful (0 votes)
10 views14 pages

MaS-TransUNet A Multi-Attention Swin Transformer U-Net For Medical Image Segmentation

Uploaded by

smithervin750
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

MaS-TransUNet A Multi-Attention Swin Transformer U-Net For Medical Image Segmentation

Uploaded by

smithervin750
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

MaS-TransUNet: A Multi-attention Swin


Transformer U-Net for Medical Image
Segmentation
Ashwini Kumar Upadhyay and Ashish Kumar Bhandari

 repeatable results, enhancing accuracy and efficiency in


Abstract—U-shaped encoder-decoder models have excelled in classifying and segmenting target regions. Therefore,
automatic medical image segmentation due to their hierarchical automated medical image segmentation is highly valued in
feature learning capabilities, robustness, and upgradability. clinical practice and advanced medical research [2].
Purely CNN-based models are excellent at extracting local details
but struggle with long-range dependencies, whereas transformer-
In the last decade, developments in CNNs have significantly
based models excel in global context modeling but have higher advanced the field, with the U-Net based architectures being
data and computational requirements. Self-attention-based particularly successful in medical image segmentation [3], [4].
transformers and other attention mechanisms have been shown to These architectures use an encoder-decoder framework,
enhance segmentation accuracy in the encoder-decoder merging hierarchical features. However, convolution
framework. Drawing from these challenges and opportunities, we operations' locality biases and fixed receptive fields limit global
propose a novel Multi-attention Swin Transformer U-Net (MaS-
TransUNet) model, incorporating self-attention, edge attention,
context capture. Additionally, deeper CNN models require
channel attention, and feedback attention. MaS-TransUNet large datasets and compute power, making segmentation
leverages the strengths of both CNNs and transformers within a accuracy improvements challenging due to the scarcity of high-
U-shaped encoder-decoder framework. For self-attention, we quality labeled medical image datasets [5]. The diverse nature
developed modules using Swin Transformer blocks, offering of lesions in medical images challenges CNN-based methods,
hierarchical feature representations. We designed specialized complicating the creation of adaptable guidelines for varying
modules, including an Edge Attention Module (EAM) to guide the
network with edge information, a Feedback Attention Module
sizes, shapes, and textures [6].
(FAM) to utilize previous epoch segmentation masks for refining However, integrating advanced attention mechanisms like
subsequent predictions, and a Channel Attention Module (CAM) self-attention, edge-attention, spatial attention, channel
to focus on relevant feature channels. We also introduced attention, and feedback attention can enhance these models'
advanced data augmentation, regularizations, and an optimal performance in medical image segmentation tasks [7], [8], [9],
training scheme for enhanced training. Comprehensive [10], [11]. Transformer [12] based models using self-attention
experiments across five diverse medical image segmentation
datasets demonstrate that MaS-TransUNet significantly
have sparked significant discussion in the computer vision
outperforms existing state-of-the-art methods while maintaining community. Initially designed for Natural Language Processing
computational efficiency. It achieves the highest Dice scores of (NLP), transformers are now seen as versatile alternatives to
0.903, 0.841, 0.908, 0.906, and 0.906 on TCGA-LGG Brain MRI, CNNs for various vision tasks [6]. The Swin Transformer [13],
COVID-19 Lung CT, DSB-2018, Kvasir-SEG, and ISIC-2018 designed specifically for vision, reduces the quadratic
datasets, respectively. These results highlight the model's complexity of standard transformers and creates hierarchical
robustness and versatility, consistently delivering exceptional
performance without modality-specific adaptations.
feature maps through patch merging, improving pixel-level
dense predictions.
Index Terms— Attention, Deep Learning, Iterative Refinement, To address the challenges faced by CNN-based models and
Medical Image Segmentation, Swin Transformer leverage various attention mechanisms, we introduce MaS-
TransUNet, a multi-attention model within a U-shaped
I. INTRODUCTION encoder-decoder framework. This model combines self-

P RECISE lesion segmentation gives doctors the quantitative attention (via Swin Transformers), edge attention, channel
data needed to predict disease progression, evaluate attention, and feedback attention in a CNN-Transformer hybrid.
treatment success, and make informed decisions with It aims to enhance the conventional U-shaped network for more
computer-based analysis [1]. Manual segmentation by experts precise and efficient medical image segmentation. The encoder
is precise but time-consuming and costly for regular clinical of the MaS-TransUNet has a ResNet-50 [14] based backbone,
use. Automated segmentation methods provide consistent, modified by integrating newly developed modules for

This work did not involve human subjects or animals in its research. Ashish Kumar Bhandari is with the Department of Electronics &
Ashwini Kumar Upadhyay is with the Department of Electronics & Communication Engineering, National Institute of Technology Patna (E-
Communication Engineering, National Institute of Technology Patna, and with mail: [email protected]).
the Department of Electronics Engineering, Rajkiya Engineering College
Kannauj (E-mail: [email protected]).

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

implementing different types of attention mechanisms. A incorporating nested and dense skip pathways. R2-UNet [23]
Bottleneck Swin Transformer Module (BSTM) is developed to replaces regular convolution layers in U-Net with recurrent
add self-attention to the feature representations from the residual convolution layers that enables better feature
encoder. A decoder is designed to reconstruct the segmented representations by accumulating features over multiple
image from the encoded representations from the BSTM recurrences. Jha et al. [24] stacked two slightly modified U-
through a series of operations, including convolution, Nets in sequence where the first U-Net's output is concatenated
upsampling, skip-connection concatenation, and attention- with the original input image and passed to the second U-Net,
based mechanisms. allowing for sequential refinement of segmentation results.
We trained our model from scratch and tested it on five Isensee et al. [25] developed an adaptive U-Net based model
configuration that analyze the given dataset and selects the most
different medical image segmentation challenges, each
appropriate network architecture, data preprocessing and
represented by a dataset from a different modality. The
training setup. Their nnU-Net model consistently ranks among
challenges include early brain tumor segmentation in The
the top performers in medical image segmentation challenges.
Cancer Genome Atlas Low Grade Glioma (TCGA-LGG) Tomar et al. [10] proposed a CNN based U-shaped network that
dataset [15], lung infection segmentation in Covid-19 CT used a feedback mechanism for biomedical image segmentation
lung lesion segmentation dataset [8], nuclei segmentation in tasks.
data science bowl (DSB) 2018 dataset [16], polyp segmentation CNNs are proficient at modelling spatial hierarchies and
in Kvasir-SEG endoscopic image dataset [17], and skin lesion local patterns, yet they face significant challenges in capturing
segmentation in International Skin Imaging Collaboration long-range contextual dependencies effectively.
(ISIC) 2018 dataset [18]. Key contributions of this work are:
B. Attention Mechanisms
1) A novel Multi-attention U-shaped model called MaS-
TransUNet is proposed that integrates four distinct A diverse range of attention mechanisms are employed in
attention mechanisms—self-attention, edge-attention, deep learning approaches for medical image segmentation
channel attention, and feedback-attention through specially because they can emphasize the most informative channels,
designed modules within a CNN-Swin Transformer hybrid highlight the most relevant parts, or capture long-range
setup. This network leverages the strengths of both CNNs dependencies within images or feature representations [26]. For
and Swin Transformers, along with different attention example, a criss-cross attention network in CCNet [27]
mechanisms, to enhance segmentation accuracy by efficiently extracts contextual information in images for
focusing on the most relevant parts of medical images. improved segmentation. Reverse attention and edge attention
2) To enhance training efficacy, we employed advanced data modules are utilised in Inf-Net [8] for focusing on the
augmentation techniques to generate diverse examples, boundaries of infected areas in the lung CT scans of corona
thereby reducing the risk of overfitting due to limited virus disease -2019 patients. Oktay et al. [28] introduces
labeled medical data. Furthermore, we incorporated deep attention gates in the skip-paths of U-Net to focus on the
supervision with optimized training settings, and applied relevant regions within an image. Reverse attention is used in
regularization techniques such as normalization, learning PraNet [29] to mine detailed boundary information for
rate scheduling, dropout, and weight decay to improve accurately segmenting polyps in colonoscopy images. MTANet
generalizability and robustness. [30] used a reverse addition attention module to pinpoint
3) We conducted extensive experiments on five different relevant regions for boundary detection in medical image
medical image segmentation challenges to demonstrate the segmentation tasks.
superiority, generalizability and robustness of our MaS- In our work, we employed self-attention based Swin
TransUNet when compared to the other state-of-the-art transformer [13] block, edge attention module, channel
models. attention, and feedback attention network. Consequently,
related works pertaining to these methods are discussed below.
II. RELATED WORKS
We begin with reviewing some of the most common CNN- 1) Transformer Based Methods
based methods used to segment medical images. Following this, Motivated by the efficacy of transformer [12] in numerous
we summarize recent research on how the use of attention natural language processing (NLP) problems, there is a growing
mechanisms is transforming image segmentation, especially trend of employing transformers in computer vision tasks. This
within the medical field. Here, we highlight the latest shift is attributed to the transformer's proficiency in capturing
advancements in self-attention-based transformers, edge- contextual cues through multihead self-attention (MSA), which
attention, channel attention and feedback attention-based makes it ideal for semantic segmentation tasks [31]. Vision
techniques for image segmentation. Transformer (ViT) [32] takes images divided into patches
(similar to words in a sentence) as inputs and applies self-
A. CNN-based Methods attention to find relations within and between patches using a
CNNs have been widely used in image analysis tasks in standard transformer. TransUNet [33] is the first model that
recent years [19], [20]. Modern image segmentation models are uses transformers in medical image segmentation. It merged the
typically based on encoder-decoder structures similar to U-Net strengths of CNNs and standard transformers in the encoder for
[21]. Nested U-Net [22] improved upon the popular U-Net extracting global contexts. Standard transformers suffer from
architecture to enhance the segmentation accuracy by quadratic computational complexity with respect to image size

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

due to their global self-attention calculations. To address this process with iterative refinement. Mosinska et al. [43] proposed
challenge, the Swin Transformer [13], designed specifically for an iterative refinement pipeline with the introduction of a new
vision tasks, introduced hierarchical feature maps and window- loss term specifically developed to calculate the topological
based self-attention, facilitating the incorporation of both local resemblance between the predicted and ground truth
and broader contextual information. Swin-Unet [34] adopts the delineations. Output from the previous step is used as input,
U-Net encoder-decoder framework but replaces convolutional allowing the model to progressively improve its predictions.
layers with Swin Transformer blocks in a very light-weight FANet [10] introduces a feedback attention network that
configuration. DS-TransUNet [35] utilizes two Swin utilizes predicted masks from previous epochs to enhance the
Transformer encoders with different patch sizes for dual-scale model's segmentation performance. Additionally, the feedback
encoding and standard transformer-based interactive fusion network enables iterative refinement of masks during testing.
modules to merge dual-scale features from the encoder layers
and deliver them to the corresponding Swin Transformer-based III. THE PROPOSED METHOD
decoder layers. SwinPA-Net [11] is Swin Transformer-based Here, we offer a comprehensive description of the novel
network incorporating dense multiplicative connections and Multi-attention Swin Transformer U-Net (MaS-TransUNet).
local pyramid attention modules to enhance multiscale feature We begin with a concise overview of the MaS-TransUNet
aggregation for robust medical image segmentation across architecture, followed by a description of the encoder, Swin
various tasks. transformer-based modules, and various attention modules
Xiao et al. [36] utilized a vision transformer (ViT) based developed and seamlessly integrated into the model to focus on
encoder, along with a contrastive module for enhancing polyp the pertinent aspects of the feature representations. Finally, we
segmentation. Through the analysis of recent works related to discuss the decoder network and an optimal training
the use of transformers in medical image segmentation, it is configuration adopted to enhance feature learning.
evident that the use of transformers have been highly effective
in a U-Net based framework, especially when they are used in A. Overview of the MaS-TransUNet
the encoder part [37]. Models combining CNNs and Swin The MaS-TransUNet, shown in Fig. 1, is a novel U-shaped
transformers have been found to be very effective in medical model incorporating four attention mechanisms: self-attention,
image segmentation tasks as they combine the advantages of edge-attention, feedback-attention, and channel-attention
CNNs and Transformers [33], [37], [38]. within a hybrid CNN-Transformer setup. Self-attention is
2) Edge Attention and Channel Attention applied to various parts of the network using specially designed
Numerous studies have demonstrated that information on modules that utilize Swin Transformer Blocks (STBs) [13]
boundaries and edges of targets can offer valuable constraints which are designed to be computationally efficient compared to
to direct feature extraction for segmentation tasks [8], [29], traditional transformers. STBs utilize hierarchical feature
[39], [40]. ET-Net [39] extracts edge features from early layers representation and a window-based approach that significantly
of the encoding network for precise localization of boundaries reduces the overall computational cost. Residual Swin
and regions of interest (ROI) in medical images. Inf-Net [8] Transformer Modules (RSTMs) are integrated into the encoder,
introduced an edge-attention module at the initial layers of its while a Bottleneck Swin Transformer Module (BSTM)
encoder network for learning edge-attention representations. enhances encoder representations before passing them to the
PraNet [29] includes a reverse attention mechanism to highlight decoder. A Swin Decoder Module (SDM) further refines
boundary and edge information, boosting polyp segmentation features in the decoder. Various module are designed to provide
performance. MEA-Net [40] proposes a multilayer edge distinct attention types: an Edge Attention Module (EAM)
attention module designed to refine boundary delineation guides initial encoder layers with edge information, each
progressively within a U-Net-like architecture. ResNet-50 [14] unit in the encoder includes a Channel
Several works pertaining to medical image segmentation Attention Module (CAM) to highlight relevant channels, and
have implemented channel attention to focus on more relevant Feedback Attention Modules (FAMs) in both contracting and
feature channels. Lei et al. [7] developed a self-attention expansive paths use previous epoch segmentation masks to
mechanism which used spatial and channel attention module prune subsequent ones. The model’s training effectiveness is
for improved segmentation in breast ultrasound. Hu et al. [9] bolstered by advanced data augmentation techniques, deep
applied a combination of spatial attention and channel attention supervision, regularizations, and an optimal combination of
in the segmentation of lung tumors. Yuan et al. [38] applied optimizer and loss functions.
channel attention to the features obtained from transformers in
a CNN-transformer hybrid model for medical image B. Encoder
segmentation. The encoder is a hybrid network, combining Swin
3) Feedback Attention and Iterative Refinement Transformer [13] based modules and CNN based ResNet-50
Feedback mechanism allows the model to refine its [14] backbone, enhanced with newly designed attention
segmentation predictions iteratively by incorporating feedback modules. RSTMs, CAMs, and FAM are integrated into the
from previous iterations, enabling it to gradually improve the ResNet-50 structure, while external EAM extracts low-level,
accuracy of segmentation results [10], [41], [42], [43]. G- moderate-resolution features from the encoder to produce edge
FRNet [41] progressively improves its segmentation prediction maps as shown in Fig. 1. Features of different scales from the
over multiple refinement stages through feedback gate units. different encoder layers are passed through skip connections to
Shibuya et al. [42] introduced a feedback loop to a standard U- the corresponding decoder layers to preserve spatial
Net architecture. This feedback loop enhances the segmentation information, improve gradient flow, facilitate feature fusion,

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

Fig. 1. Overview of the MaS-TransUNet architecture. The model incorporates a modified ResNet-50-based encoder, Swin transformer-based
modules (RSTM, BSTM, SDM), attention modules (EAM, CAM, FAM), and a decoder. Deep supervision signals (DS1, DS2) are extracted from
BSTM and SDM outputs for enhanced training.
1) Swin Transformer Block (STB)
and enhance overall network performance. The BSTM, located The Swin Transformer block (STB), a key component of the
at the bottom of the network, enhances the features from the Swin Transformer [13] architecture, calculates attention using
encoder and passes them to the decoder. a mechanism called "Shifted Window" attention. It is an
C. Swin Transformer Modules advanced technique designed to improve the efficiency and
Swin Transformers [13] excel in capturing both local and performance of the self-attention mechanism in traditional
global context due to their hierarchical feature extraction and transformers. Instead of computing self-attention globally (as in
shifted window mechanisms. They maintain computational traditional transformers), Swin Transformers divide the input
efficiency by reducing the complexity of self-attention through image into non-overlapping windows. Self-attention is
localized window operations and periodic window shifting. computed within each window, significantly reducing
These characteristics make Swin Transformers a powerful and computational complexity and memory usage. In the next layer,
efficient choice for tasks requiring detailed and context-aware the windows are shifted by a fixed amount. This shift allows for
feature extraction. The careful placement of Swin Transformer- interactions between adjacent windows that were previously
based modules within the CNN framework ensures that the isolated. The shifted windows overlap, enabling the model to
overall model complexity remains manageable. We developed capture cross-window dependencies and maintain global
three types of modules based on the Swin transformer to information flow throughout the network. Each STB is
introduce self-attention at different parts of the network. These structured with Layer Normalization (LN), regular Windowed
modules include several Swin transformer blocks (STBs) and Multi-head Self-Attention (W-MSA), shifted Windowed Multi-
other components for seamless integration into the network. head Self-Attention (SW-MSA), and Multi-Layer Perceptron
The STB and the developed modules are described below: (MLP) as depicted in Fig. 3. The input feature map 𝑋 of shape
(𝐻, 𝑊, 𝐶) where 𝐻, 𝑊, 𝑎𝑛𝑑 𝐶 are height, width and number of

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

Fig. 2. Illustration of (a) Edge Attention Module (EAM), (b) Channel Attention Module (CAM), and (c) Feedback Attention Module (FAM),

channels, respectively, given to the Swin transformer, is where 𝑊 , 𝑊 , and 𝑊 are learnable projection matrices.
segmented into non-overlapping windows of size M×M (M is The projected 𝑄, 𝐾, and 𝑉 are split into multiple heads. If
set to 7 by default). For each window, linear projections are there are ℎ heads, the dimension of each head will be 𝑑 = .
applied to obtain the query 𝑄, key 𝐾, and value 𝑉 matrices:
𝑄 = 𝑠𝑝𝑙𝑖𝑡(𝑄, ℎ), 𝐾 = 𝑠𝑝𝑙𝑖𝑡(𝐾, ℎ), 𝑉 = 𝑠𝑝𝑙𝑖𝑡(𝑉, ℎ) (2)
𝑄 = 𝑋𝑊 , 𝐾 = 𝑋𝑊 , 𝑉 = 𝑋𝑊 (1)
The outputs from both W-MSA and SW-MSA are merged
Self-Attention is performed within each local window in W- and reshaped back to the original input shape. A residual
MSA. For each head, we calculate the scaled dot-product connection and LN are applied after each attention operation.
attention. For the 𝑖 head: Each attention block is followed by a feed-forward network
consisting of two linear layers with a GELU activation in
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑄 , 𝐾 , 𝑉 ) = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥( + 𝐵)𝑉 (3)
between. By using both window-based and shifted window-
where B denotes the relative position bias. Outputs of all heads based multi-head self-attention, the Swin Transformer
are concatenated and projected back to the original dimension. effectively captures both local and global dependencies in the
To capture cross-window dependencies, the windows are input feature map, making it suitable for various vision tasks.
shifted cyclically in both horizontal and vertical directions in 2) Residual Swin Transformer Module (RSTM)
SW-MSA. The same multi-head self-attention mechanisms We designed RSTM with 6 STBs and a residual connection,
described above are applied to these shifted windows. as shown in Fig. 1, and integrated it into the ResNet-50 [14]
based encoder to add self-attention to the input feature
representations from different stages of ResNet-50, as shown in
Figure 1. Patch embeddings are computed for features from
selected layers in the encoder using a convolution operation.
These embeddings are then flattened and added with learnable
position embeddings to create a linear projection before being
given to the RSTM blocks. Enhanced features from the RSTMs
are then passed to the next encoder layers.
3) Bottleneck Swin Transformer Module (BSTM)
The encoder output is passed to BSTM, which is designed
with 12 STBs, a convolution and an upsampling layer, as shown
in Fig. 1. Output features from the encoder are transformed into
linear projections and are given to the BSTM, which produces
enhanced feature representations through self-attention. The
BSTM output is also used for deep supervision.
4) Swin Decoder Module (SDM)
SDM consists of reshaping operations, 4 STBs and two
Fig. 3. Illustration of a Swin Transformer Block (STB).
convolution operations, as shown in Fig. 1. Representations
from the preceding decoder layer are reshaped and enhanced in

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

SDM before they are given to the segmentation head. SDM 3) Feedback Attention Module (FAM)
output is also used to produce a deep supervision signal. The proposed FAM, as depicted in Fig. 2(c), is designed to
use the previous epoch segmentation predictions to focus on
D. Attention Modules
important features, reducing undesired feature clutter and
To enhance boundary detection, emphasize the most relevant pruning subsequent mask predictions. It computes attention
feature channels, and refine predicted masks, we designed and maps that highlight important regions in the feature maps based
integrated three distinct attention modules into our model. on the feedback from the previous segmentation map. These
These modules are integrated at different stages of the network, attention maps modulate the input feature maps to enhance
allowing their contributions to be sequential and additive. An relevant features, suppress irrelevant ones, and produce a
Edge Attention Module (EAM) focuses on capturing edge refined feature map. The FAM is embedded into the initial
information to enhance boundary detection, the Channel layers of the encoder and the final layers of the decoder in our
Attention Module (CAM) emphasizes the most informative proposed model, as shown in Fig. 1. The full procedure of
feature channels, and the Feedback Attention Module (FAM) obtaining refined feature maps by using previous epoch mask
uses previous epoch segmentation predictions for pruning predictions and input features as input to the FAM is described
subsequent mask predictions. These modules are detailed in Algorithm 1.
below: The feedback attention network also enables iterative
1) Edge Attention Module (EAM) refinement of predictions during the testing phase. Throughout
Numerous studies have demonstrated the utility of edge testing, we iterate through the input images for a maximum of
information in guiding feature extraction for segmentation tasks 10 iterations (determined empirically), continuously updating
[8]. Therefore, taking into account that low-level features the input masks with the predicted masks.
inherently retain significant edge information, we provide the
low-level moderately resolved features 𝐹 from the encoder to Algorithm 1 For Feedback Attention Module
the EAM to produce edge maps, as shown in Fig. 1. Fig. 2(a) Input: Previous epoch input mask and input feature map
illustrates the EAM, which consists of four layers that include Output: Refined feature map
convolution and batch normalization (BN), followed by 1. Initial mask generation: Generate initial mask 𝒎𝒑 using
interpolation. We assess the dissimilarity between the predicted Otsu thresholding (acts as input mask for the first epoch)
edge maps 𝐸 and the edge maps obtained from the ground 2. Training Phase (Repeat for each epoch):
truth masks 𝐸 , using Boundary loss function ℒ [44]. ℒ is 1) Perform a forward pass through the network to obtain
designed particularly for extremely unbalanced segmentations the feature map 𝒇𝒍 at each layer 𝒍.
and focuses on the discrepancy between the predicted and 2) Generate binary mask 𝒎𝒍 by:
ground truth boundaries rather than the entire segmentation  Applying 3x3 Conv, BN, and ReLU.
region. This enables the explicit learning of object boundary
 1x1 Conv, sigmoid activation and thresholding on
representations, contributing to enhancing segmentation
the feature map 𝒇𝒍 .
accuracy.
3) Compress 𝒎𝒑 using Run Length Encoding (RLE) and
2) Channel Attention Module (CAM)
CAM can effectively model the relationships between match the size of 𝒎𝒍 using max-pooling.
different channels, recognising that various semantic responses 4) Combine the resized mask with 𝒎𝒍 using a union
are interconnected. It allows the network to focus on the most operation to create a unified mask.
informative channels in the feature map, enhancing the 5) Apply element-wise multiplication between the unified
representational power of the features. A combination of global mask and original 𝒇𝒍 to enhance important ones.
average pooling and global max pooling operations are 6) Pass the enhanced feature map and original 𝒇𝒍 through
performed on the input feature map 𝐹 to aggregate spatial 3x3 Conv, BN, and ReLU, then concatenate to obtain
information and generate spatial context. As shown in Fig. 2(b), refined feature map.
max-pooled feature maps 𝐹 and average-pooled feature 7) Update input mask with the newly predicted mask.
maps 𝐹 are squeezed by a convolution layer, followed by a 3. Testing Phase: Iteratively update the input mask with the
predicted mask for up to 10 iterations.
rectified linear unit (ReLU) and expanded by another
convolution layer before they are added and passed through the E. Decoder
sigmoid function. The final channel attention map 𝑀 for The decoder is designed to reconstruct the segmented image
feature map 𝐹 can be formulated as: from the encoded representations from the BSTM through a
𝑀 (𝐹) = 𝜎[𝑆𝐸(𝐹 ) + 𝑆𝐸 𝐹 ] (4) series of operations, including convolution, upsampling, skip-
where 𝜎 is the sigmoid activation function, and SE represents connection concatenation, and attention-based mechanisms. As
channels squeezing and expansion through two convolution illustrated in Fig. 1, the decoder comprises multiple blocks,
operations. each containing specific layers and operations. The initial three
Finally, as shown in Fig. 1, the input feature map 𝐹 is scaled blocks consist of skip-connection concatenation, a pair of
by the channel attention map 𝑀 (𝐹) through element-wise convolutions that are followed by ReLU activation, and
multiplication: upsampling using bilinear interpolation. The fourth block
𝐹 = 𝑀 (𝐹) ⊗ 𝐹 (5) incorporates Feedback Attention Module (FAM) and Swin
here ⊗ denotes element-wise multiplication. Decoder Module (SDM) to introduce feedback attention and
self-attention mechanisms to the representations in the decoder.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

The final decoder block, known as the segmentation head, do not overly depend on
includes concatenation with the previous epoch mask for input sequences, preventing
overfitting.
feedback and a convolution layer. The output of this
convolution layer yields the final predicted mask.
2) Deep Supervision and Loss Functions:
In summary, the MaS-TransUNet's achieves a harmonious
The MaS-TransUNet is trained in an end-to-end manner with
integration of CNN and Swin Transformer components through
an optimal combination of loss functions. The primary loss
a well-structured hybrid architecture, enhanced attention
function ℒ is a composite of weighted intersection over union
mechanisms, and efficient feature fusion strategies. This design
ensures that the model leverages the advantages of both (IoU) loss ℒ and weighted binary cross-entropy (BCE) loss
architectures while maintaining high performance and ℒ for each segmentation supervision and is given as:
efficiency. ℒ =ℒ + 𝜆ℒ (6)
here, 𝜆 is weight, which is empirically set to 1 in our
F. Components of Training Configuration experiments.
To enhance the training effectiveness of the MaS-
Drawing inspiration from [29], we incorporated deep
TransUNet, we employed a well-tailored training scheme. This
supervision to aid the training process through additional
involved selecting an optimal combination of loss functions,
supervision signals using the output features 𝐷𝑆 from BSTM
optimizer, data augmentation techniques, various regularization
and 𝐷𝑆 from SDM, which are passed through a convolution
methods, and a deep supervision scheme to maximize
layer followed by ReLU. 𝐷𝑆 is also up-sampled to match the
performance.
dimensions of the segmentation ground-truth map 𝑆 .
1) Data Augmentation and Regularization Techniques:
Supervising the BSTM output mitigates the vanishing gradient
Medical datasets are often small due to the difficulty in
problem and promotes better learning in the initial layers of the
acquiring labeled data. Data augmentation generates diverse
encoder, leading to improved feature extraction. Supervision at
training examples, preventing overfitting and improving the
the SDM output helps fine-tune the reconstructed features,
model's robustness, performance, and generalization to new,
ensuring that the final output is more accurate and precise. As
unseen images. We employed the open-source Python library
mentioned earlier, since the edge maps are highly unbalanced
"Albumentations" [45] to implement a wide range of image
segmentation, we have employed the Boundary loss function
augmentation techniques. This included basic operations like
[44], ℒ for edge attention, which was specifically developed
padding and cropping, as well as more sophisticated methods
for unbalanced segmentation problems. ℒ is formulated based
such as resizing, rotation, flipping, and various distortions
on the distances D between the boundary pixels of the predicted
(elastic, grid, and optical). Furthermore, we randomly altered
segmentation and the ground truth segmentation. The general
image properties like brightness, contrast, and gamma, and
formulation of the boundary loss can be expressed as:
applied contrast-limited adaptive histogram equalization
(CLAHE) to enhance contrast in a localized manner. These ℒ = 𝐷(𝐸 , 𝐸 ) (7)
non-spatial modifications were exclusive to the image data, where distance D is based on non-symmetric 𝐿 distance, 𝐸
whereas all other transformations were uniformly applied to is the edge map obtained from the ground truth masks 𝑆 using
both images and their corresponding masks. Canny edge detection. 𝐸 is the predicted edge map.
Regularization techniques are also essential in managing the Consequently, the final loss function ℒ is represented as:
model’s complexity, ensuring its generalizability, robustness,
and reliability. We have adopted optimised training-related ℒ = ℒ (𝑆 , 𝑆 )+ℒ (𝐸 , 𝐸 )+ℒ (𝑆 , 𝐷𝑆 )+ℒ (𝑆 , 𝐷𝑆 )
settings along with a set of well-suited regularisation (8)
techniques, including normalisation, learning rate (LR) here 𝑆 is the predicted segmentation map.
scheduler, dropout, and weight decay. These settings are
summarised in Table 1, along with the remarks about their IV. EXPERIMENTS
effectiveness with respect to our model. We evaluate the effectiveness of MaS-TransUNet on
TABLE I challenging medical image segmentation tasks, comparing it
TRAINING AND REGULARIZATION SETTINGS USED IN MAS- with leading methods. Ablation studies are conducted to assess
TRANSUNET. the impact of each MaS-TransUNet component on
Setting Value Remarks performance. Additionally, we measure the computational
Loss Function ReLU
Computationally inexpensive efficiency of our model in terms of parameter count (PRM),
and faster convergence. Floating Point Operations (FLOPs), and Frames Per Second
Stochastic Gradient
Optimiser Descent with Nesterov
Faster convergence and (FPS).
improved generalization
Momentum (𝜇 = 0.99)
A. Datasets
Learning Rate Initial LR = 0.01 Optimal results with SGD.
Improved convergence and The applicability and robustness of the proposed MaS-
LR Scheduler Cosine Annealing
better generalization. TransUNet model are evaluated on multiple medical image
Group Normalisation Ability to handle small batch datasets. Below are the specifics of these datasets:
Normalization
(GN) sizes and stabilize training.
L2 Regularization Prevents overfitting and
1) The Cancer Genome Atlas Low Grade Glioma (TCGA-
Weight Decay LGG) dataset:
(0.0001) improves generalization.
Other Dropouts (p=0.2)
Dropping some input tokens Originally from The Cancer Imaging Archive (TCIA) [15],
ensures Swin Transformers the dataset comprises 3929 2D fluid-attenuated inversion

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

recovery (FLAIR) MRI images taken from 110 patients. The testing. The computer setup employed for this implementation
MRI images and their associated ground truth (GT) masks are featured the following specifications: an Intel Xeon 2278G
sourced from Kaggle [46]. After preprocessing to select 1060 CPU, 64 GB of RAM, and an Nvidia A5000 GPU with 64 cores
image-mask pairs with at least 1% abnormal pixels in each GT and 24 GB of memory.
mask, the dataset is split into 960 training pairs and 100 testing 3) Evaluation Metrics
pairs in a 9:1 ratio. This dataset is also used for ablation studies. We employed a comprehensive set of metrics to evaluate the
2) Covid-19 CT lung lesion segmentation dataset: performance of MaS-TransUNet, focusing on both pixel-level
A large dataset [8] of chest CT scans for COVID-19, details and global structural aspects. For pixel-level evaluation,
combining 2729 image-mask pairs from three public datasets, some of the most widely used metrics are applied to assess
is used. After eliminating pairs with less than 1% lesion pixels, model performance. Accuracy measures the proportion of
1131 pairs remain. These are split into 1017 training pairs and correctly predicted pixels. The Dice Similarity Coefficient
114 testing pairs. evaluates the overlap between predicted and true masks,
3) Nuclei segmentation dataset (data science bowl 2018): emphasizing the correct detection of regions. Mean Absolute
The data science bowl (DSB) 2018 dataset [16] includes 670 Error (MAE) quantifies the average absolute differences
segmented nuclei images captured under diverse conditions to between predicted and actual pixel values. Sensitivity (or recall)
test model generalizability. It is split into 603 training images assesses the model's ability to correctly identify positive pixels,
and 67 testing images in a 9:1 ratio. while Specificity measures its ability to identify negative pixels.
4) Kvasir-SEG polyp segmentation endoscopic dataset: Precision determines the proportion of true positive predictions
A challenging and popularly used endoscopic image dataset among all positive predictions made. The Mean Hausdorff
[17] containing 1000 segmented images of gastrointestinal Distance (MHD) gauges the average distance between the
polyps, which are growths in the lining of the colon that can be predicted and true boundary pixels, highlighting boundary
precursors to cancer. It is divided into 900 training images and accuracy. Finally, the Jaccard Similarity Index (IoU) calculates
100 testing images. the intersection over union of the predicted and true regions,
5) Skin lesion segmentation dataset: providing a robust measure of segmentation quality. To assess
The Skin Lesion Segmentation dataset from the International the global structural aspects of the segmented regions, we
Skin Imaging Collaboration (ISIC) 2018 challenge [18] is a utilized the Structural Similarity Measure (SSM) [48] and
substantial collection of 2594 dermoscopic images, each with a Enhanced-alignment Measure (EM) [49], supplemented by
corresponding segmentation ground truth mask. It includes MAE values. The formulations for calculating SSM and EM are
high-quality, standardized images of various skin lesions, such given below:
as melanomas, nevi, and seborrheic keratoses. This dataset is SSM = 𝛼 × 𝑆 (𝑆 , 𝑆 ) + (1 − 𝛼) × 𝑆 (𝑆 , 𝑆 ) (9)
also split into a 9:1 ratio for training and testing purposes.
here 𝑆 and 𝑆 are object-aware and locality-aware
B. Experimental Settings resemblances, respectively. 𝛼 ∈ [0,1] is a constant value,
1) Compared Models: We compared our MaS-TransUNet with equalizing 𝑆 and 𝑆 (𝛼 = 0.5 taken as per [48].
approaches broadly divided into two categories, i.e. CNN-based 𝐸𝑀 = ∑ ∑ 𝜑(𝑆𝑝 (𝑥, 𝑦), 𝑆𝐺𝑇 (𝑥, 𝑦)) (10)
models and models employing transformers. In CNN-based ×
models, we selected the standard U-Net [21], Nested U-Net (U- where 𝜑 is the enhanced alignment matrix that captures both
Net++) [22], Attention U-Net [28], Recurrent Residual U-Net pixel-level and image-level information to provide a
(R2U-Net) [23], Attention Recurrent Residual U-Net (R2AU- comprehensive assessment of the segmentation quality. w & h
Net) [47], No New Net (nnU-Net) [25], Inf-Net [8] and FANet height & width of 𝑆 , where (x,y) represents the coordinate of
[10]. For Transformer-based models, we employed Swin-UNet a pixel in 𝑆 .
[34], TransUNet [33], DS-TransUNet [35] and CTNet [36]. Inf-
C. Segmentation Results and Discussions
Net was used solely for comparison on the COVID-19 Lung CT
dataset, as it was specifically developed for COVID-19 lung To build a more generalized model, we trained and tested the
infection segmentation. Similarly, CTNet was used only for proposed MaS-TransUNet on five datasets from various
comparison on the Kvasir-SEG polyp dataset, given its design modalities. To ensure a fair and comprehensive comparison, we
for polyp segmentation tasks. trained our model and other benchmark models from scratch on
2) Implementation Details each dataset for 100 epochs. The segmentation results and
In our experiments, our MaS-TransUNet model is trained discussions for the given datasets are as follows:
with a Stochastic Gradient Descent (SGD) with Nesterov 1) Results on LGG Segmentation: Table II presents the
Momentum (μ=0.99) optimizer with an initial learning rate of segmentation performance metrics of MaS-TransUNet
0.01 and weight decay rate of 10 . We have trained our model compared to other state-of-the-art techniques on Low-Grade
from scratch for 100 epochs with a cosine annealing schedule Glioma brain MRI scans. The qualitative results are illustrated
for learning rate adjustment. Training from scratch provides in Figure 4(a). Key observations from Table II and Fig. 4(a) are
deeper insights into the learning process and behaviour of the discussed below.
model and allows for clear benchmarking against other models. MaS-TransUNet achieved a dice score of 0.9012 and an IoU
The implementation and assessment of MaS-TransUNet of 0.8281. These values significantly outperform other models,
were carried out using the Python programming language in indicating that our model has a higher overlap between the
Visual Studio Code. We utilized the open-source machine predicted and actual segmentation, thus delivering more
learning library PyTorch for model building, training, and accurate segmentation results. Our model demonstrated the .

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

TABLE II
SEGMENTATION RESULTS OF COMPARED MODELS ON TCGA-LGG BRAIN MRI DATASET. 1ST & 2ND MOST FAVOURABLE
OUTCOMES ARE HIGHLIGHTED IN BOLD AND UNDERLINED FONTS, RESPECTIVELY.
Models Accuracy Dice Sensitivity Specificity Precision MHD IoU SSM EM MAE
U-Net [21] 0.9358 0.4504 0.7988 0.9401 0.3359 114.17 0.3197 0.6460 0.7356 0.0536
U-Net++[22] 0.9906 0.8619 0.8921 0.9949 0.8579 20.053 0.7715 0.9031 0.9530 0.0093
Attention-UNet[28] 0.9893 0.8375 0.8704 0.9943 0.8333 23.522 0.7407 0.8888 0.9493 0.0105
R2-UNet [23] 0.8631 0.2450 0.3519 0.8855 0.5403 59.415 0.1536 0.5188 0.4036 0.1049
R2-Attention-UNet) [47] 0.9736 0.4202 0.3567 0.9973 0.6292 55.945 0.3286 0.6345 0.5653 0.0295
nnUNet [25] 0.9931 0.8888 0.8903 0.9969 0.9038 10.847 0.8122 0.9158 0.9729 0.0069
FANet [10] 0.9919 0.8787 0.9375 0.9940 0.8459 16.318 0.8048 0.9094 0.9615 0.0081
Swin-UNet [34] 0.9832 0.7357 0.8099 0.9890 0.7143 35.149 0.6231 0.8360 0.8776 0.0168
TransUNet [33] 0.9925 0.8865 0.9304 0.9950 0.8633 9.5910 0.8081 0.9223 0.9669 0.0076
DS-TransUNet [35] 0.9923 0.8809 0.9075 0.9958 0.8668 15.108 0.8017 0.9107 0.9768 0.0076
MaS-TransUNet (ours) 0.9936 0.9031 0.9230 0.9963 0.8940 8.6138 0.8309 0.9259 0.9802 0.0064

TABLE III
COMPARISON OF LUNG INFECTION SEGMENTATION RESULTS ON 114 TEST IMAGES FROM THE COVID-19 LUNG CT DATASET. THE
TOP TWO PERFORMING RESULTS ARE HIGHLIGHTED USING BOLD AND UNDERLINED FONTS, RESPECTIVELY.
Methods Accuracy Dice Sensitivity Specificity Precision MHD IoU SSM EM MAE
Inf-Net [8] 0.9583 0.490 0.576 0.969 0.4556 103.99 0.381 0.862 0.9424 0.012
U-Net [21] 0.9116 0.146 0.288 0.928 0.1078 286.42 0.083 0.506 0.6194 0.074
U-Net++ [22] 0.9884 0.779 0.752 0.996 0.8429 24.67 0.657 0.817 0.9036 0.012
Att-UNet [28] 0.9882 0.779 0.785 0.994 0.8026 32.10 0.653 0.826 0.9195 0.011
R2-UNet [23] 0.9629 0.404 0.463 0.976 0.3915 65.77 0.277 0.620 0.7911 0.034
R2-Att-UNet [47] 0.9233 0.431 0.669 0.928 0.3910 58.76 0.298 0.633 0.5583 0.056
FANet [10] 0.9892 0.817 0.873 0.992 0.7838 57.95 0.707 0.856 0.9581 0.010
Swin-UNet [34] 0.9782 0.625 0.773 0.984 0.5536 98.73 0.477 0.738 0.8636 0.020
TransUNet [33] 0.9897 0.822 0.872 0.993 0.7909 56.28 0.712 0.859 0.9542 0.010
DS-TransUNet [35] 0.9902 0.836 0.859 0.994 0.8251 29.80 0.734 0.859 0.9728 0.010
MaS-TransUNet 0.9907 0.841 0.884 0.994 0.8110 43.71 0.736 0.880 0.9756 0.009

regions. Overall, the superior quantitative performance of MaS-


highest sensitivity (0.9527) and lowest MHD (9.3669), which TransUNet aligns with qualitative observations, indicating a
are critical for medical image segmentation. The high comprehensive evaluation framework.
sensitivity ensures that the model effectively identifies the 3) Results on Data Science Bowl Nuclei segmentation dataset:
presence of the glioma, while the low MHD indicates that the We evaluated our model on the challenging DSB-2018
model’s delineated regions closely match the ground truth. dataset, comparing its performance metrics with other state-of-
These metrics highlight MaS-TransUNet's reliability in the-art models in Table IV. Visual comparisons are provided in
accurately detecting and delineating gliomas. MaS-TransUNet Fig. 5. The results show that MaS-TransUNet consistently
scored highly on SSM (0.9335), EM (0.9765), and MAE ranks highest across metrics such as Dice, Sensitivity, IoU,
(0.0068), suggesting that our model preserves the structural SSM, and MAE, demonstrating reliable segmentation
integrity and alignment of the segmented regions more performance. While DS-TransUNet and nnUNet perform well
effectively than other models, contributing to better visual and on the DSB 2018 dataset, closely trailing MaS-TransUNet, our
clinical relevance of the segmentation outputs as evident in Fig. model excels across a broader range of datasets, highlighting its
4(a). superior generalizability and robustness.
2) Results on Covid-19 CT lung lesion segmentation dataset: TABLE IV
The efficacy of MaS-TransUNet on a CT lung lesion
segmentation dataset from COVID-19 patients is COMPARATIVE RESULTS OF NUCLEI SEGMENTATION ON 67 TEST
comprehensively compared with other models, both IMAGES FROM DSB 2018 DATASET. 1ST & 2ND BEST RESULTS
quantitatively in Table III and qualitatively in Figure 4(b). The ARE IN BOLD AND UNDERLINED FONTS RESPECTIVELY.
observations gleaned from these comparisons are as follows: Methods Dice Sen. Spec. IoU SSM MAE
Across various metrics such as dice (0.841), IoU (0.736), U-Net [21] 0.776 0.692 0.979 0.644 0.903 0.0739
U-Net++ [22] 0.875 0.923 0.976 0.786 0.906 0.0290
SSM (0.880), and MAE (0.009), the MaS-TransUNet Att-UNet [28] 0.861 0.920 0.975 0.763 0.896 0.0301
demonstrates superior performance compared to other models. R2-UNet [23] 0.720 0.703 0.973 0.605 0.718 0.0850
This suggests its effectiveness in accurately delineating lung R2-Att-UNet [47] 0.408 0.432 0.761 0.304 0.483 0.2003
lesions in COVID-19 CT scans. While DS-TransUNet achieves nnUNet [25] 0.905 0.916 0.988 0.838 0.909 0.0227
FANet [10] 0.898 0.911 0.986 0.833 0.905 0.0231
a quantitative performance that closely rivals our model, its Swin-UNet [34] 0.758 0.849 0.963 0.628 0.811 0.0650
qualitative assessment reveals a substantial disparity. Our TransUNet [33] 0.887 0.959 0.970 0.808 0.917 0.0266
model demonstrates clear superiority in the visual inspection of DS-TransUNet [35] 0.906 0.933 0.985 0.839 0.915 0.0227
segmentation results, indicating a finer delineation of lesion MaS-TransUNet 0.908 0.941 0.983 0.841 0.916 0.0226
boundaries and superior handling of image artefacts, ensuring
smooth and precise lesion segmentation even in challenging

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

10

Fig. 4. Qualitative comparison of MaS-TransUNet with existing models in (a) tumor segmentation and (b) lung infection segmentation using
TCGA-LGG brain MRI and COVID lung CT datasets, respectively. Bounding boxes highlight key disparities between segmented masks and
ground truth.

4) Results on Kvasir-SEG polyp segmentation dataset: TABLE V


Polyp segmentation is challenging due to polyps' camouflage
properties and size variability. Table V presents model TESTING ON 100 IMAGES FROM KVASIR POLYP SEGMENTATION
evaluations on the Kvasir-SEG polyp segmentation dataset, DATASET. 1ST & 2ND BEST OUTCOMES ARE SHOWN IN BOLD
with visual comparisons shown in Fig. 5. MaS-TransUNet AND UNDERLINED FONTS RESPECTIVELY.
excels in nearly all metrics, significantly outperforming other Models Dice Sen. Spec. IoU SSM MAE
models, including the recently proposed CTNet, which was U-Net [21] 0.392 0.789 0.679 0.271 0.500 0.309
U-Net++ [22] 0.770 0.831 0.962 0.666 0.821 0.068
specifically developed for polyp segmentation. FANet and Att-UNet [28] 0.659 0.846 0.901 0.535 0.756 0.110
nnU-Net have the highest sensitivity scores, indicating strong R2-UNet [23] 0.412 0.878 0.668 0.285 0.584 0.233
true positive detection, while DS-TransUNet has the highest R2-Att-UNet [47] 0.448 0.765 0.784 0.314 0.593 0.197
specificity, minimizing false positives. Although MaS- nnU-Net [25] 0.848 0.936 0.958 0.768 0.864 0.051
FANet [10] 0.763 0.945 0.838 0.693 0.783 0.149
TransUNet doesn't achieve the top scores in these individual Swin-UNet [34] 0.487 0.692 0.863 0.369 0.661 0.153
metrics, it performs very well and is close to the best, TransUNet [33] 0.875 0.905 0.972 0.806 0.898 0.047
showcasing its balanced and robust performance. DS-TransUNet [35] 0.868 0.876 0.985 0.796 0.883 0.041
CTNet [36] 0.869 0.864 0.982 0.796 0.886 0.043
MaS-TransUNet 0.906 0.930 0.982 0.843 0.917 0.029

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

11

Fig. 5. Qualitative results of MaS-TransUNet compared to other models on DSB 2018, KVASIR-SEG and ISIC 2018 datasets. Key regions are
highlighted with bounding boxes to show differences between segmented and ground truth masks.

5) Results on ISIC 2018 Dataset: TABLE VI


The ISIC-2018 dataset presents a significant challenge in
segmenting skin lesions. Table VI and Fig. 5 show the TESTING ON 260 IMAGES FROM ISIC-2018 SKIN LESION
quantitative and qualitative performances of various models. SEGMENTATION DATASET. 1ST & 2ND BEST OUTCOMES ARE
MaS-TransUNet stands out as the top performer across most SHOWN IN BOLD AND UNDERLINED FONTS RESPECTIVELY.
metrics, indicating its robustness and accuracy. It surpasses the Models Dice Sen. Spec. IoU SSM MAE
second-best TransUNet by 3.3% in Dice score, 5.4% in IoU, U-Net [21] 0.607 0.943 0.693 0.487 0.646 0.231
U-Net++ [22] 0.809 0.932 0.928 0.714 0.858 0.072
and 37.5% in MAE. However, its slightly lower sensitivity and Att-UNet [28] 0.771 0.948 0.908 0.667 0.835 0.080
specificity suggest a weaker ability to detect true positives and R2-UNet [23] 0.572 0.523 0.980 0.465 0.634 0.154
minimize false positives. Overall, all the transformer-based R2-Att-UNet [47] 0.538 0.970 0.702 0.417 0.682 0.191
models, including our MaS-TransUNet, outperform CNN- FANet [10] 0.572 0.738 0.921 0.429 0.627 0.169
Swin-UNet [34] 0.809 0.927 0.913 0.708 0.853 0.078
based models in most metrics, highlighting their superiority for TransUNet [33] 0.877 0.950 0.932 0.797 0.900 0.056
this segmentation task. DS-TransUNet [35] 0.848 0.836 0.987 0.768 0.852 0.069
The comprehensive analysis performed across five diverse MaS-TransUNet 0.906 0.931 0.967 0.840 0.900 0.035
datasets confirms that MaS-TransUNet consistently
D. Ablation Study
outperforms all other models in various metrics, including Dice
score, Sensitivity, Intersection over Union (IoU), Structural To assess the contributions of key components within our
Similarity Measure (SSM), and Mean Absolute Error (MAE). MaS-TransUNet model for LGG segmentation, we conducted a
This consistent top ranking highlights MaS-TransUNet's series of experiments incorporating different combinations of
exceptional reliability and generalizability, making it a highly Swin transformer modules and attention mechanisms. Table
effective tool for precise medical image segmentation. Among VII summarizes these experimental results. Our baseline model
the metrics used, the Dice score is crucial for pixel-level (B) is a U-shaped CNN architecture employing a ResNet-50
evaluation, while the SSM is key for assessing global structural encoder, three skip connections, and a decoder with multiple
aspects. Fig. 7 aggregates the Dice and SSM scores for all upsampling blocks. B+RST introduces RSTMs to the encoder
compared models across the datasets. The results clearly show of the baseline. B+RST+BST further integrates BSTM between
that MaS-TransUNet surpasses all state-of-the-art models in the encoder and decoder. The hybrid model (H) incorporates all
both Dice and SSM scores across all five datasets, underscoring Swin transformer modules (RSTM, BSTM, and SDM) into the
its superior performance in these critical evaluation metrics. baseline. H+EA augments the hybrid model with EAM.
H+EA+FA adds FAM to the previous configuration, and
finally, H+EA+FA+CA represents the complete MaS-
TransUNet model with the inclusion of CAM.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

12

Fig. 6. Ablation results, shown in the form of heat maps. (a) Image. Results from (b) base model (B). (c) B+RST. (d) B+RST+BST. (e)
Hybrid base model (H). (f) H+EA. (g) H+EA+FA. (h) H+EA+FA+CA (MaS-TransUNet). (i) GT heat map.
model's ability to learn from its own predictions, leading to
more accurate segmentation. The final model
(H+EA+FA+CA), which includes the CAM, demonstrates the
best performance across all metrics. With a Dice score of 0.903,
IoU of 0.831, and MAE of 0.0064, the full MaS-TransUNet
model shows substantial improvements over the base and
intermediate models, making it the most robust model for the
given task.
TABLE VII
ABLATION STUDY OF PROPOSED MODEL ON 100 IMAGES FROM
TCGA-LGG DATASET. BEST OUTCOMES ARE SHOWN IN BOLD.
Models Dice MHD IoU EM SSM MAE
Base (B) 0.873 11.15 0.784 0.948 0.932 0.0096
B+RST 0.878 10.46 0.794 0.968 0.916 0.0083
B+RST+BST 0.883 11.28 0.801 0.970 0.921 0.0078
Hybrid Base (H) 0.887 10.34 0.805 0.972 0.923 0.0079
H+EA 0.890 9.64 0.810 0.973 0.924 0.0076
Fig. 7. Dice and SSM scores of MaS-TransUNet compared to other H+EA+FA 0.897 8.78 0.822 0.976 0.921 0.0069
models. Scores are aggregated across the five datasets. H+EA+FA+CA 0.903 8.61 0.831 0.980 0.926 0.0064
Each version builds on the previous one, adding more
complex attention and transformer modules to improve
performance on the LGG segmentation task. The following The ablation results, depicted as heat maps in Fig. 6, clearly
observations can be made from Table VII and Fig. 6: demonstrate that each added model component progressively
enhances segmentation accuracy. These findings from the
Adding RSTMs and BSTMs to the base CNN model (B) ablation study indicate that each designed model component is
enhances performance. B+RST shows improvements in Dice essential for effective medical image segmentation.
(0.873 to 0.878), IoU (0.784 to 0.794), and a reduction in MAE
(0.0096 to 0.0083). B+RST+BST further improves Dice to E. Model Efficiency
0.883 and IoU to 0.801, despite a slight increase in MHD. The
Hybrid model (H), incorporating all transformer modules, To assess the computational efficiency of our model
yields significant gains. It achieves a Dice of 0.887, IoU of compared to state-of-the-art models, we evaluated several
0.805, and MAE of 0.0079, outperforming the base model and metrics: parameters count (PRM) in millions (M), Floating
partial integrations. Adding EA to the Hybrid model (H+EA) Point Operations (FLOPs) in Giga Floating Point Operations
enhances performance further. Dice increases to 0.890, and (G), Frames Per Second (FPS), as well as Dice and SSM scores
MHD decreases from 10.34 to 9.64, indicating refined on the brain dataset. PRM reflects the total number of trainable
segmentation boundaries. parameters, indicating the model's complexity. FLOPs measure
Incorporating the FAM into the model (H+EA+FA) leads to the computational load by counting the number of floating-
significant improvements in all metrics. The Dice score rises to point operations required to process a single image. FPS, which
0.897, IoU improves to 0.822, and MAE reduces further to measures the number of images processed per second,
0.0069. This indicates that the FAM effectively enhances the highlights the model's speed and efficiency. The results are
shown in Table VIII.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

13

TABLE VIII experiments conducted on five medical image segmentation


COMPARISON OF COMPUTATIONAL EFFICIENCY AND datasets across distinct modalities including Brain MRI, Covid-
SEGMENTATION PERFORMANCE ON BRAIN DATASET.
19 Lung CT, DSB-2018, Kvasir-SEG and ISIC-2018, have
shown that our MaS-TransUNet notably outperforms prior
Models PRM FLOPs FPS Dice SSM
U-Net [21] 34.53M 65.44G 90 0.450 0.646
state-of-the-art methods. These results validate the model's
U-Net++[22] 36.63M 138.58G 41 0.862 0.903 strengths and generalizability, achieving impressive outcomes
Attention-UNet[28] 34.88M 66.56G 80 0.838 0.889 with good computational efficiency without needing modality-
R2-UNet [23] 39.09M 152.91G 36 0.245 0.519 specific adjustments. Our future work will focus on developing
R2-Attention-UNet) [47] 39.44M 154.02G 34 0.420 0.635
FANet [10] 7.72M 23.65G 51 0.879 0.909
more lightweight models that effectively capture both intrinsic
Swin-UNet [34] 27.12M 5.86G 109 0.736 0.836 local features and global context, enhancing computational
TransUNet [33] 93.23M 32.23G 49 0.887 0.922 efficiency while maintaining attention to important details in
DS-TransUNet [35] 171.3M 51.15G 18 0.881 0.911 medical images.
MaS-TRansUNet 117.7M 87.34G 23 0.901 0.934 Acknowledgment Statement for authors with no COI: All authors declare
that they have no known conflicts of interest in terms of competing financial
Table VIII reveals that transformer-based methods generally interests or personal relationships that could have an influence or are relevant
have more parameters than CNN-based methods, with only DS- to the work reported in this paper.
TransUNet exceeding our model's parameter count. However,
our model boasts significantly higher FLOPs, FPS, Dice, and REFERENCES
SSM scores, indicating more efficient computational resource [1] A. K. Upadhyay and A. K. Bhandari, “Semi-Supervised Modified-
use, faster inference times, and improved segmentation UNet for Lung Infection Image Segmentation,” IEEE Trans. Radiat.
accuracy. While other models may have better FLOPs and FPS Plasma Med. Sci., vol. 7, no. 6, pp. 638–649, 2023, doi:
10.1109/TRPMS.2023.3272209.
metrics, they have significantly fewer parameters and much [2] W. Li et al., “Accurate Whole-Brain Segmentation for Bimodal
lower segmentation accuracies. Overall, MaS-TransUNet PET/MR Images Via a Cross-Attention Mechanism,” IEEE Trans.
achieves excellent performance with better computational Radiat. Plasma Med. Sci., vol. PP, p. 1, 2024, doi:
efficiency, reducing overfitting risks and making it a promising 10.1109/TRPMS.2024.3413862.
[3] S. D. Deb and R. K. Jha, “Modified Double U-Net Architecture for
solution for applications prioritizing segmentation accuracy and Medical Image Segmentation,” IEEE Trans. Radiat. Plasma Med.
resource efficiency. Sci., vol. 7, no. 2, pp. 151–162, 2023, doi:
10.1109/TRPMS.2022.3221471.
Overall, the MaS-TransUNet addresses key challenges in [4] T. Hussain and H. Shouno, “MAGRes-UNet: Improved Medical
medical image segmentation by integrating CNNs and Image Segmentation Through a Deep Learning Paradigm of Multi-
Attention Gated Residual U-Net,” IEEE Access, vol. 12, no.
transformers to capture both local and global features February, pp. 40290–40310, 2024, doi:
effectively. It enhances hierarchical feature representation 10.1109/ACCESS.2024.3374108.
through Swin Transformer blocks, improves boundary [5] A. K. Upadhyay and A. K. Bhandari, “Advances in Deep Learning
precision with an edge attention module, refines predictions Models for Resolving Medical Image Segmentation Data Scarcity
Problem: A Topical Review,” Arch. Comput. Methods Eng., vol. 31,
using feedback attention modules, and focuses on important no. 3, pp. 1701–1719, 2024, doi: 10.1007/s11831-023-10028-9.
feature channels with channel attention modules. This approach [6] P. H. Conze, G. Andrade-Miranda, V. K. Singh, V. Jaouen, and D.
not only achieves high accuracy across diverse medical imaging Visvikis, “Current and Emerging Trends in Medical Image
datasets but also maintains computational efficiency, making it Segmentation With Deep Learning,” IEEE Trans. Radiat. Plasma
Med. Sci., vol. 7, no. 6, pp. 545–569, 2023, doi:
suitable for a variety of medical applications. The model’s 10.1109/TRPMS.2023.3265863.
robust performance and versatility promise improved [7] B. Lei et al., “Self-co-attention neural network for anatomy
diagnostic accuracy and better clinical decision-making without segmentation in whole breast ultrasound,” Med. Image Anal., vol. 64,
p. 101753, 2020, doi: 10.1016/j.media.2020.101753.
needing modality-specific adaptations.
[8] D. P. Fan et al., “Inf-Net: Automatic COVID-19 Lung Infection
Segmentation from CT Images,” IEEE Trans. Med. Imaging, vol. 39,
V. CONCLUSION no. 8, pp. 2626–2637, Aug. 2020, doi: 10.1109/TMI.2020.2996645.
[9] H. Hu, Q. Li, Y. Zhao, and Y. Zhang, “Parallel Deep Learning
We propose the Multi-attention Swin Transformer UNet Algorithms with Hybrid Attention Mechanism for Image
(MaS-TransUNet), a CNN-Swin Transformer hybrid U-shaped Segmentation of Lung Tumors,” IEEE Trans. Ind. Informatics, vol.
17, no. 4, pp. 2880–2889, 2021, doi: 10.1109/TII.2020.3022912.
network. It incorporates four distinct attention mechanisms: [10] N. K. Tomar et al., “FANet: A Feedback Attention Network for
self-attention, edge-attention, feedback-attention, and channel Improved Biomedical Image Segmentation,” IEEE Trans. Neural
attention. The self-attention feature, facilitated by specially Networks Learn. Syst., vol. 34, no. 11, pp. 9375–9388, 2023, doi:
designed modules using Swin transformer blocks, operates at 10.1109/TNNLS.2022.3159394.
[11] H. Du, J. Wang, M. Liu, Y. Wang, and E. Meijering, “SwinPA-Net:
different segments of the network. An edge-attention module, Swin Transformer-Based Multiscale Feature Pyramid Aggregation
attached to the initial layers of the encoder, guides the network Network for Medical Image Segmentation,” IEEE Trans. Neural
with edge information. Predicted masks are iteratively pruned Networks Learn. Syst., vol. 35, no. 4, pp. 5355–5366, 2024, doi:
by utilizing previous masks through feedback attention modules 10.1109/TNNLS.2022.3204090.
[12] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf.
embedded into the encoder and decoder. The channel attention Process. Syst., vol. 2017–Decem, no. Nips, pp. 5999–6009, 2017.
module, integrated into the encoder, focuses on the most [13] Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer
informative channels in the feature maps. The model's training using Shifted Windows,” Proc. IEEE Int. Conf. Comput. Vis., pp.
effectiveness is enhanced by advanced data augmentation 9992–10002, 2021, doi: 10.1109/ICCV48922.2021.00986.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
techniques, deep supervision, regularizations, and an optimal image recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
combination of optimizer and loss functions. Extensive

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TRPMS.2024.3477528

14

Pattern Recognit., vol. 2016–Decem, pp. 770–778, 2016, doi: Medical Image Segmentation,” pp. 1–13, 2021, [Online]. Available:
10.1109/CVPR.2016.90. http://arxiv.org/abs/2102.04306
[15] N. Pedano et al., “The Cancer Genome Atlas Low Grade Glioma [34] H. Cao et al., “Swin-Unet: Unet-Like Pure Transformer for Medical
(TCGA-LGG) Dataset,” The Cancer Imaging Archive (TCIA). Image Segmentation,” Lect. Notes Comput. Sci. (including Subser.
[16] J. C. Caicedo et al., “Nucleus segmentation across imaging Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 13803
experiments: the 2018 Data Science Bowl,” Nat. Methods, vol. 16, LNCS, pp. 205–218, 2023, doi: 10.1007/978-3-031-25066-8_9.
no. 12, pp. 1247–1253, 2019, doi: 10.1038/s41592-019-0612-7. [35] A. Lin, B. Chen, J. Xu, Z. Zhang, G. Lu, and D. Zhang, “DS-
[17] D. Jha, P. H. Smedsrud, and M. A. Riegler, “Kvasir-SEG : A TransUNet: Dual Swin Transformer U-Net for Medical Image
Segmented Polyp Dataset,” vol. 2, pp. 451–462, doi: 10.1007/978-3- Segmentation,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–15, 2022,
030-37734-2. doi: 10.1109/tim.2022.3178991.
[18] N. Codella et al., “Skin Lesion Analysis Toward Melanoma [36] B. Xiao, J. Hu, W. Li, C. M. Pun, and X. Bi, “CTNet: Contrastive
Detection 2018: A Challenge Hosted by the International Skin Transformer Network for Polyp Segmentation,” IEEE Trans.
Imaging Collaboration (ISIC),” pp. 1–12, 2019, [Online]. Available: Cybern., vol. PP, pp. 1–14, 2024, doi: 10.1109/TCYB.2024.3368154.
http://arxiv.org/abs/1902.03368 [37] A. He, K. Wang, T. Li, C. Du, S. Xia, and H. Fu, “H2Former: An
[19] T. Hussain and H. Shouno, “Explainable Deep Learning Approach Efficient Hierarchical Hybrid Transformer for Medical Image
for Multi-Class Brain Magnetic Resonance Imaging Tumor Segmentation,” IEEE Trans. Med. Imaging, vol. 42, no. 9, pp. 2763–
Classification and Localization Using Gradient-Weighted Class 2775, 2023, doi: 10.1109/TMI.2023.3264513.
Activation Mapping,” Inf., vol. 14, no. 12, 2023, doi: [38] F. Yuan, Z. Zhang, and Z. Fang, “An effective CNN and Transformer
10.3390/info14120642. complementary network for medical image segmentation,” Pattern
[20] D. Hussain, T. Hussain, A. A. Khan, S. A. A. Naqvi, and A. Jamil, Recognit., vol. 136, p. 109228, 2023, doi:
“A deep learning approach for hydrological time-series prediction: A 10.1016/j.patcog.2022.109228.
case study of Gilgit river basin,” Earth Sci. Informatics, vol. 13, no. [39] Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, and L. Shao, “ET-net: A
3, pp. 915–927, 2020, doi: 10.1007/s12145-020-00477-2. generic Edge-aTtention guidance network for medical image
[21] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional segmentation,” Lect. Notes Comput. Sci. (including Subser. Lect.
networks for biomedical image segmentation,” Proc. MICCAI. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11764 LNCS, pp.
Cham, Switz. Springer, pp. 232–241, 2015, doi: 10.1007/978-3-319- 442–450, 2019, doi: 10.1007/978-3-030-32239-7_49.
24574-4. [40] H. Liu et al., “MEA-Net: multilayer edge attention network for
[22] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, medical image segmentation,” Sci. Rep., vol. 12, no. 1, pp. 1–15,
“Unet++: A nested u-net architecture for medical image 2022, doi: 10.1038/s41598-022-11852-y.
segmentation,” Lect. Notes Comput. Sci. (including Subser. Lect. [41] M. A. Islam, M. Rochan, N. D. B. Bruce, and Y. Wang, “Gated
Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11045 LNCS, pp. feedback refinement network for dense image labeling,” Proc. - 30th
3–11, 2018, doi: 10.1007/978-3-030-00889-5_1. IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol.
[23] and V. K. A. Alom, Md Zahangir, Mahmudul Hasan, Chris 2017–Janua, pp. 4877–4885, 2017, doi: 10.1109/CVPR.2017.518.
Yakopcic, Tarek M. Taha, “Recurrent residual convolutional neural [42] E. Shibuya and K. Hotta, “Feedback u-net for cell image
network based on u-net (r2u-net) for medical image segmentation,” segmentation,” IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Recognit. Work., vol. 2020–June, pp. 4195–4203, 2020, doi:
Lect. Notes Bioinformatics), vol. 12085 LNAI, pp. 207–219, 2020, 10.1109/CVPRW50498.2020.00495.
doi: 10.1007/978-3-030-47436-2_16. [43] A. Mosinska, P. Marquez-Neila, M. Kozinski, and P. Fua, “Beyond
[24] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen, the Pixel-Wise Loss for Topology-Aware Delineation,” Proc. IEEE
“DoubleU-Net: A deep convolutional neural network for medical Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1, pp. 3136–
image segmentation,” Proc. - IEEE Symp. Comput. Med. Syst., vol. 3145, 2018, doi: 10.1109/CVPR.2018.00331.
2020–July, no. 1, pp. 558–564, 2020, doi: [44] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I.
10.1109/CBMS49503.2020.00111. Ben Ayed, “Boundary loss for highly unbalanced segmentation,”
[25] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier- Med. Image Anal., vol. 67, pp. 1–21, 2021, doi:
Hein, “nnU-Net: a self-configuring method for deep learning-based 10.1016/j.media.2020.101851.
biomedical image segmentation,” Nat. Methods, vol. 18, no. 2, pp. [45] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M.
203–211, 2021, doi: 10.1038/s41592-020-01008-z. Druzhinin, and A. A. Kalinin, “Albumentations: Fast and flexible
[26] M. Liu, Y. Han, J. Wang, C. Wang, Y. Wang, and E. Meijering, image augmentations,” Inf., vol. 11, no. 2, pp. 1–20, 2020, doi:
“LSKANet: Long Strip Kernel Attention Network for Robotic 10.3390/info11020125.
Surgical Scene Segmentation,” IEEE Trans. Med. Imaging, vol. 43, [46] M. Buda, “Brain MRI Segmentation: LGG Segmentation Dataset,”
no. 4, pp. 1308–1322, 2024, doi: 10.1109/TMI.2023.3335406. Kaggle.
[27] Z. Huang et al., “CCNet: Criss-Cross Attention for Semantic [47] Q. Zuo, S. Chen, and Z. Wang, “R2AU-Net: Attention Recurrent
Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. Residual Convolutional Neural Network for Multimodal Medical
6, pp. 6896–6908, 2023, doi: 10.1109/TPAMI.2020.3007032. Image Segmentation,” Secur. Commun. Networks, vol. 2021, 2021,
[28] O. Oktay et al., “Attention U-Net: Learning Where to Look for the doi: 10.1155/2021/6625688.
Pancreas,” arXiv Prepr. arXiv1804.03999, no. Midl, 2018, [Online]. [48] M. M. Cheng and D. P. Fan, “Structure-Measure: A New Way to
Available: http://arxiv.org/abs/1804.03999 Evaluate Foreground Maps,” Int. J. Comput. Vis., vol. 129, no. 9, pp.
[29] D. P. Fan et al., “PraNet: Parallel Reverse Attention Network for 2622–2638, 2021, doi: 10.1007/s11263-021-01490-8.
Polyp Segmentation,” Lect. Notes Comput. Sci. (including Subser. [49] D. P. Fan, C. Gong, Y. Cao, B. Ren, M. M. Cheng, and A. Borji,
Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12266 “Enhanced-alignment measure for binary foreground map
LNCS, pp. 263–273, 2020, doi: 10.1007/978-3-030-59725-2_26. evaluation,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2018–July, pp.
[30] Y. Ling, Y. Wang, W. Dai, J. Yu, P. Liang, and D. Kong, “MTANet: 698–704, 2018, doi: 10.24963/ijcai.2018/97.
Multi-Task Attention Network for Automatic Medical Image
Segmentation and Classification,” IEEE Trans. Med. Imaging, vol.
43, no. 2, pp. 1–1, 2023, doi: 10.1109/tmi.2023.3317088.
[31] X. Gao, Y. Jin, Y. Long, Q. Dou, and P. A. Heng, “Trans-SVNet:
Accurate Phase Recognition from Surgical Videos via Hybrid
Embedding Aggregation Transformer,” Lect. Notes Comput. Sci.
(including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 12904 LNCS, pp. 593–603, 2021, doi:
10.1007/978-3-030-87202-1_57.
[32] A. Dosovitskiy et al., “An Image Is Worth 16X16 Words:
Transformers for Image Recognition At Scale,” ICLR 2021 - 9th Int.
Conf. Learn. Represent., 2021.
[33] J. Chen et al., “TransUNet: Transformers Make Strong Encoders for

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on October 24,2024 at 19:36:39 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like