Abstract
Background
Breast cancer is one of the most prevalent malignancies among women worldwide and remains a major public health concern. Accurate classification of breast tumor subtypes is essential for guiding treatment decisions and improving patient outcomes. However, existing deep learning methods for histopathological image analysis often face limitations in balancing classification accuracy with computational efficiency, while failing to fully exploit the deep semantic features in complex tumor images.
Methods
We developed 3DSN-net, a dual-attention interaction network for multiclass breast tumor classification. The model combines two complementary strategies: (i) spatial–channel attention mechanisms to strengthen the representation of discriminative features, and (ii) deformable convolutional layers to capture fine-grained structural variations in histopathological images. To further improve efficiency, a lightweight attention component was introduced to support stable gradient propagation and multi-scale feature fusion Experimental findings demonstrate that 3DSN-net consistently outperforms competing methods in both accuracy and robustness while maintaining favorable computational efficiency. The model effectively distinguishes benign and malignant tumors as well as multiple subtypes, highlighting the advantages of combining spatial–channel attention with deformable feature modeling. The model was trained and evaluated on two histopathological datasets, BreakHis and BCPSD, and benchmarked against several state-of-the-art CNN and Transformer-based approaches under identical experimental conditions.
Results
Experimental results show that 3DSN-net consistently outperforms baseline CNN and Transformer models, achieving 92%–100% accuracy for benign tumors and 86%–99% for malignant tumors, with error rates below 8%. On average, it improves classification accuracy by 3%–5% and ROC-AUC by 0.02 to 0.04 compared with state-of-the-art methods, while maintaining competitive computational efficiency. By enhancing the interaction between spatial and channel attention mechanisms, the model effectively distinguishes breast cancer subtypes, with only a slight reduction in classification speed on larger datasets due to increased data complexity.
Conclusions
This study presents 3DSN-net as a reliable and effective framework for breast tumor classification from histopathological images. Beyond methodological improvements, the enhanced diagnostic performance has direct clinical implications, offering potential to reduce misclassification, assist pathologists in decision-making, and improve patient outcomes. The approach can also be extended to other medical imaging tasks. Future work will focus on optimizing computational efficiency and validating generalizability across larger, multi-center datasets.
Introduction
In recent years, breast cancer has become one of the common malignant tumors in women worldwide [1,2,3]. Early diagnosis and accurate classification of breast cancer are crucial for the development of effective treatment plans as well as prognostic assessment [4]. According to the World Cancer Report of the World Health Organization, breast cancer is the most common cancer with the highest morbidity and mortality rate among women worldwide [5].The 2020 cancer burden data showed that there were as many as 2.26 million new cases of breast cancer globally, which accounted for about 11.7% of all new cancer cases, and it even exceeded lung cancer (2.2 million) for the first time to be the number one painful disease in the world [6, 7]. The study of automatic classification of breast cancer histopathologic images has important clinical implications [8]. First, it can improve the early diagnosis rate and accuracy of breast cancer and help doctors better determine treatment options and develop personalized treatment strategies. Second, automated classification can improve diagnostic efficiency, reduce human error, and save valuable time and resources. In addition, accurate classification results can help predict patient prognosis and survival rates, providing patients with a more accurate prognostic assessment.
In the pathological diagnosis of breast cancer, histopathological images [9] are an important aid in determining the type and grading of breast cancer by observing and analyzing the morphological features of tumor tissue [10,11,12]. However, traditional manual pathology assessment is subjective, time-consuming, and susceptible to individual experience [13]. Therefore, an automatic classification method for breast tumors based on multimodal images is essential to eliminate operator dependence and improve classification accuracy. However, multimodal classification of breast tumors still faces two major challenges. Most existing algorithms cannot fully utilize and fuse multimodal information. Specifically, (1) model complexity and interpretability: classification using multimodal data often requires the design of more complex models and algorithms. This increases the difficulty of model training and tuning, and reduces the interpretability and understandability of the results. (2) Existing classification methods only consider intra or intermodal interactions, and cannot fully utilize multimodal information. Intramodal fusion only explores different modes independently and utilizes the relationships in specific modes, while ignoring the correlation information between modes. When only multimodal fusion is considered, the dependency of local information cannot be captured and multimodal information cannot be fully utilized.
In order to solve the above two problems faced by multimodal image classification, we propose a novel dual-tandem attention mechanism interaction network 3DSN-net.
In this network, we summarize our main contributions as follows.
(1) We propose a novel dual-tandem attention module (DTAM) multi-variant interaction mechanism, including multi-variable relational interaction and multilevel attention fusion interaction to fully utilize the multi-variable information, which integrates spatial attention unit and channel attention unit.
(2) We self-developed a class of parameter-free attentional crosstalk mechanisms 3DsimAM with NEWCA (3DSN-net) used to focus attention on regions or features of interest, which can be considered as a useful a priori feature learning. Using DCNv2 instead of normal convolution to integrate the two modalities can provide complementary information for each modality and extend the DCNv2 to enhance the modeling capability.
(3) In multilevel attention fusion, special attention is paid to enhance the discriminative ability of features. Spatial attention units have high weights as flags to distinguish regional features. We construct an attention module named C2f_Faster_EMA (CFE) to obtain the features of the target image. Finally, the output information of the two attention modules is fused to further improve the recognition ability of the features. By introducing the series interaction mechanism, the useful modal information is fully exploited to improve the target classification performance.
The rest of the paper is organized as follows. In Sect. “Related Work”, we review prior work relevant to our study. Sect. “Methodology” details our research methodology. Section “Data and Experiments” gives the experiments used to evaluate our proposed algorithm. A comprehensive discussion of the experimental results is presented in Sect. “Discussion”. Finally, we summarize our conclusions in Sect. “Conclusion”.
Related work
Researchers have begun to explore the use of computer-aided diagnosis (CAD) techniques [14] for automated classification of breast cancer histopathology images. This approach allows for automated classification and diagnosis by training and analyzing a large number of histopathological images of organisms through machine learning and deep learning algorithms [15,16,17,18]. Flores et al. [19] tested different morphological and textural feature sets based on interactive information and statistical feature selection methods to validate benign and malignant tumor classification in breast ultrasound images performance.Shan et al. [20] conducted experiments from breast ultrasound dataset for classification of breast cancer histopathology images by comparing multiple machine learning methods such as Support Vector Machine (SVM), Random Forest (RF) and Convolutional Neural Network (CNN). The performance in breast tumor classification task, experimental results showed that the Random Forest algorithm achieved the highest classification accuracy of 78.5%. Jebarani et al. [21] fused the K- means and Gaussian Mixture Model GMM to build a new used to improve the classification accuracy.Shia et al. [22] Histogram pyramid based on directional gradient description was used to classify 2D feature information in breast ultrasound images and used a correlation feature selection method to select and evaluate the main feature set to further complete the classification. This method achieves 81.64% sensitivity and 87.76% specificity in the classification task for benign and malignant tumor images. These traditional machine learning methods, although improved for classification accuracy, require the introduction of excessive human intervention factors and costs, which are not conducive to practical applications. Therefore, some scholars have used deep learning methods to classify breast tumor images for research. Currently, many research teams have made significant progress. They have employed various algorithms based on feature extraction and classifier design, and in recent years, many model architectures have been successfully used for target detection and classification [23], such as improved R-CNN, Faster R-CNN, and YOLO (Shi et al. [24], 2022, Chan et al. [25], 2023, Cai et al. [26], 2024, Umirzakova et al. [27], 2025). Byra et al. [28] introduced a deep learning based framework to classify regions of interest in ultrasound images by combining a deep representation scaling (DRS) layer in a convolutional neural network model through migration learning, and analyzing breast masses from input images to improve information flow by updating the parameters of the DRS layer. The results showed that the DRS layer can improve the classification. Cui et al. [29] fused a network with two modules to enhance the joint tumor features, and then utilized a three-branch module to extract and fuse features within the tumor, around the tumor, and in the combined tumor region. The accuracy of this approach was 96.3% on two publicly available breast image datasets (BUS & UDIAT) and 94.8%. Sahu et al. [30] five classes of hybrid convolutional neural networks for breast cancer detection depth models. Experiments were conducted on small and large datasets and both showed good classification results. Meanwhile, the proposed hybrid classifiers are able to show better performance than the individual classifiers.
Real-time target classification continues to progress with the release of YOLO. YOLOv8 takes the state of the art in target detection and classification to new heights with faster inference and higher accuracy than the previous version (i.e., YOLOv5). These features are combined and blended at the neck and then they are passed to the head of the network. and YOLO predicts the location and class of objects around which a bounding box should be drawn [31,32,33,34]. In current research, there have been cases of applying YOLO deep learning (DL) model to classify pathological images such as breast cancer, lung cancer, bone tumors & brain tumors, but it did not give good results in the study of pathological images of breast cancer [35]. Nada M. Hassan et al. [36] used YOLOv4 in combination with Vision Transformer(VIT) to detect and classify breast masses. By introducing VIT to improve the model’s ability of learning global and semantic features, compared with the traditional CNN model, the computational efficiency and performance are greatly improved. Chean Khim Toa et al. [37] used the combination of residual learning and attention mechanism to automatically classify breast cancer, and distinguished important feature areas in full-field images (WSI), especially for more complex breast cancer lesions. Ghada H. A. et al. [38] proposed an end-to-end computer-aided diagnosis system based on YOLO, using ResNet and Inception to extract tumor features and compare their classification performance. These methods show excellent performance on different data sets and achieve high accuracy and sensitivity. However, despite YOLO’s remarkable success, its deployment has not been without challenges. These include sensitivity to the target scale and considerable computational resource requirements. In addition, ethical considerations such as data privacy and algorithmic bias need to be taken into account when developing YOLO-based systems, especially in healthcare. In order to realize the full potential of YOLO, these issues need to be addressed through ongoing and future research.
Methodology
Figure 1 illustrates the overall structure of the 3DSN-net for benign and malignant classification of breast tumors. The overall construction of the network is first discussed, and then the four modular parts of 3DSN-net are described. Finally, the 3DsimAM and NEWCA dual tandem mechanism classification principle is described. The study protocol was approved by the Institutional Review Board of Ningbo Li Huili Hospital and was conducted in accordance with the Declaration of Helsinki. A total of 5,965 microscopic images were produced using a partially representative dataset from the BreakHIS dataset [39] (4,965 images, in which the ratio of benign to malignant was 1:2) in combination with breast tumor tissues collected from 98 patients by clinicians at Ningbo Li Huili Hospital. The proposed 3DSN-net was evaluated. The institutional review board exempted participants from informed consent due to the retrospective nature of the study using completely anonymized data.
Overview
3DSN-net consists of four main components: the feature extraction module, the Channel Feature Extraction (CFE) module, 3DsimAM, and NEWCA. An input histopathology image is first processed by a two-layer DCNv2 backbone, which captures spatially deformable patterns and facilitates information interaction across feature maps, generating multi-scale feature maps. These feature maps are then fed into the CFE module, where channel-wise features are evaluated and fused to achieve the first inter-modal 3D interaction fusion while efficiently reducing redundant computations and memory accesses for spatial feature extraction. The fused features are subsequently passed through the Dual Tandem Attention Module (DTAM), where 3DsimAM focuses on salient spatial regions and NEWCA performs channel recalibration to enhance inter-modal feature representation. Finally, the refined features are aggregated and sent to the classifier to produce the predicted breast tumor subtype. This sequential processing ensures that each module contributes complementary contextual information and high-resolution features, enabling the network to capture distinctive patterns for accurate histopathological classification.
Feature extraction module
3DSN-net uses two DCNv2 [40] modulated deformable convolutions instead of normal convolution to train the layers of the modal feature CSPDarkNet-53 network. In order to reduce the amount of parameter training as well as to prevent parameter overfitting in image classification tasks, the convolutional layers perform convolutional operations on the input image by means of filters and detect similar features at different locations. The parameters of these filters are shared, i.e., the same weights are used for different locations. This greatly reduces the number of parameters and improves the computational efficiency and generalization of the model. The weights added to each sampling point are:
where, \(\:{W}_{k}\), \(\:{X}_{k}\) denote the weight and the pre-specified offset at the k-th location, \(\Delta {x_k}\) is the learned offset and \(\Delta {m_k}\) is the learned weight, and \(\:a\left(x\right)\), \(\:y\left(x\right)\) represent the features at position xxx from the input feature map x and the output feature map y, respectively.
For deformable RoIPooling, the branching features\(y(k)\)are computed as:
where the values of\(\Delta {x_k}\)and\(\Delta {m_k}\)are generated by a branch on the input feature map, \(\:{X}_{kj}\) is the sampling location of the j-th grid cell in the k-th bin, and \(\:{n}_{k}\) denotes the number of sampled grid cells. In this branch, RoIpooling generates features on the RoI followed by two 1024-D\({F_c}\)layers. The additional\({F_c}\)layer generates 3 K channels of output (weights are initialized to zero). The first 2 K channels are learnable offsets\(\Delta {x_k}\), and the remaining K channels are normalized by the sigmoid layer to produce\(\Delta {m_k}\).
For the encoder-decoder module, this paper adopts the improved multi-resolution feature fusion of ResNet50 network to eliminate the redundant information, and the overall framework of the network is shown in Fig. 2, in which the input three feature maps are three different sizes of features extracted by the ResNet network, which need to be sized by the two-dimensional convolution module with corresponding sizes of the feature maps. In this paper, the parallel feature fusion strategy is used, and the output equation is:
where x and y denote the input and output features of the i-th sample, \({K_i}\) is the i th convolution kernel.
For the weighted network, the output feature matrices of the Conv_1, Conv3_x, and Conv5_x layers in ResNet50 are selected as the input features of the 3DSN-net, which correspond to feature input 1, feature input 2, & feature input 3, respectively. the size of the feature matrices of the three levels is then adjusted in real time. Specifically, the convolutional layer, the pooling layer, the batch normalization layer and the activation function are used to regulate the concat function to keep the number of channels for different feature mappings consistent. At its core, DCNv2 is used to deformably convolve the previously obtained features to complete the information interaction and obtain the three weight matrices respectively.
Without adding DCNv2, the gradient of a pixel (m, n) in the feature map is calculated as:
where L is the loss function, \(\partial {x_{ij}}\) is the gradient of a pixel (m, n) in the feature map, \(\partial y_{{ij}}^{1}\), \(\partial y_{{ij}}^{2}\), \(\partial y_{{ij}}^{3}\) are the gradients of three levels of features respectively, this initial feature model omits the importance of different levels of features, so there will be uneven distribution of weights of different features. And the formula for gradient calculation after introducing DCNv2 is:
Eqs. \({\alpha _{ij}}\), \({\beta _{ij}}\) and \({\gamma _{ij}}\) are the feature weight parameters for different layers, which ensures automatic assignment of weight parameters.
Lightweight feature classification module
In order to design a fast and lightweight neural network, many works have focused on the reduction of FLOPs [41,42,43], however, it is verified through preliminary experiments that the reduction of FLOPs does not necessarily reduce the generation of delayed manifolds. In this paper, we propose a CFE module for computer vision classification, which mainly solves the task of data classification with large amount of data, and can provide richer gradient flow information while keeping lightweight. The general framework of the CFE module is shown in Fig. 3.
Since fastenet is a new family of neural networks, it obtains a faster operation speed than other networks on various devices without affecting the accuracy of the classification task, we replace all the Bottleneck in the C2f module used for feature fusion and classification with the block of fastenet. Through the extracted information of the tumor modality features in the previous section After the residual network in c2f extracts the shunt can be convolved, up sampled and spliced with different layers of feature maps, so that the model can obtain more contextual and high-resolution information at the same time, but the computing speed will be reduced. PConv in fastenet, on the other hand, can reduce redundant computations (as shown in Eq. 6) and memory accesses of operators to extract spatial features more efficiently.
where, h and w denote the spatial dimensions of the output feature map, k represents the kernel size, c is the number of input channels, and cp refers to the reduced channel dimension introduced in PConv. The variable r quantifies the computational reduction ratio, highlighting the efficiency gain of PConv over standard convolution. At the classic partial rate\(r={c_p}/c=1/4\), FLOPs are only 1/16 of a regular Conv. On top of that, Pconv has smaller access for:
where the convolution size is only 1/4 of the conventional convolution size when r is equal to 1/4.
Finally, we add the EMA attention mechanism to the model. The EMA attention mechanism [44] is a method to calculate the attention weights by exponential moving average, which adaptively learns the importance of different feature maps and adjusts the weight parameter for weighting during feature fusion as shown in Eq. 8. where the convolution size is only 1/4 of the conventional convolution size when r is equal to 1/4.
where, \({\nu _t}\) is the result of EMA at time t, \(\alpha \)is the weight parameter, which we take as 0.85, and \({\theta _t}\) is the sample value at time t.
Double tandem attention module
The improved channel attention module NEWCA and spatial attention module 3DsimAM are included in DTAM, and the overall structure is shown in Fig. 4. The attention mechanism can focus on the important information and ignore the secondary information, thus extracting the information effectively. So far studies such as BAM and CBAM combine spatial attention and channel attention in parallel or serial respectively. However, the two kinds of attention in the human brain tend to work together, so we propose a serial attention module with unified weights.
The need to assess neurons for better realization of attention. In neuroscience, neurons containing rich data information usually have firing patterns that are different from other neurons. Moreover, activated neurons inhibit surrounding neurons, i.e., null-space inhibition. In other words, neurons with the effect of null-space inhibition should be given higher importance. The simplest way to find important neurons is to measure the linear divisibility between neurons. Therefore, it is necessary to define the energy function:
where, t and xi denote the target neuron and other neurons on a single channel of the input \(\rm x\in\mathbb{R} {^{C \times H \times W \times Z}}\), \(\:\widehat{t}\) and \(\:\widehat{x}\) are the linear transformations of t and xi, respectively, wi and bt represent the weight and bias of the linear transformation; i is the index along the spatial dimension, and M = H×W denotes the total number of neurons on that channel. Minimizing the above equation is equivalent to training linear separability between neuron t and other neurons in the same channel. Using binary labeling [45,46,47] and adding regular terms, the final energy function is defined as follows:
here, \(\:{w}_{i}{x}_{i}+{b}_{t}\)and \(\:{w}_{i}t+{b}_{t}\:\)are the linear transformations of \(\:{\widehat{x}}_{i}\) and \(\:\widehat{t}\), respectively, and \(\:\lambda\:\) is the regularization weight coefficient. The analytic solution of the above equation is:
here, \(\:{\mu\:}_{t}\) and \(\:{\sigma\:}_{t}^{2}\) are the mean and variance calculated over all neurons in the channel except for t. Since all neurons on each channel follow the same distribution, the mean and variance can be computed first for the input features in both H and W dimensions to avoid repeated computations:
Equation (13) indicates that the lower the energy \(\:{e}_{t}^{*}\), the more distinctive neuron t is from the surrounding neurons and the more important it is for visual processing. Therefore, the importance of each neuron can be obtained as \(\:{1/e}_{t}^{*}\). The whole process can be represented as:
here, E aggregates all \(\:{e}_{t}^{*}\) across channel and spatial dimensions. A sigmoid function is applied to constrain excessively large values in E. Since sigmoid is a monotonic function, it does not affect the relative importance of each neuron. Channel attention NEWCA are effective in improving model performance, but they usually ignore location information, which is important for generating spatially selective attention maps. Therefore, in this paper, we propose a novel attention mechanism for mobile networks by embedding the location information channel into the channel attention, which is called “NEWCA” attention mechanism (the module structure is shown in Fig. 4). The Multipath Coordinate Attention proposed in this paper is simple and can be flexibly inserted into the network, which has one more channel than the previously proposed CA attention and fuses the X and Y weights afterwards. In order to encourage the attention blocks to capture spatially remote interactions through precise location information, the global pooling is expressed in Eq. 14. The one-dimensional feature encoding operation is operationalized through Eq. 14. Specifically, given the input X, we encode each channel using two spatial ranges (H,1) or (1, W) of the pooled kernel along the horizontal and vertical coordinates, respectively. The output of the third channel at height h can be expressed as:
Similarly, the output of the c th channel with width \(\omega \) is represented as:
The above two transformations represent the two spatial directions respectively, producing a pair of direction-aware feature maps. With the aggregated feature maps produced by Eq. (15) and Eq. (16), we link them and then send them to the shared convolutional transform function \({g^1}\) as in Eq:
Where, \([*,*]\) denotes along the spatial dimension, \(\delta \) is the nonlinear activation function, and \({\text{f}} \in {\operatorname{R} ^{C/r*(H+W)}}\)is the vertical direction of the intermediate feature map language that encodes spatial information in the horizontal direction. where r is the reduction ratio used to control the module size. Then we split g into two independent tensors \({{\text{f}}^h} \in {{\text{R}}^{C/r \times H}}\) and \({{\text{f}}^\omega } \in {{\text{R}}^{C/r \times W}}\) along the spatial dimension, and convert \({{\text{C}}_{\text{h}}}\) and \({{\text{C}}_{\text{w}}}\) into tensors with the same channel number as the input X by convolutional transforms \({{\text{f}}^h}\) and \({{\text{f}}^\omega }\), respectively, to obtain:
Subsequently, we branch \({{\text{f}}^h}\)and \({{\text{f}}^\omega }\) after the Sigmoid function, on the one hand, into the Mean function on the array elements of the mean value of the calculation with the input tensor after the weighting of the X, Y parameters coordinated product, and thus more accurately locate the exact location of the region of interest, which will help the overall model to better carry out the completion of the classification task.
The NEWCA approach considers a more efficient way to capture location information and channel relationships to enhance the feature representation of Network. By decomposing the two-dimensional global pooling operation into two one-dimensional encoding processes, the method in this paper runs better than other attention methods with lightweight properties, as we will demonstrate exhaustively in the experimental section.
Data and experiments
Dataset and experimental setup
Dataset
The breast tumor classification dataset used in this study is the Breast Cancer Histopathological Database (BreakHis) [39], established in collaboration with the Laboratory of Pathological Anatomy and Cytopathology of P&D, Paraná, Brazil, and collected by physicians from the Joint Laboratory of Breast Oncology and Surgery, Li Huili Hospital, Ningbo Medical Center, Ningbo, China. Breast Cancer Pathology Section Database (BCPSD) were composed. The dataset BBD used in this study was collected from 180 patients (ranging in age from 24 to 70 years old, respectively) and produced into 5965 pathology slides, of which a total of 1655 cases of benign tumors and 3310 cases of malignant tumors were screened, with a production period from January 1, 2020 to January 1, 2024.
Figure 5 lists sample images of the database for each of the eight types of breast cancer in BBD: including adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenoma (TA) among benign lesions, and ductal carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC) among malignant lesions. For the collected images, exclusion criteria included cases with unknown or incompletely documented tumor types, incomplete image reports, unclear classification, and cases with a lot of interfering information. This reduces the labor required for image analysis and provides detailed information for CFE-Net. Table 1 provides a summary of the BBD dataset, including the class distribution, image magnifications, and acquisition sources.Five-fold cross-validation is performed for all experiments. 4772 images for the training set and 1193 images for the validation set. We performed data enhancement on the dataset. The target images were randomly cropped and randomly flipped horizontally with 75% probability and vertically with 15% probability. A two-sample equal-variance t-test with a two-sided distribution was used to calculate 96% confidence intervals for each parameter. The threshold was set at t < 0.05 to determine statistical significance.
Experimental setup
All tumor images are first processed through multi-channel weighted color conversion and feature calculation during training. To reduce color variability across different samples, stain normalization was applied to standardize the hematoxylin and eosin (H&E) staining. All images were resized and cropped to an input resolution of 700 × 460 pixels, which is widely adopted in deep learning-based pathology studies. During training, data augmentation was applied, including random cropping, random horizontal flipping with a probability of 0.75, random vertical flipping with a probability of 0.15, and mild color jittering, to improve the robustness of the model and mitigate overfitting. The breast tumor classification experiments were conducted under a 64-bit Windows 10 operating system, with a GeForce GTX 3060 GPU for deep learning training under the NVIDIA architecture. The 3DSN-net algorithm was implemented using Python 3.8, PyTorch 1.9.1, and CUDA 11.1.
Our proposed classification model was trained using the mean squared error (MSE) loss function to quantify the difference between predicted and true labels, and optimization was performed using the Adam optimizer.The initial learning rate was set to 6 × 10-5 with a cosine annealing schedule for gradual decay, and the weight decay was fixed at 1 × 10-4 to prevent overfitting. A batch size of 32 was used in all experiments. Training was conducted for a maximum of 60 epochs for each training/validation split, with an early stopping criterion applied if the validation loss did not improve for 10 consecutive epochs.Auto-data batches were loaded sequentially during training, and a checkpointing mechanism was employed to preserve the best-performing model with minimal validation loss. Importantly, the choice of specific hyperparameters, including learning rate, batch size, and optimizer settings, was determined through empirical testing, where multiple parameter combinations were compared to identify the configuration that yielded the best validation performance.
Evaluation protocols
In this section, the distribution of benign and malignant masses is not the same, and this imbalance may lead to the existence of reduced precision and poor classification accuracy for a few classes. In order to comprehensively evaluate the performance of classification on the dataset, we evaluated the experiments based on accuracy (ACC), specificity (SPE), precision (PRE), area under the receiver operating characteristic curve (AUC) and sensitivity (SEN), as well as TOP1_ACC, TOP5_ACC, and the following evaluation of the classification performance is shown in Table 2.
Subsequently, the Receiver operating characteristic curve (ROC) can be used independently as an evaluation of the classification performance of the network since it does not need to consider specificity and sensitivity metrics. The metrics are constructed based on the test performance for different diagnostic thresholds: the higher the sensitivity, the lower the misdiagnosis rate; and the lower the specificity, the lower the misdiagnosis rate. A good test has 100% sensitivity with 0% false positives at all thresholds, and this is the case in the upper left corner of the ROC. We counted the area under the ROC curve (AUC), and the experimental results show that a value of AUC greater than 0.5 indicates a valid diagnosis. That is, the AUC is proportional to the rate of confirmed diagnosis.
Ablation experiments
In this section, we experimentally verify the validity of each module part of the interaction unit. NEWCA, CFE, DCNv2, and 3DsimAM were removed separately to validate each module function. The results of the experimental comparisons are shown in Tables 3 and 4, which summarizes both the classification performance and the computational metrics (parameters, FLOPs, and inference time) for each ablation variant. By comparing each individual interaction unit with 3DSN-net, significant improvements can be seen in all aspects when using the full network. These performance gaps show that the algorithm proposed in this paper is able to provide accurate classification results by interacting modules with each other. In addition, 3DSN-net’s sophisticated handling of classification errors further improves performance.
Comparison with SOTA methods
In order to validate the performance of our proposed 3DSN-net in the classification of eight types of breast tumors, we compared it with eight mainstream networks: NMTNet, Alexnet, Resnet50, Vgg 19, Mobilenet v3, Shufflenet v3, and Googlenet, and the specific comparison of the classification results is shown in Table 6 (in the supplementary material). From the data in the Table 3, it is obvious that we can see that 3DSN-net has improved about 15% in the classification of F, PT, TA, DC, LC, MC, and PC tumor types in ACC, SPE, and SEN metrics. In addition, in order to reveal the effect of the proposed method more intuitively, the ROC curves of several test networks are plotted, as shown in Fig. 6.
In the figure, we use the ROC curves to compare the proposed algorithm with some state-of-the-art networks. By comparing sensitivity and specificity, we find that the diagnostic accuracy is independent of the decision criteria and also independent of the disease prevalence. We can see that the area under the ROC curve of 3DSN-net is the largest, indicating the best performance. From the ROC curve, we can see that the curve represented by the method proposed in this paper is located in the upper left corner, which has the best critical performance and is superior to other methods.
Table 5 reports the precision (PRE) performance of a set of representative models on the TA tumor classification task. For each model, we provide the mean ± standard deviation (SD) across folds, the corresponding 95% confidence interval (CI), and the p-value for comparisons against baseline models. This presentation allows a clear and statistically informative comparison of model performance. Notably, our proposed 3DSN-net achieves a PRE of 0.83 ± 0.02 with a 95% CI of 0.815–0.854, showing a statistically significant improvement compared to several other models. The other models’ PRE values and CIs are also reported for completeness, and statistical tests were conducted to assess the significance of differences, ensuring fair and reliable comparisons.
Experimental results and visualization analysis
Figure 7 shows the tumor classification results obtained from the BBD dataset at 40× magnification. Additional visualized results are provided in Fig. 10 of the supplementary material. From the experimental results in the figure, it can be seen that the accuracy of our proposed network ranges from 92% to 100% for benign tumors classification detection and 86% to 99% for malignant tumors classification detection, and the error rate of both types of tumors classification is less than 8%.
To better understand the limitations of 3DSN-net, we examined a subset of misclassified breast tumor images, as shown in Fig. 8. Most errors occur when tumor morphology is atypical or histopathological features are subtle, causing overlap between benign and malignant patterns. Variations in staining, imaging artifacts, and low contrast also contribute. These cases highlight areas for potential model improvement.
In order to visualize the comprehensive performance of our proposed 3DSN-net method with tandem attention mechanism more intuitively, the final experimental results are given by the training loss curves and confusion matrices, and Fig. 9 shows the loss curves and confusion matrix plots before and after the algorithm improvement.
Confusion matrix (a) Corresponds to the performance of the algorithm on the BBD dataset before algorithm improvement. (b) Shows the change in training loss and verification accuracy on the BACH dataset before algorithm improvement. (c) Corresponds to the performance of 3DSN-Net on the BBD dataset. (d) Shows the change in training loss and verification accuracy on the BACH dataset after 3DSN-Net improvement
The confusion matrix enables a more detailed analysis of the classification results for benign and malignant tumors, not limited to accuracy. From the confusion matrix, it can be seen that the number of correct predictions for benign and malignant tumors among the eight breast tumor types is significantly improved, and the pre-trained DTAM-based model converges quickly and does not show large amplitude fluctuations at the beginning of the training process due to the loading of the pre-trained weight parameters.
Discussion
Breast cancer has become the first cancer that jeopardizes women’s health, and traditional breast cancer screening relies on the experience and subjectivity of radiologists. In this paper, with the task of realizing the automatic classification of benign and malignant breast tumor pathology images, and aiming at the problems of poor feature extraction and low recognition rate of breast tumor images, multi-channel and multi-scale classification experiments are carried out from the perspective of multi-attention mechanism in tandem and a diagnostic classification method based on the dual-attention mechanism module in tandem, i.e., NEWCA and 3DsimAM, is proposed. and is validated in the actual breast image eight classification experiments were validated. In this study, we developed 3DSN-net to improve the early diagnosis of breast cancer and reduce the workload of doctors. The traditional classification algorithms do not take full advantage of the advantages become the bottleneck of performance improvement, the existing traditional classification algorithms only correlation feature selection methods to select and evaluate the main feature set to further complete the classification work, and can not mine the full useful information. To address this problem, we explore a number of multi-model fusion classification methods to fully utilize the useful information. Specifically, we developed a dual tandem network model based on two attention modules, i.e., the DTAM model. To address the problems of multiparameter and model complexity, which are common in previous correlation classification algorithms, we developed a CFE module with fast and lightweight to get the target image features to better utilize the correlation, complementarity, and discriminative information.
In order to evaluate the effectiveness of 3DSN-net, a large number of experimental studies were conducted. Firstly, unlike single-channel classification networks, our algorithm fully utilizes the multi-channel attention mechanism interaction to improve accuracy. In addition, our algorithm was compared with various recent common network algorithms. This algorithm effectively improves the accuracy and robustness of target classification. Our study aims to validate the effectiveness of each module by interconnecting DTAM with other attentional modules using ablation experiments, and the best experimental results are obtained when all the four modules are used, with the highest classification performance in each of the eight breast tumors: top1_acc and top5_acc reach 86.2% and 99.9%, respectively, and PRE, SEN, ACC, and SPE: 90.1% (MC), 87.5% (PC), 97.8% (DC), and 99.2% (DC), respectively.3 DSN-net has demonstrated that breast cancer classification can be improved by the DTAM model.
Conclusion
We proposed the novel 3DSN-net for breast tumor classification to fully exploit depth feature information and improve classification performance. We introduced a dual tandem attention mechanism that leverages inter-feature correlation, complementarity, and discriminative informativeness to enhance feature interaction. A spatial attention unit (3DsimAM) and a channel attention unit (NEWCA) were integrated to focus on important regions and channel features, respectively. The CFE module was designed to efficiently fuse features while reducing redundant computation, and DCNv2 was employed for adaptive convolution to better capture deformable structures. Extensive experiments demonstrated that 3DSN-net significantly improves multiclassification of eight breast tumor subtypes, indicating its potential application value in real-world scenarios, such as pathology workflows, where it can assist in diagnostic decisions and improve both speed and accuracy. Future work will focus on multi-center dataset expansion, optimizing inference speed for real-time applications, and extending the framework to other medical image analysis tasks to enhance generalization and clinical applicability.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
- DCNV2:
-
Deformable Convolution
- DTAM:
-
Dual-tandem Attention Module
- 3DSN-net:
-
3DsimAM with NEWCA
- CNN:
-
Convolutional Neural Network
- DRS:
-
Deep Representation Scaling
- A:
-
Adenosis
- F:
-
Fibroadenoma
- PT:
-
Phyllodes Tumor
- TA:
-
Tubular Adenoma
- DC:
-
Ductal Carcinoma
- LC:
-
Lobular Carcinoma
- MC:
-
Mucinous Carcinoma
- PC:
-
Papillary Carcinoma
- BreakHis:
-
Breast Cancer Histopathological Database
- ACC:
-
Accuracy
- SPE:
-
Specificity
- PRE:
-
Precision
- AUC:
-
Area Under the receiver operating characteristic Curve
- SEN:
-
Sensitivity
- ROC:
-
Receiver Operating characteristic Curve
- SD:
-
Standard Deviation
- CI:
-
Confidence Interval
References
Christopher MDOL, Tward J, et al. Malignant phyllodes tumor of the female breast[J]. Cancer. 2006;107:2127–33.
Yu K, Liang Tan, Lin L, et al. Deep-learning-empowered breast cancer auxiliary diagnosis for 5GB remote E-Health. IEEE Wirel Commun. 2021;28:54–61.
Murtaza Ghulam S, Liyana W, Ainuddin et al. Deep learning-based breast cancer classification through medical imaging modalities: state of the Art and research challenges[J]. Artif Intell Rev. 2020;53.
Han Z, Wei B, Zheng Y, Yin, Yilong, Li, Kejian. & Li, shuo. Breast cancer multi-classification from histopathological images with structured deep learning model[J]. Sci Rep. 2017;7.
Qi J, Li M, Wang L, et al. National and subnational trends in cancer burden in china, 2005-20: an analysis of national mortality surveillance data[J]. Lancet Public Health. 2023;8(12):e943–55.
Bizuayehu Habtamu D, Abel HT, Ahmed et al. Global burden of 34 cancers among women in 2020 and projections to 2040: population-based data from 185 countries/territories. Int J Cancer 154. 2023.
Qiu H, Cao S, Xu. Ruihua. Cancer incidence, mortality, and burden in China: a time-trend analysis and comparison with the United States and United Kingdom based on the global epidemiological data released in 2020[J]. Cancer Communications. 2021.
Shankar K, Dutta A, Kumar S, Joshi G, Prasad, Doo. Ill. Chaotic sparrow search algorithm with deep transfer learning enabled breast cancer classification on histopathological images[J]. Cancers. 2022.
Liu Y, Liu X, Qi. Yuan. Adaptive threshold learning in frequency domain for classification of breast cancer histopathological Images[J]. Int J Intell Syst. 2024. 1–13.
Spanhol F, de Soares L, Petitjean C, Heutte. Laurent. A dataset for breast cancer histopathological image classification[J]. IEEE transactions on bio-medical engineering. 2015;63.
Dalle J-R, Leow WK, Racoceanu D, Tutac A, Putti T. Automatic breast cancer grading of histopathological images[C]. Conference proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference. 2008:3052-5.
Salim, S., & Sarath, R. Breast cancer detection and classification using histopathological images based onoptimization-enabled deep learning[J]. Biomedical Engineering: Applications, Basis and Communications. 2024;36(01).
Ye F, Fu B, Li Y, et al. Accurate prognostic prediction for breast cancer based on histopathological images by Artificial Intelligence[J]. Social Science Electronic Publishing; 2024.
Chan H-P, Hadjiiski L, Samala. Ravi. Computer-aided diagnosis in the era of deep learning[J]. Med Phys. 2020;47.
Aljuaid H, Alturki N, Alsubaie N, Cavallaro L, Liotta. Antonio. Computer-aided diagnosis for breast cancer classification using deep neural networks and transfer learning[J]. Computer methods and programs in biomedicine. 2022;223:106951.
Tang J, Rangayyan R, Xu J, El Naqa, Issam &, Yang. Yongyi. Computer-aided detection and diagnosis of breast cancer with mammography: recent advances[J]. IEEE transactions on information technology in biomedicine. 2009;13:236 – 51.
Wang J, Yang X, Cai H, et al. Discrimination of breast cancer with microcalcifications on mammography by deep Learning[J]. Sci Rep. 2016;6:27327.
Houssein EH, Emam MM, Ali AA et al. Deep and machine learning techniques for medical imaging-based breast cancer: a comprehensive review[J]. Expert Syst Appl. 2021;167.
mez-Flores GW, Pereira W, C D A, et al. Improving classification performance of breast lesions on ultrasonography[J]. Pattern Recognition; 2015.
Shan J, Alam SK, Garra B, et al. Computer-Aided diagnosis for breast ultrasound using computerized BI-RADS features and machine learning Methods[J]. Ultrasound in Medicine & Biology; 2016.
Jebarani PE, Umadevi N, Dang H, et al. A novel hybrid K-means and GMM machine learning model for breast cancer detection[J]. IEEE Access. 2021;9:146153–62.
Shia WC, Lin LS, Chen DR. Classification of malignant tumours in breast ultrasound using unsupervised machine learning approaches[J]. Sci Rep. 2021;11(1):1–11.
Lin J, Yi, et al. Classifier design with feature selection and feature extraction using layered genetic programming[J]. Expert Syst Appl. 2008;34(2):1384–93.
Shi S, et al. PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection[J]. Int J Comput Vision. 2022;131(2):531–51.
Chan PH, Anthony SG, Jennings PD et al. Influence of AVC and HEVC compression on detection of vehicles through faster R-CNN[J]. IEEE Trans Intell Transp Syst. 2023:1–11.
Cai YY, Yao ZK, Jiang HB, Qin W, Xiao J, Huang XX, Pan JJ, Feng H. Rapid detection of fish with SVC symptoms based on machine vision combined with a NAM-YOLO v7 hybrid model[J]. Aquaculture. 2024:582.
Umirzakova S, Shakhnoza M, Sevara M, Whangbo TK. Deep learning for multiple sclerosis lesion classification and stratification using MRI[J]. Comput Biol Med. 2025;192:110078. Pt A).
Byra M. Breast mass classification with transfer learning based on scaling of deep representations[J]. Biomed Signal Process Control. 2021;69:102828.
Cui W, Peng Y, Yuan G, et al. FMRNet: a fused network of multiple tumoral regions for breast tumor classification with ultrasound images[J]. Med Phys. 2022;49(1):144–57.
Sahu A, Das PK, Meher S. High accuracy hybrid CNN classifiers for breast cancer detection using mammogram and ultrasound datasets[J]. Biomed Signal Process Control. 2023;80:104292.
Zhongzhen S, Xiangguang L, Yu L, Boli X, Kefeng J, Gangyao K. BiFA-YOLO: a novel YOLO-based method for arbitrary-oriented ship detection in high-resolution SAR images[J]. Remote Sens. 2021;13.
Al-Masni MA, et al. Simultaneous detection and classification of breast masses in digital mammograms via a deep learning YOLO-based CAD System[J]. Comput Methods Programs Biomed. 2018;157:85–94.
Aly G, HamedMarey, MohammedEl-Sayed S, AminTolba M, Fahmy. YOLO based breast masses detection and classification in full-field digital mammograms[J]. Computer Methods and Programs in Biomedicine; 2021. p. 2001.
Ragab MG, Abdulkadir SJ, Muneer A, et al. A comprehensive systematic review of YOLO for medical object detection (2018 to 2023). IEEE Access. 2024;12:57815–36.
Hassan NM, Hamad S. K. YOLO-based CAD framework with ViT transformer for breast mass detection and classification in CESM and FFDM images. Neural Comput&Applic. 2024;36:6467–96.
Toa CK, Elsayed M. Deep residual learning with attention mechanism for breast cancer classification. Soft Comput. 2024;28:9025–35.
Aly GH, Marey M, Safaa Amin EI-S. Mohamed Fahmy tolba.yolo based breast masses detection and classification in full-field digital mammograms. Volume 200. Computer Methods and Programs in Biomedicine; 2021.
World Medical Association. WMA-The World Medical Association-WMA declaration of Helsinki–Ethical principles for medical research involving human subjects. World Medical Association. 2022.
Spanhol FA, Oliveira LS, Petitjean C, Heutte L. A dataset for breast cancer histopathological image classification[J]. IEEE Trans Biomed Eng. 2016;63(7):1455–62.
Zhu X, Hu H, Lin S et al. Deformable convnets v2: more deformable, better results[C]/proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:9308–9316.
Bochkovskiy A, Wang CY, Liao HYM. Yolov4: optimal speed and accuracy of object detection[J]. 2020. https://www.scirp.org/reference/referencespapers?referenceid=3308955.
He K, Zhang X, Ren S et al. Deep residual learning for image recognition[C]/proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770–778.
Howard AG et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[J]. 2017. https://arxiv.org/abs/1704.04861.
Zhu X, et al. Lightweight image super-resolution with expectation-maximization attention mechanism[J]. IEEE Trans Circuits Syst Video Technol PP. 2021;99:1–1.
Park J, Woo S, Lee JY, et al. A simple and light-weight attention module for convolutional neural networks[J]. Int J Comput Vision. 2020;128(4):783–98.
Woo S, Park J, Lee JY et al. Cbam: convolutional block attention module[C]/proceedings of the European conference on computer vision (ECCV). 2018:3–19.
Sheng H, Schomaker L. DeepOtsu: document enhancement and binarization using iterative deep Learning[J]. Pattern Recognition; 2019.
Acknowledgements
We sincerely thank the 98 patients who participated in this study for providing their tumor imaging data.
Funding
This research was funded by Heilongjiang Provincial Natural Science Foundation of China under Grant LH2022E087, Heilongjiang Province Key Research and Development Program of China under Grant 2023ZX01A08, Heilongjiang Province Key Research and Development Program of China under Grant JD2023SJ18, and Doctoral Development Fundation of LiHuili Hospital (2023BSKY-LL (B)).
Ethics declarations
Ethics approval and consent to participate
The study protocol was approved by the Institutional Review Board of Ningbo Medical Center LiHuiLi Hospital and was conducted in accordance with the Declaration of Helsinki. The institutional review board exempted participants from informed consent due to the retrospective nature of the study using completely anonymized data.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, L., Wang, M., Li, D. et al. 3DSN-net: dual-tandem attention mechanism interaction network for breast tumor classification. BMC Med Imaging 25, 393 (2025). https://doi.org/10.1186/s12880-025-01936-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1186/s12880-025-01936-2








