Introduction

In recent years, breast cancer has become one of the common malignant tumors in women worldwide [1,2,3]. Early diagnosis and accurate classification of breast cancer are crucial for the development of effective treatment plans as well as prognostic assessment [4]. According to the World Cancer Report of the World Health Organization, breast cancer is the most common cancer with the highest morbidity and mortality rate among women worldwide [5].The 2020 cancer burden data showed that there were as many as 2.26 million new cases of breast cancer globally, which accounted for about 11.7% of all new cancer cases, and it even exceeded lung cancer (2.2 million) for the first time to be the number one painful disease in the world [6, 7]. The study of automatic classification of breast cancer histopathologic images has important clinical implications [8]. First, it can improve the early diagnosis rate and accuracy of breast cancer and help doctors better determine treatment options and develop personalized treatment strategies. Second, automated classification can improve diagnostic efficiency, reduce human error, and save valuable time and resources. In addition, accurate classification results can help predict patient prognosis and survival rates, providing patients with a more accurate prognostic assessment.

In the pathological diagnosis of breast cancer, histopathological images [9] are an important aid in determining the type and grading of breast cancer by observing and analyzing the morphological features of tumor tissue [10,11,12]. However, traditional manual pathology assessment is subjective, time-consuming, and susceptible to individual experience [13]. Therefore, an automatic classification method for breast tumors based on multimodal images is essential to eliminate operator dependence and improve classification accuracy. However, multimodal classification of breast tumors still faces two major challenges. Most existing algorithms cannot fully utilize and fuse multimodal information. Specifically, (1) model complexity and interpretability: classification using multimodal data often requires the design of more complex models and algorithms. This increases the difficulty of model training and tuning, and reduces the interpretability and understandability of the results. (2) Existing classification methods only consider intra or intermodal interactions, and cannot fully utilize multimodal information. Intramodal fusion only explores different modes independently and utilizes the relationships in specific modes, while ignoring the correlation information between modes. When only multimodal fusion is considered, the dependency of local information cannot be captured and multimodal information cannot be fully utilized.

In order to solve the above two problems faced by multimodal image classification, we propose a novel dual-tandem attention mechanism interaction network 3DSN-net.

In this network, we summarize our main contributions as follows.

(1) We propose a novel dual-tandem attention module (DTAM) multi-variant interaction mechanism, including multi-variable relational interaction and multilevel attention fusion interaction to fully utilize the multi-variable information, which integrates spatial attention unit and channel attention unit.

(2) We self-developed a class of parameter-free attentional crosstalk mechanisms 3DsimAM with NEWCA (3DSN-net) used to focus attention on regions or features of interest, which can be considered as a useful a priori feature learning. Using DCNv2 instead of normal convolution to integrate the two modalities can provide complementary information for each modality and extend the DCNv2 to enhance the modeling capability.

(3) In multilevel attention fusion, special attention is paid to enhance the discriminative ability of features. Spatial attention units have high weights as flags to distinguish regional features. We construct an attention module named C2f_Faster_EMA (CFE) to obtain the features of the target image. Finally, the output information of the two attention modules is fused to further improve the recognition ability of the features. By introducing the series interaction mechanism, the useful modal information is fully exploited to improve the target classification performance.

The rest of the paper is organized as follows. In Sect. “Related Work”, we review prior work relevant to our study. Sect. “Methodology” details our research methodology. Section “Data and Experiments” gives the experiments used to evaluate our proposed algorithm. A comprehensive discussion of the experimental results is presented in Sect. “Discussion”. Finally, we summarize our conclusions in Sect. “Conclusion”.

Related work

Researchers have begun to explore the use of computer-aided diagnosis (CAD) techniques [14] for automated classification of breast cancer histopathology images. This approach allows for automated classification and diagnosis by training and analyzing a large number of histopathological images of organisms through machine learning and deep learning algorithms [15,16,17,18]. Flores et al. [19] tested different morphological and textural feature sets based on interactive information and statistical feature selection methods to validate benign and malignant tumor classification in breast ultrasound images performance.Shan et al. [20] conducted experiments from breast ultrasound dataset for classification of breast cancer histopathology images by comparing multiple machine learning methods such as Support Vector Machine (SVM), Random Forest (RF) and Convolutional Neural Network (CNN). The performance in breast tumor classification task, experimental results showed that the Random Forest algorithm achieved the highest classification accuracy of 78.5%. Jebarani et al. [21] fused the K- means and Gaussian Mixture Model GMM to build a new used to improve the classification accuracy.Shia et al. [22] Histogram pyramid based on directional gradient description was used to classify 2D feature information in breast ultrasound images and used a correlation feature selection method to select and evaluate the main feature set to further complete the classification. This method achieves 81.64% sensitivity and 87.76% specificity in the classification task for benign and malignant tumor images. These traditional machine learning methods, although improved for classification accuracy, require the introduction of excessive human intervention factors and costs, which are not conducive to practical applications. Therefore, some scholars have used deep learning methods to classify breast tumor images for research. Currently, many research teams have made significant progress. They have employed various algorithms based on feature extraction and classifier design, and in recent years, many model architectures have been successfully used for target detection and classification [23], such as improved R-CNN, Faster R-CNN, and YOLO (Shi et al. [24], 2022, Chan et al. [25], 2023, Cai et al. [26], 2024, Umirzakova et al. [27], 2025). Byra et al. [28] introduced a deep learning based framework to classify regions of interest in ultrasound images by combining a deep representation scaling (DRS) layer in a convolutional neural network model through migration learning, and analyzing breast masses from input images to improve information flow by updating the parameters of the DRS layer. The results showed that the DRS layer can improve the classification. Cui et al. [29] fused a network with two modules to enhance the joint tumor features, and then utilized a three-branch module to extract and fuse features within the tumor, around the tumor, and in the combined tumor region. The accuracy of this approach was 96.3% on two publicly available breast image datasets (BUS & UDIAT) and 94.8%. Sahu et al. [30] five classes of hybrid convolutional neural networks for breast cancer detection depth models. Experiments were conducted on small and large datasets and both showed good classification results. Meanwhile, the proposed hybrid classifiers are able to show better performance than the individual classifiers.

Real-time target classification continues to progress with the release of YOLO. YOLOv8 takes the state of the art in target detection and classification to new heights with faster inference and higher accuracy than the previous version (i.e., YOLOv5). These features are combined and blended at the neck and then they are passed to the head of the network. and YOLO predicts the location and class of objects around which a bounding box should be drawn [31,32,33,34]. In current research, there have been cases of applying YOLO deep learning (DL) model to classify pathological images such as breast cancer, lung cancer, bone tumors & brain tumors, but it did not give good results in the study of pathological images of breast cancer [35]. Nada M. Hassan et al. [36] used YOLOv4 in combination with Vision Transformer(VIT) to detect and classify breast masses. By introducing VIT to improve the model’s ability of learning global and semantic features, compared with the traditional CNN model, the computational efficiency and performance are greatly improved. Chean Khim Toa et al. [37] used the combination of residual learning and attention mechanism to automatically classify breast cancer, and distinguished important feature areas in full-field images (WSI), especially for more complex breast cancer lesions. Ghada H. A. et al. [38] proposed an end-to-end computer-aided diagnosis system based on YOLO, using ResNet and Inception to extract tumor features and compare their classification performance. These methods show excellent performance on different data sets and achieve high accuracy and sensitivity. However, despite YOLO’s remarkable success, its deployment has not been without challenges. These include sensitivity to the target scale and considerable computational resource requirements. In addition, ethical considerations such as data privacy and algorithmic bias need to be taken into account when developing YOLO-based systems, especially in healthcare. In order to realize the full potential of YOLO, these issues need to be addressed through ongoing and future research.

Methodology

Figure 1 illustrates the overall structure of the 3DSN-net for benign and malignant classification of breast tumors. The overall construction of the network is first discussed, and then the four modular parts of 3DSN-net are described. Finally, the 3DsimAM and NEWCA dual tandem mechanism classification principle is described. The study protocol was approved by the Institutional Review Board of Ningbo Li Huili Hospital and was conducted in accordance with the Declaration of Helsinki. A total of 5,965 microscopic images were produced using a partially representative dataset from the BreakHIS dataset [39] (4,965 images, in which the ratio of benign to malignant was 1:2) in combination with breast tumor tissues collected from 98 patients by clinicians at Ningbo Li Huili Hospital. The proposed 3DSN-net was evaluated. The institutional review board exempted participants from informed consent due to the retrospective nature of the study using completely anonymized data.

Fig. 1
figure 1

Overall workflow diagram of 3DSN-net for breast cancer image classification task

Overview

3DSN-net consists of four main components: the feature extraction module, the Channel Feature Extraction (CFE) module, 3DsimAM, and NEWCA. An input histopathology image is first processed by a two-layer DCNv2 backbone, which captures spatially deformable patterns and facilitates information interaction across feature maps, generating multi-scale feature maps. These feature maps are then fed into the CFE module, where channel-wise features are evaluated and fused to achieve the first inter-modal 3D interaction fusion while efficiently reducing redundant computations and memory accesses for spatial feature extraction. The fused features are subsequently passed through the Dual Tandem Attention Module (DTAM), where 3DsimAM focuses on salient spatial regions and NEWCA performs channel recalibration to enhance inter-modal feature representation. Finally, the refined features are aggregated and sent to the classifier to produce the predicted breast tumor subtype. This sequential processing ensures that each module contributes complementary contextual information and high-resolution features, enabling the network to capture distinctive patterns for accurate histopathological classification.

Feature extraction module

3DSN-net uses two DCNv2 [40] modulated deformable convolutions instead of normal convolution to train the layers of the modal feature CSPDarkNet-53 network. In order to reduce the amount of parameter training as well as to prevent parameter overfitting in image classification tasks, the convolutional layers perform convolutional operations on the input image by means of filters and detect similar features at different locations. The parameters of these filters are shared, i.e., the same weights are used for different locations. This greatly reduces the number of parameters and improves the computational efficiency and generalization of the model. The weights added to each sampling point are:

$$y(x)=\sum\limits_{{k=1}}^{n} {{w_k} \cdot a(X+{X_k}+\Delta {X_k}) \cdot \Delta {m_k}}$$
(1)

where, \(\:{W}_{k}\), \(\:{X}_{k}\) denote the weight and the pre-specified offset at the k-th location, \(\Delta {x_k}\) is the learned offset and \(\Delta {m_k}\) is the learned weight, and \(\:a\left(x\right)\), \(\:y\left(x\right)\) represent the features at position xxx from the input feature map x and the output feature map y, respectively.

For deformable RoIPooling, the branching features\(y(k)\)are computed as:

$$y(k)=\sum\limits_{{j=1}}^{{{n_k}}} {a({X_{kj}}+\Delta {X_k}) \cdot \Delta {m_k}/{n_k}} $$
(2)

where the values of\(\Delta {x_k}\)and\(\Delta {m_k}\)are generated by a branch on the input feature map, \(\:{X}_{kj}\) is the sampling location of the j-th grid cell in the k-th bin, and \(\:{n}_{k}\) denotes the number of sampled grid cells. In this branch, RoIpooling generates features on the RoI followed by two 1024-D\({F_c}\)layers. The additional\({F_c}\)layer generates 3 K channels of output (weights are initialized to zero). The first 2 K channels are learnable offsets\(\Delta {x_k}\), and the remaining K channels are normalized by the sigmoid layer to produce\(\Delta {m_k}\).

Fig. 2
figure 2

Block diagram of feature extraction module network

For the encoder-decoder module, this paper adopts the improved multi-resolution feature fusion of ResNet50 network to eliminate the redundant information, and the overall framework of the network is shown in Fig. 2, in which the input three feature maps are three different sizes of features extracted by the ResNet network, which need to be sized by the two-dimensional convolution module with corresponding sizes of the feature maps. In this paper, the parallel feature fusion strategy is used, and the output equation is:

$${z_{Add}}=\sum\limits_{{i=1}}^{n} {({X_i}+{y_i}) \cdot {k_i}=} \sum\limits_{{i=1}}^{n} {{X_i} \cdot {k_i}+\sum\limits_{{i=1}}^{n} {{y_i} \cdot {k_i}} } $$
(3)

where x and y denote the input and output features of the i-th sample, \({K_i}\) is the i th convolution kernel.

For the weighted network, the output feature matrices of the Conv_1, Conv3_x, and Conv5_x layers in ResNet50 are selected as the input features of the 3DSN-net, which correspond to feature input 1, feature input 2, & feature input 3, respectively. the size of the feature matrices of the three levels is then adjusted in real time. Specifically, the convolutional layer, the pooling layer, the batch normalization layer and the activation function are used to regulate the concat function to keep the number of channels for different feature mappings consistent. At its core, DCNv2 is used to deformably convolve the previously obtained features to complete the information interaction and obtain the three weight matrices respectively.

Without adding DCNv2, the gradient of a pixel (m, n) in the feature map is calculated as:

$$\frac{{{\partial _L}}}{{\partial {X_{ij}}}}=\frac{{{\partial _L}}}{{\partial y_{{ij}}^{1}}}+\frac{{{\partial _L}}}{{\partial y_{{ij}}^{2}}}+\frac{{{\partial _L}}}{{\partial y_{{ij}}^{3}}}$$
(4)

where L is the loss function, \(\partial {x_{ij}}\) is the gradient of a pixel (m, n) in the feature map, \(\partial y_{{ij}}^{1}\), \(\partial y_{{ij}}^{2}\), \(\partial y_{{ij}}^{3}\) are the gradients of three levels of features respectively, this initial feature model omits the importance of different levels of features, so there will be uneven distribution of weights of different features. And the formula for gradient calculation after introducing DCNv2 is:

$$\frac{{{\partial _L}}}{{\partial {X_{ij}}}}={\alpha _{ij}}\frac{{{\partial _L}}}{{\partial y_{{ij}}^{1}}}+{\beta _{ij}}\frac{{{\partial _L}}}{{\partial y_{{ij}}^{1}}}+{\gamma _{ij}}\frac{{{\partial _L}}}{{\partial y_{{ij}}^{1}}}$$
(5)

Eqs. \({\alpha _{ij}}\), \({\beta _{ij}}\) and \({\gamma _{ij}}\) are the feature weight parameters for different layers, which ensures automatic assignment of weight parameters.

Lightweight feature classification module

In order to design a fast and lightweight neural network, many works have focused on the reduction of FLOPs [41,42,43], however, it is verified through preliminary experiments that the reduction of FLOPs does not necessarily reduce the generation of delayed manifolds. In this paper, we propose a CFE module for computer vision classification, which mainly solves the task of data classification with large amount of data, and can provide richer gradient flow information while keeping lightweight. The general framework of the CFE module is shown in Fig. 3.

Fig. 3
figure 3

Schematic diagram of CFE module composition

Since fastenet is a new family of neural networks, it obtains a faster operation speed than other networks on various devices without affecting the accuracy of the classification task, we replace all the Bottleneck in the C2f module used for feature fusion and classification with the block of fastenet. Through the extracted information of the tumor modality features in the previous section After the residual network in c2f extracts the shunt can be convolved, up sampled and spliced with different layers of feature maps, so that the model can obtain more contextual and high-resolution information at the same time, but the computing speed will be reduced. PConv in fastenet, on the other hand, can reduce redundant computations (as shown in Eq. 6) and memory accesses of operators to extract spatial features more efficiently.

$$\frac{1}{r}=h \times \omega \times {k^2} \times c_{p}^{2}$$
(6)

where, h and w denote the spatial dimensions of the output feature map, k represents the kernel size, c is the number of input channels, and cp refers to the reduced channel dimension introduced in PConv. The variable r quantifies the computational reduction ratio, highlighting the efficiency gain of PConv over standard convolution. At the classic partial rate\(r={c_p}/c=1/4\), FLOPs are only 1/16 of a regular Conv. On top of that, Pconv has smaller access for:

$$r \times h \times \omega \times 2{c_p} \times {k^2} \times c_{p}^{2} \approx h \times \omega \times 2{c_p}$$
(7)

where the convolution size is only 1/4 of the conventional convolution size when r is equal to 1/4.

Finally, we add the EMA attention mechanism to the model. The EMA attention mechanism [44] is a method to calculate the attention weights by exponential moving average, which adaptively learns the importance of different feature maps and adjusts the weight parameter for weighting during feature fusion as shown in Eq. 8. where the convolution size is only 1/4 of the conventional convolution size when r is equal to 1/4.

$${\nu _t}=\alpha {\upsilon _{t - 1}}+(1 - \alpha){\theta _t}$$
(8)

where, \({\nu _t}\) is the result of EMA at time t, \(\alpha \)is the weight parameter, which we take as 0.85, and \({\theta _t}\) is the sample value at time t.

Double tandem attention module

The improved channel attention module NEWCA and spatial attention module 3DsimAM are included in DTAM, and the overall structure is shown in Fig. 4. The attention mechanism can focus on the important information and ignore the secondary information, thus extracting the information effectively. So far studies such as BAM and CBAM combine spatial attention and channel attention in parallel or serial respectively. However, the two kinds of attention in the human brain tend to work together, so we propose a serial attention module with unified weights.

Fig. 4
figure 4

Block diagram of tandem attention module DTAM

The need to assess neurons for better realization of attention. In neuroscience, neurons containing rich data information usually have firing patterns that are different from other neurons. Moreover, activated neurons inhibit surrounding neurons, i.e., null-space inhibition. In other words, neurons with the effect of null-space inhibition should be given higher importance. The simplest way to find important neurons is to measure the linear divisibility between neurons. Therefore, it is necessary to define the energy function:

$${e_t}({\omega _t},{b_t},y,{X_i})={({y_t} - \widehat {t})^2}+\frac{1}{{M - 1}}\sum\limits_{{i=1}}^{{M - 1}} {{{({y_0} - \hat {X})}^2}} $$
(9)

where, t and xi denote the target neuron and other neurons on a single channel of the input \(\rm x\in\mathbb{R} {^{C \times H \times W \times Z}}\), \(\:\widehat{t}\) and \(\:\widehat{x}\) are the linear transformations of t and xi, respectively, wi and bt represent the weight and bias of the linear transformation; i is the index along the spatial dimension, and M = H×W denotes the total number of neurons on that channel. Minimizing the above equation is equivalent to training linear separability between neuron t and other neurons in the same channel. Using binary labeling [45,46,47] and adding regular terms, the final energy function is defined as follows:

$$\begin{aligned}&{e_t}({\omega _t},{b_t},y,{X_i})=\frac{1}{{M - 1}}\sum\limits_{{i=1}}^{{M - 1}}\cr&{{{(- 1 - ({\omega _t}{X_i}+{b_t}))}^2}+{{(1 - ({\omega _t}t+{b_t}))}^2}+\lambda {\omega _t}^{2}}\end{aligned}$$
(10)

here, \(\:{w}_{i}{x}_{i}+{b}_{t}\)and \(\:{w}_{i}t+{b}_{t}\:\)are the linear transformations of \(\:{\widehat{x}}_{i}\) and \(\:\widehat{t}\), respectively, and \(\:\lambda\:\) is the regularization weight coefficient. The analytic solution of the above equation is:

$${\omega _t}= - \frac{{2(t - {\mu _t})}}{{{{(t - {\mu _t})}^2}+2\sigma _{t}^{2}+2\lambda }}$$
(11)
$${b_t}= - \frac{1}{2}(t+{\mu _t}){\omega _t}$$
(12)

here, \(\:{\mu\:}_{t}\) and \(\:{\sigma\:}_{t}^{2}\) are the mean and variance calculated over all neurons in the channel except for t. Since all neurons on each channel follow the same distribution, the mean and variance can be computed first for the input features in both H and W dimensions to avoid repeated computations:

$$e_{t}^{*}=\frac{{4({{\widehat {\sigma}}^2}+\lambda)}}{{{{(t - \widehat {\mu})}^2}+2{{\widehat {\sigma}}^2}+2\lambda}}$$
(13)

Equation (13) indicates that the lower the energy \(\:{e}_{t}^{*}\), the more distinctive neuron t is from the surrounding neurons and the more important it is for visual processing. Therefore, the importance of each neuron can be obtained as \(\:{1/e}_{t}^{*}\). The whole process can be represented as:

$$\widetilde {X}=sigmoid(\frac{1}{E}) \odot X$$
(14)

here, E aggregates all \(\:{e}_{t}^{*}\) across channel and spatial dimensions. A sigmoid function is applied to constrain excessively large values in E. Since sigmoid is a monotonic function, it does not affect the relative importance of each neuron. Channel attention NEWCA are effective in improving model performance, but they usually ignore location information, which is important for generating spatially selective attention maps. Therefore, in this paper, we propose a novel attention mechanism for mobile networks by embedding the location information channel into the channel attention, which is called “NEWCA” attention mechanism (the module structure is shown in Fig. 4). The Multipath Coordinate Attention proposed in this paper is simple and can be flexibly inserted into the network, which has one more channel than the previously proposed CA attention and fuses the X and Y weights afterwards. In order to encourage the attention blocks to capture spatially remote interactions through precise location information, the global pooling is expressed in Eq. 14. The one-dimensional feature encoding operation is operationalized through Eq. 14. Specifically, given the input X, we encode each channel using two spatial ranges (H,1) or (1, W) of the pooled kernel along the horizontal and vertical coordinates, respectively. The output of the third channel at height h can be expressed as:

$$Z_{c}^{h}(h)=\frac{1}{\omega }\sum\limits_{{0 \leqslant i \leqslant \omega }} {{X_c}(h,i)} $$
(15)

Similarly, the output of the c th channel with width \(\omega \) is represented as:

$$Z_{c}^{\omega}(\omega)=\frac{1}{H}\sum\limits_{{0 \leqslant j \leqslant H}} {{X_c}(j,\omega)} $$
(16)

The above two transformations represent the two spatial directions respectively, producing a pair of direction-aware feature maps. With the aggregated feature maps produced by Eq. (15) and Eq. (16), we link them and then send them to the shared convolutional transform function \({g^1}\) as in Eq:

$${\text{f}}=\delta ({F_1}([{Z^h},{Z^\omega }]))$$
(17)

Where, \([*,*]\) denotes along the spatial dimension, \(\delta \) is the nonlinear activation function, and \({\text{f}} \in {\operatorname{R} ^{C/r*(H+W)}}\)is the vertical direction of the intermediate feature map language that encodes spatial information in the horizontal direction. where r is the reduction ratio used to control the module size. Then we split g into two independent tensors \({{\text{f}}^h} \in {{\text{R}}^{C/r \times H}}\) and \({{\text{f}}^\omega } \in {{\text{R}}^{C/r \times W}}\) along the spatial dimension, and convert \({{\text{C}}_{\text{h}}}\) and \({{\text{C}}_{\text{w}}}\) into tensors with the same channel number as the input X by convolutional transforms \({{\text{f}}^h}\) and \({{\text{f}}^\omega }\), respectively, to obtain:

$$\begin{gathered} {{\text{g}}^h}=\sigma ({F_h}({{\text{f}}^h})) \hfill \\ {{\text{g}}^\omega }=\sigma ({F_\omega }({{\text{f}}^\omega })) \hfill \\ \end{gathered} $$
(18)

Subsequently, we branch \({{\text{f}}^h}\)and \({{\text{f}}^\omega }\) after the Sigmoid function, on the one hand, into the Mean function on the array elements of the mean value of the calculation with the input tensor after the weighting of the X, Y parameters coordinated product, and thus more accurately locate the exact location of the region of interest, which will help the overall model to better carry out the completion of the classification task.

The NEWCA approach considers a more efficient way to capture location information and channel relationships to enhance the feature representation of Network. By decomposing the two-dimensional global pooling operation into two one-dimensional encoding processes, the method in this paper runs better than other attention methods with lightweight properties, as we will demonstrate exhaustively in the experimental section.

Data and experiments

Dataset and experimental setup

Dataset

The breast tumor classification dataset used in this study is the Breast Cancer Histopathological Database (BreakHis) [39], established in collaboration with the Laboratory of Pathological Anatomy and Cytopathology of P&D, Paraná, Brazil, and collected by physicians from the Joint Laboratory of Breast Oncology and Surgery, Li Huili Hospital, Ningbo Medical Center, Ningbo, China. Breast Cancer Pathology Section Database (BCPSD) were composed. The dataset BBD used in this study was collected from 180 patients (ranging in age from 24 to 70 years old, respectively) and produced into 5965 pathology slides, of which a total of 1655 cases of benign tumors and 3310 cases of malignant tumors were screened, with a production period from January 1, 2020 to January 1, 2024.

Figure 5 lists sample images of the database for each of the eight types of breast cancer in BBD: including adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenoma (TA) among benign lesions, and ductal carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC) among malignant lesions. For the collected images, exclusion criteria included cases with unknown or incompletely documented tumor types, incomplete image reports, unclear classification, and cases with a lot of interfering information. This reduces the labor required for image analysis and provides detailed information for CFE-Net. Table 1 provides a summary of the BBD dataset, including the class distribution, image magnifications, and acquisition sources.Five-fold cross-validation is performed for all experiments. 4772 images for the training set and 1193 images for the validation set. We performed data enhancement on the dataset. The target images were randomly cropped and randomly flipped horizontally with 75% probability and vertically with 15% probability. A two-sample equal-variance t-test with a two-sided distribution was used to calculate 96% confidence intervals for each parameter. The threshold was set at t < 0.05 to determine statistical significance.

Fig. 5
figure 5

Example image of BBD database

Table 1 Overview of the BBD dataset

Experimental setup

All tumor images are first processed through multi-channel weighted color conversion and feature calculation during training. To reduce color variability across different samples, stain normalization was applied to standardize the hematoxylin and eosin (H&E) staining. All images were resized and cropped to an input resolution of 700 × 460 pixels, which is widely adopted in deep learning-based pathology studies. During training, data augmentation was applied, including random cropping, random horizontal flipping with a probability of 0.75, random vertical flipping with a probability of 0.15, and mild color jittering, to improve the robustness of the model and mitigate overfitting. The breast tumor classification experiments were conducted under a 64-bit Windows 10 operating system, with a GeForce GTX 3060 GPU for deep learning training under the NVIDIA architecture. The 3DSN-net algorithm was implemented using Python 3.8, PyTorch 1.9.1, and CUDA 11.1.

Our proposed classification model was trained using the mean squared error (MSE) loss function to quantify the difference between predicted and true labels, and optimization was performed using the Adam optimizer.The initial learning rate was set to 6 × 10-5 with a cosine annealing schedule for gradual decay, and the weight decay was fixed at 1 × 10-4 to prevent overfitting. A batch size of 32 was used in all experiments. Training was conducted for a maximum of 60 epochs for each training/validation split, with an early stopping criterion applied if the validation loss did not improve for 10 consecutive epochs.Auto-data batches were loaded sequentially during training, and a checkpointing mechanism was employed to preserve the best-performing model with minimal validation loss. Importantly, the choice of specific hyperparameters, including learning rate, batch size, and optimizer settings, was determined through empirical testing, where multiple parameter combinations were compared to identify the configuration that yielded the best validation performance.

Evaluation protocols

In this section, the distribution of benign and malignant masses is not the same, and this imbalance may lead to the existence of reduced precision and poor classification accuracy for a few classes. In order to comprehensively evaluate the performance of classification on the dataset, we evaluated the experiments based on accuracy (ACC), specificity (SPE), precision (PRE), area under the receiver operating characteristic curve (AUC) and sensitivity (SEN), as well as TOP1_ACC, TOP5_ACC, and the following evaluation of the classification performance is shown in Table 2.

Table 2 Evaluation indicators

Subsequently, the Receiver operating characteristic curve (ROC) can be used independently as an evaluation of the classification performance of the network since it does not need to consider specificity and sensitivity metrics. The metrics are constructed based on the test performance for different diagnostic thresholds: the higher the sensitivity, the lower the misdiagnosis rate; and the lower the specificity, the lower the misdiagnosis rate. A good test has 100% sensitivity with 0% false positives at all thresholds, and this is the case in the upper left corner of the ROC. We counted the area under the ROC curve (AUC), and the experimental results show that a value of AUC greater than 0.5 indicates a valid diagnosis. That is, the AUC is proportional to the rate of confirmed diagnosis.

Ablation experiments

In this section, we experimentally verify the validity of each module part of the interaction unit. NEWCA, CFE, DCNv2, and 3DsimAM were removed separately to validate each module function. The results of the experimental comparisons are shown in Tables 3 and 4, which summarizes both the classification performance and the computational metrics (parameters, FLOPs, and inference time) for each ablation variant. By comparing each individual interaction unit with 3DSN-net, significant improvements can be seen in all aspects when using the full network. These performance gaps show that the algorithm proposed in this paper is able to provide accurate classification results by interacting modules with each other. In addition, 3DSN-net’s sophisticated handling of classification errors further improves performance.

Table 3 Data from ablation experiments
Table 4 Computational and inference metrics of 3DSN-net

Comparison with SOTA methods

In order to validate the performance of our proposed 3DSN-net in the classification of eight types of breast tumors, we compared it with eight mainstream networks: NMTNet, Alexnet, Resnet50, Vgg 19, Mobilenet v3, Shufflenet v3, and Googlenet, and the specific comparison of the classification results is shown in Table 6 (in the supplementary material). From the data in the Table 3, it is obvious that we can see that 3DSN-net has improved about 15% in the classification of F, PT, TA, DC, LC, MC, and PC tumor types in ACC, SPE, and SEN metrics. In addition, in order to reveal the effect of the proposed method more intuitively, the ROC curves of several test networks are plotted, as shown in Fig. 6.

Fig. 6
figure 6

The ROC of the proposed algorithm and other network algorithms

In the figure, we use the ROC curves to compare the proposed algorithm with some state-of-the-art networks. By comparing sensitivity and specificity, we find that the diagnostic accuracy is independent of the decision criteria and also independent of the disease prevalence. We can see that the area under the ROC curve of 3DSN-net is the largest, indicating the best performance. From the ROC curve, we can see that the curve represented by the method proposed in this paper is located in the upper left corner, which has the best critical performance and is superior to other methods.

Table 5 reports the precision (PRE) performance of a set of representative models on the TA tumor classification task. For each model, we provide the mean ± standard deviation (SD) across folds, the corresponding 95% confidence interval (CI), and the p-value for comparisons against baseline models. This presentation allows a clear and statistically informative comparison of model performance. Notably, our proposed 3DSN-net achieves a PRE of 0.83 ± 0.02 with a 95% CI of 0.815–0.854, showing a statistically significant improvement compared to several other models. The other models’ PRE values and CIs are also reported for completeness, and statistical tests were conducted to assess the significance of differences, ensuring fair and reliable comparisons.

Table 5 PRE of representative models on TA tumor classification with statistical comparison

Experimental results and visualization analysis

Figure 7 shows the tumor classification results obtained from the BBD dataset at 40× magnification. Additional visualized results are provided in Fig. 10 of the supplementary material. From the experimental results in the figure, it can be seen that the accuracy of our proposed network ranges from 92% to 100% for benign tumors classification detection and 86% to 99% for malignant tumors classification detection, and the error rate of both types of tumors classification is less than 8%.

Fig. 7
figure 7

Tumor classification results with 40x magnification

To better understand the limitations of 3DSN-net, we examined a subset of misclassified breast tumor images, as shown in Fig. 8. Most errors occur when tumor morphology is atypical or histopathological features are subtle, causing overlap between benign and malignant patterns. Variations in staining, imaging artifacts, and low contrast also contribute. These cases highlight areas for potential model improvement.

Fig. 8
figure 8

Representative misclassified samples

In order to visualize the comprehensive performance of our proposed 3DSN-net method with tandem attention mechanism more intuitively, the final experimental results are given by the training loss curves and confusion matrices, and Fig. 9 shows the loss curves and confusion matrix plots before and after the algorithm improvement.

Fig. 9
figure 9

Confusion matrix (a) Corresponds to the performance of the algorithm on the BBD dataset before algorithm improvement. (b) Shows the change in training loss and verification accuracy on the BACH dataset before algorithm improvement. (c) Corresponds to the performance of 3DSN-Net on the BBD dataset. (d) Shows the change in training loss and verification accuracy on the BACH dataset after 3DSN-Net improvement

The confusion matrix enables a more detailed analysis of the classification results for benign and malignant tumors, not limited to accuracy. From the confusion matrix, it can be seen that the number of correct predictions for benign and malignant tumors among the eight breast tumor types is significantly improved, and the pre-trained DTAM-based model converges quickly and does not show large amplitude fluctuations at the beginning of the training process due to the loading of the pre-trained weight parameters.

Discussion

Breast cancer has become the first cancer that jeopardizes women’s health, and traditional breast cancer screening relies on the experience and subjectivity of radiologists. In this paper, with the task of realizing the automatic classification of benign and malignant breast tumor pathology images, and aiming at the problems of poor feature extraction and low recognition rate of breast tumor images, multi-channel and multi-scale classification experiments are carried out from the perspective of multi-attention mechanism in tandem and a diagnostic classification method based on the dual-attention mechanism module in tandem, i.e., NEWCA and 3DsimAM, is proposed. and is validated in the actual breast image eight classification experiments were validated. In this study, we developed 3DSN-net to improve the early diagnosis of breast cancer and reduce the workload of doctors. The traditional classification algorithms do not take full advantage of the advantages become the bottleneck of performance improvement, the existing traditional classification algorithms only correlation feature selection methods to select and evaluate the main feature set to further complete the classification work, and can not mine the full useful information. To address this problem, we explore a number of multi-model fusion classification methods to fully utilize the useful information. Specifically, we developed a dual tandem network model based on two attention modules, i.e., the DTAM model. To address the problems of multiparameter and model complexity, which are common in previous correlation classification algorithms, we developed a CFE module with fast and lightweight to get the target image features to better utilize the correlation, complementarity, and discriminative information.

In order to evaluate the effectiveness of 3DSN-net, a large number of experimental studies were conducted. Firstly, unlike single-channel classification networks, our algorithm fully utilizes the multi-channel attention mechanism interaction to improve accuracy. In addition, our algorithm was compared with various recent common network algorithms. This algorithm effectively improves the accuracy and robustness of target classification. Our study aims to validate the effectiveness of each module by interconnecting DTAM with other attentional modules using ablation experiments, and the best experimental results are obtained when all the four modules are used, with the highest classification performance in each of the eight breast tumors: top1_acc and top5_acc reach 86.2% and 99.9%, respectively, and PRE, SEN, ACC, and SPE: 90.1% (MC), 87.5% (PC), 97.8% (DC), and 99.2% (DC), respectively.3 DSN-net has demonstrated that breast cancer classification can be improved by the DTAM model.

Conclusion

We proposed the novel 3DSN-net for breast tumor classification to fully exploit depth feature information and improve classification performance. We introduced a dual tandem attention mechanism that leverages inter-feature correlation, complementarity, and discriminative informativeness to enhance feature interaction. A spatial attention unit (3DsimAM) and a channel attention unit (NEWCA) were integrated to focus on important regions and channel features, respectively. The CFE module was designed to efficiently fuse features while reducing redundant computation, and DCNv2 was employed for adaptive convolution to better capture deformable structures. Extensive experiments demonstrated that 3DSN-net significantly improves multiclassification of eight breast tumor subtypes, indicating its potential application value in real-world scenarios, such as pathology workflows, where it can assist in diagnostic decisions and improve both speed and accuracy. Future work will focus on multi-center dataset expansion, optimizing inference speed for real-time applications, and extending the framework to other medical image analysis tasks to enhance generalization and clinical applicability.