A novel modality contribution confidence-enhanced multimodal deep learning framework for multiomics data

Zhang, Duoyi; Bashar, Md Abul; Nayak, Richi; Cuttle, Leila

doi:10.1186/s12859-025-06219-9

A novel modality contribution confidence-enhanced multimodal deep learning framework for multiomics data

Research
Open access
Published: 31 October 2025

Volume 26, article number 271, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

BMC Bioinformatics Aims and scope Submit manuscript

A novel modality contribution confidence-enhanced multimodal deep learning framework for multiomics data

Download PDF

Duoyi Zhang¹,
Md Abul Bashar¹,
Richi Nayak¹ &
…
Leila Cuttle²

1412 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Multimodal learning for classification tasks has recently gained significant attention in bioinformatics. Current approaches primarily concentrate on devising efficient deep learning architectures to capture features within and across modalities. However, they typically assume that each modality contributes equally to the classification objective, overlooking inherent biases within multimodal learning. This paper presents a modality contribution confidence-enhanced deep learning framework to address this issue, resulting in an improved fusion space and improved classification performance on multiomics data. Specifically, we propose utilising a non-parametric Gaussian Process to assess the unimodal confidence of each modality and learn within-modality features. Additionally, we introduce the use of the Kullback-Leibler divergence to align multiple modalities and learn cross-modality features. Extensive experiments on four multiomics datasets, incorporating modalities such as static information, DNA, mRNA, miRNA, and protein data, validate the effectiveness of the proposed method. Furthermore, a case study on the blister recovery task is included to demonstrate the practical utility of our model.

View this article's peer review reports

Introduction

Utilising multiomics data for biomedical classification has become a growing focus in recent research [1, 2]. Each modality represents a distinct source of patient information, such as demographic details (i.e. static information), genomics sequences, and proteins. Integrating multiomics data enables a more comprehensive patient description, thus enhancing the accuracy of disease classification [3]. Our study investigates multimodal classification tasks (e.g breast cancer prediction, Alzheimer prediction, and wound recovery prediction tasks) using a novel deep learning approach that explores both static and omics modalities.

Prior research has predominantly focused on designing deep learning architectures to extract unimodal features and cross-modal interactions [4]. However, these methods often assume that each modality contributes equally to the classification task, overlooking inherent biases within multimodal data, which results in underperforming training [5]. The biased modality issue arises when different modalities possess varying levels of informativeness to the downstream classification task. For example, variables like wound size and wound depth typically exhibit a direct correlation with recovery time in predicting the burn wound recovery [6]. Conversely, protein data collected from the burn tissues may have less direct relevance to recovery prediction. Failure to address this imbalance in informativeness across modalities leads deep learning models to treat all modalities equally. Consequently, less informative modalities may introduce noise into the joint multimodal representation space, thereby compromising classification performance.

This problem has been partially explored in other multimodal domains [5, 7, 8]. However, the unique challenges of the multiomics domain limit the direct applicability of existing multimodal learning approaches. Firstly, due to the high costs associated with data collection in this domain, labelled multiomics datasets are often small. Integrating multiple modalities substantially increases input dimensionality for the small dataset [9], which exacerbates overfitting in deep learning models due to the curse of dimensionality [2]. Secondly, multiomics data often suffer from modality misalignment for two reasons. (1) Existing multimodal methods typically rely on robust, pre-trained backbone networks to align modalities into a joint feature space. However, such backbone networks for multiomics data are yet to be developed, making effective alignment challenging. (2) Noise characteristics differ across modalities [9], meaning that some modalities may inherently offer more reliable signals for downstream tasks. Poorly aligned multimodal representations amplify these disparities, reducing generalisability and introducing classification bias.

To tackle these challenges, we propose a modality contribution confidence-enhanced learning framework for multiomics data. To address the first challenge, we propose a metric called Modality Contribution Confidence (MCC), which quantifies the predictive reliability of each modality using the Gaussian Process classifiers (GPC). The MCC scores are then applied as weighting factors to combine modality-specific representations, resulting in a more robust joint representation for classification. Concerning the second challenge, we employ a variational deep learning model for multimodal classification. Within this framework, a Kullback–Leibler (KL) divergence term is introduced to align latent distributions across modalities. This alignment encourages consistent and complementary feature representations, allowing each modality to contribute more effectively to the joint space. As a result, the model achieves improved classification performance through both confidence-aware fusion and distributional alignment.

Our key contributions can be summarised as follows:

We introduce a novel MCC-enhanced learning framework for multiomics classification tasks. This framework effectively regulates the multimodal feature learning process, thus enhancing performance in downstream classification tasks.
Our deep learning framework introduces two key innovations to facilitate effective multimodal fusion. Firstly, we propose a novel metric named Modality Contribution Confidence (MCC), which dynamically balances modality-specific learning and effectively captures within-modality features. Secondly, we design a variational deep learning model that utilises the KL divergence to constrain the multimodal feature space and facilitate improved cross-modality alignment.
Extensive experiments across four multiomics datasets show that our proposed method consistently outperforms state-of-the-art multimodal fusion and robust learning frameworks. Ablation studies further validate the contribution of each component in our framework.
We also include a case study on blister recovery period classification. Using a post-hoc explainability method, we highlight the informative biomarkers identified by our model, demonstrating its practical interpretability in real-world biomedical settings.

Related work

In this section, we review recent studies on developing multimodal fusion models for bioinformatic data, as well as explore research into robust multimodal representation learning in general.

Multimodal fusion dealing with multiomics data

Multimodal fusion integrates multiomics data with diverse data types, such as clinical texts, demographic information, and multiple omics, into a joint representation suitable for downstream tasks. Our paper focuses on multiomics classification utilising static information and various omics data. The classification task includes multimodal feature extraction, data fusion, and classification. Recent works have explored various network architectures for feature extraction from each modality independently [1, 4]. Once the features for each modality are obtained, the next step is to create a joint representation space that combines features from multiple modalities.

Existing data fusion frameworks can be classified as early, late, and intermediate [10, 11]. Early fusion methods combine all modalities at the data level through concatenation operations. For example, Liu et al. [12] combine four omics modalities, including transcriptomic, proteomics, metabolomics, and lipidomics, and feed the concatenated omics into an autoencoder for Covid-19 patient analysis. Wang et al. [13] use an attention mechanism on concatenated mRNA, mutations, CNV, and RPPA expression and metabolites for drug response prediction. Though early fusion has shown promising performance on these tasks, its overall effectiveness remains uncertain. An empirical analysis conducted by Hauptmann and Kramer [14] suggest that early fusion shows poor performance on drug response prediction with gene repression, mutation and CNA modalities compared with late and intermediate fusion methods. Early fusion typically fails to consider the within-modality features, resulting in the deterioration of the unimodal manifolds [15].

Most existing late and intermediate fusion methods represent the consensus and complementary information of multiomics data in learning [16]. Late fusion methods generate independent decision scores for each modality and then average these scores to derive predictions [17]. Want et al. [1] combine unimodal prediction confidence with tensor fusion, allowing the full interaction between modalities at the decision level. Results on four disease prediction datasets, utilising DNA, mRNA, and miRNA modalities, have shown promising outcomes. Han et al. [2] design a trustworthy framework for evaluating the unimodal reliability and combining unimodal prediction results weighted by their reliabilities.

Athaya et al use a deep learning-based intermediate fusion framework, enabling the learning of cross-modal interactions [10]. Arya and Saha [4] use a co-attention framework to capture bi-modal interactions for breast cancer prediction across static, CNV, and gene expression modalities. Choi and Chae [18] use biological relationships between DNA, gene expression, and miRNA modalities to learn the interactions for breast cancer subtype detection.

These works show the potential of deep learning for multimodal representation learning in multiomics data. Instead of focusing primarily on effective frameworks for extracting within and across modality features, our study introduces a novel Modality Contribution Confidence-enhanced fusion framework that addresses the inherent biases across different omics modalities, resulting in improved classification performance.

Robust multimodal learning

Robust training strategies for multimodal learning are gaining momentum across domains such as audio-video and text-image. A seminal work [5] posits that multimodal classification often suffers from overfitting due to varying informativeness across modalities. Typically, one modality dominates the learning process while others contribute little, causing the model to overfit to the most informative modality as training progresses. To mitigate this, existing methods often incorporate additional mechanisms to balance the contribution of multiple modalities during training. Wang et al. [5] propose a generalisation to overfitting criterion, while Peng et al. [7] use unimodal classification confidence to mitigate bias. Fan et al. [8] apply prototype learning to optimise the joint feature space, and Mai [19] design a boosting-inspired method to select informative modalities dynamically.

Although these methods show improved performance in general multimodal tasks by encouraging underperforming modalities to contribute more, their assumptions do not fully translate to multiomics data. Prior works [1] show that certain omics modalities are inherently more predictive for specific biomedical tasks. Forcing low-informative modalities to contribute equally may introduce noise rather than improve performance. This paper proposes a novel robust multimodal learning framework that directly addresses the biased modality problem through targeted feature learning. Instead of enforcing uniform contribution, our method adaptively weighs each modality based on its true predictive value, leading to more reliable and generalisable classification outcomes.

Methodology

This section is structured as follows: we first present the motivation for the proposed approach and formally define the multimodal classification problem. Next, we introduce the MCC metric and outline the joint representation learning strategy. We then describe the role of KL divergence in regularising the latent space, followed by a detailed explanation of the proposed network architecture. Fianlly, at the end of this section, we discuss the proposed method.

Problem motivation and core components

A key challenge of multiomics data is that not all modalities contribute equally to downstream prediction [1]. To address this, we propose a deep learning framework enhanced by two core components: MCC and KL divergence regularisation, as illustrated in Fig. 1.

The MCC mechanism promotes modality-aware feature fusion by assigning higher weights to more informative modalities, which means those that provide greater predictive value, while reducing the influence of noisy or less relevant ones. This adaptive weighting results in a joint representation that better reflects the true informativeness of each modality, thereby enhancing classification performance and robustness.

To compute MCC, we estimate each modality’s predictive confidence using a Gaussian Process Classifier (GPC) on a small subset of the training data. GPCs are particularly well-suited for this setting as they are less prone to overfitting than deep models and offer calibrated uncertainty estimates [20]. This is an advantage for small-sample omics datasets where overconfident errors are common. The averaged predictive confidence for each modality serves as its MCC score, reflecting its standalone ability to predict correct labels. These MCC scores are then used to guide the feature fusion process in the main network, allowing the model to focus on more informative modalities.

Complementing MCC, KL divergence regularisation [21] is introduced to align the latent feature distributions across modalities. Since different omics types often follow distinct statistical patterns, unaligned representations can hinder learning. KL divergence encourages consistency across modalities by learning a consistent manifold, ensuring that no single modality dominates the joint space due to distributional imbalance (e.g., differences in scale or variance).

As shown in Fig. 1, our framework comprises a primary network ${\mathcal {M}}$ and the MCC estimation path. The MCC path (shown in red) computes modality-specific confidence using GPCs. The main network (black path) performs unimodal feature embedding, MCC-guided fusion, and final classification. The MCC scores $\omega _m$ are applied during the fusion stage to weight modality contributions accordingly.

Problem formulation

Let $ {\mathcal {D}} = \{{\{x^m_i\}_{m=1}^M,y_i}\}_{i=1}^N$ be a multimodal dataset where M is the number of modalities, N is the number of instances, and $x^m_i \in {\mathbb {R}}^{d_m}$ represents a feature vector of $d_m$ dimension for modality m. The goal of multimodal classification is to learn a mapping from $\{\{x^m_i\}_{m=1}^M\}_{i=1}^N$ to class labels $\{y_i\}_{i=1}^N \in {1,2,\ldots C}$ where C is the number of classes. For clarity, we sometimes omit the instance index $i$ in the following discussion.

The classification process consists of two main steps. First, each input modality $x^m$ is projected into a fixed-length latent representation $f^m \in {\mathbb {R}}^{l}$, where $l$ is the shared feature dimension across modalities. Second, these unimodal features $\{f^m\}_{m=1}^M$ are combined into a joint multimodal latent representation $z \in {\mathbb {R}}^\iota $ with $\iota $ dimension. In some fusion strategies, such as Product-of-Experts, $\iota = l$. Finally, a classifier operates on the joint representation $z$ to produce the predicted labels ${\hat{y}}$. The effectiveness of this process depends on learning a high-quality joint representation $z$, which is especially challenging in multiomics data due to inherent noise and variability in modality informativeness.

The objective function formulation

The objective of multimodal classification is to maximise the log-likelihood of the conditional probability $p(y|\{x^m\}_m^{M})$. Since computing this likelihood directly is intractable [22], we adopt a variational inference approach to estimate the evidence lower bound (ELBO), as detailed in the Appendix. The corresponding loss function can be expressed as in Eq. 1:

$$\begin{aligned} \begin{aligned} L = \frac{1}{N}\sum _{i=1}^N \underbrace{{\mathbb {E}}_{z \sim p(z|{x_i^m}_m^{M})}[-log(q_\phi (y_i|z_i))]}_{\text {Cross-entropy loss}} + \beta \underbrace{KL(q_\theta (z_i|\{x_i^m\}_{m=1}^M)|| r(z_i))}_{\text {KL divergence}} \end{aligned} \end{aligned}$$

(1)

Here, $r(z_i)$ represents the prior distribution, assumed to be a standard Gaussian distribution ${\mathcal {N}}(0,I)$, which regulates the latent representation $z_i$. $KL(q_\theta (z_i|x_i)|| r(z_i))$ denotes the KL divergence, which measures the divergence between the posterior and the prior. $q_\theta (z_i|\{x_i^m\}_{m=1}^M)$ represents the learned multimodal distribution of $z_i$ parameterised by a set of parameters $\theta $. The classifier $q_{\phi }(y_i|z_i)$, parameterised by $\phi $, approximates the true posterior $p(y_i|z_i)$ based on the sampled latent representation $z_i$, since integrating over all possible $z_i$ is computationally intractable.

The first term in Eq. 1 represents the cross-entropy loss for classification and the second term (KL divergence) is a regularisation term that aligns the learned latent distribution $q_\theta (z_i|\{x_i^m\}_{m=1}^M)$ with the standard Gaussian prior $r(z_i)$. In our setting, both the prior and the posterior (i.e. the learned multimodal distribution) are assumed to follow Gaussian distributions (i.e. $r(z) = {\mathcal {N}}(0,I)$ and $q_\theta (z) = {\mathcal {N}}(\mu _z,\sigma _z)$). In this context, $q$ represents the learned probability, $\theta $ represents a set of parameters for the feature representation learning, and $\phi $ is the set of parameters for the classifier.

MCC-enhanced training to learn within-modality features

We propose a confidence-guided strategy that leverages the Gaussian Process classifiers to quantify modality informativeness through a metric called MCC. This section defines MCC and explains how it integrates with the overall learning framework to learn within-modality features. The MCC computation pipeline is shown in Fig. 2.

Definition of MCC: Let ${\mathcal {D}}_r \in {\mathcal {D}}$ be a random subset of the training data, containing $N_r = r\times N$ instances, where $r \in [0,1])$ is a sampling ratio. The MCC for modality m is defined as the confidence score $\omega _m$, representing the probability that modality m can accurately classify samples in $D_r$.

To estimate the confidence score $\omega _m$, we employ a Gaussian Process Classifier (GPC) for each modality m. GP is a non-parametric model where any finite set of variables has a joint Gaussian distribution, defined as ${\mathcal {N}}(\mu (x), {\mathcal {K}}(x_i,x_j))$ with $\mu (x)$ and ${\mathcal {K}}(x_i,x_j)$ representing the mean and covariance (kernel) functions, respectively [23]. $x_i$ and $x_j$ are two random samples from the dataset. For a classification problem with $C$ classes, the GPC models each class c using an independent GP: $GP_c \sim {\mathcal {N}}(\mu _c, {\mathcal {K}}_c(x_i,x_j))$ where $c \in {1,2,\ldots C}$, typically assuming $(\mu _c = 0$. The radial basis function (RBF) kernel [24] is commonly used to learn the covariance between samples.

MCC Computation: At the beginning of each training epoch, we sample a small subset ${\mathcal {D}}_r$ and train a separate GPC for each modality using only its respective unimodal features. This design prevents information leakage by excluding ${\mathcal {D}}_r$ from the main training set (i.e. ${\mathcal {D}}_{main} = {\mathcal {D}} - {\mathcal {D}}_r$) during that epoch. For each instance $x_{i}^m$ in $D_r$ for modality $m \in M$, the GPC outputs a probabilistic prediction for each class. The classification confidence score, $c_i^m$, for the true class $y_i$ is computed as:

$$\begin{aligned} \begin{aligned} c_i^m&= \frac{\exp (GP^m_{c=y_i}(x_{i}^m))}{\sum _{c=1}^{C} \exp (GP^m_c(x_{i}^m))} \end{aligned} \end{aligned}$$

(2)

This scalar value $c_i^m \in [0,1]$ captures the confidence of modality m in correctly predicting $y_i$.

The overall contribution $\omega _m$ of modality m is then computed by averaging the confidence scores across $D_r$

Subsequently, we average the confidence score $c^m_i$ over the dataset $D_m$ to determine the contribution of each modality m towards the downstream task.

$$\begin{aligned} \begin{aligned} \omega _m^{'}&= \sum _{i=1}^{N_r} c_i^m \\ \omega _m&= \frac{\omega _m^{'}}{\sum _{m=1}^M\omega _m^{'}} \end{aligned} \end{aligned}$$

(3)

A higher $\omega _m$ indicates that the modality is more informative and capable of making confident, accurate predictions independently, and should thus contribute more to the joint representation.

Why use GPC: GPC is particularly suited for small-sample multiomics settings due to its probabilistic and non-parametric nature. Unlike deterministic classifiers such as SVMs, random forests, or neural networks, GPC produces calibrated confidence scores without relying on a fixed decision boundary. Neural networks, while capable of producing class probabilities, often suffer from poor calibration due to architectural components like batch normalisation, which can lead to overconfident and misleading predictions [25]. In contrast, GPCs are less prone to overfitting and better reflect predictive uncertainty, making them well-suited for estimating MCC.

Multimodal representation representation

In this section, we illustrate how to combine the learned MCC into multimodal representation learning. Specifically, we decompose the posterior (Eq. 1) as a weighted sum of modality-specific unimodal posteriors ${q_\theta }_m(z_i|x_i^m)$, where the weights are given by the MCC values $\omega _m$, following the mixture of experts formulation [26]:

$$\begin{aligned} \begin{aligned} q_\theta (z_i|\{x_{i}^{m}\}_{m=1}^{M})&= \sum _{m=1}^M \omega _{m} {q_\theta }_m(z_i|x_i^m) \\ s.t. \sum _{m=1}^M \omega _m&= 1 \end{aligned} \end{aligned}$$

(4)

Here, $\theta _m$ represents the parameters associated with modality m and $\omega _m$ represent the contribution of modality m to the joint latent representation.

As expressed in Eq. 4, the fusion of multiple modalities is achieved via a weighted summation of the individual modality posteriors. This formulation comprises two components: the learned unimodal posterior distributions ${q_\theta }_m(z_i|x_i^m)$ and their corresponding MCC weights $\omega _m$. While prior work has primarily focused on learning effective unimodal distributions ${q_\theta }_m(z_i|x_i^m)$, the role of $\omega _m$ has often been unexplored. Intuitively, $\omega _m$ captures the relative importance of modality m in the joint representation. Specifically, $\omega _m = 0$ implies no contribution from modality m whereas $\omega _m = 1$ suggests that the joint representation relies entirely on modality m. This is particularly critical in multiomics data, where the informativeness of modalities can vary widely [2]. Some modalities may contain redundant or noisy information, which can degrade the downstream task. Thus, accurately learning $\omega _m$ is essential for robust and effective multimodal learning.

KL divergence for learning cross-modality features

To effectively learn the posterior $q_\theta (z_i|\{x_i^m\}_{m=1}^M)$ in Eq. 1, it is essential to model the latent representation $z_i$. We contend that the KL divergence term in Eq. 1 plays a pivotal role in multimodal learning. As shown in Fig. 3b, when the unimodal training occurs independently, their joint latent representation may reside in misaligned subspaces due to inherent discrepancies between modalities. This misalignment can lead to prediction bias when the model is applied to unseen data. Conversely, the KL divergence, by penalising the modality latent distributions that do not match a shared prior (i.e. the standard normal distribution), restricts all unimodal representations to a common latent space (Fig. 3a). This regularisation helps align the unimodal features and mitigates the risk of distributional mismatch. An alternative approach to addressing modality misalignment is to apply normalisation techniques (e.g. batch normalisation) to scale unimodal representations into a consistent range. However, such methods focus solely on rescaling and do not account for distributional variance or structural alignment. As a result, they may fail to preserve the informative variability of the original features across modalities.

Network architecture

As shown in Fig. 1, the primary network consists of three sequential steps: within-modality feature learning, feature fusion enhanced by GPC and KL divergence, and classification. The input to the network comprises raw multimodal data $\{x^m\}^M_{m=1}$, and GPC-generated MCC values $\omega _m$. The output is the predicted label ${\hat{y}}$.

Within modality feature learning: Within modality features $f^m$ are obtained as follows:

$$\begin{aligned} f^m = Dropout({NN_\theta }_m(x^m)) \end{aligned}$$

(5)

Here, $\theta _m$ represents the parameters for projecting the raw input $x^m$ into its feature space. NN represents any type of neural network, such as graph neural networks, transformers or fully connected layered networks. In this paper, we employ a fully connected layered network. Each $NN_\theta $ produces a unimodal feature distribution $f^m \in {\mathcal {N}}(\mu _m, \sigma _m)$, representing the feature distribution for one modality. This stochastic behaviour arises from the use of Monte Carlo dropout [27], which randomly deactivates a subset of network parameters during inference. As a result, the output of the neural network is no longer deterministic but instead reflects a Gaussian distribution over possible representations.

Fusion: Given the unimodal distributions $f^m \in {\mathcal {N}}(\mu _m, \sigma _m)$) and their corresponding MCC weights $\omega _m$, the joint representation $z'$ is computed as a weighted sum:

$$\begin{aligned} {z'} = \sum _{m=1}^M \omega _m \times f^m \end{aligned}$$

(6)

Mathematically, the weighted sum of Gaussian-distributed features also yields a Gaussian distribution, characterised by mean $\sum _m^{M} \omega _m \mu _m$ and variance $\sum _m^{M} \omega _{m}\sigma _m^{2}$ [28]. This forms the basis for constructing the joint multimodal distribution,$z$, from which a representation is sampled for the downstream classification task.

$$\begin{aligned} \begin{aligned} \mu _{z}&= {NN_\theta }_\mu ({z'}) \\ \sigma _{z}&= {NN_\theta }_\sigma ({z'}) \\ z&= \mu _{z} + \sigma _{z} \times \epsilon \\ \epsilon&\sim {\mathcal {N}}(0,I) \end{aligned} \end{aligned}$$

(7)

In Eq. 7, we apply the reparameterisation trick [22] to enable gradient-based optimisation with SGD. $\epsilon $ is a random variable sampled from a standard normal distribution ${\mathcal {N}}(0,I)$. The KL divergence in Eq. 1 is applied to the parameters $\mu _{z}$ and $\sigma _{z}$ to regularise the latent space toward the prior.

Classification: Finally, the latent representation z is passed through a fully connected layer classifier to predict the class label:

$$\begin{aligned} {\hat{y}} = NN_{\phi }(z) \end{aligned}$$

(8)

Here, ${\hat{y}}$ denotes the predicted results, generated by a fully connected layer $NN_{\phi }$ followed by a softmax activation. The model is trained using the loss defined Eq. 1, where the first term minimises the negative log-likelihood via cross-entropy between ${\hat{y}}$ and ground truth label y.

Algorithms 1 and 2 summarise the training and testing phases for the proposed framework. In the testing phase, we directly use the ${\hat{\omega }}$ constructed at the final epoch for simplicity.

Improving fusion with MCC: stability and interpretability

Compared to prior attention- or gating-based dynamic fusion methods [2], the proposed MCC framework offers improved robustness and interpretability. By decoupling modality weighting from feature representation learning, our method reduces the risk of overfitting, which is particularly important in small-sample, high-dimensional settings, such as multiomics. Moreover, MCC enhances modality-level interpretability by explicitly quantifying the contribution of each modality through the scalar weight $\omega _m$.

In existing methods, modality importance is typically modelled as $\omega _m = {\mathcal {A}}_\psi (f^m)$, where ${\mathcal {A}}_\psi $ is an attention or gating mechanism parameterised by $\psi $, and operates directly on the modality-specific feature $f^m$. The parameters $\psi $ are jointly optimised with the representation learning pipeline. This joint learning leads to entangled gradients during backpropagation. Specifically, the gradient of the loss with respect to $f^m$ includes a direct and an indirect component:

$$ \frac{\partial {\mathcal {L}}}{\partial f^m} = \underbrace{\frac{\partial {\mathcal {L}}}{\partial z} \cdot {\mathcal {A}}_\psi (f^m)}_{\text {(i) direct term}} + \underbrace{\frac{\partial {\mathcal {L}}}{\partial z} \cdot f^m \cdot \frac{\partial {\mathcal {A}}_\psi (f^m)}{\partial f^m}}_{\text {(ii) indirect term}} $$

The second term introduces a feedback loop in which the feature representation affects its contribution weight, leading to entangled gradients and potential instability during optimisation. This problem is exacerbated in domains like multiomics, where data scarcity and modality imbalance are common. In contrast, our MCC-enhanced framework computes the modality weight $\omega _m$ externally using a GPC trained on a small calibration subset. These weights are fixed during each epoch, eliminating the indirect gradient term and stabilising optimisation. The result is a cleaner, more modular learning process, where feature extraction and modality weighting are separated.

MCC also offers superior interpretability. While prior works typically apply attention weights at the feature level (i.e., assigning element-wise importance within the latent space), they often obscure the global contribution of each modality. In contrast, MCC provides an explicit, scalar-level importance score for each modality $\omega _m$, facilitating a high-level understanding of their respective roles. This is especially valuable in scientific and clinical applications where understanding which data sources drive predictions is crucial for trust and transparency.

Experiments

Extensive experiments have been conducted to validate the efficacy of the proposed MCC-enhanced multimodal learning framework.

Datasets

We conducted experiments on four biomedical datasets spanning various domains as illustrated in Table 1. The ROSMAP and BRCA datasets ^{Footnote 1} contain DNA, mRNA, and miRNA data used for Alzheimer’s and breast cancer classification, respectively [1]. TCGA is another dataset utilised for breast cancer classification, incorporating gene expression, CNA, and demographic modalities [4] .^{Footnote 2} Finally, we include a burn healing dataset that aims to predict the recovery time for burn wounds within one week, two weeks, four weeks, or longer [29]. The Availability of data and materials section provides details on dataset sources and access links for datasets in Table 1.

Table 1 Discription of datasets

Full size table

Experimental settings

All experiments were conducted on a Linux platform using a Python 3.10.4 environment and PyTorch. The scikit-learn library implements the Gaussian Process. For modelling the Gaussian Process classifier, we utilise an RBF kernel with bandwidth 10 plus a white kernel to model the noise in the data. We conducted 5-fold validation to evaluate the performance of all models and report the mean and variance over five folds. Further implementation details are provided in the Appendix.

Baselines

To validate the effectiveness of the proposed method, we compare its classification performance against the four family of methods: (1) Early fusion: we directly concatenate the multimodal data and feed the joint representation to SVM [30], Random Forest(RF) [31], XGBoost [32], KNN [33], and Neural networks (NN), respectively. We utilised the Scikit-learn library to implement the above methods. (2) Late fusion: We obtain the softmax decision score for each modality using a DNN and calculate the average decision score across all modalities. (3) Intermediate fusion: We consider tensor fusion [34], tensor-gate [35], GMU [36], and Cross attention [37] fusion methods. These methods calculate the full interactions between modalities through different strategies. We reimplmented these intermediate fusion methods. (4) Robust multimodal fusion: We consider the following five dynamic fusion methods: Gradient Blend (GB) [5] which controls the overfitting level of unimodal and multimodal models; Gradient modulation (GM) [7] which dynamically modifies the learning rate for each modality; DFTMC [2] which dynamically selects the useful features and modalities using the trustworthy based method; PMR [8] which utilises the unimodal prototype to adjust the learning speed and momentum of each single modality; Multimodal boosting (MMB) [19] that adjusts the modality contributions through boosting algorithm. We utilised and adapted the GM, PMR and DFTMC from their original GitHub repositories. We used the implementation of GB provided in the Multibench benchmark [38]. Due to the absence of code provided, we re-implemented the MMB method.

Evaluation criteria

Classification accuracy, macro F1 (M-F1), and weighted F1 score (W-F1) were reported, considering the imbalanced nature of most datasets.

Results and discussion

In this section, we present the classification performance of baseline methods and the proposed method on four multiomics datasets. Subsequently, we delve deeper into the proposed method, conducting an ablation study, robustness test, and sensitivity test. At the end of this section, we analyse the computational complexity of the proposed method.

Performance on multiomics classification tasks

This section presents the experimental results on four multiomics classification tasks. We first compare the overall performance of the proposed method with baseline approaches to demonstrate its effectiveness. Then, we provide a per-label performance analysis on the two multi-class classification tasks, with particular attention to the model’s ability to accurately classify minority classes.

Performance on baseline datasets

Tables 2, 3, 4, 5 illustrates the classification performance on the four multiomics classification tasks. Overall, the proposed method outperforms all baseline methods regarding accuracy, M-F1, and W-F1, underscoring the efficacy of leveraging GP as guidance and employing KL divergence as a constraint in multimodal representation learning.

Table 2 Results on ROSMAP

Full size table

Table 3 Results on BRCA

Full size table

Table 4 Results on TCGA

Full size table

Table 5 Results on Blister

Full size table

Early fusion methods struggle to capture the intricate non-linear relationship between multimodal input features and class information. By merging modalities at the data level, early fusion disrupts the distribution within modalities, resulting in poorer classification performance than other methods. The proposed framework outperforms these early fusion methods by a margin of at least 2.13% and up to 34.13% in terms of accuracy.

Late fusion and other intermediate fusion methods exhibit an ability to capture complex relationships within and across modalities compared to early fusion methods. However, they are susceptible to the biased modality issue, where the most informative modality dominates the joint representation space. In contrast, the proposed framework effectively balances the learning across multiple modalities, thus outperforming these methods.

Dynamic training frameworks, such as GM and DFTMC, demonstrate competitive performance compared to all other methods by better balancing the learning of joint representation for multimodal data. However, empirical findings indicate a tendency for these methods to overfit on small datasets, such as the blister data with only 144 instances, due to the lack of regularisation. Conversely, our proposed method employs KL divergence to regulate the learning space and better align multiple modalities. Consequently, our method surpasses these dynamic methods by at least 1% and up to 26.16% of the M-F1 score.

Most of the datasets suffer from the class imbalance problem, which makes the machine learning and deep learning methods tend to learn more from the majority of classes. Compared with baseline methods, the proposed method performs better at detecting the minority classes, resulting in higher macro F1 scores. Moreover, the dimensions of the input features are higher than the number of instances for most of the datasets, resulting in the overfitting problem. The proposed method shows better classification performance on the test dataset, indicating its robustness.

Performance of multi-class classification

To further evaluate the model’s ability on class imbalance and fine-grained prediction, we analyse per-class performance on BRCA and Blister, which are two multi-class classification tasks. Figs. 4, 5 present the F1 score for each class for these two tasks. The model demonstrates consistently strong performance on the BRCA dataset across all subtypes, including the minority class, suggesting that it effectively learns stable modality contributions across all the classes, even for the minority classes. For the Blister dataset, although the model achieves the best overall performance, the per-class analysis reveals that the $\le 1$ week recovery class remains difficult to predict. This underperformance is likely due to the limited number of training samples and potential label ambiguity arising from subjective or inconsistent definitions of recovery timelines. Nevertheless, the model performs relatively well on the other three classes, and this result indicates the future improvement on leveraging the data augmentation technique to enhance the minority class learning for the Blister dataset.

Ablation results

This section first presents the ablation results on four multiomics classification tasks to validate the individual contributions of the MCC and the KL divergence components. It then visualises the representation spaces of ablation variants without KL and the proposed method, demonstrating that incorporating KL divergence leads to a more structured multimodal fusion space with improved class separability and a more robust decision boundary.

Ablation results

Table 6 shows the classification performance of three ablation models against the proposed model. They are (1) w$\backslash $o MCC denotes the removal of the MCC during training, with all modalities combined using equal weights. (2) w$\backslash $o KL signifies the removal of the KL restriction in the loss function during training. (3) GP only involves using early fusion to combine data and feed it to the GP classifier with RBF and white kernel.

Table 6 Ablation results

Full size table

A decrease in classification performance of ablation methods indicates the significance of each component within our proposed framework. For the w$\backslash $o MCC model, excluding the MCC leads to over-learning of the informative modality and under-learning of the less informative one. In the case of the w$\backslash $o KL model, the failure to regulate modalities into a joint space results in overfitting. Similarly to other early fusion methods, GP only struggles to model the within-modality distribution. However, leveraging GP’s capability to model variance through kernel-based methods outperforms traditional machine learning baselines. Nevertheless, deep learning-based methods demonstrate superior overall performance.

Analyse the importance of KL divergence

Fig. 6a and b show the t-SNE visualisations of the proposed model without and with KL divergence regularisation, respectively, on the BRCA testing dataset. The KL divergence is applied to the fused latent space and serves to align the learned multimodal representations with a standard Gaussian prior. This regularisation encourages uncertainty-aware learning and improves the structure of the latent space, which in turn enhances the separability of classes in the decision space. As illustrated in Fig. 6a, the absence of KL divergence leads to overlapping clusters, particularly between Luminal A and Luminal B, indicating poor class separation. In contrast, Fig. 6b, which includes KL divergence, shows more compact clusters and clearer decision boundaries. This improvement arises because KL divergence penalises divergence from a well-behaved prior, promoting disentangled and regularised features that reduce overfitting. Notably, the minority class HER2-enriched is also better separated from Luminal A in (b), suggesting that KL divergence contributes to more robust decision boundaries across both major and minority classes. To summarise, these findings confirm the motivation that incorporating KL divergence can improve the robustness and structure of the multimodal fusion space, leading to better decision boundaries in downstream classification.

Robustness evaluation

Robustness evaluation is essential to ensure that the proposed method performs reliably under real-world conditions, where data may be noisy or exhibit imbalanced modality contributions. Thus, we design two complementary experiments: First, we inject random noise into individual modalities to examine the model’s ability to maintain stable performance under perturbation. This simulates scenarios with degraded data quality. Second, we evaluate the model using different modality combinations to assess whether the fusion mechanism effectively balances complementary and conflicting features. These experiments validate the roles of MCC weighting and KL alignment in enhancing robustness.

Noise perturbation

We delve deeper into examining the robustness of the proposed model and other ablation models. For this experiment, we introduce random noise into the training and testing datasets of one of the modalities from the BRCA dataset at varying percentages of 5%, 10%, and 20%. For example, injecting 5% random noise into the DNA modality implies randomly selecting 5% of instances and introducing noise to their DNA modality. In other words, we replaced the DNA modality with the noisy version and kept the other two modalities. This experiment aims to assess the robustness of each model by simulating real-world scenarios where noise may originate from factors like biased data collection devices. The results of this experiment are shown in Fig. 7. Results indicate that the original Gaussian process model is not robust to the noise within the multimodal data. The GPC in the proposed model played an important role in maintaining robustness. Without the GPC, the noisy modality contributes equally to the other noisy-free modalities, resulting in a drop in classification performance. Meanwhile, w$\backslash $o KL also caused a performance drop due to the loss of alignment in the joint representation space. Conversely, the proposed model is relatively more robust than other ablation models to noise due to the robust learning of aligned features over unimodal and multimodal feature spaces.

Modality combinations

To demonstrate the proposed method’s ability to balance modality contributions, we conduct a modality combination analysis on the TCGA dataset. This experiment evaluates classification performance under different modality configurations, including single-modality, bi-modality, and tri-modality setups. Fig. 8 illustrates the influence of modality configuration on the TCGA dataset. Among the unimodal inputs, gene expression exhibits the strongest predictive power, while demographic and CNA profiles alone perform less effectively. However, when combined with gene expression, both modalities contribute to performance gains, demonstrating that the proposed model can balance the learning of less informative modalities without letting them dominate or dilute the overall prediction. Notably, the demographic and CNA modalities, when used together without gene expression, also result in non-trivial improvements, further indicating the model’s capacity to exploit cross-modal interactions. The highest performance is achieved when all three modalities are used, confirming that the proposed fusion strategy, including Gaussian Process weighting and KL alignment, effectively balances modality contributions, avoids over-reliance on any single source, and maximises complementary information.

Sensitivity analysis

We analyse $\beta $ and the effectiveness of bandwidth for the RBF kernel, which are two essential hyperparameters for MCC.

Figure 9 demonstrates the tuning process for the crucial hyperparameter $\beta $, which controls the impact of KL divergence during optimisation. Results also indicate that setting $\beta $ to excessively large values severely constrains the learning of the joint representation space, leading to minimal model updates. Conversely, setting $\beta $ to very small values results in the loss of KL divergence regulation, leading to a performance drop. Given the model’s sensitivity to this hyperparameter, we explore the change of classification performance on the ROSMAP, BRCA, and TCGA datasets with $\beta $ in [1e−1, 1e−2, 1e−3, 1e−4, 1e−5].

Figure 10 illustrates the sensitivity analysis of the RBF kernel bandwidth in the GPC used for computing MCC. Overall, classification performance remains relatively stable across different bandwidth values, with a slight peak observed at a bandwidth of 10, where all evaluation metrics achieve their highest scores on the BRCA dataset.

Complexity analysis

Finally, this subsection will analyse the computational complexity of the proposed MCC model. The computational complexity of our proposed method includes two main components: the Modal Contribution Calibration (MCC) module and the main representation learning network. The MCC module is a data-driven method based on Gaussian Processes, where the computational cost is determined by the number of data points used in training. Its complexity is ${\mathcal {O}}(n_r^3)$, where $n_r \subset n$ is a small subset of the training data. For example, in the blister recovery task, the full dataset contains 148 instances, with 80% (i.e., 118 instances) used for training. Out of these, only 20% (i.e., $n_r = 23$) are selected for MCC, resulting in a complexity of ${\mathcal {O}}(23^3)$. Unlike MCC, the main neural network is not data-driven in complexity, and its cost is determined by its architecture.

The complexity of the multimodal neural network depends on the architecture of the unimodal encoders and the fusion layers. For the unimodal encoders, modality one has an input dimension of 507 and is projected to 50 dimensions, while modality two has an input dimension of 7 and is also projected to 50 dimensions. The complexity for these encoders is ${\mathcal {O}}(507 \times 50 + 7 \times 50)$. The fusion module learns both the mean $\mu $ and variance $\sigma $ representations from the two modalities, each of dimension 50, resulting in a complexity of ${\mathcal {O}}(2 \times 50 \times 50)$. Finally, a classification layer maps the fused features to four output classes, giving an additional complexity of ${\mathcal {O}}(50 \times 4)$.

In total, the full computational complexity of the proposed model can be written as:

$$ {\mathcal {O}}(23^3 + 507 \times 50 + 7 \times 50 + 2 \times 50 \times 50 + 50 \times 4) $$

Given that $n_r = 23$ is small in practice, the Gaussian Process component remains computationally tractable. Importantly, this provides substantial benefits of stable and calibrated modality contribution estimates, improved optimisation stability, and interpretable fusion behaviour. As a result, the proposed framework is practical for multiomics datasets, where model robustness and explainability are often more critical than raw computational throughput.

A case study on the Blister dataset

We present a case study showcasing the application of our model in predicting the recovery period of burn wound patients. In this study, we received a dataset comprising blister data from 144 patients who sustained burn injuries. The dataset includes two modalities: a static modality with 7 features and a proteomic modality with 508 protein features. The primary objective is to predict whether the patient will recover within one week, two weeks, one month, or a longer timeframe. The objectives of building classification models serve two purposes: (1) classification for burning recovery prediction and (2) identifying biomarkers learned by the deep learning method. This case study specifically aims to show the latter aspect.

Figure 11 shows the evolution of MC (i.e. $\omega $ in Eq. 3) during training. The static modality consistently exhibits higher weights than the proteomic modality. Despite weight fluctuations, both modalities generally hover around 0.5, ensuring the overall stability of the training process.

To identify informative biomarkers, we follow a methodology akin to previous research [1]. This involves systematically removing one feature at a time from the dataset and calculating the performance drop without that particular feature during the evaluation phase. Figure 12 illustrates the top 30 most important biomarkers identified through this process. Notably, four out of the seven static features are ranked among the top-influenced biomarkers. This observation aligns with the findings illustrated in Fig. 11, where static features are consistently highlighted as important. Moreover, the identified proteomic aligned with the existing study regarding burn recovery [29].

Conclusion and future works

This study introduces a novel MCC-enhanced multimodal learning framework for multiomics classification tasks. By employing a non-parametric Gaussian Process to assess the confidence of each modality, the framework effectively balances the learning process across multimodal data. Additionally, incorporating KL divergence as a constraint for joint learning space facilitates the acquisition of a generalised joint representation. Extensive experiments validate the efficacy of the proposed method. The case study on biomarker identification underscores the significance of critical features within the blister recovery dataset.

We recognise that GP-based estimation introduces additional computational overhead. Training a GP classifier for every modality in each epoch incurs substantial computational cost and raises concerns about scalability to higher-dimensional or more numerous modalities. Exploring optimisation strategies such as early stopping or weight freezing is a future direction. Enhancing the interpretability of the proposed method is another direction for future research endeavours.

Data availability

The raw ROSMAP dataset is available from the AMP-AD Knowledge Portal at https://adknowledgeportal.synapse.org/. The BRCA and TCGA datasets can be accessed through The Cancer Genome Atlas (TCGA) via the Broad GDAC Firehose at https://gdac.broadinstitute.org/. The pre-processed ROSMAP and BRCA datasets used in this study are available at https://github.com/txWang/MOGONET, while the pre-processed TCGA dataset is available at https://github.com/nikhilaryan92/SiGaAtCNNstackedRF. The Blister dataset is available upon reasonable request.

Notes

References

Wang T, Shao W, Huang Z, Tang H, Zhang J, Ding Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12(1):3445. https://doi.org/10.1038/s41467-021-23774-w.
Article CAS PubMed PubMed Central Google Scholar
Han Z, Yang F, Huang J, Zhang C, Yao J. Multimodal dynamics: dynamical fusion for trustworthy multimodal classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2022.
Li J, Han X, Qin Y, Tan F, Chen Y, Wang Z, et al. Artificial intelligence accelerates multi-modal biomedical process: a survey. Neurocomputing. 2023;558:126720. https://doi.org/10.1016/j.neucom.2023.126720.
Article Google Scholar
Arya N, Saha S. Multi-modal advanced deep learning architectures for breast cancer survival prediction. Knowl-Based Syst. 2021;221:106965. https://doi.org/10.1016/j.knosys.2021.106965.
Article Google Scholar
Wang W, Tran D, Feiszli M. What makes training multi-modal classification networks hard? In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE; 2020.
Brown NJ, Kimble RM, Gramotnev G, Rodger S, Cuttle L. Predictors of re-epithelialization in pediatric burn. Burns. 2014;40(4):751–8. https://doi.org/10.1016/j.burns.2013.09.027.
Article PubMed Google Scholar
Peng X, Wei Y, Deng A, Wang D, Hu D. Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022.
Fan Y, Xu W, Wang H, Wang J, Guo S. PMR: prototypical modal rebalance for multimodal learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. pp. 20029–20038.
Athieniti E, Spyrou GM. A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2023;21:134–49. https://doi.org/10.1016/j.csbj.2022.11.050.
Article CAS PubMed Google Scholar
Athaya T, Ripan RC, Li X, Hu H. Multimodal deep learning approaches for single-cell multi-omics data integration. Brief Bioinform. 2023;9(5):24. https://doi.org/10.1093/bib/bbad313. (bbad313).
Article CAS Google Scholar
Nayak R, Luong K. Subspace learning for multi-aspect data. In: Nayak R, Luong K, editors. Multi-aspect learning: methods and applications. Cham: Springer; 2023. p. 77–101. Available from: https://doi.org/10.1007/978-3-031-33560-0_4.
Liu X, Hasan MR, Ahmed KA, Hossain MZ. Machine learning to analyse omic-data for COVID-19 diagnosis and prognosis. BMC Bioinform. 2023;24(1):7. https://doi.org/10.1186/s12859-022-05127-6.
Article Google Scholar
Wang C, Lye X, Kaalia R, Kumar P, Rajapakse JC. Deep learning and multi-omics approach to predict drug responses in cancer. BMC Bioinform. 2022;22(10):632. https://doi.org/10.1186/s12859-022-04964-9.
Article Google Scholar
Hauptmann T, Kramer S. A fair experimental comparison of neural network architectures for latent representations of multi-omics for drug response prediction. BMC Bioinform. 2023;24(1):45. https://doi.org/10.1186/s12859-023-05166-7.
Article Google Scholar
Pingi ST, Zhang D, Bashar MA, Nayak R. Joint representation learning with generative adversarial imputation network for improved classification of longitudinal data (In Press). Data Science and Engineering. 2023.
Zhang D, Nayak R, Bashar MA. Exploring fusion strategies in deep learning models for multi-modal classification. In: Data mining; 2021.
Leng D, Zheng L, Wen Y, Zhang Y, Wu L, Wang J, et al. A benchmark study of deep learning-based multi-omics data fusion methods for cancer. Genome Biol. 2022;23(1):171. https://doi.org/10.1186/s13059-022-02739-2.
Article PubMed PubMed Central Google Scholar
Choi JM, Chae H. MoBRCA-net: a breast cancer subtype classification framework based on multi-omics attention neural networks. BMC Bioinform. 2023;24(1):169. https://doi.org/10.1186/s12859-023-05273-5.
Article Google Scholar
Mai S, Sun Y, Xiong A, Zeng Y, Hu H. Multimodal boosting: addressing noisy modalities and identifying modality contribution. IEEE Trans Multimedia. 2023;1:1–16. https://doi.org/10.1109/TMM.2023.3306489.
Article Google Scholar
Wilson AG, Izmailov P. Bayesian deep learning and a probabilistic perspective of generalization. In: Proceedings of the 34th international conference on neural information processing systems. NIPS ’20. Red Hook: Curran Associates Inc.; 2020.
Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86.
Article Google Scholar
Kingma DP, Welling M. Auto-encoding variational bayes. In: 2nd International conference on learning representations, ICLR; 2014.
Christopher KI Williams, Carl Edward Rasmussen. Gaussian processes for machine learning. vol. 2; 2006.
Scholkopf B, Tsuda K, Vert JP. A primer on kernel methods. Kernel Methods Comput. 2004;35–70.
Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International conference on machine learning; 2017. p. 1321–1330.
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts. Neural Comput. 1991;3(1):79–87. https://doi.org/10.1162/neco.1991.3.1.79.
Article PubMed Google Scholar
Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the 33rd international conference on international conference on machine learning - Vol. 48. ICML’16. JMLR.org; 2016. p. 1050–1059.
Lemons DS.: An introduction to stochastic processes in physics. American Association of Physics Teachers.
Zang T, Fear MW, Parker TJ, Holland A, Martin L, Langley D, et al. Inflammatory proteins and neutrophil extracellular traps increase in burn blister fluid 24h after burn. Burns. 2024;50(5):1180–91. https://doi.org/10.1016/j.burns.2024.02.026.
Article PubMed Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.
Article Google Scholar
Ho TK. Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. 1995;1:278–82.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–794.
Fix E, Hodges JL. Discriminatory analysis. Nonparametric discrimination: consistency properties. Int Stat Rev Revue Int Stat. 1989;57(3):238–247.
Zadeh A, Chen M, Poria S, Cambria E, Morency LP. Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing; 2017.
Kiela D, Grave E, Joulin A, Mikolov T. Efficient large-scale multi-modal classification. AAAI. 2018.
Arevalo J, Solorio T, Montes-y Gómez M, González FA. Gated multimodal networks. Neural Comput Appl. 2020:3.
Abavisani M, Wu L, Hu S, Tetreault J, Jaimes A. Multimodal categorization of crisis events in social media. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE; 2020.
Liang PP, Lyu Y, Fan X, Wu Z, Cheng Y, Wu J, et al. MultiBench: multiscale benchmarks for multimodal representation learning. In: Proceedings of the neural information processing systems track on datasets and benchmarks 1, NeurIPS Datasets and Benchmarks 2021; 2021.

Download references

Acknowledgements

The authors thank the School of Computer Science, Centre for Data Science at QUT, and the ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (MACSYS) grant (ID: CE230100001) for their valuable discussions, feedback, and for providing partial funding. The authors also appreciate the publication fee discount from Springer Nature.

Funding

Not applicable.

Author information

Authors and Affiliations

School of Computer Science, Queensland University of Technology, 2 George St, Brisbane, Queensland, 4000, Australia
Duoyi Zhang, Md Abul Bashar & Richi Nayak
School of Biomedical Sciences, Queensland University of Technology, 2 George St, Brisbane, Queensland, 4000, Australia
Leila Cuttle

Authors

Duoyi Zhang
View author publications
Search author on:PubMed Google Scholar
Md Abul Bashar
View author publications
Search author on:PubMed Google Scholar
Richi Nayak
View author publications
Search author on:PubMed Google Scholar
Leila Cuttle
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualisation was led by DZ, MB, RN, and LC. DZ, MB, and RN developed the methodology, while DZ conducted the experiments. LC prepared the Blister data. The original draft was written by DZ, with contributions from MB, RN, and LC for review and editing. MB and RN supervised the work.

Corresponding author

Correspondence to Duoyi Zhang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, D., Bashar, M.A., Nayak, R. et al. A novel modality contribution confidence-enhanced multimodal deep learning framework for multiomics data. BMC Bioinformatics 26, 271 (2025). https://doi.org/10.1186/s12859-025-06219-9

Download citation

Received: 24 March 2025
Accepted: 04 July 2025
Published: 31 October 2025
Version of record: 31 October 2025
DOI: https://doi.org/10.1186/s12859-025-06219-9

Keywords

Profiles

Md Abul Bashar View author profile

A novel modality contribution confidence-enhanced multimodal deep learning framework for multiomics data

Abstract

Explore related subjects

Introduction

Related work

Multimodal fusion dealing with multiomics data

Robust multimodal learning

Methodology

Problem motivation and core components

Problem formulation

The objective function formulation

MCC-enhanced training to learn within-modality features

Multimodal representation representation

KL divergence for learning cross-modality features

Network architecture

Improving fusion with MCC: stability and interpretability

Experiments

Datasets

Experimental settings

Baselines

Evaluation criteria

Results and discussion

Performance on multiomics classification tasks

Performance on baseline datasets

Performance of multi-class classification

Ablation results

Ablation results

Analyse the importance of KL divergence

Robustness evaluation

Noise perturbation

Modality combinations

Sensitivity analysis

Complexity analysis

A case study on the Blister dataset

Conclusion and future works

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles