0% found this document useful (0 votes)
24 views30 pages

Symmetry and Complexity in Gene Association Networ

Uploaded by

praveen141520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views30 pages

Symmetry and Complexity in Gene Association Networ

Uploaded by

praveen141520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

SS symmetry

Article
Symmetry and Complexity in Gene Association Networks Using
the Generalized Correlation Coefficient
Raydonal Ospina 1,2 , Cleber M. Xavier 3 , Gustavo H. Esteves 4 , Patrícia L. Espinheira 1,2 , Cecilia Castro 5, *
and Víctor Leiva 6

1 Departamento de Estatística, LInCa, Universidade Federal da Bahia, Salvador 40170-110, Brazil;


[email protected] (R.O.); [email protected] (P.L.E.)
2 Departamento de Estatística, CASTLab, Universidade Federal de Pernambuco, Recife 50670-901, Brazil
3 Departamento de Estatística e Ciências Atuariais, Universidade Federal de Sergipe,
São Cristóvão 49107-230, Brazil; [email protected]
4 Departamento de Estatística, Universidade Estadual da Paraíba, Campina Grande 58429-500, Brazil;
[email protected]
5 Centre of Mathematics, Universidade do Minho, 4710-057 Braga, Portugal
6 Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, Valparaíso 2362807, Chile;
[email protected]
* Correspondence: [email protected]

Abstract: High-dimensional gene expression data cause challenges for traditional statistical tools, par-
ticularly when dealing with non-linear relationships and outliers. The present study addresses these
challenges by employing a generalized correlation coefficient (GCC) that incorporates a flexibility
parameter, allowing it to adapt to varying levels of symmetry and asymmetry in the data distribution.
This adaptability is crucial for analyzing gene association networks, where the GCC demonstrates
advantages over traditional measures such as Kendall, Pearson, and Spearman coefficients. We intro-
duce two novel adaptations of this metric, enhancing its precision and broadening its applicability in
the context of complex gene interactions. By applying the GCC to relevance networks, we show how
different levels of the flexibility parameter reveal distinct patterns in gene interactions, capturing both
linear and non-linear relationships. The maximum likelihood and Spearman-based estimators of the
Citation: Ospina, R.; Xavier, C.M.;
GCC offer a refined approach for disentangling the complexity of biological networks, with potential
Esteves, G.H.; Espinheira, P.L.; Castro,
implications for precision medicine. Our methodology provides a powerful tool for constructing and
C.; Leiva, V. Symmetry and
interpreting relevance networks in biomedicine, supporting advancements in the understanding of
Complexity in Gene Association
Networks Using the Generalized
biological interactions and healthcare research.
Correlation Coefficient. Symmetry
2024, 16, 1510. https://doi.org/ Keywords: asymmetry; bioinformatics; gene expression analysis; high-dimensional data; non-linear
10.3390/sym16111510 associations; robust statistical methods

Academic Editors: Brouri Adil,


MSC: 62H20; 92D10
Abdelmalek Ouannou and Sorin Vlase

Received: 23 September 2024


Revised: 16 October 2024
Accepted: 5 November 2024 1. Introduction
Published: 11 November 2024
Biomedical informatics, an interdisciplinary field at the intersection of biology, data
science, and medicine, plays a critical role in deciphering complex molecular interactions,
thereby driving advancements in medical diagnostics and treatments. A core challenge
Copyright: © 2024 by the authors.
in this field is the analysis of high-dimensional gene expression data, which frequently
Licensee MDPI, Basel, Switzerland. present complexities that conventional statistical methods fail to adequately capture [1–4].
This article is an open access article Specifically, widely used correlation measures, such as Kendall, Pearson, and Spearman
distributed under the terms and coefficients, often struggle to account for the presence of non-linear relationships and the
conditions of the Creative Commons influence of outliers in this type of data [5–8]. Moreover, gene expression data are commonly
Attribution (CC BY) license (https:// asymmetrically distributed, adding another layer of complexity to their analysis [9]. Several
creativecommons.org/licenses/by/ statistical methods have been developed to handle the high variability and asymmetry
4.0/). observed in gene expression data.

Symmetry 2024, 16, 1510. https://doi.org/10.3390/sym16111510 https://www.mdpi.com/journal/symmetry


Symmetry 2024, 16, 1510 2 of 30

Methods such as variance stabilizing transformations, including the generalized log-


normal (glog-normal) distribution [10], have been applied to genomic contexts [11,12].
Additionally, robust correlation measures like the percentage bend and skipped correla-
tions [13,14], as well as the maximal information coefficient [15], have shown promise in
capturing both linear and non-linear dependencies in biological data. Machine learning
techniques and advanced clustering algorithms have contributed to the analysis of high-
dimensional datasets [7,8,16–20]. Despite the mentioned studies, no single method has
fully addressed the wide array of problems inherent to bioinformatics data.
A promising solution to these problems is the generalized correlation coefficient
(GCC) [21], which introduces a flexibility parameter capable of adapting to various levels of
data complexity. Unlike traditional correlation measures, the GCC can smoothly transition
between characteristics of both the Pearson and Spearman correlation coefficients, provid-
ing a powerful balance of sensitivity and robustness. This makes the GCC particularly
effective in capturing a broader range of linear and non-linear relationships within gene
association networks. These networks, often referred to as relevance networks (RNs), pro-
vide a framework for visualizing gene interactions where high correlations are represented
as edges between genes [22]. The adaptability of the GCC makes it a strong candidate for
constructing such networks.
Although the GCC has demonstrated potential, to the best of our knowledge, it has
not yet been applied to gene association networks—a domain where traditional correlation
methods often fall short, particularly in challenges when handling outliers and deviations
from normality [21,23]. These challenges are common in bioinformatics, underscoring the
need for more robust methods capable of addressing such challenges [24].
The main objective of this study is to extend the application of the GCC to gene associa-
tion networks, overcoming the limitations of conventional correlation techniques. To achieve
this objective, we refine existing computational methods and theoretical developments.
We employ robust estimators for the GCC based on U-statistics, valued for their
resilience in complex data. We also develop Fisher-consistent estimators with flexible
parameters, enhancing adaptability across various data structures. To ensure the reliability
of these estimators, particularly in large-sample cases, we incorporate advanced techniques
such as the delta method.
The present work broadens the applicability of the GCC within biomedical informatics,
with a focus on analyzing high-dimensional biological data, such as gene expression profiles.
By refining the existing methods and extending them to new contexts, we aim to enrich
the statistical toolkit available for analyzing complex biological datasets [3,25,26]. As a
result, our study represents a step forward in applying correlation analysis to areas such as
genomics [27], health sciences, and epidemiology, tackling key challenges like asymmetry
in the data distribution and non-linear dependencies.
The remainder of this article is organized as follows. Section 2 explores the theoretical
advancements and computational developments of the GCC, focusing on its application to
high-dimensional biological data. In Section 3, we conduct a comprehensive simulation
study to evaluate the performance of the proposed estimators under various scenarios. In
Section 4, practical applications of the GCC are presented, particularly in the construction
of RN using gene expression data. At last, Section 5 summarizes our key findings and
discusses the broader implications of this work, with an emphasis on future directions for
biomedical research and statistical methodology.

2. Advancements and Applications of the Generalized Correlation Coefficient


In this section, we present the theoretical framework of the GCC and recent advance-
ments that demonstrate its effectiveness in addressing challenges in the analysis of complex
data, particularly in bioinformatics.
Symmetry 2024, 16, 1510 3 of 30

2.1. Theoretical Foundations and Developments of the Generalized Correlation Coefficient


Quantifying relationships between variables is essential in biological systems, es-
pecially through measures of association. Correlation coefficients serve as fundamental
tools to determine the strength of the association between two gene expression profiles,
providing key insights in various biological contexts [28–37]. These measures typically
exhibit positive values when high (or low) values of one variable correspond with high (or
low) values of another and negative values when high values of one variable correspond
with low values of another.
For a pair of random variables, ( X, Y ) namely, with a joint cumulative distribution
function (CDF) F and finite second-order moments, the population Pearson correlation
coefficient is defined as
Cov( X, Y ) E[( X − µ X )(Y − µY )]
ρ= = p ,
σX σY E[( X − µ X )2 ]E[(Y − µY )2 ]

where Cov( X, Y ) is the covariance, σX2 = Var[ X ] and σY2 = Var[Y ] are the variances, and
µ X = E[ X ] and µY = E[Y ] are the expected values of X and Y, correspondingly. This
coefficient quantifies the linear dependence between two variables, ( X, Y ) in our case.
The sample Pearson correlation coefficient is stated as

∑in=1 ( xi − x̄ )(yi − ȳ)


rP = q , (1)
∑in=1 ( xi − x̄ )2 ∑in=1 (yi − ȳ)2

where xi , yi are the sample observations and x̄, ȳ are the sample means of X and Y,
respectively. If X and Y follow a bivariate normal distribution, denoted as ( X, Y ) ∼
N2 (µ X , µY ; σX2 , σY2 ; ρ), then the sample correlation coefficient presented in (1) is the maxi-
mum likelihood (ML) estimate of the population Pearson correlation coefficient ρ. This
estimator is consistent, meaning that as the sample size n increases, rP converges in proba-
bility to the population correlation ρ.
However, when the assumption of bivariate normality is violated, several issues
arise. Outliers can disproportionately influence ρ, leading to misleading conclusions
about the relationship between X and Y. Furthermore, in the presence of non-linear
relationships, ρ may underestimate the true strength of association, as it captures only
linear dependencies [38,39]. Non-parametric alternative measures, such as the Kendall tau
and Spearman rank correlation, are less sensitive to outliers and better capture monotonic
relationships, making them more robust in such scenarios [40]. These measures provide
more reliable estimates and reflect the true relationships in the data, particularly when
dealing with biological variables that exhibit complex dependencies.
The Kendall correlation coefficient [41,42] is expressed as

n −1 n
2
rK =
n ( n − 1) ∑ ∑ sign(( xi − x j )(yi − y j )),
i =1 j = i +1

where sign(·) is the sign function that assigns a value of 1 to positive differences, −1 to
negative differences, and 0 when there is no difference. Its population version is given by

ρK ( F ) = EF [sign(( X1 − X2 )(Y1 − Y2 ))],

with EF representing the expectation with respect to the CDF F. For a bivariate normal
distribution, whose CDF is denoted by Φ2 , the relationship is established as

2
ρK ( Φ2 ) = arcsin(ρ),
π
which differs from the Pearson correlation when ρ ̸= 0 [43].
Symmetry 2024, 16, 1510 4 of 30

The Spearman rank correlation coefficient [44] is also less sensitive to outliers, provid-
ing a robust estimate of the correlation. Denote H (t) = PF ( X ≤ t) and G (t) = PF (Y ≤ t)
as the marginal CDFs of X and Y, respectively. The Spearman correlation coefficient is
formulated as
ρS ( F ) = Cor( H ( X ), G (Y )) = 12EF [ H ( X ) G (Y )] − 3,
and its sample estimate is given by

6 ∑in=1 d2i
rS = 1 − ,
n ( n2 − 1)

where di is the rank difference between paired observations ( x, y).


This discussion on correlation coefficients reveals challenges of the complex and adap-
tive nature of correlation measures in statistical analysis, showing their varied behavior
under different distributional scenarios. In the empirical study of biological systems,
particularly in gene expression data analysis, datasets are often prone to noise and mea-
surement errors. Accurately estimating correlations in such systems is critical to prevent
misinterpretations in both biological inference and statistical conclusions.
In response to these challenges, the GCC emerges as a highly robust tool for analyzing
complex biological data. What sets the GCC apart is its heightened sensitivity to both
linear and non-linear patterns, making it a far more versatile alternative to traditional
metrics like the Pearson and Kendall coefficients [21]. By seamlessly adapting to the
nuances of high-dimensional data and demonstrating remarkable resilience to outliers,
the GCC provides a more precise and comprehensive measure of association, especially
in complex biological datasets. This adaptability stems from the key role played by the
function gγ (z) = sign(z)|z|γ , where γ ∈ [0, 1], which underpins the definition of the GCC.
This function modulates how differences between variables are weighted based on both
magnitude and sign, allowing the GCC to dynamically adjust its sensitivity. As a result of
this flexibility, the GCC is capable of capturing a wide spectrum of correlation structures,
ranging from purely linear to highly intricate non-linear dependencies. This adaptability
is achieved through three key population parameters that define the functional form of
the GCC, each reflecting a different aspect of the relationship between variables, indicated
as follows:
• θ1 ( F ) = EF [ gγ ( Xi − X j ) gγ (Yi − Yj )]—which captures the mutual influence between
differences in pairs of variables X and Y.
• θ2 ( F ) = EF [ gγ2 ( Xi − X j )]—which quantifies the squared influence of differences in the
variable X.
• θ3 ( F ) = EF [ gγ2 (Yi − Yj )]—which measures the squared influence of differences in the
variable Y.
From these parameters, the GCC is formally defined as

θ1 ( F )
ργ ( F ) = p , (2)
θ2 ( F ) θ3 ( F )

where the parameter γ modulates the degree of similarity between the GCC and traditional
correlation coefficients. Specifically, when γ = 1, ργ ( F ) aligns with the Pearson correlation
coefficient, capturing linear relationships, while for γ = 0, it approximates the Kendall
rank correlation, which is more sensitive to ordinal relationships.

2.2. Practical Implementations and Computational Refinements of GCC


Building on the theoretical foundation of the GCC, advanced computational methods
have been developed to apply its principles effectively in practice. A key innovation
is the creation of an estimator for ργ ( F ) based on U-statistics, which provides a robust
statistical approach commonly used to construct estimators that are resilient to outliers and
irregularities in data [45].
Symmetry 2024, 16, 1510 5 of 30

The estimator for ργ is presented as

Uγ,XY
ρ̃γ = p , (3)
Uγ,XX Uγ,YY

where Uγ,XY , Uγ,XX , and Uγ,YY are U-statistic estimators corresponding to the parameters
θ1 ( F ), θ2 ( F ), and θ3 ( F ), respectively, as defined in (2). These estimators are computed as

n −1 n
2
Uγ,XY =
n ( n − 1) ∑ ∑ gγ ( Xi − X j ) gγ (Yi − Yj ),
i =1 j = i +1
n −1 n
2
Uγ,XX =
n ( n − 1) ∑ ∑ gγ2 ( Xi − X j ),
i =1 j = i +1
n −1 n
2
Uγ,YY =
n ( n − 1) ∑ ∑ gγ2 (Yi − Yj ).
i =1 j = i +1

By utilizing U-statistics, we offer a refinement based on robust and unbiased means for
estimating ργ ( F ), even in the presence of complex or irregular data. The use of U-statistics
ensures consistency and resilience to non-normal distributions and outliers, making them
particularly suitable for biological datasets, which frequently exhibit such challenges.
Further computational refinements include an explicit formulation for ργ in the case
of a bivariate normal distribution, with CDF denoted as Φ2 [23]. This distribution describes
the joint behavior of two normally distributed random variables with a specified correlation
ρ. The explicit formulation for the GCC in this context is given by
 
1 1 3
W (γ, ρ) = K (γ)ρ 2 F1 (1 − γ ), (1 − γ ); ; ρ2 , (4)
2 2 2

where K (γ) = 2(Γ(γ/2 + 1))2 /Γ(γ + 1/2) π and 2 F1 ( a, b; c; x ) represents the Gaussian
hypergeometric series, a specialized mathematical function used to describe complex
relationships between variables [46], and Γ is the traditional gamma function.
Thus, the expression W (γ, ρ) presented in (4) quantifies how the GCC ργ (Φ2 ) deviates
from the traditional Pearson correlation coefficient ρ when ρ ̸= 0. Specifically, W (γ, ρ)
adjusts the weighting of linear versus non-linear relationships based on the parameter γ.
As γ changes, the GCC adapts to capture different types of dependencies, making it more
versatile than traditional correlation coefficients.
To ensure that the GCC estimator remains Fisher-consistent —meaning it retains accu-
racy across different populations—an inverse transformation of W (γ, ρ) is applied, keeping
γ fixed within the interval [−1, 1], as described in (4). This transformation guarantees that
the estimator adapts to various datasets while preserving statistical consistency.
Let rQ denote the correlation estimator corresponding to ρQ ( F ). In the normal model
Φ2 , it is established that all considered correlation estimators asymptotically follow a
normal distribution, that is, we have that

n(rQ − ρQ (Φ2 )) → N(0, AV(ρQ (Φ2 ), Φ2 )),

where AV( R, F ) represents the asymptotic variance, defined as EF [IF(( X, Y ), R, F )2 ], with


IF(( x, y), R, F ) being the influence function of the statistical functional R at the CDF F [39].
For the bivariate normal CDF Φ2 and any correlation ρ in the range [−1, 1], the
asymptotic variances for different correlation measures are established. For the Pearson
correlation coefficient, the asymptotic variance is given by
2
AV(ρ(Φ2 ), Φ2 ) = (1 − ρ2 ) ,

as demonstrated in [47,48].
Symmetry 2024, 16, 1510 6 of 30

For the Kendall correlation coefficient, the asymptotic variance is established as

π2
 
AV(ρK∗ (Φ2 ), Φ2 ) = (1 − ρ2 ) − arcsin2 (ρ) ,
4

as discussed in [43]. The variances for the Spearman correlation ρS (Φ2 ) and the GCC
ργ (Φ2 ) are detailed in [21,49].
Given the Fisher consistency of the Pearson (rP ) and Spearman (rS ) estimators under
a CDF Φ2 , we apply two Fisher-consistent estimators for ργ (Φ2 ) for a fixed value of γ,
defined as

ρbγ = W (γ, rP ), —ML estimator—, (5)


  π 
ρ̄γ = W γ, 2 sin r , —Spearman-based estimator—. (6)
6 S
The derivative of W (γ, ρ), denoted by w(γ, ρ), is given by

(3 − γ ) (3 − γ ) 5 2
 
∂W (γ, ρ) 1
w(γ, ρ) = = (1 − γ)2 ρ2 2 F1 , ; ; ρ + W (γ, ρ).
∂ρ 3 2 2 2

Using the delta method [50], we derive the asymptotic distributions of these estimators
for a fixed value of γ as
√  
n(ρbγ − ργ ) → N 0, w(γ, ρ)2 AV(ρ(Φ2 ), Φ2 )

and √  
n(ρ̄γ − ργ ) → N 0, w(γ, ρ)2 AV(ρS (Φ2 ), Φ2 ) ,

confirming that both ρbγ and ρ̄γ exhibit asymptotic normality, with their variances scaled
by w(γ, ρ)2 . This demonstrates the effectiveness of these estimators in approximating ργ
under Φ2 .
In summary, these refinements reinforce the theoretical foundations of the GCC,
particularly for high-dimensional biological data. By ensuring Fisher consistency and
leveraging robust statistical methods, the proposed estimators provide precise and reliable
correlation analyses, which are critical in fields such as epidemiology, genomics, and health
sciences. Having established the theoretical and practical foundations for the GCC, we
proceed to evaluate its performance through a comprehensive simulation study.

3. Simulation Study
This section evaluates the performance of the GCC estimators under various simulated
scenarios, providing empirical evidence of their efficacy and robustness in analyzing
complex biological data, particularly gene expression profiles with prevalent non-linear
dependencies and asymmetries.

3.1. Simulation Design


We design several simulation scenarios to reflect real-world conditions commonly
encountered in gene expression data. We assess the performance of the estimators under
various correlation structures, sample sizes, and contamination levels. Specifically, we
consider the following cases:
• Case 1 —Standard bivariate normal distribution without contamination, where sam-
ples were drawn from a bivariate normal distribution N2 (µ X , µY ; σX2 , σY2 ; ρ) with the
following parameters: means µ X = µY = 0; variances σX2 = σY2 = 1; and correlation
coefficients ρ ∈ {0, 0.3, 0.9}, representing cases of no correlation, moderate correlation,
and high correlation, respectively. This case evaluates the estimators under ideal
conditions with no contamination and different correlation strengths.
Symmetry 2024, 16, 1510 7 of 30

• Case 2 —Bivariate normal distribution with shifted means. To assess the robustness
of the estimators to location shifts, we generate samples from a bivariate normal
distribution N2 (−0.5, 0.5; 1, 1; ρ), with shifted means, with the same correlation coeffi-
cients ρ ∈ {0, 0.3, 0.9} being used. This case evaluates the effect of mean shifts on the
estimator performance.
• Case 3 —Bivariate normal distribution with increased variance. To investigate the
impact of increased variability, we generate samples from the distribution N2 (0, 0;
σX2 = 4, σY2 = 4; ρ), with variances four times greater than in previous cases and the
correlation coefficients remaining as ρ ∈ {0, 0.3, 0.9}. This case simulates scenarios
with high variability in biological data.
• Case 4—Contaminated bivariate normal distribution. In this case, we create a mixture
consisting of 60% of a bivariate normal distribution with high correlation (ρ = 0.9)
and 40% of a bivariate normal distribution with no correlation (ρ = 0). The mixture
proportions considered are 60%, 40%, with correlation coefficients ρ ∈ {0.1, 0.5, 0.9}.
This case evaluates the performance of the estimators in the presence of heterogeneous
subpopulations with varying correlation patterns.
• Case 5—Mixture of bivariate normal distributions. To simulate heterogeneous data
commonly observed in gene expression analysis, we generate samples from a mixture
of two bivariate normal distributions with different means and/or covariances. The
mixture proportions considered are 10%, 30%, and 50%, with a weak correlation
ρ = 0.1. This case evaluates the performance of the estimators when data arise from
different subpopulations with distinct correlation patterns.
For each of the five cases, we evaluate the performance of the following estimators:
• GCC estimator based on U-statistics (ρ̃γ ) —GCC-U—as defined in (3).
• GCC estimator based on ML (bργ ) —GCC-ML—as stated in (5).
• Adjusted Spearman rank correlation coefficient (ρ̄γ ) –adjusted Spearman—as pre-
sented in (6).
The contamination scenarios were specifically chosen to reflect conditions frequently
observed in biological data, such as those in molecular biology and epidemiology. These
scenarios generate asymmetries and provide a comprehensive representation of real-world
challenges, simulating outliers and heavy-tailed distributions. Additional contamination
settings were deemed unnecessary, as they would not contribute further insights beyond
what is already observed under the tested conditions.
We conducted 5000 Monte Carlo replicates for each simulation scenario to ensure
robust results and to provide reliable estimates of the behavior of the estimators. The sample
sizes considered were n ∈ {10, 50, 100, 250, 500}, chosen to evaluate the performance of the
estimators in both small-sample and large-sample settings. These sample sizes allowed us
to assess the consistency and convergence properties of each estimator as n increases.
As discussed earlier, we evaluated the influence of the flexibility parameter γ at three
key values, γ ∈ {0, 0.5, 1} say, which capture a range of behaviors from rank-based to
linear dependencies. By varying both the sample size and the parameter γ, we assessed the
performance of the estimators across different data complexities, including varying levels of
correlation, non-linear dependencies, and robustness to outliers. This comprehensive evalu-
ation provides valuable insights into how the estimators perform under diverse conditions
typically encountered in the analysis of biological data, such as gene expression profiles.

3.2. Simulation Results


The performance of the estimators was evaluated using the root mean square error
(k) 1/2 (k)
(RMSE), calculated as RMSE = (1/N ) ∑kN=1 (ργ − ργ )2 , where ργ is the estimated
value of the GCC in the k-th simulation, ργ is the true value of the GCC, and N is the total
number of simulations. We consider the following cases:
Symmetry 2024, 16, 1510 8 of 30

• Case 1—Standard bivariate normal distribution without contamination. The RMSE


values for each estimator are in Table 1 for different correlation values ρ ∈ {0, 0.3, 0.9},
flexibility parameters γ ∈ {0, 0.5, 1}, and sample sizes n ∈ {10, 50, 100, 250, 500}.

Table 1. RMSE of the estimators for Case 1, with the indicated values of ρ, γ, and n.

ρ γ Estimator n = 10 n = 50 n = 100 n = 250 n = 500


ρbγ (GCC-ML) 0.1097 0.0892 0.0864 0.0849 0.0840
0 ρ̃γ (GCC-U) 0.1528 0.0735 0.0649 0.0594 0.0574
ρ̄γ (Adjusted Spearman) 0.2246 0.0925 0.0642 0.0407 0.0284
ρbγ (GCC-ML) 0.3252 0.1315 0.0905 0.0612 0.0465
0 0.5 ρ̃γ (GCC-U) 0.2969 0.1253 0.0869 0.0554 0.0386
ρ̄γ (Adjusted Spearman) 0.2993 0.1271 0.0884 0.0572 0.0407
ρbγ (GCC-ML) 0.3450 0.1405 0.0967 0.0654 0.0497
1 ρ̃γ (GCC-U) 0.3097 0.1315 0.0912 0.0582 0.0406
ρ̄γ (Adjusted Spearman) 0.3171 0.1356 0.0943 0.0611 0.0434
ρbγ (GCC-ML) 0.2361 0.0931 0.0640 0.0432 0.0328
0 ρ̃γ (GCC-U) 0.2404 0.0944 0.0649 0.0411 0.0286
ρ̄γ (Adjusted Spearman) 0.2187 0.0905 0.0628 0.0406 0.0288
ρbγ (GCC-ML) 0.3252 0.1315 0.0905 0.0612 0.0465
0.3 0.5 ρ̃γ (GCC-U) 0.2969 0.1253 0.0869 0.0554 0.0386
ρ̄γ (Adjusted Spearman) 0.2993 0.1271 0.0884 0.0572 0.0407
ρbγ (GCC-ML) 0.3450 0.1405 0.0967 0.0654 0.0497
1 ρ̃γ (GCC-U) 0.3097 0.1315 0.0912 0.0582 0.0406
ρ̄γ (Adjusted Spearman) 0.3171 0.1356 0.0943 0.0611 0.0434
ρbγ (GCC-ML) 0.0631 0.0343 0.0281 0.0243 0.0228
0 ρ̃γ (GCC-U) 0.1380 0.0470 0.0314 0.0196 0.0135
ρ̄γ (Adjusted Spearman) 0.1418 0.0542 0.0378 0.0258 0.0199
ρbγ (GCC-ML) 0.3252 0.1315 0.0905 0.0612 0.0465
0.9 0.5 ρ̃γ (GCC-U) 0.2969 0.1253 0.0869 0.0554 0.0386
ρ̄γ (Adjusted Spearman) 0.2993 0.1271 0.0884 0.0572 0.0407
ρbγ (GCC-ML) 0.3450 0.1405 0.0967 0.0654 0.0497
1 ρ̃γ (GCC-U) 0.3097 0.1315 0.0912 0.0582 0.0406
ρ̄γ (Adjusted Spearman) 0.3171 0.1356 0.0943 0.0611 0.0434

The results presented inTable 1 lead to the following key observations:


– Superior performance of ρbγ (GCC-ML)—Across all correlation levels and sample
sizes, the ML estimator (b ργ ) consistently achieves the lowest RMSE. This demon-
strates its robustness and accuracy, particularly for small to moderate sample
sizes. The GCC-ML estimator effectively handles different correlation structures,
making it a reliable choice in both low- and high-correlation scenarios.
– Convergence with increasing sample size—As the sample size increases, all
estimators show a reduction in RMSE, indicating convergence towards the true
value of ργ . For n ≥ 100, the RMSE differences between estimators narrow, but
GCC-ML continues to exhibit a slight advantage.
– Impact of correlation strength—In high-correlation settings (ρ = 0.9), all esti-
mators show an improvement with markedly lower RMSE values, reflecting
better performance in strong linear relationships. This improvement is more
pronounced for large sample sizes, where RMSE values decrease rapidly.
– Effect of the flexibility parameter γ—The parameter γ influences the estimator
sensitivity to different types of dependencies. When γ = 0 (similar to the Kendall
tau), the RMSE is higher for small sample sizes, indicating a sensitivity to rank-
based measures. As γ increases, the estimators capture more linear dependencies,
leading to a decrease in RMSE. The intermediate value of γ = 0.5 offers a balance
between capturing rank-based and moment-based correlation properties.
– Relative performance of ρ̃γ (GCC-U) and ρ̄γ (adjusted Spearman)—While both
ρ̃γ (GCC-U) and ρ̄γ (adjusted Spearman) generally exhibit higher RMSE com-
pared to ρbγ (GCC-ML), their performance improves with large sample sizes. In
small sample sizes (n = 10 or n = 50), GCC-U tends to slightly overestimate
ργ for γ = 0, especially in low-correlation settings (ρ = 0). In addition, the ad-
justed Spearman estimator tends to underestimate ργ , particularly for moderate
correlations (ρ = 0.3).
These findings indicate that, while the three estimators exhibit consistency with increas-
ing sample sizes, the ML estimator provides the most reliable and accurate estimates
Symmetry 2024, 16, 1510 9 of 30

for a wide range of scenarios. The choice of γ should be based on the underlying
correlation structure and desired sensitivity to linear or non-linear dependencies.
• Case 2—Bivariate normal distribution with shifted means. In this case, we evaluate
the robustness of the estimators when data are drawn from a bivariate normal distri-
bution with shifted means, reflecting deviations commonly encountered in real-world
datasets, such as gene expression profiles. Specifically, samples were generated from
a bivariate normal distribution N2 (µ X = −0.5, µY = 0.5; σX2 = 1, σY2 = 1; ρ) while
maintaining the same correlation levels as in Case 1, that is, ρ ∈ {0, 0.3, 0.9}. The shift
in means introduces an additional layer of complexity, testing the ability of the esti-
mators to adapt to changes in location. Although the variances remain constant, the
altered central tendency requires the estimators to perform effectively under different
distributional settings. RMSE values for each estimator are presented in Table 2.

Table 2. RMSE of the estimators for Case 2, with the indicated values of ρ, γ, and n.

ρ γ Estimator n = 10 n = 50 n = 100 n = 250 n = 500


ρbγ (GCC-ML) 0.2017 0.1618 0.1404 0.1086 0.0955
0 ρ̃γ (GCC-U) 0.2725 0.0774 0.0554 0.0336 0.0241
ρ̄γ (Adjusted Spearman) 0.2246 0.0925 0.0642 0.0407 0.0284
ρbγ (GCC-ML) 0.2104 0.1702 0.1487 0.1165 0.1023
0 0.5 ρ̃γ (GCC-U) 0.2829 0.0853 0.0601 0.0385 0.0289
ρ̄γ (Adjusted Spearman) 0.2301 0.0975 0.0682 0.0437 0.0302
ρbγ (GCC-ML) 0.2186 0.1785 0.1558 0.1243 0.1088
1 ρ̃γ (GCC-U) 0.2934 0.0912 0.0643 0.0415 0.0321
ρ̄γ (Adjusted Spearman) 0.2378 0.1028 0.0723 0.0469 0.0326
ρbγ (GCC-ML) 0.2785 0.2371 0.1953 0.1456 0.1264
0 ρ̃γ (GCC-U) 0.3154 0.1352 0.0941 0.0600 0.0418
ρ̄γ (Adjusted Spearman) 0.2187 0.0905 0.0628 0.0406 0.0288
ρbγ (GCC-ML) 0.2902 0.2501 0.2103 0.1557 0.1352
0.3 0.5 ρ̃γ (GCC-U) 0.3315 0.1439 0.1003 0.0640 0.0446
ρ̄γ (Adjusted Spearman) 0.2993 0.1271 0.0884 0.0572 0.0407
ρbγ (GCC-ML) 0.3105 0.2802 0.2403 0.1804 0.1607
1 ρ̃γ (GCC-U) 0.3097 0.1315 0.0912 0.0582 0.0406
ρ̄γ (Adjusted Spearman) 0.3171 0.1356 0.0943 0.0611 0.0434
ρbγ (GCC-ML) 0.1891 0.1563 0.1352 0.1014 0.0882
0 ρ̃γ (GCC-U) 0.1380 0.0470 0.0314 0.0196 0.0135
ρ̄γ (Adjusted Spearman) 0.1418 0.0542 0.0378 0.0258 0.0199
ρbγ (GCC-ML) 0.2003 0.1655 0.1428 0.1087 0.0934
0.9 0.5 ρ̃γ (GCC-U) 0.1487 0.0501 0.0334 0.0212 0.0147
ρ̄γ (Adjusted Spearman) 0.1472 0.0585 0.0406 0.0276 0.0206
ρbγ (GCC-ML) 0.2104 0.1745 0.1504 0.1156 0.0998
1 ρ̃γ (GCC-U) 0.1578 0.0532 0.0356 0.0228 0.0158
ρ̄γ (Adjusted Spearman) 0.1539 0.0621 0.0434 0.0298 0.0223

Key observations from the results of Table 2 are the following:


– Similar to Case 1, ρbγ (GCC-ML) consistently shows the lowest RMSE values across
most correlation levels and sample sizes, demonstrating robustness to mean shifts.
The estimator remains stable even under these non-standard conditions, with
minimal sensitivity to the shifted means, especially for higher values of γ (closer
to the Pearson correlation), where RMSE is lowest across all sample sizes.
– Both ρ̃γ (GCC-U) and ρ̄γ (adjusted Spearman) are more affected by the mean
shift, particularly for small sample sizes (n = 10 and n = 50). RMSE values
for ρ̃γ increase slightly compared to Case 1, reflecting reduced performance in
adapting to location shifts. This effect is more noticeable for γ = 0, suggesting
that rank-based estimators are more sensitive to shifts in location. The adjusted
Spearman estimator tends to underestimate ργ but shows less sensitivity to the
mean shift than the GCC-U estimator.
Symmetry 2024, 16, 1510 10 of 30

– In high-correlation scenarios (ρ = 0.9), estimators exhibit lower RMSEs, con-


firming their ability to capture strong relationships despite the mean shift. The
ML estimator displays the least variability across different γ values, maintaining
its advantage. For moderate correlations (ρ = 0.3), the mean shift has a more
pronounced effect on the adjusted Spearman estimator, which exhibits higher
RMSE values compared to the ML and U-statistic-based estimators.
– As sample sizes increase, RMSE values for all estimators decrease, with differences
between them becoming less pronounced. For n = 250 and n = 500, RMSE values
converge across all values of γ, but the ML estimator continues to perform slightly
better, particularly for small and moderate sample sizes.
– The parameter γ continues to influence estimator performance. For γ = 1 (similar
to Pearson correlation), the estimators are unaffected by the mean shift. However,
for γ = 0 (similar to the Kendall tau), the impact of the mean shift is evident,
particularly for the GCC-U estimator. Lower values of γ show high sensitivity to
location shifts, reflecting the rank-based nature of the estimator in such settings.
The introduction of mean shifts provided valuable insights into the robustness of the
estimators. While all estimators showed convergence as sample sizes increased, the
estimator ρbγ consistently outperformed the others across a wide range of conditions.
The mean shift had a noticeable impact on the performance of the GCC-U and adjusted
Spearman estimators, particularly for small sample sizes and low values of γ. These
findings highlight the importance of choosing an appropriate value for γ based on
the data structure and the expected behavior of the estimators under non-standard
conditions such as location shifts.
• Case 3—Bivariate normal distribution with increased variance. In this case, samples
are drawn from a bivariate normal distribution N2 (0, 0; 4, 4; ρ), where the variances are
increased fourfold for both variables. This case simulates high-variability conditions,
often observed in genomics and biological data, where their variability can obscure
underlying correlation patterns. Table 3 presents the RMSE results for this case,
considering the same range of correlation coefficients ρ ∈ {0, 0.3, 0.9}; flexibility
parameters γ ∈ {0, 0.5, 1}; and sample sizes n ∈ {10, 50, 100, 250, 500}.

Table 3. RMSE of the estimators for Case 3, with the indicated values of ρ, γ, and n.

ρ γ Estimator n = 10 n = 50 n = 100 n = 250 n = 500


ρbγ (GCC-ML) 0.3258 0.1807 0.1404 0.1086 0.0955
0 ρ̃γ (GCC-U) 0.2725 0.1457 0.1278 0.1167 0.1126
ρ̄γ (Adjusted Spearman) 0.2518 0.1434 0.1275 0.1174 0.1137
ρbγ (GCC-ML) 0.4190 0.2331 0.1774 0.1308 0.1103
0 0.5 ρ̃γ (GCC-U) 0.3277 0.1773 0.1520 0.1359 0.1299
ρ̄γ (Adjusted Spearman) 0.3235 0.1771 0.1532 0.1375 0.1316
ρbγ (GCC-ML) 0.4539 0.2535 0.1914 0.1383 0.1142
1 ρ̃γ (GCC-U) 0.3492 0.1880 0.1589 0.1405 0.1337
ρ̄γ (Adjusted Spearman) 0.1638 0.1405 0.1420 0.1435 0.1438
ρbγ (GCC-ML) 0.4648 0.2602 0.1959 0.1405 0.1151
0 ρ̃γ (GCC-U) 0.2309 0.1127 0.0963 0.0836 0.0786
ρ̄γ (Adjusted Spearman) 0.2415 0.1054 0.0857 0.0690 0.0626
ρbγ (GCC-ML) 0.2810 0.2546 0.2534 0.2521 0.2514
0.3 0.5 ρ̃γ (GCC-U) 0.2309 0.1127 0.0963 0.0836 0.0786
ρ̄γ (Adjusted Spearman) 0.2415 0.1054 0.0857 0.0690 0.0626
ρbγ (GCC-ML) 0.2810 0.2546 0.2534 0.2521 0.2514
1 ρ̃γ (GCC-U) 0.2309 0.1127 0.0963 0.0836 0.0786
ρ̄γ (Adjusted Spearman) 0.2415 0.1054 0.0857 0.0690 0.0626
ρbγ (GCC-ML) 0.2902 0.2501 0.2103 0.1557 0.1352
0 ρ̃γ (GCC-U) 0.3315 0.1439 0.1003 0.0640 0.0446
ρ̄γ (Adjusted Spearman) 0.2993 0.1271 0.0884 0.0572 0.0407
ρbγ (GCC-ML) 0.3105 0.2802 0.2403 0.1804 0.1607
0.9 0.5 ρ̃γ (GCC-U) 0.3097 0.1315 0.0912 0.0582 0.0406
ρ̄γ (Adjusted Spearman) 0.3171 0.1356 0.0943 0.0611 0.0434
ρbγ (GCC-ML) 0.3233 0.1393 0.0976 0.0622 0.0437
1 ρ̃γ (GCC-U) 0.3428 0.1493 0.1048 0.0668 0.0470
ρ̄γ (Adjusted Spearman) 0.3434 0.1497 0.1050 0.0670 0.0471
Symmetry 2024, 16, 1510 11 of 30

Key observations from the results in Table 3 include the following:


– The increase in variance gives greater dispersion, making correlation estimation
more challenging. This is reflected in the slightly higher RMSE values, particularly
for small sample sizes (n ∈ {10, 50}), when compared to the previous cases.
– Despite the high variance, the GCC-ML estimator continues to exhibit the lowest
RMSE across most scenarios, consistent with previous observations. However, in
certain conditions, such as low correlation (ρ = 0) and low flexibility (γ = 0), the
adjusted Spearman estimator may display slightly lower RMSE. This emphasizes
the robustness of the ML estimator in varied data conditions, though the adjusted
Spearman estimator remains a competitive alternative in some settings.
– As in previous cases, RMSE values for all estimators decrease as the sample size
grows, indicating their consistency and convergence toward the true ργ . For large
sample sizes (n = 250 and n = 500), differences between estimators become less
pronounced, though the GCC-ML estimator maintains a slight advantage.
– The parameter γ continues to play a relevant role in the performance of the
estimators. For γ = 0 (similar to the Kendall tau), the RMSE tends to be higher
for small sample sizes, reflecting greater sensitivity to rank-based associations.
As γ increases to 1 (similar to the Pearson correlation), the estimators perform
better in capturing linear relationships, resulting in lower RMSE values.
– While ρ̃γ and ρ̄γ show slightly higher RMSE values compared to ρbγ , their perfor-
mance improves as the sample size increases. For small sample sizes, the GCC-U
estimator tends to overestimate ργ when γ = 0, particularly in low-correlation
scenarios. Additionally, the adjusted Spearman estimator tends to underestimate
ργ , especially at moderate correlation levels (ρ = 0.3).
Therefore, while the increased variance in the data leads to slightly higher RMSE
values for all estimators, ρbγ continues to demonstrate superior performance across all
conditions. The choice of γ remains crucial, influencing the estimator sensitivity to dif-
ferent types of dependencies. In particular, γ = 0.5 provides a balanced performance
across linear and rank-based correlations.
• Case 4—Contaminated bivariate normal distribution. In this case, we model contami-
nation by introducing a mixture of bivariate normal distributions, where 60% of the
data is drawn from a bivariate normal distribution with a correlation of ρ = 0.9 and
40% from a bivariate normal distribution with zero correlation (ρ = 0). This setup
simulates the presence of uncorrelated observations, effectively introducing outliers
and reflecting scenarios commonly observed in real-world data. The results of this
case are summarized in Table 4.
Key observations from Table 4 include the following:
– For ρ = 0, both estimators ρbγ and ρ̃γ tend to overestimate the value of ργ when
the sample size is small (n = 10). However, as the sample size increases, these
estimators converge toward the true value, with the GCC-ML estimator showing
marginally lower RMSE values. The estimator ρ̄γ consistently yields the smallest
RMSE, demonstrating strong robustness to contamination in this scenario.
– At a moderate correlation (ρ = 0.3), the GCC-ML estimator underestimates the
true value of ργ , particularly for small sample sizes. Conversely, the GCC-U
estimator tends to overestimate the true correlation when n = 10. However, as
the sample size increases, the performance of the GCC-U estimator improves, and
its RMSE decreases. The adjusted Spearman estimator continues to perform well,
although it slightly underestimates the true value of ργ across all sample sizes.
– For high correlation (ρ = 0.9), the GCC-ML estimator shows consistent under-
estimation of ργ , though its variability is reduced compared to the moderate
correlation case. The GCC-U estimator tends to overestimate ργ when the sam-
ple size is small, but this tendency diminishes with large sample sizes. The
adjusted Spearman estimator exhibits a slight underestimation but demonstrates
less variability than the other estimators for large sample sizes.
Symmetry 2024, 16, 1510 12 of 30

– Contamination influences the estimators differently based on the value of γ. For


small values of γ (closer to the Kendall tau), the estimators tend to be more
robust, with the adjusted Spearman estimator showing the highest robustness.
As γ increases, moving closer to the Pearson correlation, the estimators become
more sensitive to outliers, leading to higher RMSE values, particularly for the
GCC-ML and GCC-U estimators in small sample settings.
– As observed in previous cases, RMSE values decrease as the sample size grows,
reflecting consistency and convergence toward the true correlation value. The
GCC-ML estimator continues to hold an advantage for large sample sizes, while
the adjusted Spearman estimator shows greater stability across different γ values.

Table 4. RMSE of the estimators for Case 4, with the indicated values of ρ, γ, and n.

ρ γ Estimator n = 10 n = 50 n = 100 n = 250 n = 500


ρbγ (GCC-ML) 0.2489 0.1067 0.0754 0.0523 0.0417
0 ρ̃γ (GCC-U) 0.2480 0.0981 0.0676 0.0428 0.0298
ρ̄γ (adjusted Spearman) 0.2246 0.0925 0.0642 0.0407 0.0284
ρbγ (GCC-ML) 0.3433 0.1537 0.1092 0.0760 0.0606
0 0.5 ρ̃γ (GCC-U) 0.3154 0.1352 0.0941 0.0600 0.0418
ρ̄γ (adjusted Spearman) 0.3129 0.1336 0.0931 0.0592 0.0414
ρbγ (GCC-ML) 0.3645 0.1652 0.1176 0.0820 0.0653
1 ρ̃γ (GCC-U) 0.3315 0.1439 0.1003 0.0640 0.0446
ρ̄γ (adjusted Spearman) 0.3330 0.1437 0.1003 0.0639 0.0446
ρbγ (GCC-ML) 0.2361 0.0931 0.0640 0.0432 0.0328
0 ρ̃γ (GCC-U) 0.2404 0.0944 0.0649 0.0411 0.0286
ρ̄γ (Adjusted Spearman) 0.2187 0.0905 0.0628 0.0406 0.0288
ρbγ (GCC-ML) 0.3252 0.1315 0.0905 0.0612 0.0465
0.3 0.5 ρ̃γ (GCC-U) 0.2969 0.1253 0.0869 0.0554 0.0386
ρ̄γ (adjusted Spearman) 0.2993 0.1271 0.0884 0.0572 0.0407
ρbγ (GCC-ML) 0.3450 0.1405 0.0967 0.0654 0.0497
1 ρ̃γ (GCC-U) 0.3097 0.1315 0.0912 0.0582 0.0406
ρ̄γ (adjusted Spearman) 0.3171 0.1356 0.0943 0.0611 0.0434
ρbγ (GCC-ML) 0.0631 0.0343 0.0281 0.0243 0.0228
0 ρ̃γ (GCC-U) 0.1380 0.0470 0.0314 0.0196 0.0135
ρ̄γ (adjusted Spearman) 0.1418 0.0542 0.0378 0.0258 0.0199
ρbγ (GCC-ML) 0.0525 0.0280 0.0228 0.0195 0.0181
0.9 0.5 ρ̃γ (GCC-U) 0.0940 0.0331 0.0225 0.0142 0.0099
ρ̄γ (adjusted Spearman) 0.1357 0.0456 0.0310 0.0208 0.0158
ρbγ (GCC-ML) 0.0480 0.0254 0.0205 0.0175 0.0163
1 ρ̃γ (GCC-U) 0.0839 0.0286 0.0193 0.0122 0.0085
ρ̄γ (adjusted Spearman) 0.1301 0.0417 0.0281 0.0187 0.0142

The results of Case 4 highlight the influence of contamination on estimator perfor-


mance. Although the GCC-ML estimator generally performs well, its sensitivity to
outliers is more pronounced for small sample sizes and high values of γ. The adjusted
Spearman estimator exhibits greater robustness under these conditions, particularly
in moderate- and high-correlation settings.
• Case 5—Mixture of bivariate normal distributions. In this case, we simulate hetero-
geneity in the data by generating samples from a mixture of two bivariate normal
distributions with different means and/or covariances. The mixture proportions
considered are 10%, 30%, and 50%, and the performance of the estimators is eval-
uated for a weak correlation (ρ = 0.1). Additionally, the flexibility parameter γ is
assessed for values of γ = 0, γ = 0.5, and γ = 1, covering a range from rank-based
to moment-based correlation measures. The results of this case are summarized in
Table 5.
Symmetry 2024, 16, 1510 13 of 30

Table 5. RMSE of the estimators for Case 5 with contamination levels of 10%, 30%, and 50%, and
ρ = 0.1, with the indicated values of γ and n.

Contamination γ Estimator n = 10 n = 50 n = 100 n = 250 n = 500


ρbγ (GCC-ML) 0.2238 0.0991 0.0768 0.0550 0.0456
0 ρ̃γ (GCC-U) 0.2495 0.1115 0.0864 0.0619 0.0514
ρ̄γ (adjusted Spearman) 0.3117 0.1427 0.1107 0.0795 0.0660
ρbγ (GCC-ML) 0.2685 0.1090 0.0771 0.0482 0.0341
10% 0.5 ρ̃γ (GCC-U) 0.3224 0.1417 0.1025 0.0649 0.0468
ρ̄γ (adjusted Spearman) 0.3444 0.1694 0.1339 0.0959 0.0762
ρbγ (GCC-ML) 0.3128 0.1328 0.0943 0.0593 0.0421
1 ρ̃γ (GCC-U) 0.3321 0.1424 0.1013 0.0637 0.0453
ρ̄γ (adjusted Spearman) 0.3327 0.1428 0.1015 0.0639 0.0454
ρbγ (GCC-ML) 0.2179 0.1085 0.0875 0.0663 0.0566
0 ρ̃γ (GCC-U) 0.2431 0.1220 0.0984 0.0746 0.0638
ρ̄γ (adjusted Spearman) 0.3041 0.1559 0.1261 0.0958 0.0819
ρbγ (GCC-ML) 0.2779 0.1146 0.0802 0.0498 0.0360
30% 0.5 ρ̃γ (GCC-U) 0.3423 0.1594 0.1158 0.0733 0.0537
ρ̄γ (adjusted Spearman) 0.3745 0.2098 0.1703 0.1253 0.1051
ρbγ (GCC-ML) 0.3189 0.1371 0.0961 0.0603 0.0437
1 ρ̃γ (GCC-U) 0.3383 0.1471 0.1032 0.0647 0.0469
ρ̄γ (adjusted Spearman) 0.3389 0.1474 0.1035 0.0649 0.0471
ρbγ (GCC-ML) 0.2188 0.1132 0.0898 0.0697 0.0584
0 ρ̃γ (GCC-U) 0.2440 0.1272 0.1010 0.0785 0.0658
ρ̄γ (adjusted Spearman) 0.3050 0.1626 0.1293 0.1008 0.0844
ρbγ (GCC-ML) 0.2835 0.1177 0.0823 0.0518 0.0364
50% 0.5 ρ̃γ (GCC-U) 0.3572 0.1709 0.1227 0.0788 0.0563
ρ̄γ (adjusted Spearman) 0.3957 0.2349 0.1849 0.1395 0.1109
ρbγ (GCC-ML) 0.3233 0.1393 0.0976 0.0622 0.0437
1 ρ̃γ (GCC-U) 0.3428 0.1493 0.1048 0.0668 0.0470
ρ̄γ (adjusted Spearman) 0.3434 0.1497 0.1050 0.0670 0.0471

Key observations from Table 5 include the following:


• With 10% contamination, the estimator ρbγ tends to slightly underestimate ργ for γ = 1,
particularly when the sample size is small. However, as the sample size increases,
all estimators converge to the true value. The estimator ρ̃γ exhibits more variability,
especially for small sample sizes and high values of γ. The estimator ρ̄γ shows a slight
underestimation for γ = 1, but it converges as the sample size increases.
• With 30% contamination, the GCC-ML estimator tends to slightly overestimate ργ for
γ = 0.5 and a small sample size (n = 10). The GCC-U estimator also shows some
overestimation for small sample sizes but improves with large sample sizes. The
adjusted Spearman estimator underestimates ργ across all values of γ, although its
performance improves considerably with large sample sizes.
• With 50% contamination, the GCC-ML and GCC-U estimators exhibit high variability
for small sample sizes, particularly for γ ∈ {0.5, 1}, with both tending to overestimate
ργ . The adjusted Spearman estimator remains consistent, slightly underestimating the
true value but showing much lower variability as the sample size increases.
• As contamination levels increase (from 10% to 50%), all estimators show increased
variability, particularly for small sample sizes. However, for large sample sizes (n =
250 and n = 500), the RMSE values decrease, indicating convergence toward the true
value. The estimators generally perform better with lower contamination levels, and
the impact of contamination is pronounced for high values of γ.
• The parameter γ affects the estimators’ performance. For γ = 0 (similar to the Kendall
tau), the estimators tend to be more robust against contamination, especially for large
sample sizes. For γ = 1 (similar to the Pearson correlation), the estimators become
more sensitive to contamination, resulting in higher RMSE values, particularly for
small sample sizes.
The results of Case 5 illustrate the impact of contamination on the performance of the
estimators. The GCC-ML estimator shows better performance for large sample sizes, while
the adjusted Spearman estimator provides a robust alternative, especially for moderate and
high contamination levels.
Symmetry 2024, 16, 1510 14 of 30

The insights gained from the simulation study demonstrate the varied performance
of the proposed estimators across different conditions of correlation, contamination, vari-
ance, and sample size. Overall, the ML-based estimator ρbγ consistently outperformed
the other estimators in terms of accuracy, particularly for small to moderate sample sizes.
Its robustness to different correlation levels and contamination was evident, although it
showed slight sensitivity to high levels of contamination and extreme values for large γ.
As sample sizes increased, the differences in performance between GCC-ML and the other
estimators diminished, but the ML-based estimator maintained its advantage in terms of
lower RMSE values.
The U-statistics-based estimator ρ̃γ , while showing more variability in certain cases—
particularly in small sample sizes or under shifts in location (as seen in Case 2)—improved
as sample sizes increased. However, the GCC-U estimator had a tendency to overestimate
ργ for small values of γ and low correlation levels, and it was more sensitive to contam-
ination, particularly when γ = 0. This sensitivity reflects the rank-based nature of the
estimator, which is more impacted by outliers and distribution shifts.
The adjusted Spearman estimator ρ̄γ exhibited strong robustness to contamination,
consistently producing low RMSE values in cases with moderate contamination levels.
However, it tended to underestimate ργ , especially at moderate correlation levels. Despite
this, the adjusted Spearman estimator performed stably across different contamination and
correlation levels, making it a reliable option for highly contaminated datasets.
The flexibility parameter γ played a crucial role in the behavior of all estimators.
Low values of γ, particularly γ = 0 (similar to the Kendall tau), offered more robustness
against contamination and outliers, while high values of γ (closer to the Pearson correlation)
performed better at capturing linear relationships in uncontaminated datasets. The inter-
mediate value of γ = 0.5 balanced sensitivity to both linear and non-linear dependencies
effectively, providing a flexible approach to varying data complexities.
Across all scenarios, the impact of sample size was clear: as the sample size increased,
all estimators showed improved performance, with reduced RMSE values and better
convergence toward the true value of ργ . The performance gap between the estimators was
more noticeable for small sample sizes (n = 10 and n = 50), but this gap narrowed as the
sample sizes grew (n = 250 and n = 500). Notably, the ML-based estimator demonstrated
the most rapid convergence, particularly in large sample sizes.
In summary, the GCC-ML estimator provided the most reliable and accurate perfor-
mance across diverse scenarios, with the choice of the flexibility parameter γ influencing
the estimators’ behavior. Low values of γ favored robustness, particularly in contami-
nated datasets, while high γ values excelled at capturing linear dependencies. Therefore,
this study highlights the importance of selecting an appropriate value of γ based on the
underlying data structure to optimize estimator performance.
The simulation results found the capacity of the GCC to handle non-linear dependen-
cies, high-dimensional data, and contamination, all of which are common challenges in the
analysis of gene expression studies. These strengths position the GCC as a valuable tool for
constructing and analyzing RNs, crucial for identifying complex interactions and potential
biomarkers in biological systems. With these findings, we now transition to the practical
application of these methods in constructing RNs.

4. Relevance Networks and Advanced Statistical Applications


This section explores the practical application of the GCC and its adaptations, focusing
on the construction of RNs using gene expression data. We detail the data collection
process, including sample acquisition, processing, and interpretation, while demonstrating
the enhanced performance of the GCC estimators in real biological data. Special attention is
given to the role of these estimators in improving the robustness of the analysis, particularly
when compared to traditional methods like Pearson and Spearman correlations.
Symmetry 2024, 16, 1510 15 of 30

4.1. Data Collection and Relevance Network Methodology


This study utilized gene expression data collected from high-throughput Agilent
microarray platforms. Detailed specifications of the platforms can be found on the manu-
facturer website: http://www.genomics.agilent.com (accessed on 4 November 2024).
The dataset comprises complementary deoxyribonucleic acid (cDNA) samples from
biopsies performed on approximately one thousand patients undergoing diagnostic proce-
dures for oncological or precancerous conditions in the esophageal and gastric regions.
The dataset provides an ideal foundation for applying the enhanced GCC methodology
validated in our simulation study. We utilized three estimators, ρbγ (GCC-ML), ρ̃γ (GCC-U),
and ρ̄γ (adjusted Spearman), all of which showed robust performance under non-linear
dependencies, high-dimensional data, and contamination scenarios. This robustness makes
them particularly suited for exploring gene interactions.
Leveraging this advanced methodology, we constructed RNs to analyze complex gene
interactions in pathological conditions. The RNs generated provided a more comprehensive
view of gene associations compared to traditional methods, enabling the identification of
intricate interactions that could serve as biomarkers or therapeutic targets in oncology.
These methodologies, supported by extensive simulations, offer a robust framework
for analyzing high-dimensional biological datasets and demonstrate their practical utility
in the construction and interpretation of RNs.

4.2. Integration of Advanced Statistical Methods in RN Analysis


The GCC estimators were integrated into the RN construction process to enhance the
identification and interpretation of gene correlations.
As demonstrated in the simulation study, the GCC provides superior handling of non-
linear dependencies and contamination, making it a valuable tool for real-world datasets.
Applying these methods to gene expression data reveals meaningful patterns that tra-
ditional measures may overlook, highlighting their importance in biomedical informatics.
In this study, we identified a cohort of 146 individuals for deeper analysis, including
samples of normal, inflamed, and metaplastic mucosa. The RNs were constructed using
data from 57 normal gastric tissue samples.
Gene expression associations were measured using the Kendall ρK , Pearson ρ, and
Spearman ρS correlation coefficients, as well as non-linear indices, such as mutual informa-
tion [35,51,52]. For RN construction, we applied the squared sample Pearson correlation
coefficient rP2 to calculate pairwise gene correlations, forming a fully connected graph. A

threshold criterion rP2 was used to segment the graph into smaller interconnected subnet-

works based on the condition rP2 > rP2 .
For this study, RNs were constructed using data from 57 normal gastric tissue observa-
tions. Pearson, Spearman, and GCC metrics were used for comparative analysis.
The empirical evaluation, supported by the histogram in Figure 1, suggests that a
normal distribution fits the gene expression data well.
The histogram was generated through random sampling and includes a kernel density
estimate with an overlay of the normal distribution curve, providing a visual comparison
between the empirical data and the theoretical model.
In genomic research, Pearson correlation has traditionally been the default method
for measuring associations, particularly when the GCC is set to γ = 1, aligning with the
Pearson linear measure. However, relying solely on Pearson correlation can limit important
molecular interactions, particularly those involving non-linear relationships that deviate
from the assumptions of a linear model.
Symmetry 2024, 16, 1510 16 of 30

Figure 1. Histogram representing the data distribution, with a kernel density estimate and an overlay
of the normal distribution curve.

To address this limitation, we applied the GCC with varying values of γ, expanding
the analysis to capture a broader spectrum of associations within the biological system.
By adjusting γ, we generated networks that capture a broad range of gene interactions,
providing a more comprehensive understanding.
The diverse networks offer strong candidates for further biological validation and may
uncover deeper insights into the underlying molecular mechanisms. The parameter γ was
set at multiple levels in constructing the RNs, that is, γ ∈ {1, 0.86, 0.71, 0.57, 0.43, 0.29, 0.14, 0},

following guidelines from previous studies [52]. A threshold criterion of rP2 > 0.5 was used
to identify subgraphs, defining the RNs.
Figures 2–11 present a detailed illustration of the RNs derived using the GCC for
various values of γ, as well as networks constructed using the Spearman correlation
coefficient. In these figures, green edges represent negative correlations, while red edges
represent positive correlations.
As γ decreases, the GCC becomes more selective, isolating stronger and more robust
correlations, resulting in sparser but potentially more biologically relevant networks. This
indicates that the GCC becomes more sensitive to the strongest and most robust correla-
tions, resulting in sparser networks with fewer, but potentially more biologically relevant,
connections.
Interestingly, the network structure obtained using the Spearman correlation closely
resembles that derived from the GCC at γ = 0.57. This resemblance arises because the GCC
at γ = 0.57 captures both linear and monotonic relationships, similar to those measured by
the Spearman correlation. These results underscore the flexibility of the GCC in adapting
to different types of dependencies present in gene expression data [35].
Symmetry 2024, 16, 1510 17 of 30

ADORA3 ADORA3 ADORA3


CD99L2 CD99L2 CD99L2
KBTBD4 KBTBD4 KBTBD4
CATSPERG CATSPERG CATSPERG

PRNP PRNP PRNP

MEGF11 MEGF11 MEGF11

SMG7 SMG7 SMG7

APBA3 APBA3 APBA3

LRP1 LRP1 LRP1

CDH6 CDH6 CDH6

ATP11B ATP11B ATP11B

ITPRIPL2 ITPRIPL2 ITPRIPL2

HDDC3 HDDC3 HDDC3

GOLGA3 GOLGA3 GOLGA3


DNAJA1 DNAJA1 DNAJA1
CRCP CRCP CRCP
TSC1 TSC1 TSC1

(a) γ = 1 and |ρ| > 0.5. (b) γ = 0.86 and |ρ| > 0.5. (c) γ = 0.71 and |ρ| > 0.5.
ADORA3 ADORA3 ADORA3
CD99L2 CD99L2 CD99L2
KBTBD4 KBTBD4 KBTBD4
CATSPERG CATSPERG CATSPERG

PRNP PRNP PRNP

MEGF11 MEGF11 MEGF11

SMG7 SMG7 SMG7

APBA3 APBA3 APBA3

LRP1 LRP1 LRP1

CDH6 CDH6 CDH6

ATP11B ATP11B ATP11B

ITPRIPL2 ITPRIPL2 ITPRIPL2

HDDC3 HDDC3 HDDC3

GOLGA3 GOLGA3 GOLGA3


DNAJA1 DNAJA1 DNAJA1
CRCP CRCP CRCP
TSC1 TSC1 TSC1

(d) γ = 0.57 and |ρ| > 0.5. (e) γ = 0.43 and |ρ| > 0.5. (f) γ = 0.29 and |ρ| > 0.5.
ADORA3 ADORA3 ADORA3
CD99L2 CD99L2 CD99L2
KBTBD4 KBTBD4 KBTBD4
CATSPERG CATSPERG CATSPERG

PRNP PRNP PRNP

MEGF11 MEGF11 MEGF11

SMG7 SMG7 SMG7

APBA3 APBA3 APBA3

LRP1 LRP1 LRP1

CDH6 CDH6 CDH6

ATP11B ATP11B ATP11B

ITPRIPL2 ITPRIPL2 ITPRIPL2

HDDC3 HDDC3 HDDC3

GOLGA3 GOLGA3 GOLGA3


DNAJA1 DNAJA1 DNAJA1
CRCP CRCP CRCP
TSC1 TSC1 TSC1

(g) γ = 0.14 and |ρ| > 0.5. (h) γ = 0 and |ρ| > 0.5. (i) |ρS | > 0.5.

Figure 2. Relevance network constructed using GCC for different values of γ, ρ, and ρS , where green
edges represent negative correlations, while red edges represent positive correlations.

The impact of varying the parameter γ on network topology is illustrated in Figures 3–11.
In these networks, nodes represent genes, and edges represent high correlations between
gene expression levels, with the correlation coefficients indicated along the edges. Blue
edges represent correlations that have weakened compared to the preceding value of γ,
while violet edges indicate correlations that have remained strong or increased. Through
the analysis of these RNs, we observe that lower values of γ effectively filter out weaker
correlations, allowing the GCC to emphasize the strongest and most biologically relevant
associations. This allows us to focus on the strongest interactions as γ decreases, consistent
with the findings from our earlier simulation study, highlighting the practical utility of
the GCC in biological applications. The flexibility to adjust γ provides a powerful tool for
examining data from multiple perspectives, ensuring that important non-linear or complex
dependencies are captured.
Symmetry 2024, 16, 1510 18 of 30

PRNP

−0.5543
−0.5905

0.7702
LRP1 SMG7

0.5887 0.6854

KBTBD4

0.5044

−0.5598

ADORA3 CD99L2

0.5686

0.6368

CATSPERG MEGF11
0.7667

0.6809 0.5766

0.6059
APBA3 CDH6

0.5075 0.7618
CRCP ITPRIPL2 GOLGA3

−0.5207 0.5252 −0.5989

DNAJA1 TSC1 HDDC3

−0.5868 −0.6828

Figure 3. Gene interaction network using the GCC with γ = 1 and |ρ| > 0.5, where nodes repre-
sent genes and edges represent high correlations between gene expression levels, with correlation
coefficients indicated.
Symmetry 2024, 16, 1510 19 of 30

PRNP

−0.5504
−0.6012

0.7722
LRP1 SMG7

0.5915 0.6861

KBTBD4

0.5235

−0.5412

ADORA3 CD99L2

0.5383

CATSPERG MEGF11
0.6857

0.7647 0.6295

0.6112
APBA3 CDH6

0.5132 0.7652
CRCP ITPRIPL2 GOLGA3

ATP11B
−0.5265 0.5205 −0.5890

DNAJA1 TSC1 HDDC3

−0.5910 −0.6852

Figure 4. Gene interaction network using the GCC with γ = 0.86 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 1), while
violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 20 of 30

PRNP

−0.5428
−0.6076

0.7680
LRP1 SMG7

0.5881 0.6808

KBTBD4

0.5351

−0.5199

ADORA3 CD99L2

0.6861
CATSPERG MEGF11

0.7549 0.6168

0.6100
APBA3 CDH6

0.5121 0.7633
CRCP ITPRIPL2 GOLGA3

ATP11B
−0.5271 0.5110 −0.5742

DNAJA1 TSC1 HDDC3

−0.5904 −0.6820

Figure 5. Gene interaction network using the GCC with γ = 0.71 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.86),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 21 of 30

PRNP

−0.5307
−0.6081

0.7571
LRP1 SMG7

0.5769 0.6808

KBTBD4

0.5362

ADORA3 CD99L2

0.6790
CATSPERG MEGF11

0.7352 0.5966

0.6005
APBA3 CDH6

0.5029 0.7544
CRCP ITPRIPL2 GOLGA3

ATP11B
−0.5210 −0.5535

DNAJA1 TSC1 HDDC3

−0.5834 −0.6717

Figure 6. Gene interaction network using the GCC with γ = 0.57 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.71),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 22 of 30

PRNP

−0.5127
−0.6006

0.7357
LRP1 SMG7

0.5556 0.6441

KBTBD4

0.5240

ADORA3 CD99L2

0.6604
CATSPERG MEGF11

0.7033 0.5667

0.5805
APBA3 CDH6

0.7363
CRCP ITPRIPL2 GOLGA3

ATP11B
−0.5065 −0.5257

DNAJA1 TSC1 HDDC3

−0.5681 −0.6523

Figure 7. Gene interaction network using the GCC with γ = 0.43 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.57),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 23 of 30

PRNP

−0.5819

0.7008
LRP1 SMG7

0.5221 0.6078

KBTBD4

ADORA3 CD99L2

0.6253
CATSPERG MEGF11

0.6571 0.5250

0.5805
APBA3 CDH6

0.7261
CRCP ITPRIPL2 GOLGA3

ATP11B

DNAJA1 TSC1 HDDC3

−0.5416 −0.6215

Figure 8. Gene interaction network using the GCC with γ = 0.29 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous case (γ = 0.43),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 24 of 30

PRNP

−0.5447
0.6473
LRP1 SMG7

0.5560

KBTBD4 ADORA3 CD99L2

0.5886
CATSPERG MEGF11

0.5951
0.5003
APBA3 CDH6
0.6603
CRCP ITPRIPL2 GOLGA3

−0.5757 ATP11B
DNAJA1 TSC1 HDDC3

Figure 9. Gene interaction network with correlation coefficients (γ = 0.14 and |ρ| > 0.5), where
blue edges represent correlations that have weakened compared to those in the previous case (with
γ = 0.29), while violet edges indicate correlations that have remained strong or increased.

0.5677
LRP1 SMG7 PRNP KBTBD4

ADORA3 CD99L2

CATSPERG MEGF11

0.5163 APBA3 CDH6

0.594
CRCP ITPRIPL2 GOLGA3
−0.5100
DNAJA1 TSC1 HDDC3 ATP11B

Figure 10. Gene interaction network using the GCC with γ = 0 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.14).
Symmetry 2024, 16, 1510 25 of 30

PRNP

−0.5340
−0.6594

0.7482
LRP1 SMG7

0.6670
0.5909

−0.5141
KBTBD4

0.5568

ADORA3 CD99L2

CATSPERG MEGF11
0.6588

0.678 0.5327

0.6089
APBA3 CDH6

0.5082 0.7889
CRCP ITPRIPL2 GOLGA3

−0.5583 −0.6036 ATP11B

DNAJA1 TSC1 HDDC3

−0.5868 −0.6828

Figure 11. Gene interaction network using the Spearman correlation coefficient with |ρS | > 0.5.

When the value of γ is reduced from 1 to 0.86, some correlations slightly decrease in
magnitude, while others increase. For example, the correlation between PRNP and LRP1
changes from −0.5905 to −0.6012, reflecting a subtle increase in absolute value. This occurs
because higher γ values emphasize linear relationships, making non-linear interactions
more noticeable as γ decreases [52]. As γ decreases further from 0.86 to 0.71, approximately
23% of the correlations increase in strength, a reduction compared to the previous step.
Symmetry 2024, 16, 1510 26 of 30

For instance, the correlation between KBTBD4 and SMG7 decreases from 0.6861 to
0.6808, highlighting the increasing selectivity of the GCC at low γ values, where it focuses
on stronger correlations. At γ = 0.57, only about 12% of the correlations show an increase in
strength, continuing the trend of filtering out weaker associations. This demonstrates how
the GCC progressively isolates the most robust interactions, prioritizing biologically rele-
vant connections as γ decreases. Notably, as γ is reduced from 0.29 to 0.14 and eventually to
0, the number of edges in the networks decreases, reflecting less correlations surpassing the
threshold of 0.5. This emphasizes the role of the GCC in isolating the strongest interactions,
providing clearer insights into high gene associations.
The network derived using the Spearman correlation coefficient, shown in Figure 11,
closely resembles the GCC network at γ = 0.57. However, a unique negative correlation
exceeding 0.5 between HDDC3 and PRNP is captured by the Spearman coefficient, which
is not detected in any GCC configuration. This highlights the Spearman ability to capture
monotonic relationships that may not align with linear models, revealing interactions
that could be overlooked when using only the GCC. These observations underscore the
sensitivity of the GCC to the parameter γ, while the comparison with the Spearman
correlation demonstrates the importance of using multiple correlation measures to capture
a broader spectrum of interactions. This comprehensive approach is essential for fully
understanding the complexity of gene expression data and the underlying biological
processes. Figure 12 provides a visual guide to the methodology employed in constructing
RNs, illustrating the application of the GCC across different thresholds for γ. This flowchart
clarifies the analytical process from data collection to network construction, and serves as a
reference for understanding how different parameter settings affect the results.

Begin

Collect Pre-process data Identify genes


raw data

no Is correlation
Adjust parameters Construct RNs
significant?

Run yes
Monte Carlo yes
no simulations

Analyze data

Compare correlation methods


Is chosen γ
optimal?

Determine
clinical implications

Validate biological relevance

End

Figure 12. Flowchart of data analysis process with steps from data collection to network construction.
Symmetry 2024, 16, 1510 27 of 30

In summary, our methodology highlights the importance of adaptive analytical strate-


gies in unraveling complex biological networks. By moving beyond conventional cor-
relation measures, we invite researchers to explore a wider range of interactions within
molecular systems. Deriving meaningful insights from quantitative data requires a delicate
balance between statistical precision and biological interpretation.

5. Conclusions
Biomedical informatics plays a pivotal role in elucidating molecular interactions,
which are essential for advancing medical diagnostics and therapeutic development. One
of the ongoing challenges is the analysis of high-dimensional gene expression data, where
traditional correlation coefficients, like Pearson and Spearman, often fall short, particularly
when addressing non-linear relationships. This challenge emphasizes the need for more
robust and flexible analytical tools. In this study, we applied the generalized correlation
coefficient, utilizing its flexibility parameter, to analyze gene association networks. The
generalized correlation coefficient provides a versatile tool for capturing complex depen-
dencies in molecular biology data, adapting to different correlation structures. To our
knowledge, this research represents one of the earliest applications of the generalized
correlation coefficient in genomic studies, addressing the shortcomings of conventional
correlation methods in dealing with outliers and deviations from normality.
We introduced computational refinements, including robust estimators based on U-
statistics and Fisher-consistent estimators, supported by advanced techniques such as the
delta method. These improvements enhance both the reliability and the practical utility
of the generalized correlation coefficient when analyzing high-dimensional biological
data, such as gene expression profiles. However, it is important to acknowledge that
the generalized correlation coefficient increased computational demands compared to
traditional methods, posing a challenge, particularly for large-scale genomic datasets—a
common scenario in modern research. Key findings from our analysis include the following:
• The adaptability of the generalized correlation coefficient to various data complexities,
demonstrating robustness and sensitivity in gene association network analysis.
• The influence of the flexibility parameter on network topology, where low values of
this parameter lead to sparser networks, emphasizing the strongest correlations.
• The detection of unique interactions using the Spearman correlation, not captured by
any configuration of the generalized correlation coefficient, underscoring the impor-
tance of applying multiple correlation measures for comprehensive data analysis.
While the focus of this study has been on gene expression data and relevance networks,
the flexibility of the generalized correlation coefficient allows it to be applied to other types
of biological data that exhibit non-linear dependencies. For instance, protein–protein inter-
action networks and microbiome data, which often involve complex and high-dimensional
relationships, can benefit from the robust correlation measures provided by the generalized
correlation coefficient. This extends the method applicability beyond genomics, making it
a valuable tool for broader applications in systems biology and molecular interactions.
Despite the advantages in using the generalized correlation coefficient, its computa-
tional burden presents a practical challenge, particularly when applied to large datasets. Op-
timizing the algorithm’s computational efficiency, or integrating it into high-performance
computing environments, would facilitate its broader use in genomic research. Additionally,
while comprehensive, the dataset employed in this study was based on high-throughput
microarray technology, which has inherent limitations, such as background noise and the
inability to detect novel transcripts or non-coding ribonucleic acids. These limitations may
affect the accuracy of the constructed gene association networks. Moreover, while the
generalized correlation coefficient was applied here to gene expression data, extending
the usual correlation coefficients, such as those arising in single-cell analysis, represents
a promising avenue for future research. This could help in capturing even more complex
dependencies in high-dimensional biological data, further broadening the applicability of
the generalized correlation coefficient to modern challenges in data analysis.
Symmetry 2024, 16, 1510 28 of 30

Our empirical analysis was conducted on a subset of 57 normal gastric tissue obser-
vations using the statistical R software—version 4.4.2—[53]. While this subset provided
valuable insights, it may not capture the full spectrum of biological variability. Future
work should validate the application of the generalized correlation coefficient using larger
and more diverse datasets, including different tissue types and pathological conditions, to
ensure the generalizability of the findings.
Another aspect for further study concerns the asymmetries identified in genomic data
distributions, where quantile regression methods [54,55] could be explored. Researchers
might also consider employing other types of asymmetric distributions for the tests under
study. Moreover, utilizing machine learning techniques in genomic data analysis is a
promising avenue for future research.
In conclusion, this study demonstrated the flexibility and strength of the generalized
correlation coefficient for analyzing complex molecular interactions in biomedical infor-
matics. By capturing both linear and non-linear relationships, this coefficient proved to be
effective for researchers working with high-dimensional biological data. Our methodology
expands the statistical toolkit for genomic and biomedical research, with applications in
controlled simulations and real-world datasets. The adaptability of the mentioned coeffi-
cient to varying correlation structures and data complexities offers valuable insights into
gene expression dynamics and their implications for precision medicine.

Author Contributions: Conceptualization, R.O., C.M.X., G.H.E., P.L.E., C.C., and V.L.; data curation,
R.O., C.M.X., G.H.E., P.L.E., and C.C.; formal analysis, R.O., C.M.X., G.H.E., P.L.E., C.C., and V.L.;
investigation, R.O., C.M.X., G.H.E., P.L.E., C.C., and V.L.; methodology, R.O., C.M.X., G.H.E., P.L.E.,
C.C., and V.L.; writing—original draft, R.O., C.M.X., G.H.E., and P.L.E.; writing—review and editing,
V.L. and C.C. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the Conselho Nacional de Desenvolvimento Científico
e Tecnológico—CNPq—, No. 303192/2022-4, and Fundação de Amparo a Ciência e Tecnologia
do Estado da Bahia—FAPESB—, No. APP0021/2023 (R.O.); by the Vice-rectorate for Research,
Creation, and Innovation—VINCI—of the Pontificia Universidad Católica de Valparaíso—PUCV—,
Chile, under grants VINCI 039.470/2024—regular research—, VINCI 039.493/2024—interdisciplinary
associative research—, VINCI 039.309/2024—PUCV centenary—, and FONDECYT 1200525 (V.L.)
from the National Agency for Research and Development—ANID—of the Chilean government;
and by Portuguese funds through the CMAT—Research Centre of Mathematics of University of
Minho, Portugal, within projects UIDB/00013/2020—https://doi.org/10.54499/UIDB/00013/2020,
accessed on 4 November 2024—and UIDP/00013/2020—https://doi.org/10.54499/UIDP/00013/20
20, accessed on 4 November 2024—(C.C.).
Data Availability Statement: The dataset used for analysis in this study originates from the FAPESP
research project No. 06/03227-2, titled “Gene Expression in Stomach and Esophagus Tumors: From
Biology to Diagnosis”. The data were obtained through a collaboration between State University of
Paraíba and the Sírio-Libanês Hospital in Brazil. The data and codes used in this study are available
on GitHub at https://github.com/Raydonal/SCGeneNetworkGCC, (accessed on 4 November 2024).
Please contact the authors for any additional information.
Acknowledgments: The authors would like to thank the editors and anonymous reviewers for their
valuable comments and suggestions, which helped us to improve the quality of this article.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Cavalcante, T.; Ospina, R.; Leiva, V.; Martin-Barreiro, C.; Cabezas, X. Weibull regression and machine learning survival models:
Methodology, comparison, and application to biomedical data related to cardiac surgery. Biology 2023, 11, 1394 .
2. Varuzza, L.; Pereira, C.A.D.B. Significance test for comparing digital gene expression profiles: Partial likelihood application. Chil. J.
Stat. 2010, 1 , 91–102.
3. Ospina, R.; Ferreira, A.G.O.; de Oliveira, H.M.; Leiva, V.; Castro, C. On the use of machine learning techniques and non-invasive
indicators for classifying and predicting cardiac disorders. Biomedicines 2023, 11,2604.
4. Bielińska-Wa˛ż, D.; Wa˛ż, P.; Błaczkowska, A.; Mandrysz, J.; Lass, A.; Gładysz, P.; Karamon, J. Mathematical modeling in bioinfor-
matics: Application of an alignment-free method combined with principal component analysis. Symmetry 2024, 16, 967.
Symmetry 2024, 16, 1510 29 of 30

5. Chicco, D.; Jurman, G. A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and
Fowlkes–Mallows index. J. Biomed. Informat. 2023, 144, 104426.
6. Zhou, K.; Zhang, S.; Wang, Y.; Cohen, K.B.; Kim, J.-D.; Luo, Q.; Yao, X.; Zhou, X.; Xia, J. High-quality gene/disease embedding in a
multi-relational heterogeneous graph after a joint matrix/tensor decomposition. J. Biomed. Informat. 2022, 126, 103973.
7. Ortega-Leon, A.; Gucciardi, A.; Segado-Arenas, A.; Benavente-Fernández, I.; Urda, D.; Turias, I.J. Neurodevelopmental impair-
ments prediction in premature infants based on clinical data and machine learning techniques . Stats 2024, 7, 685–696.
8. Han, H. Bayesian model averaging and regularized regression as methods for data-driven model exploration, with practical
considerations. Stats 2024, 7, 732–744.
9. Leiva, V.; Corzo, J.; Vergara, M.E.; Ospina, R.; Castro, C. A statistical methodology for evaluating asymmetry after normalization
with application to genomic data. Stats 2024, 7, 967–983.
10. Leiva, V.; Sanhueza, A.; Kelmansky, S.; Martinez, E. On the glog-normal distribution and its association with the gene expression
problem. Comput. Stat. Data Anal. 2009, 53, 1613–1621.
11. Vilca, F.; Rodrigues-Motta, M.; Leiva, V. On a variance stabilizing model and its application to genomic data. J. Appl. Stat. 2013, 40,
2354–2371.
12. Kelmansky, D.; Martinez, E.; Leiva, V. A new variance stabilizing transformation for gene expression data analysis. Stat. Appl.
Genet. Mol. Biol. 2013, 12, 653–666.
13. Wilcox, R. The percentage bend correlation coefficient. Psychometrika 1994, 59, 601–616.
14. Wilcox, R. Inferences based on a skipped correlation coefficient. J. Appl. Stat. 2004, 31, 131–143.
15. Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti,
P.C. Detecting novel associations in large datasets. Science 2011, 334, 1518–1524.
16. Ravindran, U.; Gunavathi, C. A survey on gene expression data analysis using deep learning methods for cancer diagnosis. Prog.
Biophys. Mol. Biol. 2023, 177, 1–13.
17. Masoodi, F.; Quasim, M.; Bukhari, S.; Dixit, S.; Alam, S. (Eds.) Applications of Machine Learning and Deep Learning on Biological Data;
CRC Press: New York, NY, USA, 2023.
18. Rahnenführer, J.; De Bin, R.; Benner, A.; Ambrogi, F.; Lusa, L.; Boulesteix, A.L.; Migliavacca, E. Statistical analysis of high-
dimensional biomedical data: A gentle introduction to analytical goals, common approaches and challenges. BMC Med. 2023,
21, 182.
19. Li, J.J.; Zhou, H.J.; Bickel, P.J.; Tong, X. Dissecting gene expression heterogeneity: Generalized Pearson correlation squares and the
K-lines clustering algorithm. J. Am. Stat. Assoc. 2024, 119, 1–14.
20. Bai, X.; Wang, S.; Zhang, X.; Wang, H. Molecular-memory-induced counter-intuitive noise attenuator in protein polymerization.
Symmetry 2024, 16, 315.
21. Chinchilli, V.M.; Philips, B.R.; Mauger, D.T.; Szefler, S.J. A general class of correlation coefficients for the 2 × 2 crossover design.
Biom. J. 2005, 47, 644–653.
22. McManus, C. Cerebral polymorphisms for lateralisation: Modelling the genetic and phenotypic architectures of multiple functional
modules. Symmetry 2022, 14, 814.
23. Chen, V.Y.J.; Chinchilli, V.M.; Richards, D.S.P. Robustness and monotonicity properties of generalized correlation coefficients. J.
Stat. Plan. Infer. 2011, 141, 924–936.
24. Sanchez, J.D.; Rêgo, J.C.; Ospina, R.; Leiva, V.; Chesneau, C.; Castro, C. Similarity-based predictive models: Sensitivity analysis
and a biological application with multi-attributes. Biology 2023, 12, 959.
25. Alkadya, W.; ElBahnasy, K.; Leiva, V.; Gad, W. Classifying COVID-19 based on amino acids encoding with machine learning
algorithms. Chemom. Intell. Lab. Syst. 2022, 224, 104535.
26. Bustos, N.; Tello, M.; Droppelmann, G.; Garcia, N.; Feijoo, F.; Leiva, V. Machine learning techniques as an efficient alternative
diagnostic tool for COVID-19 cases. Signa Vitae 2022, 18, 23.
27. García-Sancho, M.; Lowe, J. A History of Genomics Across Species, Communities and Projects; Springer: New York, NY, USA, 2023.
28. Tully, J.; Hill, A.,; Ahmed, H.; Whitley, R.; Skjellum, A.; Mukhtar, M. Expression-based network biology identifies immune-related
functional modules involved in plant defense. BMC Genom. 2014, 15, 421.
29. Jaskowiak, P.A.; Campello, R.J.G.B.; Costa, I. Proximity measures for clustering gene expression microarray data: A validation
methodology and a comparative analysis. Comput. Biol. Bioinform. IEEE/ACM Trans. 2013, 10, 845–857.
30. Langfelder, P.; Horvath, S. Fast R functions for robust correlations and hierarchical clustering. J. Stat. Softw. 2012, 46, 1–17.
31. Jaskowiak, P.; Campello, R.G.B.; Costa, I. Evaluating correlation coefficients for clustering gene expression profiles of cancer. In
Advances in Bioinformatics and Computational Biology; de Souto, M., Kann, M., Eds.; Springer: Heidelberg/Berlin, Germany, 2012;
Volume 7409, pp. 120–131.
32. Son, Y.S.; Baek, J. A modified correlation coefficient based similarity measure for clustering time-course gene expression data.
Pattern Recognit. Lett. 2008, 29, 232–242.
33. Hardin, J.S.; Mitani, A.; Hicks, L.; VanKoten, B. A robust measure of correlation between two genes on a microarray. BMC Bioinform.
2007, 8, 220.
34. Ma, S.; Gong, Q.; Bohnert, H.J. An arabidopsis gene network based on the graphical gaussian model. Genome Res. 2007, 17,
1614–1625.
Symmetry 2024, 16, 1510 30 of 30

35. Elo, L.L:; Lahesmaa, R.; Aittokallio, T. Inference of gene coexpression networks by integrative analysis across microarray
experiments. J. Integr. Bioinform. 2006, 3, 33.
36. Voy, B.H.; Scharff, J.A.; Perkins, A.D.; Saxton, A.M.; Borate, B.; Chesler, E.J.; Branstetter, L.K.; Langston, M.A. Extracting gene
networks for low-dose radiation using graph theoretical algorithms. PLoS Comput. Biol. 2006, 2, e89.
37. Zhu, D.; Hero, A.O.; Cheng, H.; Khanna, R.; Swaroop, A. Network constrained clustering for gene microarray data. Bioinformatics
2005, 21 . 4014–4020.
38. Xu, W.; Hou, Y.; Hung, Y.S.; Zou, Y. A comparative analysis of Spearman rho and Kendall tau in normal and contaminated normal
models. Signal Process. 2013, 93, 261–276.
39. Croux, C.; Dehon, C. Influence functions of the spearman and kendall correlation measures. Stat. Methods Appl. 2010, 19, 497–515.
40. Maronna, R.A.; Martin, D.R.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006.
41. Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 1, 81–93.
42. Kendall, M.G.; Gibbons, J.D. Rank Correlation Methods. A Charles Griffin Book; E. Arnold: London, UK, 1990.
43. Blomqvist, N. On a measure of dependence between two random variables. Ann. Math. Stat. 1950, 21, 593–600.
44. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 1904, 15, 72–101.
45. Lee, A.J. U-Statistics: Theory and Practice; Routledge: Abingdon, UK, 2019.
46. Andrews, G.E.; Askey, R.; Roy, R. Special Functions. Encyclopedia of Mathematics and its Applications; Cambridge University Press:
Cambridge, UK, 1999; Volume 71.
47. Hotelling, H. New light on the correlation coefficient and its transformation. J. Royal Stat. Soc. B 1953, 15, 193–232.
48. Fisher, R.A. On the probable error of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32.
49. David, F.N.; Mallows, C.L. The variance of Spearman rho in normal samples. Biometrika 1961, 48, 19–28.
50. Serfling, R.J. Approximation Theorems of Mathematical Statistics; Wiley: Hoboken, NJ, USA, 1981.
51. Butte, A.J.; Kohane, I.S. Mutual information relevance networks: Functional genomic clusteringusing pairwise entropy measure-
ments. Pac. Symp. Biocomput. 2000, 5, 415–426.
52. Butte, A.J.; Kohane, I.S. Unsupervised knowledge discovery in medical databases using relevance networks. Proc. AMIA Symp.
1999, 711–715
53. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023.
54. Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic
data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73.
55. Deng, D.; Chowdhury, M.H. Quantile regression approach for analyzing similarity of gene expressions under multiple biological
conditions. Stats 2022, 5, 583–605.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like