Symmetry and Complexity in Gene Association Networ
Symmetry and Complexity in Gene Association Networ
Article
Symmetry and Complexity in Gene Association Networks Using
the Generalized Correlation Coefficient
Raydonal Ospina 1,2 , Cleber M. Xavier 3 , Gustavo H. Esteves 4 , Patrícia L. Espinheira 1,2 , Cecilia Castro 5, *
and Víctor Leiva 6
Abstract: High-dimensional gene expression data cause challenges for traditional statistical tools, par-
ticularly when dealing with non-linear relationships and outliers. The present study addresses these
challenges by employing a generalized correlation coefficient (GCC) that incorporates a flexibility
parameter, allowing it to adapt to varying levels of symmetry and asymmetry in the data distribution.
This adaptability is crucial for analyzing gene association networks, where the GCC demonstrates
advantages over traditional measures such as Kendall, Pearson, and Spearman coefficients. We intro-
duce two novel adaptations of this metric, enhancing its precision and broadening its applicability in
the context of complex gene interactions. By applying the GCC to relevance networks, we show how
different levels of the flexibility parameter reveal distinct patterns in gene interactions, capturing both
linear and non-linear relationships. The maximum likelihood and Spearman-based estimators of the
Citation: Ospina, R.; Xavier, C.M.;
GCC offer a refined approach for disentangling the complexity of biological networks, with potential
Esteves, G.H.; Espinheira, P.L.; Castro,
implications for precision medicine. Our methodology provides a powerful tool for constructing and
C.; Leiva, V. Symmetry and
interpreting relevance networks in biomedicine, supporting advancements in the understanding of
Complexity in Gene Association
Networks Using the Generalized
biological interactions and healthcare research.
Correlation Coefficient. Symmetry
2024, 16, 1510. https://doi.org/ Keywords: asymmetry; bioinformatics; gene expression analysis; high-dimensional data; non-linear
10.3390/sym16111510 associations; robust statistical methods
where Cov( X, Y ) is the covariance, σX2 = Var[ X ] and σY2 = Var[Y ] are the variances, and
µ X = E[ X ] and µY = E[Y ] are the expected values of X and Y, correspondingly. This
coefficient quantifies the linear dependence between two variables, ( X, Y ) in our case.
The sample Pearson correlation coefficient is stated as
where xi , yi are the sample observations and x̄, ȳ are the sample means of X and Y,
respectively. If X and Y follow a bivariate normal distribution, denoted as ( X, Y ) ∼
N2 (µ X , µY ; σX2 , σY2 ; ρ), then the sample correlation coefficient presented in (1) is the maxi-
mum likelihood (ML) estimate of the population Pearson correlation coefficient ρ. This
estimator is consistent, meaning that as the sample size n increases, rP converges in proba-
bility to the population correlation ρ.
However, when the assumption of bivariate normality is violated, several issues
arise. Outliers can disproportionately influence ρ, leading to misleading conclusions
about the relationship between X and Y. Furthermore, in the presence of non-linear
relationships, ρ may underestimate the true strength of association, as it captures only
linear dependencies [38,39]. Non-parametric alternative measures, such as the Kendall tau
and Spearman rank correlation, are less sensitive to outliers and better capture monotonic
relationships, making them more robust in such scenarios [40]. These measures provide
more reliable estimates and reflect the true relationships in the data, particularly when
dealing with biological variables that exhibit complex dependencies.
The Kendall correlation coefficient [41,42] is expressed as
n −1 n
2
rK =
n ( n − 1) ∑ ∑ sign(( xi − x j )(yi − y j )),
i =1 j = i +1
where sign(·) is the sign function that assigns a value of 1 to positive differences, −1 to
negative differences, and 0 when there is no difference. Its population version is given by
with EF representing the expectation with respect to the CDF F. For a bivariate normal
distribution, whose CDF is denoted by Φ2 , the relationship is established as
2
ρK ( Φ2 ) = arcsin(ρ),
π
which differs from the Pearson correlation when ρ ̸= 0 [43].
Symmetry 2024, 16, 1510 4 of 30
The Spearman rank correlation coefficient [44] is also less sensitive to outliers, provid-
ing a robust estimate of the correlation. Denote H (t) = PF ( X ≤ t) and G (t) = PF (Y ≤ t)
as the marginal CDFs of X and Y, respectively. The Spearman correlation coefficient is
formulated as
ρS ( F ) = Cor( H ( X ), G (Y )) = 12EF [ H ( X ) G (Y )] − 3,
and its sample estimate is given by
6 ∑in=1 d2i
rS = 1 − ,
n ( n2 − 1)
θ1 ( F )
ργ ( F ) = p , (2)
θ2 ( F ) θ3 ( F )
where the parameter γ modulates the degree of similarity between the GCC and traditional
correlation coefficients. Specifically, when γ = 1, ργ ( F ) aligns with the Pearson correlation
coefficient, capturing linear relationships, while for γ = 0, it approximates the Kendall
rank correlation, which is more sensitive to ordinal relationships.
Uγ,XY
ρ̃γ = p , (3)
Uγ,XX Uγ,YY
where Uγ,XY , Uγ,XX , and Uγ,YY are U-statistic estimators corresponding to the parameters
θ1 ( F ), θ2 ( F ), and θ3 ( F ), respectively, as defined in (2). These estimators are computed as
n −1 n
2
Uγ,XY =
n ( n − 1) ∑ ∑ gγ ( Xi − X j ) gγ (Yi − Yj ),
i =1 j = i +1
n −1 n
2
Uγ,XX =
n ( n − 1) ∑ ∑ gγ2 ( Xi − X j ),
i =1 j = i +1
n −1 n
2
Uγ,YY =
n ( n − 1) ∑ ∑ gγ2 (Yi − Yj ).
i =1 j = i +1
By utilizing U-statistics, we offer a refinement based on robust and unbiased means for
estimating ργ ( F ), even in the presence of complex or irregular data. The use of U-statistics
ensures consistency and resilience to non-normal distributions and outliers, making them
particularly suitable for biological datasets, which frequently exhibit such challenges.
Further computational refinements include an explicit formulation for ργ in the case
of a bivariate normal distribution, with CDF denoted as Φ2 [23]. This distribution describes
the joint behavior of two normally distributed random variables with a specified correlation
ρ. The explicit formulation for the GCC in this context is given by
1 1 3
W (γ, ρ) = K (γ)ρ 2 F1 (1 − γ ), (1 − γ ); ; ρ2 , (4)
2 2 2
√
where K (γ) = 2(Γ(γ/2 + 1))2 /Γ(γ + 1/2) π and 2 F1 ( a, b; c; x ) represents the Gaussian
hypergeometric series, a specialized mathematical function used to describe complex
relationships between variables [46], and Γ is the traditional gamma function.
Thus, the expression W (γ, ρ) presented in (4) quantifies how the GCC ργ (Φ2 ) deviates
from the traditional Pearson correlation coefficient ρ when ρ ̸= 0. Specifically, W (γ, ρ)
adjusts the weighting of linear versus non-linear relationships based on the parameter γ.
As γ changes, the GCC adapts to capture different types of dependencies, making it more
versatile than traditional correlation coefficients.
To ensure that the GCC estimator remains Fisher-consistent —meaning it retains accu-
racy across different populations—an inverse transformation of W (γ, ρ) is applied, keeping
γ fixed within the interval [−1, 1], as described in (4). This transformation guarantees that
the estimator adapts to various datasets while preserving statistical consistency.
Let rQ denote the correlation estimator corresponding to ρQ ( F ). In the normal model
Φ2 , it is established that all considered correlation estimators asymptotically follow a
normal distribution, that is, we have that
√
n(rQ − ρQ (Φ2 )) → N(0, AV(ρQ (Φ2 ), Φ2 )),
as demonstrated in [47,48].
Symmetry 2024, 16, 1510 6 of 30
π2
AV(ρK∗ (Φ2 ), Φ2 ) = (1 − ρ2 ) − arcsin2 (ρ) ,
4
as discussed in [43]. The variances for the Spearman correlation ρS (Φ2 ) and the GCC
ργ (Φ2 ) are detailed in [21,49].
Given the Fisher consistency of the Pearson (rP ) and Spearman (rS ) estimators under
a CDF Φ2 , we apply two Fisher-consistent estimators for ργ (Φ2 ) for a fixed value of γ,
defined as
(3 − γ ) (3 − γ ) 5 2
∂W (γ, ρ) 1
w(γ, ρ) = = (1 − γ)2 ρ2 2 F1 , ; ; ρ + W (γ, ρ).
∂ρ 3 2 2 2
Using the delta method [50], we derive the asymptotic distributions of these estimators
for a fixed value of γ as
√
n(ρbγ − ργ ) → N 0, w(γ, ρ)2 AV(ρ(Φ2 ), Φ2 )
and √
n(ρ̄γ − ργ ) → N 0, w(γ, ρ)2 AV(ρS (Φ2 ), Φ2 ) ,
confirming that both ρbγ and ρ̄γ exhibit asymptotic normality, with their variances scaled
by w(γ, ρ)2 . This demonstrates the effectiveness of these estimators in approximating ργ
under Φ2 .
In summary, these refinements reinforce the theoretical foundations of the GCC,
particularly for high-dimensional biological data. By ensuring Fisher consistency and
leveraging robust statistical methods, the proposed estimators provide precise and reliable
correlation analyses, which are critical in fields such as epidemiology, genomics, and health
sciences. Having established the theoretical and practical foundations for the GCC, we
proceed to evaluate its performance through a comprehensive simulation study.
3. Simulation Study
This section evaluates the performance of the GCC estimators under various simulated
scenarios, providing empirical evidence of their efficacy and robustness in analyzing
complex biological data, particularly gene expression profiles with prevalent non-linear
dependencies and asymmetries.
• Case 2 —Bivariate normal distribution with shifted means. To assess the robustness
of the estimators to location shifts, we generate samples from a bivariate normal
distribution N2 (−0.5, 0.5; 1, 1; ρ), with shifted means, with the same correlation coeffi-
cients ρ ∈ {0, 0.3, 0.9} being used. This case evaluates the effect of mean shifts on the
estimator performance.
• Case 3 —Bivariate normal distribution with increased variance. To investigate the
impact of increased variability, we generate samples from the distribution N2 (0, 0;
σX2 = 4, σY2 = 4; ρ), with variances four times greater than in previous cases and the
correlation coefficients remaining as ρ ∈ {0, 0.3, 0.9}. This case simulates scenarios
with high variability in biological data.
• Case 4—Contaminated bivariate normal distribution. In this case, we create a mixture
consisting of 60% of a bivariate normal distribution with high correlation (ρ = 0.9)
and 40% of a bivariate normal distribution with no correlation (ρ = 0). The mixture
proportions considered are 60%, 40%, with correlation coefficients ρ ∈ {0.1, 0.5, 0.9}.
This case evaluates the performance of the estimators in the presence of heterogeneous
subpopulations with varying correlation patterns.
• Case 5—Mixture of bivariate normal distributions. To simulate heterogeneous data
commonly observed in gene expression analysis, we generate samples from a mixture
of two bivariate normal distributions with different means and/or covariances. The
mixture proportions considered are 10%, 30%, and 50%, with a weak correlation
ρ = 0.1. This case evaluates the performance of the estimators when data arise from
different subpopulations with distinct correlation patterns.
For each of the five cases, we evaluate the performance of the following estimators:
• GCC estimator based on U-statistics (ρ̃γ ) —GCC-U—as defined in (3).
• GCC estimator based on ML (bργ ) —GCC-ML—as stated in (5).
• Adjusted Spearman rank correlation coefficient (ρ̄γ ) –adjusted Spearman—as pre-
sented in (6).
The contamination scenarios were specifically chosen to reflect conditions frequently
observed in biological data, such as those in molecular biology and epidemiology. These
scenarios generate asymmetries and provide a comprehensive representation of real-world
challenges, simulating outliers and heavy-tailed distributions. Additional contamination
settings were deemed unnecessary, as they would not contribute further insights beyond
what is already observed under the tested conditions.
We conducted 5000 Monte Carlo replicates for each simulation scenario to ensure
robust results and to provide reliable estimates of the behavior of the estimators. The sample
sizes considered were n ∈ {10, 50, 100, 250, 500}, chosen to evaluate the performance of the
estimators in both small-sample and large-sample settings. These sample sizes allowed us
to assess the consistency and convergence properties of each estimator as n increases.
As discussed earlier, we evaluated the influence of the flexibility parameter γ at three
key values, γ ∈ {0, 0.5, 1} say, which capture a range of behaviors from rank-based to
linear dependencies. By varying both the sample size and the parameter γ, we assessed the
performance of the estimators across different data complexities, including varying levels of
correlation, non-linear dependencies, and robustness to outliers. This comprehensive evalu-
ation provides valuable insights into how the estimators perform under diverse conditions
typically encountered in the analysis of biological data, such as gene expression profiles.
Table 1. RMSE of the estimators for Case 1, with the indicated values of ρ, γ, and n.
for a wide range of scenarios. The choice of γ should be based on the underlying
correlation structure and desired sensitivity to linear or non-linear dependencies.
• Case 2—Bivariate normal distribution with shifted means. In this case, we evaluate
the robustness of the estimators when data are drawn from a bivariate normal distri-
bution with shifted means, reflecting deviations commonly encountered in real-world
datasets, such as gene expression profiles. Specifically, samples were generated from
a bivariate normal distribution N2 (µ X = −0.5, µY = 0.5; σX2 = 1, σY2 = 1; ρ) while
maintaining the same correlation levels as in Case 1, that is, ρ ∈ {0, 0.3, 0.9}. The shift
in means introduces an additional layer of complexity, testing the ability of the esti-
mators to adapt to changes in location. Although the variances remain constant, the
altered central tendency requires the estimators to perform effectively under different
distributional settings. RMSE values for each estimator are presented in Table 2.
Table 2. RMSE of the estimators for Case 2, with the indicated values of ρ, γ, and n.
Table 3. RMSE of the estimators for Case 3, with the indicated values of ρ, γ, and n.
Table 4. RMSE of the estimators for Case 4, with the indicated values of ρ, γ, and n.
Table 5. RMSE of the estimators for Case 5 with contamination levels of 10%, 30%, and 50%, and
ρ = 0.1, with the indicated values of γ and n.
The insights gained from the simulation study demonstrate the varied performance
of the proposed estimators across different conditions of correlation, contamination, vari-
ance, and sample size. Overall, the ML-based estimator ρbγ consistently outperformed
the other estimators in terms of accuracy, particularly for small to moderate sample sizes.
Its robustness to different correlation levels and contamination was evident, although it
showed slight sensitivity to high levels of contamination and extreme values for large γ.
As sample sizes increased, the differences in performance between GCC-ML and the other
estimators diminished, but the ML-based estimator maintained its advantage in terms of
lower RMSE values.
The U-statistics-based estimator ρ̃γ , while showing more variability in certain cases—
particularly in small sample sizes or under shifts in location (as seen in Case 2)—improved
as sample sizes increased. However, the GCC-U estimator had a tendency to overestimate
ργ for small values of γ and low correlation levels, and it was more sensitive to contam-
ination, particularly when γ = 0. This sensitivity reflects the rank-based nature of the
estimator, which is more impacted by outliers and distribution shifts.
The adjusted Spearman estimator ρ̄γ exhibited strong robustness to contamination,
consistently producing low RMSE values in cases with moderate contamination levels.
However, it tended to underestimate ργ , especially at moderate correlation levels. Despite
this, the adjusted Spearman estimator performed stably across different contamination and
correlation levels, making it a reliable option for highly contaminated datasets.
The flexibility parameter γ played a crucial role in the behavior of all estimators.
Low values of γ, particularly γ = 0 (similar to the Kendall tau), offered more robustness
against contamination and outliers, while high values of γ (closer to the Pearson correlation)
performed better at capturing linear relationships in uncontaminated datasets. The inter-
mediate value of γ = 0.5 balanced sensitivity to both linear and non-linear dependencies
effectively, providing a flexible approach to varying data complexities.
Across all scenarios, the impact of sample size was clear: as the sample size increased,
all estimators showed improved performance, with reduced RMSE values and better
convergence toward the true value of ργ . The performance gap between the estimators was
more noticeable for small sample sizes (n = 10 and n = 50), but this gap narrowed as the
sample sizes grew (n = 250 and n = 500). Notably, the ML-based estimator demonstrated
the most rapid convergence, particularly in large sample sizes.
In summary, the GCC-ML estimator provided the most reliable and accurate perfor-
mance across diverse scenarios, with the choice of the flexibility parameter γ influencing
the estimators’ behavior. Low values of γ favored robustness, particularly in contami-
nated datasets, while high γ values excelled at capturing linear dependencies. Therefore,
this study highlights the importance of selecting an appropriate value of γ based on the
underlying data structure to optimize estimator performance.
The simulation results found the capacity of the GCC to handle non-linear dependen-
cies, high-dimensional data, and contamination, all of which are common challenges in the
analysis of gene expression studies. These strengths position the GCC as a valuable tool for
constructing and analyzing RNs, crucial for identifying complex interactions and potential
biomarkers in biological systems. With these findings, we now transition to the practical
application of these methods in constructing RNs.
Figure 1. Histogram representing the data distribution, with a kernel density estimate and an overlay
of the normal distribution curve.
To address this limitation, we applied the GCC with varying values of γ, expanding
the analysis to capture a broader spectrum of associations within the biological system.
By adjusting γ, we generated networks that capture a broad range of gene interactions,
providing a more comprehensive understanding.
The diverse networks offer strong candidates for further biological validation and may
uncover deeper insights into the underlying molecular mechanisms. The parameter γ was
set at multiple levels in constructing the RNs, that is, γ ∈ {1, 0.86, 0.71, 0.57, 0.43, 0.29, 0.14, 0},
′
following guidelines from previous studies [52]. A threshold criterion of rP2 > 0.5 was used
to identify subgraphs, defining the RNs.
Figures 2–11 present a detailed illustration of the RNs derived using the GCC for
various values of γ, as well as networks constructed using the Spearman correlation
coefficient. In these figures, green edges represent negative correlations, while red edges
represent positive correlations.
As γ decreases, the GCC becomes more selective, isolating stronger and more robust
correlations, resulting in sparser but potentially more biologically relevant networks. This
indicates that the GCC becomes more sensitive to the strongest and most robust correla-
tions, resulting in sparser networks with fewer, but potentially more biologically relevant,
connections.
Interestingly, the network structure obtained using the Spearman correlation closely
resembles that derived from the GCC at γ = 0.57. This resemblance arises because the GCC
at γ = 0.57 captures both linear and monotonic relationships, similar to those measured by
the Spearman correlation. These results underscore the flexibility of the GCC in adapting
to different types of dependencies present in gene expression data [35].
Symmetry 2024, 16, 1510 17 of 30
(a) γ = 1 and |ρ| > 0.5. (b) γ = 0.86 and |ρ| > 0.5. (c) γ = 0.71 and |ρ| > 0.5.
ADORA3 ADORA3 ADORA3
CD99L2 CD99L2 CD99L2
KBTBD4 KBTBD4 KBTBD4
CATSPERG CATSPERG CATSPERG
(d) γ = 0.57 and |ρ| > 0.5. (e) γ = 0.43 and |ρ| > 0.5. (f) γ = 0.29 and |ρ| > 0.5.
ADORA3 ADORA3 ADORA3
CD99L2 CD99L2 CD99L2
KBTBD4 KBTBD4 KBTBD4
CATSPERG CATSPERG CATSPERG
(g) γ = 0.14 and |ρ| > 0.5. (h) γ = 0 and |ρ| > 0.5. (i) |ρS | > 0.5.
Figure 2. Relevance network constructed using GCC for different values of γ, ρ, and ρS , where green
edges represent negative correlations, while red edges represent positive correlations.
The impact of varying the parameter γ on network topology is illustrated in Figures 3–11.
In these networks, nodes represent genes, and edges represent high correlations between
gene expression levels, with the correlation coefficients indicated along the edges. Blue
edges represent correlations that have weakened compared to the preceding value of γ,
while violet edges indicate correlations that have remained strong or increased. Through
the analysis of these RNs, we observe that lower values of γ effectively filter out weaker
correlations, allowing the GCC to emphasize the strongest and most biologically relevant
associations. This allows us to focus on the strongest interactions as γ decreases, consistent
with the findings from our earlier simulation study, highlighting the practical utility of
the GCC in biological applications. The flexibility to adjust γ provides a powerful tool for
examining data from multiple perspectives, ensuring that important non-linear or complex
dependencies are captured.
Symmetry 2024, 16, 1510 18 of 30
PRNP
−0.5543
−0.5905
0.7702
LRP1 SMG7
0.5887 0.6854
KBTBD4
0.5044
−0.5598
ADORA3 CD99L2
0.5686
0.6368
CATSPERG MEGF11
0.7667
0.6809 0.5766
0.6059
APBA3 CDH6
0.5075 0.7618
CRCP ITPRIPL2 GOLGA3
−0.5868 −0.6828
Figure 3. Gene interaction network using the GCC with γ = 1 and |ρ| > 0.5, where nodes repre-
sent genes and edges represent high correlations between gene expression levels, with correlation
coefficients indicated.
Symmetry 2024, 16, 1510 19 of 30
PRNP
−0.5504
−0.6012
0.7722
LRP1 SMG7
0.5915 0.6861
KBTBD4
0.5235
−0.5412
ADORA3 CD99L2
0.5383
CATSPERG MEGF11
0.6857
0.7647 0.6295
0.6112
APBA3 CDH6
0.5132 0.7652
CRCP ITPRIPL2 GOLGA3
ATP11B
−0.5265 0.5205 −0.5890
−0.5910 −0.6852
Figure 4. Gene interaction network using the GCC with γ = 0.86 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 1), while
violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 20 of 30
PRNP
−0.5428
−0.6076
0.7680
LRP1 SMG7
0.5881 0.6808
KBTBD4
0.5351
−0.5199
ADORA3 CD99L2
0.6861
CATSPERG MEGF11
0.7549 0.6168
0.6100
APBA3 CDH6
0.5121 0.7633
CRCP ITPRIPL2 GOLGA3
ATP11B
−0.5271 0.5110 −0.5742
−0.5904 −0.6820
Figure 5. Gene interaction network using the GCC with γ = 0.71 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.86),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 21 of 30
PRNP
−0.5307
−0.6081
0.7571
LRP1 SMG7
0.5769 0.6808
KBTBD4
0.5362
ADORA3 CD99L2
0.6790
CATSPERG MEGF11
0.7352 0.5966
0.6005
APBA3 CDH6
0.5029 0.7544
CRCP ITPRIPL2 GOLGA3
ATP11B
−0.5210 −0.5535
−0.5834 −0.6717
Figure 6. Gene interaction network using the GCC with γ = 0.57 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.71),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 22 of 30
PRNP
−0.5127
−0.6006
0.7357
LRP1 SMG7
0.5556 0.6441
KBTBD4
0.5240
ADORA3 CD99L2
0.6604
CATSPERG MEGF11
0.7033 0.5667
0.5805
APBA3 CDH6
0.7363
CRCP ITPRIPL2 GOLGA3
ATP11B
−0.5065 −0.5257
−0.5681 −0.6523
Figure 7. Gene interaction network using the GCC with γ = 0.43 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.57),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 23 of 30
PRNP
−0.5819
0.7008
LRP1 SMG7
0.5221 0.6078
KBTBD4
ADORA3 CD99L2
0.6253
CATSPERG MEGF11
0.6571 0.5250
0.5805
APBA3 CDH6
0.7261
CRCP ITPRIPL2 GOLGA3
ATP11B
−0.5416 −0.6215
Figure 8. Gene interaction network using the GCC with γ = 0.29 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous case (γ = 0.43),
while violet edges indicate correlations that have remained strong or increased.
Symmetry 2024, 16, 1510 24 of 30
PRNP
−0.5447
0.6473
LRP1 SMG7
0.5560
0.5886
CATSPERG MEGF11
0.5951
0.5003
APBA3 CDH6
0.6603
CRCP ITPRIPL2 GOLGA3
−0.5757 ATP11B
DNAJA1 TSC1 HDDC3
Figure 9. Gene interaction network with correlation coefficients (γ = 0.14 and |ρ| > 0.5), where
blue edges represent correlations that have weakened compared to those in the previous case (with
γ = 0.29), while violet edges indicate correlations that have remained strong or increased.
0.5677
LRP1 SMG7 PRNP KBTBD4
ADORA3 CD99L2
CATSPERG MEGF11
0.594
CRCP ITPRIPL2 GOLGA3
−0.5100
DNAJA1 TSC1 HDDC3 ATP11B
Figure 10. Gene interaction network using the GCC with γ = 0 and |ρ| > 0.5, where blue edges
represent correlations that have weakened compared to those with the previous value (γ = 0.14).
Symmetry 2024, 16, 1510 25 of 30
PRNP
−0.5340
−0.6594
0.7482
LRP1 SMG7
0.6670
0.5909
−0.5141
KBTBD4
0.5568
ADORA3 CD99L2
CATSPERG MEGF11
0.6588
0.678 0.5327
0.6089
APBA3 CDH6
0.5082 0.7889
CRCP ITPRIPL2 GOLGA3
−0.5868 −0.6828
Figure 11. Gene interaction network using the Spearman correlation coefficient with |ρS | > 0.5.
When the value of γ is reduced from 1 to 0.86, some correlations slightly decrease in
magnitude, while others increase. For example, the correlation between PRNP and LRP1
changes from −0.5905 to −0.6012, reflecting a subtle increase in absolute value. This occurs
because higher γ values emphasize linear relationships, making non-linear interactions
more noticeable as γ decreases [52]. As γ decreases further from 0.86 to 0.71, approximately
23% of the correlations increase in strength, a reduction compared to the previous step.
Symmetry 2024, 16, 1510 26 of 30
For instance, the correlation between KBTBD4 and SMG7 decreases from 0.6861 to
0.6808, highlighting the increasing selectivity of the GCC at low γ values, where it focuses
on stronger correlations. At γ = 0.57, only about 12% of the correlations show an increase in
strength, continuing the trend of filtering out weaker associations. This demonstrates how
the GCC progressively isolates the most robust interactions, prioritizing biologically rele-
vant connections as γ decreases. Notably, as γ is reduced from 0.29 to 0.14 and eventually to
0, the number of edges in the networks decreases, reflecting less correlations surpassing the
threshold of 0.5. This emphasizes the role of the GCC in isolating the strongest interactions,
providing clearer insights into high gene associations.
The network derived using the Spearman correlation coefficient, shown in Figure 11,
closely resembles the GCC network at γ = 0.57. However, a unique negative correlation
exceeding 0.5 between HDDC3 and PRNP is captured by the Spearman coefficient, which
is not detected in any GCC configuration. This highlights the Spearman ability to capture
monotonic relationships that may not align with linear models, revealing interactions
that could be overlooked when using only the GCC. These observations underscore the
sensitivity of the GCC to the parameter γ, while the comparison with the Spearman
correlation demonstrates the importance of using multiple correlation measures to capture
a broader spectrum of interactions. This comprehensive approach is essential for fully
understanding the complexity of gene expression data and the underlying biological
processes. Figure 12 provides a visual guide to the methodology employed in constructing
RNs, illustrating the application of the GCC across different thresholds for γ. This flowchart
clarifies the analytical process from data collection to network construction, and serves as a
reference for understanding how different parameter settings affect the results.
Begin
no Is correlation
Adjust parameters Construct RNs
significant?
Run yes
Monte Carlo yes
no simulations
Analyze data
Determine
clinical implications
End
Figure 12. Flowchart of data analysis process with steps from data collection to network construction.
Symmetry 2024, 16, 1510 27 of 30
5. Conclusions
Biomedical informatics plays a pivotal role in elucidating molecular interactions,
which are essential for advancing medical diagnostics and therapeutic development. One
of the ongoing challenges is the analysis of high-dimensional gene expression data, where
traditional correlation coefficients, like Pearson and Spearman, often fall short, particularly
when addressing non-linear relationships. This challenge emphasizes the need for more
robust and flexible analytical tools. In this study, we applied the generalized correlation
coefficient, utilizing its flexibility parameter, to analyze gene association networks. The
generalized correlation coefficient provides a versatile tool for capturing complex depen-
dencies in molecular biology data, adapting to different correlation structures. To our
knowledge, this research represents one of the earliest applications of the generalized
correlation coefficient in genomic studies, addressing the shortcomings of conventional
correlation methods in dealing with outliers and deviations from normality.
We introduced computational refinements, including robust estimators based on U-
statistics and Fisher-consistent estimators, supported by advanced techniques such as the
delta method. These improvements enhance both the reliability and the practical utility
of the generalized correlation coefficient when analyzing high-dimensional biological
data, such as gene expression profiles. However, it is important to acknowledge that
the generalized correlation coefficient increased computational demands compared to
traditional methods, posing a challenge, particularly for large-scale genomic datasets—a
common scenario in modern research. Key findings from our analysis include the following:
• The adaptability of the generalized correlation coefficient to various data complexities,
demonstrating robustness and sensitivity in gene association network analysis.
• The influence of the flexibility parameter on network topology, where low values of
this parameter lead to sparser networks, emphasizing the strongest correlations.
• The detection of unique interactions using the Spearman correlation, not captured by
any configuration of the generalized correlation coefficient, underscoring the impor-
tance of applying multiple correlation measures for comprehensive data analysis.
While the focus of this study has been on gene expression data and relevance networks,
the flexibility of the generalized correlation coefficient allows it to be applied to other types
of biological data that exhibit non-linear dependencies. For instance, protein–protein inter-
action networks and microbiome data, which often involve complex and high-dimensional
relationships, can benefit from the robust correlation measures provided by the generalized
correlation coefficient. This extends the method applicability beyond genomics, making it
a valuable tool for broader applications in systems biology and molecular interactions.
Despite the advantages in using the generalized correlation coefficient, its computa-
tional burden presents a practical challenge, particularly when applied to large datasets. Op-
timizing the algorithm’s computational efficiency, or integrating it into high-performance
computing environments, would facilitate its broader use in genomic research. Additionally,
while comprehensive, the dataset employed in this study was based on high-throughput
microarray technology, which has inherent limitations, such as background noise and the
inability to detect novel transcripts or non-coding ribonucleic acids. These limitations may
affect the accuracy of the constructed gene association networks. Moreover, while the
generalized correlation coefficient was applied here to gene expression data, extending
the usual correlation coefficients, such as those arising in single-cell analysis, represents
a promising avenue for future research. This could help in capturing even more complex
dependencies in high-dimensional biological data, further broadening the applicability of
the generalized correlation coefficient to modern challenges in data analysis.
Symmetry 2024, 16, 1510 28 of 30
Our empirical analysis was conducted on a subset of 57 normal gastric tissue obser-
vations using the statistical R software—version 4.4.2—[53]. While this subset provided
valuable insights, it may not capture the full spectrum of biological variability. Future
work should validate the application of the generalized correlation coefficient using larger
and more diverse datasets, including different tissue types and pathological conditions, to
ensure the generalizability of the findings.
Another aspect for further study concerns the asymmetries identified in genomic data
distributions, where quantile regression methods [54,55] could be explored. Researchers
might also consider employing other types of asymmetric distributions for the tests under
study. Moreover, utilizing machine learning techniques in genomic data analysis is a
promising avenue for future research.
In conclusion, this study demonstrated the flexibility and strength of the generalized
correlation coefficient for analyzing complex molecular interactions in biomedical infor-
matics. By capturing both linear and non-linear relationships, this coefficient proved to be
effective for researchers working with high-dimensional biological data. Our methodology
expands the statistical toolkit for genomic and biomedical research, with applications in
controlled simulations and real-world datasets. The adaptability of the mentioned coeffi-
cient to varying correlation structures and data complexities offers valuable insights into
gene expression dynamics and their implications for precision medicine.
Author Contributions: Conceptualization, R.O., C.M.X., G.H.E., P.L.E., C.C., and V.L.; data curation,
R.O., C.M.X., G.H.E., P.L.E., and C.C.; formal analysis, R.O., C.M.X., G.H.E., P.L.E., C.C., and V.L.;
investigation, R.O., C.M.X., G.H.E., P.L.E., C.C., and V.L.; methodology, R.O., C.M.X., G.H.E., P.L.E.,
C.C., and V.L.; writing—original draft, R.O., C.M.X., G.H.E., and P.L.E.; writing—review and editing,
V.L. and C.C. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the Conselho Nacional de Desenvolvimento Científico
e Tecnológico—CNPq—, No. 303192/2022-4, and Fundação de Amparo a Ciência e Tecnologia
do Estado da Bahia—FAPESB—, No. APP0021/2023 (R.O.); by the Vice-rectorate for Research,
Creation, and Innovation—VINCI—of the Pontificia Universidad Católica de Valparaíso—PUCV—,
Chile, under grants VINCI 039.470/2024—regular research—, VINCI 039.493/2024—interdisciplinary
associative research—, VINCI 039.309/2024—PUCV centenary—, and FONDECYT 1200525 (V.L.)
from the National Agency for Research and Development—ANID—of the Chilean government;
and by Portuguese funds through the CMAT—Research Centre of Mathematics of University of
Minho, Portugal, within projects UIDB/00013/2020—https://doi.org/10.54499/UIDB/00013/2020,
accessed on 4 November 2024—and UIDP/00013/2020—https://doi.org/10.54499/UIDP/00013/20
20, accessed on 4 November 2024—(C.C.).
Data Availability Statement: The dataset used for analysis in this study originates from the FAPESP
research project No. 06/03227-2, titled “Gene Expression in Stomach and Esophagus Tumors: From
Biology to Diagnosis”. The data were obtained through a collaboration between State University of
Paraíba and the Sírio-Libanês Hospital in Brazil. The data and codes used in this study are available
on GitHub at https://github.com/Raydonal/SCGeneNetworkGCC, (accessed on 4 November 2024).
Please contact the authors for any additional information.
Acknowledgments: The authors would like to thank the editors and anonymous reviewers for their
valuable comments and suggestions, which helped us to improve the quality of this article.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Cavalcante, T.; Ospina, R.; Leiva, V.; Martin-Barreiro, C.; Cabezas, X. Weibull regression and machine learning survival models:
Methodology, comparison, and application to biomedical data related to cardiac surgery. Biology 2023, 11, 1394 .
2. Varuzza, L.; Pereira, C.A.D.B. Significance test for comparing digital gene expression profiles: Partial likelihood application. Chil. J.
Stat. 2010, 1 , 91–102.
3. Ospina, R.; Ferreira, A.G.O.; de Oliveira, H.M.; Leiva, V.; Castro, C. On the use of machine learning techniques and non-invasive
indicators for classifying and predicting cardiac disorders. Biomedicines 2023, 11,2604.
4. Bielińska-Wa˛ż, D.; Wa˛ż, P.; Błaczkowska, A.; Mandrysz, J.; Lass, A.; Gładysz, P.; Karamon, J. Mathematical modeling in bioinfor-
matics: Application of an alignment-free method combined with principal component analysis. Symmetry 2024, 16, 967.
Symmetry 2024, 16, 1510 29 of 30
5. Chicco, D.; Jurman, G. A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and
Fowlkes–Mallows index. J. Biomed. Informat. 2023, 144, 104426.
6. Zhou, K.; Zhang, S.; Wang, Y.; Cohen, K.B.; Kim, J.-D.; Luo, Q.; Yao, X.; Zhou, X.; Xia, J. High-quality gene/disease embedding in a
multi-relational heterogeneous graph after a joint matrix/tensor decomposition. J. Biomed. Informat. 2022, 126, 103973.
7. Ortega-Leon, A.; Gucciardi, A.; Segado-Arenas, A.; Benavente-Fernández, I.; Urda, D.; Turias, I.J. Neurodevelopmental impair-
ments prediction in premature infants based on clinical data and machine learning techniques . Stats 2024, 7, 685–696.
8. Han, H. Bayesian model averaging and regularized regression as methods for data-driven model exploration, with practical
considerations. Stats 2024, 7, 732–744.
9. Leiva, V.; Corzo, J.; Vergara, M.E.; Ospina, R.; Castro, C. A statistical methodology for evaluating asymmetry after normalization
with application to genomic data. Stats 2024, 7, 967–983.
10. Leiva, V.; Sanhueza, A.; Kelmansky, S.; Martinez, E. On the glog-normal distribution and its association with the gene expression
problem. Comput. Stat. Data Anal. 2009, 53, 1613–1621.
11. Vilca, F.; Rodrigues-Motta, M.; Leiva, V. On a variance stabilizing model and its application to genomic data. J. Appl. Stat. 2013, 40,
2354–2371.
12. Kelmansky, D.; Martinez, E.; Leiva, V. A new variance stabilizing transformation for gene expression data analysis. Stat. Appl.
Genet. Mol. Biol. 2013, 12, 653–666.
13. Wilcox, R. The percentage bend correlation coefficient. Psychometrika 1994, 59, 601–616.
14. Wilcox, R. Inferences based on a skipped correlation coefficient. J. Appl. Stat. 2004, 31, 131–143.
15. Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti,
P.C. Detecting novel associations in large datasets. Science 2011, 334, 1518–1524.
16. Ravindran, U.; Gunavathi, C. A survey on gene expression data analysis using deep learning methods for cancer diagnosis. Prog.
Biophys. Mol. Biol. 2023, 177, 1–13.
17. Masoodi, F.; Quasim, M.; Bukhari, S.; Dixit, S.; Alam, S. (Eds.) Applications of Machine Learning and Deep Learning on Biological Data;
CRC Press: New York, NY, USA, 2023.
18. Rahnenführer, J.; De Bin, R.; Benner, A.; Ambrogi, F.; Lusa, L.; Boulesteix, A.L.; Migliavacca, E. Statistical analysis of high-
dimensional biomedical data: A gentle introduction to analytical goals, common approaches and challenges. BMC Med. 2023,
21, 182.
19. Li, J.J.; Zhou, H.J.; Bickel, P.J.; Tong, X. Dissecting gene expression heterogeneity: Generalized Pearson correlation squares and the
K-lines clustering algorithm. J. Am. Stat. Assoc. 2024, 119, 1–14.
20. Bai, X.; Wang, S.; Zhang, X.; Wang, H. Molecular-memory-induced counter-intuitive noise attenuator in protein polymerization.
Symmetry 2024, 16, 315.
21. Chinchilli, V.M.; Philips, B.R.; Mauger, D.T.; Szefler, S.J. A general class of correlation coefficients for the 2 × 2 crossover design.
Biom. J. 2005, 47, 644–653.
22. McManus, C. Cerebral polymorphisms for lateralisation: Modelling the genetic and phenotypic architectures of multiple functional
modules. Symmetry 2022, 14, 814.
23. Chen, V.Y.J.; Chinchilli, V.M.; Richards, D.S.P. Robustness and monotonicity properties of generalized correlation coefficients. J.
Stat. Plan. Infer. 2011, 141, 924–936.
24. Sanchez, J.D.; Rêgo, J.C.; Ospina, R.; Leiva, V.; Chesneau, C.; Castro, C. Similarity-based predictive models: Sensitivity analysis
and a biological application with multi-attributes. Biology 2023, 12, 959.
25. Alkadya, W.; ElBahnasy, K.; Leiva, V.; Gad, W. Classifying COVID-19 based on amino acids encoding with machine learning
algorithms. Chemom. Intell. Lab. Syst. 2022, 224, 104535.
26. Bustos, N.; Tello, M.; Droppelmann, G.; Garcia, N.; Feijoo, F.; Leiva, V. Machine learning techniques as an efficient alternative
diagnostic tool for COVID-19 cases. Signa Vitae 2022, 18, 23.
27. García-Sancho, M.; Lowe, J. A History of Genomics Across Species, Communities and Projects; Springer: New York, NY, USA, 2023.
28. Tully, J.; Hill, A.,; Ahmed, H.; Whitley, R.; Skjellum, A.; Mukhtar, M. Expression-based network biology identifies immune-related
functional modules involved in plant defense. BMC Genom. 2014, 15, 421.
29. Jaskowiak, P.A.; Campello, R.J.G.B.; Costa, I. Proximity measures for clustering gene expression microarray data: A validation
methodology and a comparative analysis. Comput. Biol. Bioinform. IEEE/ACM Trans. 2013, 10, 845–857.
30. Langfelder, P.; Horvath, S. Fast R functions for robust correlations and hierarchical clustering. J. Stat. Softw. 2012, 46, 1–17.
31. Jaskowiak, P.; Campello, R.G.B.; Costa, I. Evaluating correlation coefficients for clustering gene expression profiles of cancer. In
Advances in Bioinformatics and Computational Biology; de Souto, M., Kann, M., Eds.; Springer: Heidelberg/Berlin, Germany, 2012;
Volume 7409, pp. 120–131.
32. Son, Y.S.; Baek, J. A modified correlation coefficient based similarity measure for clustering time-course gene expression data.
Pattern Recognit. Lett. 2008, 29, 232–242.
33. Hardin, J.S.; Mitani, A.; Hicks, L.; VanKoten, B. A robust measure of correlation between two genes on a microarray. BMC Bioinform.
2007, 8, 220.
34. Ma, S.; Gong, Q.; Bohnert, H.J. An arabidopsis gene network based on the graphical gaussian model. Genome Res. 2007, 17,
1614–1625.
Symmetry 2024, 16, 1510 30 of 30
35. Elo, L.L:; Lahesmaa, R.; Aittokallio, T. Inference of gene coexpression networks by integrative analysis across microarray
experiments. J. Integr. Bioinform. 2006, 3, 33.
36. Voy, B.H.; Scharff, J.A.; Perkins, A.D.; Saxton, A.M.; Borate, B.; Chesler, E.J.; Branstetter, L.K.; Langston, M.A. Extracting gene
networks for low-dose radiation using graph theoretical algorithms. PLoS Comput. Biol. 2006, 2, e89.
37. Zhu, D.; Hero, A.O.; Cheng, H.; Khanna, R.; Swaroop, A. Network constrained clustering for gene microarray data. Bioinformatics
2005, 21 . 4014–4020.
38. Xu, W.; Hou, Y.; Hung, Y.S.; Zou, Y. A comparative analysis of Spearman rho and Kendall tau in normal and contaminated normal
models. Signal Process. 2013, 93, 261–276.
39. Croux, C.; Dehon, C. Influence functions of the spearman and kendall correlation measures. Stat. Methods Appl. 2010, 19, 497–515.
40. Maronna, R.A.; Martin, D.R.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006.
41. Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 1, 81–93.
42. Kendall, M.G.; Gibbons, J.D. Rank Correlation Methods. A Charles Griffin Book; E. Arnold: London, UK, 1990.
43. Blomqvist, N. On a measure of dependence between two random variables. Ann. Math. Stat. 1950, 21, 593–600.
44. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 1904, 15, 72–101.
45. Lee, A.J. U-Statistics: Theory and Practice; Routledge: Abingdon, UK, 2019.
46. Andrews, G.E.; Askey, R.; Roy, R. Special Functions. Encyclopedia of Mathematics and its Applications; Cambridge University Press:
Cambridge, UK, 1999; Volume 71.
47. Hotelling, H. New light on the correlation coefficient and its transformation. J. Royal Stat. Soc. B 1953, 15, 193–232.
48. Fisher, R.A. On the probable error of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32.
49. David, F.N.; Mallows, C.L. The variance of Spearman rho in normal samples. Biometrika 1961, 48, 19–28.
50. Serfling, R.J. Approximation Theorems of Mathematical Statistics; Wiley: Hoboken, NJ, USA, 1981.
51. Butte, A.J.; Kohane, I.S. Mutual information relevance networks: Functional genomic clusteringusing pairwise entropy measure-
ments. Pac. Symp. Biocomput. 2000, 5, 415–426.
52. Butte, A.J.; Kohane, I.S. Unsupervised knowledge discovery in medical databases using relevance networks. Proc. AMIA Symp.
1999, 711–715
53. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023.
54. Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic
data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73.
55. Deng, D.; Chowdhury, M.H. Quantile regression approach for analyzing similarity of gene expressions under multiple biological
conditions. Stats 2022, 5, 583–605.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.