Data 04 00081 v3
Data 04 00081 v3
Article
Graph Theoretic and Pearson Correlation-Based
Discovery of Network Biomarkers for Cancer
Raihanul Bari Tanvir, Tasmia Aqila, Mona Maharjan, Abdullah Al Mamun and
Ananda Mohan Mondal *
School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA;
[email protected] (R.B.T.); [email protected] (T.A.); [email protected] (M.M.); [email protected] (A.A.M.)
* Correspondence: [email protected]
Received: 16 April 2019; Accepted: 3 June 2019; Published: 5 June 2019
Abstract: Two graph theoretic concepts—clique and bipartite graphs—are explored to identify the
network biomarkers for cancer at the gene network level. The rationale is that a group of genes
work together by forming a cluster or a clique-like structures to initiate a cancer. After initiation,
the disease signal goes to the next group of genes related to the second stage of a cancer, which can be
represented as a bipartite graph. In other words, bipartite graphs represent the cross-talk among the
genes between two disease stages. To prove this hypothesis, gene expression values for three cancers—
breast invasive carcinoma (BRCA), colorectal adenocarcinoma (COAD) and glioblastoma multiforme
(GBM)—are used for analysis. First, a co-expression gene network is generated with highly correlated
gene pairs with a Pearson correlation coefficient ≥ 0.9. Second, clique structures of all sizes are isolated
from the co-expression network. Then combining these cliques, three different biomarker modules
are developed—maximal clique-like modules, 2-clique-1-bipartite modules, and 3-clique-2-bipartite
modules. The list of biomarker genes discovered from these network modules are validated as the
essential genes for causing a cancer in terms of network properties and survival analysis. This list
of biomarker genes will help biologists to design wet lab experiments for further elucidating the
complex mechanism of cancer.
Keywords: bipartite graph; clique; network biomarker; Pearson correlation coefficient (PCC); gene
co-expression network
1. Introduction
The present work is motivated by the prospective applications of protein-protein interaction (PPI)
networks to diseases and other dynamic processes. Ideker and Sharan [1] enumerated four different
applications of protein networks to diseases: i) identifying new disease genes, ii) studying the network
properties of disease genes, iii) classifying diseases based on protein network, and iv) identifying
disease-related subnetworks. Genome-wide PPI networks come with rich information about the
dynamic processes such as the behavior of genetic networks in response to DNA damage [2] and
exposure to arsenic [3], the prediction of protein function [4], genetic interaction [5], protein subcellular
localization [6–11], the process of aging [12], and protein network biomarkers [13–15].
One of the widely used methods for elucidating biomarkers for diseases is through protein-protein
interaction (PPI) or gene co-expression networks based on “guilt by association” concept. In a gene
co-expression network, nodes represent the genes and edges represent the connection between genes
due to significantly similar expression patterns over different samples. Several methods exist for
inferring edges in gene networks. Pearson correlation is one of the most common co-expression
measures employed in various studies [16,17]. Another common method, Mutual Information (MI) [18]
is an information theoretic measure for measuring nonlinear relationship between genes or other
variables. A threshold is applied after constructing the co-expression network to retain the most
biologically significant correlations between genes.
The main purpose of analyzing gene co-expression networks is to identify the biologically
significant modules consist of groups of genes with dense interactions. Usually, highly connected groups
have a higher within-group homogeneity and can be considered as biologically significant modules
performing a common task, such as shared regulatory inputs or functional pathways. Clustering is
a popular method for finding relevant modules from gene co-expression networks. Weighted Gene
Correlation Network Analysis (WGCNA) is the most widely used package for module finding [19]
which applies hierarchical clustering to find modules. It applies a soft threshold during construction of
a gene co-expression network. Several researchers have identified key differentially expressed genes
associated with different cancers, such as breast, cervical, colon, esophageal, osteosarcoma and ovarian
cancers [20–26], using WGCNA.
Lui et al. [27] used differential entropy technique to identify key genes in diabetes using rat’s
time-series gene expression data from case and control samples. Guan et al. [28], developed a prediction
model using Bayes discriminant method to predict the prognosis of hepatocellular carcinoma based on
gene co-expression network.
Graph theoretic methods are also applied for analysis of gene co-expression networks. Shi et.al. [29],
proposed an algorithm named Iterative clique enumeration technique (ICE) to discover relatively
independent maximal cliques for breast cancer on GEO dataset and found some highly correlated
modules that may indicate the tumor grades. Similarly, Perkins et al. used spectral graph theory on
Homo sapiens and Saccharomyces cerevisiae microarray data for clustering at various thresholds [30].
Zhang et al. [31], discovered the top five hub genes for bladder cancer using the centrality analysis method.
None of the previous studies used clique and bipartite combination to identify the biologically
significant modules. The main goal of this paper is to explore the existence of clique-bipartite-like
network modules in actual gene network for cancer. Mondal et al. [32] showed that clique-like
structures and bipartite graphs could be the building blocks for disease progression, Figure 2 in [32].
The rationale is that a group of proteins or genes work together by forming a network (a clique-like
structure) to accomplish a specific function, which could be related to a disease stage [32] and bipartite
structure represents the cross-talk among genes between two disease stages.
In this study, gene co-expression network was constructed using highly correlated gene pairs
with PCC ≥ 0.9. Three network modules—maximal clique-like graph, 2-clique-1-bipartite graph,
and 3-clique-2-bipartite graph—are identified. Finally, the effectiveness of the key genes discovered
from these network modules was validated using pathway and survival analyses.
2. Results
Three different types of cancers—breast invasive carcinoma (BRCA), colorectal adenocarcinoma
(COAD), and glioblastoma multiforme (GBM)—are considered in the present study to identify network
biomarkers. Gene correlation networks based on gene expression profiles of BRCA (20,155 genes for
1093 samples), COAD (19,828 genes for 379 samples), and GBM (19,660 genes for 153 samples) are
developed with highly correlated gene pairs (PCC ≥ 0.9). From these networks, three types of gene network
modules, considered as network biomarkers, are isolated: i) Single clique-like module based on maximal
cliques named as “maximal clique-like” module, ii) clique-bipartite-like modules with two cliques and
one bipartite graph named as “2-clique-1-bipartite” modules, which are discovered based on two cliques
connected with maximum number of inter-clique connections, and iii) clique-bipartite-like modules with
three cliques (A, B, C) and two bipartite graphs (A-B and B-C) named as “3-clique-2-bipartite” modules,
which are discovered based on two bipartite graphs having relatively more edges compare to others.
This section is organized in following subsections: Section 2.1—results with the topology
of gene co-expression networks; Section 2.2—results with cliques and maximal clique-like
modules; Section 2.3—results with 2-clique-1-bipartite modules; and Section 2.4—results with
3-clique-2-bipartite modules.
Data 2019, 4, 81 3 of 12
Cancer Name # Of Genes # Of Edges Max Degree Min Degree Avg Degree
BRCA 380 1034 39 1 5.4
COAD 607 3651 75 1 12.0
GBM 506 1243 49 1 4.9
Table 2. List of genes in maximal clique-like modules for three Cancers—BRCA, COAD, and GBM.
focuses on identifying cliques with maximal connections (cross-talks) only. There are 59, 145, and 44 edges
that are
Dataconnecting
2019, 4, 81 two cliques in Figure 1a–c, which are the highest in three respective cancers.
4 of 12
Figure 1. Clique-bipartite-like modules with maximal interconnections between two cliques. (a) BRCA;
Figure 1. Clique-bipartite-like modules with maximal interconnections between two cliques. (a)
(b) COAD;
BRCA; (b)and (c) GBM.
COAD; and (c)Nodes in Clique-1
GBM. Nodes are are
in Clique-1 yellow
yellowand
andnodes
nodes ininClique-2
Clique-2areare
greygrey colored.
colored.
Intra-clique connections are blue and inter-clique connections (a bipartite graph) are red.
Intra-clique connections are blue and inter-clique connections (a bipartite graph) are red.
TableTable
3 shows the list
3 shows theof genes
list discovered
of genes from
discovered these
from clique-bipartite-like
these modules.
clique-bipartite-like modules.Based
Basedononthese
modules,
these modules, COAD and GBM cancers share many genes in common. The common genes—in for
COAD and GBM cancers share many genes in common. The common genes—in clique1
clique1 for
both cancers are both
HCK,cancers
ITGB2,are HCK,and
LAIR1, ITGB2, LAIR1,
LRRC25, and
and forLRRC25, andCD4,
clique2 are for clique2
CD53, are CD4, NCKAP1L,
LILRB1, CD53,
LILRB1,
and SPI1. NCKAP1L,
LAPTM5 is theand
onlySPI1. LAPTM5
common is the
gene only common
between clique1 ofgene
GBMbetween clique1 of
and clique2 ofCOAD.
GBM and On the
clique2 of COAD. On the other hand, BRCA does not share any gene in common. It can
other hand, BRCA does not share any gene in common. It can be concluded from 2-clique-1-bipartite be concluded
from 2-clique-1-bipartite modules that BRCA cancer has unique behavior, which is different from
modules that BRCA cancer has unique behavior, which is different from COAD and GBM cancers,
COAD and GBM cancers, whereas COAD and GBM might have some common characteristics.
whereas COAD and GBM might have some common characteristics.
Table 3. List of genes in 2-clique-1-bipartite modules.
Table 3. List of genes in 2-clique-1-bipartite modules.
BRCA COAD GBM
BRCA COAD C3AR1,
C1QB, C1QC, GBM
CCL5,CCL5, CD2, CD247, C1QB, C1QC,
CD2, CD247, CD163,
C3AR1, CD163, CLEC7A,CD68, FERMT3, HCK,
CLEC7A,
CD68, FERMT3, HCK,
CD3D, CD3E,CD3E,
CD3D, CXCR3,
CXCR3, CMKLR1, CSF1R,CSF1R,
CMKLR1, FPR3, HCK,
FPR3,ITGB2, ITGB2, LAIR1,
Clique1
Clique1 ITGB2, LAIR1, LAPTM5,
GZMA, IL2RG,
GZMA, SIRPG,
IL2RG, LAIR1,HCK,
LILRB2, LRRC25,
ITGB2, SIGLEC7, LAPTM5, LRRC25,
LAIR1, LRRC25, SIGLEC9
SLA2, TBX21 SLAMF8, TLR8
SIRPG, SLA2, TBX21 LILRB2, LRRC25, SIGLEC9
BTLA, ITK, LY9, SIGLEC7,
CD4, CD53, CD86, SLAMF8, TLR8DOCK2,
CYBB, CYTH4, CD4, CD53, LILRB4,
PYHIN1, SH2D1A,
Clique2 DOK2,CD4, CD53,LCP2,
LAPTM5, CD86,LILRB1, CD4, CD53, LILRB4,
NCKAP1L, PTPN6,
SLAMF1, SLAMF6,
BTLA, ITK, LY9, CYBB, CYTH4,
NCKAP1L, SLA, SPI1 NCKAP1L, PTPN6,
SASH3, SPI1, VAV1
TRAT1, ZNF831
PYHIN1, SH2D1A, DOCK2, DOK2, SASH3, SPI1, VAV1
Clique2
SLAMF1, SLAMF6, LAPTM5, LCP2,
2.4. 3-Clique-2-Bipartite Modules
TRAT1, ZNF831 LILRB1, NCKAP1L,
SLA, SPI1
The top three modules of 3-clique-2-bipartite from each cancer are considered for further analysis.
Table 4 summarizes these modules in terms of clique size and the number of inter-clique connections.
2.4. 3-Clique-2-Bipartite Modules
For example, BRCA-Module1 consists of three cliques of 13, seven, and four genes connected by two
The topofthree
bipartite graphs modules
56 and of 3-clique-2-bipartite from each cancer are considered for further
13 connections.
analysis. Table 4 summarizes these modules in terms of clique size and the number of inter-clique
connections. For example,
Table 4.BRCA-Module1 consists
Summary statistics of three cliques of
of 3-clique-2-bipartite 13, seven, and four genes
modules.
connected by two bipartite graphs of 56 and 13 connections.
Clique-A Clique-B Clique-C Connections A-B Connections B-C
Table 4. Summary statistics of 3-clique-2-bipartite modules.
BRCA-Module1 13 7 4 56 13
BRCA-Module2 11 Clique-A 7 Clique-B 4 Connections
35 Connections
10
Clique-C
A-B B-C
BRCA-Module3 8 6 6 8 18
BRCA-Module1 13 7 4 56 13
COAD-Module1 16 14 7 85 53
BRCA-Module2 11 7 4 35 10
COAD-Module2 16
BRCA-Module3 8 14 6 6 6 111
8 18 51
COAD-Module1
COAD-Module3 16 16 12 14 7 7 85
69 53 40
COAD-Module2
GBM-Module1 9 16 9 14 5 6 111
30 51 19
COAD-Module3 16 12 7 69 40
GBM-Module2 9 7 6 22 23
GBM-Module1 9 9 5 30 19
GBM-Module3 9 7 4 36 14
Data 2019, 4, 81 5 of 12
Figure 2 shows
Figure the top
2 shows thethree 3-clique-2Clique-2-bipartite
top three 3-clique-2Clique-2-bipartitemodules for for
modules BRCA. Modules
BRCA. Modulesforfor
COAD
COADare
and GBM and GBM in
shown areFigure
shownS2.in Figure S2. The
The nodes nodescliques
in three in threeare
cliques are represented
represented by yellow
by yellow (clique-A),
(clique-A), and
grey (clique-B) greyorange
(clique-B) and orange
(clique-C) (clique-C)
colors. colors. Intra-clique
Intra-clique edges areblue
edges are colored colored
andblue and inter-edges
inter-clique
clique edges
are colored red. are colored red.
Figure 2. Top three 3-clique-2-bipartite modules for BRCA. Yellow nodes: Clique-A, gray nodes:
Figure 2. Top three 3-clique-2-bipartite modules for BRCA. Yellow nodes: Clique-A, gray nodes:
Clique-B, Orange
Clique-B, nodes:
Orange Clique-C.
nodes: Clique-C.Blue:
Blue:Intra-clique edges,Red:
Intra-clique edges, Red:Inter-clique
Inter-clique edges.
edges. (a) Cliques
(a) Cliques A, B, A, B,
and Cand
have 13, 7, and 4 nodes respectively. There are 56 connecting edges between
C have 13, 7, and 4 nodes respectively. There are 56 connecting edges between cliques A and cliques A Band B
and 13
andconnecting edges
13 connecting between
edges between cliques
cliquesBBand
and C.; (b) Cliques
C.; (b) CliquesA,A,B,B, and
and C have
C have 11, 11, 7, and
7, and 4 nodes
4 nodes
respectively. There
respectively. are are
There 35 connecting
35 connecting edges
edgesbetween
between cliques
cliques AAand
andBBandand1010 connecting
connecting edgesedges between
between
cliques
cliques B and B and C.; Cliques
C.; (c) (c) Cliques
A,A,
B, B,
andandCChavehave8,
8, 6,
6, and
and 66 nodes
nodesrespectively.
respectively.There are are
There 8 connecting
8 connecting
edgesedges between
between cliques
cliques A and
A and B and
B and 1818connecting
connecting edges
edges between
betweencliques
cliquesB and C. C.
B and
The complete
The complete listslists of genes
of genes thatare
that arepresent
present in
in each
each of
ofthe
thetop
topthree
three3-clique-2-bipartite
3-clique-2-bipartitemodules
modules
for BRCA, COAD, and GBM are presented in Supplementary Table S3. Observation of these listlist
for BRCA, COAD, and GBM are presented in Supplementary Table S3. Observation of these reveals
reveals that there are many genes in common in three modules of a particular cancer. Table 5 shows
that there are many genes in common in three modules of a particular cancer. Table 5 shows the
the combined list—44, 48, and 32 genes for BRCA, COAD, and GBM respectively. Three cancers share
combined list—44, 48, and 32 genes for BRCA, COAD, and GBM respectively. Three cancers share four
four genes—CD53, DOCK2, IKZF1, and NCKAP1L. Other than these four genes, BRCA and COAD
genes—CD53,
share threeDOCK2, IKZF1, and
more genes—ITK, NCKAP1L.
PTPRC, Other thanCOAD
and TBC1D10C; these and
fourGBM
genes, BRCA
share and COAD
10 more genes—share
three ARHGAP30,
more genes—ITK, PTPRC, and TBC1D10C; COAD and GBM share 10
CD4, CD86, CSF1R, HCK, ITGB2, LAIR1, LAPTM5, SASH3, and SPI; and BRCA andmore genes—ARHGAP30,
CD4, GBM
CD86,doCSF1R, HCK,
not share any ITGB2, LAIR1,
more genes. LAPTM5,
Thus, BRCA and SASH3,
COADand SPI;
share and of
a total BRCA
sevenand GBM
genes; COAD do not
andshare
any more
GBMgenes.
share aThus,
total ofBRCA andand
14 genes; COAD
BRCA share
and aGBM
totalshare
of seven
only genes; COAD
four genes. andbased
Again, GBMon share a total of
3-clique-
bipapartite
14 genes; and BRCAmodules,
andCOAD and GBM
GBM share onlyshares
four many
genes.genes, which
Again, means
based that they might have some
on 3-clique-bipapartite modules,
COAD common
and GBMcauseshares
for cancer
many development.
genes, which These lists of
means common
that genes might
they might provide
have some better insight
common cause for
from lab experiments.
cancer development. These lists of common genes might provide better insight from lab experiments.
Table 5. Combined list of genes from top three 3-clique-2 bipartite modules.
Table 5. Combined list of genes from top three 3-clique-2 bipartite modules.
List of genes
ACAP1, CCL5, CD2, CD247, CD3D,List of Genes
CD3E, CD3G, CD5, CD53, CD96, CXCR3,
CXCR6, CCL5,
ACAP1, DOCK2, EVI2B,
CD2, FYB,
CD247, GZMA,
CD3D, GZMM,
CD3E, IKZF1,
CD3G, CD5,IL2RG,
CD53, ITK, LCP2,
CD96, LY9,CXCR6,
CXCR3,
BRCA-
DOCK2,
NCKAP1L,EVI2B, FYB,
PLEK, GZMA,
PRF1, GZMM,
PRKCB, IKZF1,
PTPRC, IL2RG, ITK,
PTPRCAP, LCP2,S1PR4,
PYHIN1, LY9, NCKAP1L,
SH2D1A, PLEK,
BRCA-Modules
Modules PRF1, PRKCB,
SIRPG, PTPRC,
SIT1, SLA2, PTPRCAP,
SLAMF1, PYHIN1,
SLAMF6, SPN,S1PR4, SH2D1A,
TBC1D10C, SIRPG,
TBX21, SIT1, SLA2,
THEMIS, SLAMF1,
TRAT1,
SLAMF6, SPN, TBC1D10C, TBX21, THEMIS,
UBASH3A, TRAT1, UBASH3A, ZAP70, ZNF831
ZAP70, ZNF831
APBB1IP, ARHGAP30,
APBB1IP, ARHGAP30,ARHGAP9,
ARHGAP9,BTK, C3AR1,
BTK, CD163,
C3AR1, CD4,
CD163, CD53,
CD4, CD84,
CD53, CD86,
CD84, CD86,
CLEC7A, CSF1R,
CLEC7A, CSF1R, CYBB, CYTH4, DOCK10,
DOCK10, DOCK2,
DOCK2,FPR3,
FPR3,HAVCR2,
HAVCR2,HCK,HCK,HCLS1,
HCLS1,IKZF1,
COAD-
COAD-Modules IL10RA, ITGAL, ITGB2, ITK, KLHL6, LAIR1, LAPTM5, LILRB1, LILRB4, LRRC25,
IKZF1, IL10RA, ITGAL, ITGB2, ITK, KLHL6, LAIR1, LAPTM5, LILRB1, LILRB4,
Modules MAP4K1, MNDA, MYO1G, NCKAP1L, PIK3R5, PTPRC, RASAL3, SASH3, SIGLEC7,
LRRC25, MAP4K1, MNDA, MYO1G, NCKAP1L, PIK3R5, PTPRC, RASAL3, SASH3,
SIGLEC9, SIRPB2, SLA, SLAMF8, SPI1, TBC1D10C, TRAF3IP3, WAS
SIGLEC7, SIGLEC9, SIRPB2, SLA, SLAMF8, SPI1, TBC1D10C, TRAF3IP3, WAS
ARHGAP30,
ARHGAP30, ARL11,
ARL11, C1QA,
C1QA, C1QB,
C1QB, C1QC,
C1QC, CD33,
CD33, CD4,
CD4, CD53,
CD53, CD68,
CD68, CD86,
CD86, CSF1R,
CSF1R,
GBM-Modules
GBM- DOCK2, DOCK8, FCER1G, FCGR3A, FERMT3, HCK, IKZF1, ITGB2,
DOCK2, DOCK8, FCER1G, FCGR3A, FERMT3, HCK, IKZF1, ITGB2, LAIR1, LAIR1, LAPTM5,
MYO1F, NCF4,
Modules LAPTM5, MYO1F,NCKAP1L, PLCG2, SASH3,
NCF4, NCKAP1L, PLCG2,SPI1, STXBP2,
SASH3, SPI1,SYK, TYROBP,
STXBP2, VAMP8, VAV1
SYK, TYROBP,
VAMP8, VAV1
3. Discussion
This section discusses the validation of key genes related to three cancers—BRCA, COAD,
and GBM—discovered from three network modules—maximal clique-like modules, 2-clique-1-bipartite
modules, and 3-clique-2-bipartite modules. First, since the key genes are discovered via network
modules, this paper used a network-based app, CytoHubba [34] for validation. The app, CytoHubba,
GBM—discovered from three network modules—maximal clique-like modules, 2-clique-1-bipartite
modules, and 3-clique-2-bipartite modules. First, since the key genes are discovered via network
modules, this paper used a network-based app, CytoHubba [34] for validation. The app, CytoHubba,
is capable of ranking genes in a network using 12 different graph-theoretic algorithms. The reason
for using CytoHubba is that it produces successful results in predicting essential proteins from
Data 2019, 4, 81
the
6 of 12
yeast protein-protein interaction network [34]. Similarly, in a cancer gene co-expression network, the
genes that cause cancer can be thought of as the essential genes for causing that cancer and most
is capable
likely of ranking
will have genesnetwork
the similar in a network using
properties as12 different
essential graph-theoretic
proteins algorithms.
in PPI network. Second,The reason
a survival
for using CytoHubba is that it produces successful results in predicting essential proteins
analysis is conducted to show the effectiveness of the key genes discovered using network modules. from the
yeast protein-protein
Finally, pathway and GO interaction network [34].
term enrichment Similarly,
analyses in a cancer
are conducted gene
for the co-expression
key genes. network,
the genes that cause cancer can be thought of as the essential genes for causing that cancer and most
3.1. Validation
likely will haveUsing CytoHubba
the similar network properties as essential proteins in PPI network. Second, a survival
analysis is conducted to show the effectiveness of the key genes discovered using network modules.
Figure 3 shows the validation process using two validation metrics—Top 20 genes and Top 50
Finally, pathway and GO term enrichment analyses are conducted for the key genes.
genes—developed using CytoHubba. The original or base gene network (network created with PCC ≥
3.1. are
0.9) analyzed
Validation using
Using 12 scoring methods—betweenness, bottleneck, closeness, clustering coefficient
CytoHubba
(CC), degree, density of maximum neighborhood component (DMNC), eccentricity (EcC), edge
Figurecomponent
percolated 3 shows the validation
(EPC), maximal process
cliqueusing two (MCC),
centrality validation metrics—Top
maximum 20 genes
neighborhood and Top
component
50 genes—developed using CytoHubba. The original or base gene network (network created
(MNC), radiality, and stress—of CytoHubba to create the list of genes as the benchmark for validation. with PCC
≥ 0.9)Metric-1
are analyzed usingGenes):
(Top-20 12 scoring methods—betweenness,
First, Top-20 genes are taken bottleneck,
from eachcloseness, clustering
of the 12 scoringcoefficient
methods.
(CC), degree, density of maximum neighborhood component (DMNC), eccentricity
Then, the genes that appear in two or more scoring methods are considered as the benchmark (EcC), edgefor
percolated component (EPC), maximal clique centrality (MCC), maximum neighborhood
validation. The benchmarks for BRCA, COAD, and GBM cancers consist of 41, 53, and 42 genes, component
(MNC), radiality,
respectively, and stress—of CytoHubba
see Supplementary Table S4. to create the list of genes as the benchmark for validation.
3. Validation process
Figure 3. process using
using two
two metrics.
metrics. Metric-1: Top-20
Top-20 genes from 12 scoring methods of
CytoHubba; Metric-2: Top-50 genes
genes from
from 12
12 scoring
scoring methods
methods ofof CytoHubba.
CytoHubba.
Cox
3.2. proportional
Survival Analysis hazard regression [35], a semi-parametric method was used for calculating the
Cox coefficients of the key genes (Supplemental Table S5). It can adjust survival rate estimation to
Cox proportional hazard regression [35], a semi-parametric method was used for calculating the
quantify the effect to predictor variables, which are key genes in the present study. The clinical data
Cox coefficients of the key genes (Supplemental Table S5). It can adjust survival rate estimation to
of cancer patients (obtained from TCGA) were divided into two equal groups such that each group
quantify the effect to predictor variables, which are key genes in the present study. The clinical data
hadofthe same
cancer ratio of
patients dead and
(obtained alive.
from TCGA)One of the
were groups
divided intowere used as
two equal training
groups suchset for
that calculation
each group
of had
Cox the
coefficients of the key genes. Then, the prognostic risk of each patient in the test
same ratio of dead and alive. One of the groups were used as training set for calculation set of
was
calculated based on the expression values of key genes using the gene expression grade index
Cox coefficients of the key genes. Then, the prognostic risk of each patient in the test set was (GGI) [36].
Thecalculated
followingbased
equation
on thecalculates thevalues
expression risk: of key genes using the gene expression grade index (GGI)
[36]. The following equation calculates the risk: X X
GGIRiskScore = xi − yi
GGI Risk Score = ∑ 𝑥 − ∑ 𝑦
where,
where, 𝑥 and
xi and 𝑦 the
yi are expression
are the expression level of genes
level of genes with
withpositive
positive andandnegative
negativecox
coxcoefficient.
coefficient.
According
According toto
GGI
GGIrisk
riskscore,
score,patients
patients in
in the
the test
test were divided
dividedintointotwo
twogroups,
groups,asashigh highandandlow low
riskrisk
groups.
groups.The
The patients
patientswith
withaatoptop50%
50%GGI
GGI risk
risk score are in
in the
thehigh-risk
high-riskgroup
groupand andothers
othersareare
in in
thethe
low-risk
low-risk group.
group. Thena alog-rank
Then log-ranktest testwas
was performed
performed to see seeififthere
thereare
aresignificant
significantdifference
difference inin
thethe
realreal
survival
survival risks
risks between
between thethetwo
twogroups.
groups.
The
The survival
survival analysis
analysis of key
of key genes
genes of three
of three cancers
cancers is shown
is shown in Figure
in Figure 4. It4.isItclear
is clear
fromfrom this
this figure
figure that the key genes of BRCA, COAD, and GBM are capable of distinguishing
that the key genes of BRCA, COAD, and GBM are capable of distinguishing between cancer patients in between cancer
patients
terms in terms
of survival of survival
in the respective in the respective
cancers. The logcancers. The log between
rank p-values rank p-values between
high-risk high-riskgroups
and low-risk and
low-risk groups were 0.0411, 0.0100, and 0.0171. Log-rank p-values below 0.05
were 0.0411, 0.0100, and 0.0171. Log-rank p-values below 0.05 means there is a significant difference means there is a
significant difference between the two groups in consideration. The hazard ratios between high-risk
between the two groups in consideration. The hazard ratios between high-risk groups and low-risk
groups and low-risk groups are 1.6478, 2.1627, and 1.6569 for cancer patients of BRCA, COAD, and
groups are 1.6478, 2.1627, and 1.6569 for cancer patients of BRCA, COAD, and GBM. This means,
GBM. This means, for example, high-risk groups of COAD patients are 2.1627 more likely to die than
for example, high-risk groups of COAD patients are 2.1627 more likely to die than low-risk patients.
low-risk patients.
Figure 4. Survival Analysis in data sets of BRCA (a), COAD (b), and GBM (c) cancer patients, using their
Figure 4. Survival Analysis in data sets of BRCA (a), COAD (b), and GBM (c) cancer patients, using
respective key genes as prognostic factors. The Kaplan–Meyer curve in blue is for the low-risk group
their respective key genes as prognostic factors. The Kaplan–Meyer curve in blue is for the low-risk
and in orange for the high-risk group. The shaded blue and orange regions around their respective
group and in orange for the high-risk group. The shaded blue and orange regions around their
lines indicate the confidence interval. The y-axis is the probability of survival and the x-axis is the
respective lines indicate the confidence interval. The y-axis is the probability of survival and the x-
duration in days.
axis is the duration in days.
Data 2019, 4, 81 8 of 12
379 samples, and 19,660 genes for 153 samples, respectively, for BRCA, COAD, and GBM as mentioned
in Table 7. In these datasets, all samples are cancer patients.
Table 7. Summary of gene expression data for BRCA, COAD, and GBM.
The missing values were imputed using the fancyimpute package in Python employing the
k-nearest neighbors algorithm. The number of genes in the reduced datasets are 16,011, 15,769,
and 16,186, respectively, for BRCA, COAD, and GBM. For the present study, highly correlated positive
gene pairs, PCC ≥ 0.9 in each cancer are considered for creating the base networks for further analysis.
5. Conclusions
This paper used two graph theoretic concepts—clique and bipartite graphs—to identify the network
biomarkers for cancer from gene co-expression networks developed with highly correlated gene pairs.
The gene expression profiles of three cancers—BRCA, COAD, and GBM—are considered for experiment.
Results show that three types of network modules—maximal clique-like, 2-clique-1-bipartite,
and 3-clique-2-bipartite graphs—derived using the simple graph theoretic concepts clique and bipartite
graph are capable of representing cancer dynamics at the gene network level. The combined list of
genes from three network modules for a particular cancer are validated with the benchmark developed
Data 2019, 4, 81 10 of 12
from a network-based tools CytoHubba. The effectiveness of the key genes is also validated by survival
and pathway analyses.
The discovered gene network modules provide a short list of genes related to cancer that can be
used by the biologist to design wet lab experiment for further elucidation of the complex mechanism
of cancer.
References
1. Ideker, T.; Sharan, R. Protein networks in disease. Genome Res. 2008, 18, 644–652. [CrossRef] [PubMed]
2. Bandyopadhyay, S.; Mehta, M.; Kuo, D.; Sung, M.-K.; Chuang, R.; Jaehnig, E.J.; Bodenmiller, B.; Licon, K.;
Copeland, W.; Shales, M.; et al. Rewiring of Genetic Networks in Response to DNA Damage. Science 2010,
330, 1385–1389. [CrossRef] [PubMed]
3. Haugen, A.C.; Kelley, R.; Collins, J.B.; Tucker, C.J.; Deng, C.; Afshari, C.A.; Brown, J.M.; Ideker, T.;
Van Houten, B. Integrating phenotypic and expression profiles to map arsenic-response networks. Genome
Boil. 2004, 5, R95. [CrossRef] [PubMed]
4. Lee, H.; Tu, Z.; Deng, M.; Sun, F.; Chen, T. Diffusion Kernel-Based Logistic Regression Models for Protein
Function Prediction. OMICS A J. Integr. Boil. 2006, 10, 40–55. [CrossRef] [PubMed]
5. Qi, Y.; Suhail, Y.; Lin, Y.; Boeke, J.D.; Bader, J.S. Finding friends and enemies in an enemies-only network: A
graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast
genetic interactions. Genome Res. 2008, 18, 1991–2004. [CrossRef] [PubMed]
6. Ananda, M.M.; Hu, J. NetLoc: Network based protein localization prediction using protein-protein interaction
and co-expression networks. In Proceedings of the 2010 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM), Hong Kong, China, 18–21 December 2010; pp. 142–148.
7. Mondal, A.; Lin, J.-R.; Hu, J. Network based subcellular localization prediction for multi-label proteins.
In Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops
(BIBMW), Atlanta, GA, USA, 12–15 November 2011.
8. Mondal, A.M.; Hu, J. Protein Localization by Integrating Multiple Protein Correlation Networks.
Proceedings of The 2012 International Conference on Bioinformatics & Computational Biology (BIOCOMP’12),
Las Vegas, NV, USA, 16–19 July 2012; pp. 82–88.
9. Lin, J.-R.; Mondal, A.M.; Liu, R.; Hu, J. Minimalist ensemble algorithms for genome-wide protein localization
prediction. BMC Bioinform. 2012, 13, 157. [CrossRef]
10. Mondal, A.; Hu, J. Scored Protein-Protein Interaction to Predict Subcellular Localizations for Yeast Using
Diffusion Kernel. In International Conference on Pattern Recognition and Machine Intelligence; Springer:
Berlin/Heidelberg, Germany, 2013.
11. Mondal, A.; Hu, J. Network based prediction of protein localisation using diffusion kernel. Int. J. Data
Min. Bioinform. 2014, 9, 386–400. [CrossRef]
12. Faisal, F.E.; Milenkovic, T. Dynamic networks reveal key players in aging. Bioinformatics 2014, 30, 1721–1729.
[CrossRef]
Data 2019, 4, 81 11 of 12
13. Kevin, C.; Andrews, A.; Ananda, M. Protein Subnetwork Biomarkers for Yeast Using Brute Force Method.
In Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP), Las
Vagas, NV, USA, 22–25 July 2013; pp. 218–223.
14. Timalsina, P.; Charles, K.; Mondal, A.M. STRING PPI Score to Characterize Protein Subnetwork Biomarkers for
Human Diseases and Pathways. In Proceedings of the 2014 IEEE International Conference on Bioinformatics
and Bioengineering, Boca Raton, FL, USA, 10–12 November 2014; pp. 251–256.
15. Maharjan, M.; Tanvir, R.B.; Chowdhury, K.; Mondal, A.M. Determination of Biomarkers for Diagnosis of
Lung Cancer Using Cytoscape-based GO and Pathway Analysis. In Proceedings of the 20th International
Conference on Bioinformatics & Computational Biology (BIOCOMP’19), Las Vegas, NV, USA, 29 July–01 Aug
2019. (Accepted).
16. Eisen, M.B.; Spellman, P.T.; Brown, P.O.; Botstein, D. Cluster analysis and display of genome-wide expression
patterns. Proc. Natl. Acad. Sci. USA 1998, 95, 14863–14868. [CrossRef]
17. Wolfe, C.J.; Kohane, I.S.; Butte, A.J. Systematic survey reveals general applicability of ‘guilt-by-association’
within gene coexpression networks. BMC Bioinform. 2005, 6, 227. [CrossRef]
18. Butte, A.J.; Kohane, I.S. Mutual information relevance networks: Functional genomic clustering using
pairwise entropy measurements. Pac. Symp. Biocomput. 2000, 418–429.
19. Zhang, B.; Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl.
Genet. Mol. Biol. 2005, 4, 17. [CrossRef]
20. Tang, J.; Lu, M.; Cui, Q.; Zhang, D.; Kong, D.; Liao, X.; Ren, J.; Gong, Y.; Wu, G. Overexpression of ASPM,
CDC20, and TTK Confer a Poorer Prognosis in Breast Cancer Identified by Gene Co-expression Network
Analysis. Front. Oncol. 2019, 9, 310. [CrossRef]
21. Lalremmawia, H.; Tiwary, B.K. Identification of Molecular Biomarkers for Ovarian Cancer using
Computational Approaches. Carcinogenesis 2019. [CrossRef]
22. Maertens, A.M.; Tran, V.; Kleensang, A.; Hartung, T. Weighted Gene Correlation Network Analysis (WGCNA)
Reveals Novel Transcription Factors Associated With Bisphenol A Dose-Response. Front. Genet. 2018, 9, 508.
[CrossRef]
23. Shi, H.; Zhang, L.; Qu, Y.; Hou, L.; Wang, L.; Zheng, M. Prognostic genes of breast cancer revealed by gene
co-expression network analysis. Oncol. Lett. 2017, 14, 4535–4542. [CrossRef]
24. Liu, X.; Hu, A.-X.; Zhao, J.-L.; Chen, F. Identification of Key Gene Modules in Human Osteosarcoma by
Co-Expression Analysis Weighted Gene Co-Expression Network Analysis (WGCNA). J. Cell. Biochem. 2017,
118, 3953–3959. [CrossRef]
25. Zhang, C.; Sun, Q. Weighted gene co-expression network analysis of gene modules for the prognosis of
esophageal cancer. J. Huazhong Univ. Sci. Technol. [Med. Sci.] 2017, 37, 319–325. [CrossRef]
26. Liu, R.; Zhang, W.; Liu, Z.; Zhou, H. Associating transcriptional modules with colon cancer survival through
weighted gene co-expression network analysis. BMC Genom. 2017, 18, 361. [CrossRef]
27. Liu, Z.-P.; Gao, R. Detecting pathway biomarkers of diabetic progression with differential entropy.
J. Biomed. Inform. 2018, 82, 143–153. [CrossRef]
28. Guan, L.; Luo, Q.; Liang, N.; Liu, H. A prognostic prediction system for hepatocellular carcinoma based on
gene co-expression network. Exp. Ther. Med. 2019, 17, 4506–4516. [CrossRef]
29. Shi, Z.; Derow, C.K.; Zhang, B. Co-expression module analysis reveals biological processes, genomic gain,
and regulatory mechanisms associated with breast cancer progression. BMC Syst. Biol. 2010, 4, 74. [CrossRef]
30. Perkins, A.D.; Langston, M.A. Threshold selection in gene co-expression networks using spectral graph
theory techniques. BMC Bioinform. 2009, 10, S4. [CrossRef]
31. Zhang, D.-Q.; Zhou, C.; Chen, S.-Z.; Yang, Y.; Shi, B. Identification of hub genes and pathways associated
with bladder cancer based on co-expression network analysis. Oncol. Lett. 2017, 14, 1115–1122. [CrossRef]
32. Mondal, A.M.; Schultz, C.A.; Sheppard, M.; Carson, J.; Tanvir, R.B.; Aqila, T. Graph Theoretic Concepts
as the Building Blocks for Disease Initiation and Progression at Protein Network Level: Identification and
Challenges. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine,
BIBM, Madrid, Spain, 3–6 December 2018.
33. Hagberg, A.A.; Schult, D.A.; Swart, P.J. Exploring network structure, dynamics, and function using NetworkX.
In Proceedings of the 7th Python in Science Conference (SciPy), Pasadena, CA, USA, 19–24 August 2008;
pp. 11–15.
Data 2019, 4, 81 12 of 12
34. Chin, C.-H.; Chen, S.-H.; Wu, H.-H.; Ho, C.-W.; Ko, M.-T.; Lin, C.-Y. cytoHubba: Identifying hub objects and
sub-networks from complex interactome. BMC Syst. Biol. 2014, 8 (Suppl. 4), S11. [CrossRef]
35. Mauger, E.A.; Wolfe, R.A.; Port, F.K. Transient effects in the cox proportional hazards regression model.
Stat. Med. 1995, 14, 1553–1565. [CrossRef]
36. Sotiriou, C.; Wirapati, P.; Loi, S.; Harris, A.; Fox, S.; Smeds, J.; Nordgren, H.; Farmer, P.; Praz, V.; Haibe-Kains, B.;
et al. Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade to
Improve Prognosis. J. Natl. Cancer Inst. 2006, 98, 262–272. [CrossRef]
37. Wu, G.; Dawson, E.; Duong, A.; Haw, R.; Stein, L. ReactomeFIViz: The Reactome FI Cytoscape app for
pathway and network-based data analysis. F1000Research 2014, 3, 146. [CrossRef]
38. Maere, S.; Heymans, K.; Kuiper, M. BiNGO: A Cytoscape plugin to assess overrepresentation of Gene
Ontology categories in Biological Networks. Bioinformatics 2005, 21, 3448–3449. [CrossRef]
39. Monette, A.; Bergeron, D.; Ben Amor, A.; Meunier, L.; Caron, C.; Mes-Masson, A.-M.; Kchir, N.; Hamzaoui, K.;
Jurisica, I.; Lapointe, R. Immune-enrichment of non-small cell lung cancer baseline biopsies for multiplex
profiling define prognostic immune checkpoint combinations for patient stratification. J. Immunother. Cancer
2019, 7, 86. [CrossRef]
40. Erazo-Oliveras, A.; Fuentes, N.R.; Wright, R.C.; Chapkin, R.S. Functional link between plasma membrane
spatiotemporal dynamics, cancer biology, and dietary membrane-altering agents. Cancer Metastasis Rev.
2018, 37, 519–544. [CrossRef]
41. Vasaikar, S.V.; Straub, P.; Wang, J.; Zhang, B. LinkedOmics: Analyzing multi-omics data within and across
32 cancer types. Nucleic Acids Res. 2017, 46, D956–D963. [CrossRef]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).