0% found this document useful (0 votes)

75 views14 pages

DeepGene - An Advanced Cancer Type Classifier Based On Deep Learning and Somatic Point Mutations

DeepGene- an advanced cancer type classifier based on deep learning and somatic point mutations

Uploaded by

timothyhses

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views14 pages

DeepGene - An Advanced Cancer Type Classifier Based On Deep Learning and Somatic Point Mutations

DeepGene- an advanced cancer type classifier based on deep learning and somatic point mutations

Uploaded by

timothyhses

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476

DOI 10.1186/s12859-016-1334-9

RESEARCH Open Access

DeepGene: an advanced cancer type

classifier based on deep learning and
somatic point mutations
Yuchen Yuan1,2†, Yi Shi2*†, Changyang Li1, Jinman Kim1, Weidong Cai1, Zeguang Han2 and David Dagan Feng1,2
From The 27th International Conference on Genome Informatics
Shanghai, China. 3-5 October 2016

Abstract
Background: With the developments of DNA sequencing technology, large amounts of sequencing data have
become available in recent years and provide unprecedented opportunities for advanced association studies
between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic
point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data
sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in
improving the classification performance.
Results: To address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural
network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates
the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the
indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly
suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts
high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a
reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all
contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted
classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy.
Conclusions: Based on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type
classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three
widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high
level features between combinatorial somatic point mutations and cancer types.

Background To alleviate the impact of cancer to human health,

Cancer is known as a category of disease causing ab- considerable research endeavors have been devoted to
normal cell growths or tumors that potentially invade the related diagnosis and therapy techniques, among
or metastasize to other parts of human body [1]. It has which somatic point mutation based cancer classification
long become one of the major lethal diseases which (SMCC) is an important perspective. The purpose of
leads to about 8.2 million, or 14.6%, of all human SMCC is to detect the cancer types or subtypes based
deaths each year [2]. on somatic gene mutations from the patient, so that the
cancer condition of the patient can be specified. Due to
the drop in the cost of DNA sequencing in recent years,
* Correspondence: [email protected]
†
Equal contributors
the availability of DNA sequencing data has increased
2
Key Laboratory of Systems Biomedicine, Shanghai Center for Systems dramatically, which greatly promotes the developments
Biomedicine, Shanghai Jiaotong University, Shanghai 200240, China of SMCC [3]. Compared with conventional cancer
Full list of author information is available at the end of the article

© The Author(s). 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 244 of 303

classification methods that are mostly based on morpho- into the algorithm; Cai et al. [26] propose the clus-
logical appearances or gene expressions of the tumor, tered gene selection, which groups the genes via k-
SMCC is particularly effective in differentiating tumors means clustering and picks up the top genes in each
with similar histopathological appearances [4] and is sig- group that are closest to the centroid locations.
nificantly more robust to environmental influences, thus These methods are simple and effective in some
is favorable in delivering more accurate classification re- cases, but their heuristics are designed for continu-
sults. Other genetic aberrations such as copy number ous gene expression data, and are not directly
variance, translocation, and small insertion or deletion applicable to discrete, and especially binary point
have also been shown to be associated with different mutation data.
cancers [5, 6], but due to the major causal role of som- (2)Even within the discriminatory subset, the majority
atic point mutations and potential application consider- of genes are not guaranteed to contain informative
ation, we only focus on this kind of genetic aberration in point mutations and often remain normal (i.e. zero
this study. Moreover, the combinatorial point mutation values in the data) [27], which results in extremely
patterns learned in predicting cancer types/subtypes can sparse gene data (even all-zeros) that is difficult to
be used for developing diagnostic gene marker panels classify. Yet, to the best of our knowledge, there has
that are cost effective. This is particularly true , when been no existing work specifically devised for redu-
compared to DNA amplifications and rearrangements cing the data sparsity for SMCC.
which usually require whole genome sequencing and is (3)Different genes related to specific types of cancer are
expensive for patients, especially regarding time series generally correlated and have complex interactions
and whole genome sequencing used in tracing tumor which may impede the application of conventional
linage evolution during cancer progression. simple linear classifiers such as linear kernel support
Clinically, SMCC may significantly facilitate cancer- vector machine (SVM) [28]. Therefore, an advanced
related diagnoses and treatments, such as personalized classifier being capable of extracting the high level
tumor medicine [7], targeted tumor therapy [8] and features within the discriminatory subset is desired.
compound medicine [9]. It can also aid cancer early Although there have been recent works utilizing
diagnosis (CED) in combination with the sampling and sparse-coding [29] or auto-encoder [17] for gene an-
sequencing of circulating tumor cells (CTCs) or circu- notation, no work has been devoted in applying
lating DNA (ctDNA) [10–12]. Given the promising high-level machine learning approaches to SMCC.
applications above, SMCC is widely studied in recent
researches [13–15]. In recent years, the developments of deep neural net-
In recent years, the drastic developments of machine work (DNN) [30] have equipped bioinformaticians with
learning methods have greatly facilitated the researches powerful machine learning tools. DNN is a type of artifi-
in bioinformatics, including SMCC. In order to predict cial neural network that aims to model abstracted high-
the cancer types/subtypes more effectively, many ma- level data features using multiple nonlinear and complex
chine learning approaches have been applied in existing processing layers, and provides feedback via back-
cancer type prediction works, which have shown prom- propagation [31]. First introduced in 1989 [32], DNN has
ising results [16–18]. Currently, remarkable develop- garnered tremendous developments and is widely applied
ments have been demonstrated in tumor cases of in image classification [33, 34], object localization [35, 36],
colorectal [19], breast [20], ovary [21], brain [22], and facial recognition [37, 38], etc. DNN has the potential to
melanoma [23]. However, there are at least three major introduce novel opportunities for SMCC where it per-
unresolved challenges: fectly fits the need for large scale data processing and high
level feature extraction. However, to the present, applying
(1)Normal sequencing results involve extremely large customized DNN on SMCC is yet to be explored.
number of genes, usually in tens of thousands, but In this paper, we propose a novel SMCC method,
only a small discriminatory subset of genes is related named DeepGene, designed to simultaneously address
to the cancer classification task. The other genes are the three identified issues. DeepGene is a DNN-based
largely irrelevant genes whose existence will only classification model composed of three steps. It first
obstruct the cancer classification. Many recent conducts two pre-processing techniques, including the
works have been conducted in identifying the clustered gene filtering (CGF) based on mutation occur-
discriminatory subset of genes. For example, Cho rence frequency, and the indexed sparsity reduction
et al. [24] apply the mean and standard deviation of (ISR) based on indexes of non-zero elements; the gene
the distances from each sample to the class center as data is then classified by a fully-connected DNN classi-
criteria for classification; Yang et al. [25] improve fier into a specific cancer type. The proposed DeepGene
the method in [24] and bring inter-class variations model has four distinct contributions:
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 245 of 303

(1)The proposed CGF procedure locates the Methods

discriminatory gene subset based on mutation DeepGene has three major steps, namely clustered gene
occurrence frequency. CGF utilizes features from the filtering (CGF), indexed sparsity reduction (ISR), and
whole dataset instead of the current sample alone DNN-based classification. The CGF and ISR are two in-
(e.g. mean and standard deviation), and thus more dependent pre-processing modules, the results of which
objectively reflects the correlations among the genes are then concatenated in the final DNN classifier.
which can more effectively summarize the
discriminatory subset. In addition, CGF does not Clustered gene filtering
require any prior knowledge from the original data The CGF step is based on the mutation occurrence fre-
and therefore functions well on both discrete and quency of the gene data, and its workflow is summarized
binary point mutation data. in Table 1. Let A ∈ {0, 1}m × n be the matrix of raw data
(2)The proposed ISR procedure converts the sparse with binary value, where the n columns correspond to
gene data into indexes of its non-zero elements. ISR the n samples (cases) in the dataset, and the m rows cor-
eliminates the vast majority of zero gene elements, respond to the m genes per sample. The binary value in-
and significantly reduces the complexity of the gene dicates whether a mutation is observed:
data during such process.
1 if mutation obsereved at gene i of sample j
(3)We establish a fully connected DNN classifier that Aði; jÞ ¼ :
0 otherwise
uses the gene data after CGF and ISR for cancer
classification. With the capacity of high-level feature We first sum A by row, and concatenate the result with
extraction, our classifier is able to effectively extract the row indexes for later reference (step 1 in Table 1):
deep features from the complexly correlated gene 2 3
data, and significantly improve the classification ac- 1
curacy compared with conventional simple linear Asum ¼ 4 ⋮ sumðA; axis ¼ rowÞ5:
classifiers such as SVM. m
(4)We compile and release the TCGA-DeepGene data-
Since the genes with higher occurrence frequency are
set, which is a reformulated subset of the widely ap-
of more interest, the rows of Asum are sorted in descend-
plied TCGA dataset [39] in genome-related
ing order by the second column as A*sum. After that, we
researches. TCGA-DeepGene selects 22,834 genes of
only keep its index column:
12 types of cancer from 3122 different samples, and
regularizes the data in a unified format so that classi- Asum ¼ Asum ð:; 1Þ:
fication tasks can be readily performed.
The next step is to group A*sum by inter-gene similarity
The flowchart of DeepGene is shown in Fig. 1. We (step 3 in Table 1). For two 1 × n gene samples p and q,
conduct experiments on the proposed TCGA-DeepGene we use the Jaccard coefficient as their inter-sample simi-
dataset, and DeepGene is evaluated against three widely larity d(p, q):
adopted classification methods for SMCC. The results sumðpqÞ
demonstrate that DeepGene has generated significantly d ðp; qÞ ¼ ;
sumðpjqÞ
higher performance in terms of testing accuracy against
the comparison methods. where “&” and “|” stand for logical AND and OR.

Fig. 1 Flowchart of the proposed DeepGene method. The raw gene data is first pre-processed by the clustered gene filtering (CGF) and the
indexed sparsity reduction (ISR), respectively, and then fed into the DNN classifier. The output label from the DNN indicates the cancer type of
the input gene sample
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 246 of 303

Table 1 Workflow of Clustered Gene Filtering (CGF) pISR to make it has the length of nISR. The workflow of
Input: Gene data matrix A ∈ {0, 1} m×n
, distance threshold dCGF, group ISR is illustrated in Fig. 2.
element threshold nCGF. The significance of ISR is apparent. For each gene
1: Sum A by row and sort the result in descent order, and then obtain
the sorted index A*sum; sample p, ISR filters out the majority of its zero elements
2: Initialize each element as ungrouped; and leaves most (if nNZ ≥ nISR) or all (if nNZ < nISR) of its
3: For each ungrouped element i in A*sum: non-zero elements. Since nISR ≪ length(p), the percent-
(a) For each ungrouped element j in A*sum other than i:
i. Calculate the similarity d(A(i, :), A(j, :)); age of zero elements will drop dramatically after ISR,
ii. If d(A(i, :), A(j, :)) > dCGF, assign j into the group of i; which means the impact of data sparsity will be signifi-
4: Set the output gene index set gout = ∅; cantly suppressed.
5: For each group c of A after step 3:
(a) If group element number nc ≥ nCGF, select the top nCGF genes with
the highest mutation occurrence frequency as gc; DNN-based classifier
(b) gout = gout ∪ gc; As introduced in the previous two sections, both CGF
6: Apply the index set gout on A and get the filtered gene data
ACGF = A(gout, :); and ISR have their own advantages when conducted
Output: ACGF, i.e. the gene data after CGF alone. However, the performance can be even higher if
they are combined together (see more details in the
“Evaluate the effect of combining CGF and ISR” Sec-
Starting from A*sum(1), which stands for the index of tion). We thus combine both CGF and ISR as the pre-
the gene with the highest occurrence frequency, we cal- processing for our DNN-based classifier.
culate its similarity with each of the following genes. If As shown in Fig. 1, the raw gene data is processed by
their similarity is larger than a predefined threshold CGF and ISR, separately, and then concatenated as the
dCGF, the latter gene is merged into the group of A*sum(1). input of the DNN classifier. The concatenation is con-
After the loop for A*sum(1), we conduct the loop for the ducted by appending the output of ISR to the tail of the
next ungrouped element in A*sum, until all the genes are output of CGF, by which the two outputs form a new
grouped with a unique group ID. and longer data vector. The classifier is a feed-forward
The final step is to filter the elements from each group artificial neural network with fixed input and output
and form the discriminatory subset. We do this by size, and multiple hidden layers for data processing. For
selecting the top nCGF genes in each group with the a hidden layer l, its activation (or output value to the
highest mutation occurrence frequency, where nCGF is next layer) is computed as:
another predefined threshold. Groups that have fewer
than nCGF elements are discarded. All of the selected xl ¼ f ðzl−1 Þ;
genes are then united as the result of CGF (steps 5 and where f is the activation function, zl is the total weighted
6 in Table 1). sum of the input:

z l ¼ W l xl þ b l ;
Indexed sparsity reduction
Although the CGF can effectively locate the discrimin- where Wl and bl are the weight matrix and bias vector of
atory gene subset and filter out the majority of irrelevant layer l (to be learned in training). In our case, we adopt
genes, it is still probable that the selected gene subset the ReLU [40] function as f, and x1 is the input gene
being highly sparse, i.e. most of the elements in ACGF are data after pre-processing. The size of the last layer L’s
zeros. The high sparsity is likely to obscure any distin- output xL equals to the number of cancer types ncancer
guishable feature in the gene data and severely hinder (ncancer = 12 in our case). xL is then processed by a soft-
the classification. Hence, an effective process in reducing max layer [41], and the loss J is computed by the loga-
the gene data sparsity is highly desired. rithm loss function:
To address the data sparsity issue, we propose the nX
cancer
indexed sparsity reduction (ISR) procedure, which min- J ¼− yi logPi ;
ifies the sparsity by converting the gene data into the in- i¼1
dexes of its non-zero genes. For a 1 × n gene sample
where yi ∈ {0, 1} is the ground truth label of cancer type
p ∈ {0, 1}1 × n, let the number of its non-zero element be
i, and
nNZ. We set a pre-defined threshold nISR. If nNZ ≥ nISR ,
find the indexes of its top nISR non-zero elements that xL ð i Þ
have the highest occurrence frequency in A*sum of the Pi ¼ X
expðxL ðjÞÞ
previous section, and these nISR indexes are listed in as- j
cending order as a vector pISR , which is the output of
ISR; if nNZ < nISR , we conduct zero-padding to the tail of is the softmax probability of cancer type i.
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 247 of 303

Fig. 2 Flowchart of the Indexed Sparsity Reduction (ISR) step. After indexing of the non-zero elements, if nNZ ≥ nISR, select the top nISR non-zero elements
that have the highest occurrence frequency; if nNZ < nISR, we conduct zero-padding to the tail of the output data so that it has the length of nISR

In training, the loss J is transferred from the last layer database with filter criteria IlluminaGA_DNASeq_Curated
to the former layers via back-propagation [32], by which updated before April, 2015. The mutation information for a
the parameters W and b of each layer are updated. The gene is represented by a binary value according to one or
training then enters the next epoch, and the feed- more mutation(s) (1) or without mutation (0) on that gene
forwarding and back-propagation are conducted again. for a specific sample. We assemble a total of 22,834 genes
The training stops when a pre-defined epoch number is from the 3122 samples, and generate a 22, 834 × 3, 122 bin-
reached. In testing, only the feed-forwarding is con- ary data matrix (i.e. the original data matrix A). This data
ducted (for once) for a testing sample, and the type of matrix is the product of our proposed TCGA-DeepGene
cancer i corresponding to the largest softmax probability subset, where each sample (column) is assigned one of the
of Pi is adopted as the classification result. The workflow labels {1, 2, …, 12} meaning the 12 types of cancer above.
of the DNN classifier is summarized in Table 2, and the To facilitate the 10-fold cross validation in the follow-
complete flowchart of DeepGene is illustrated in Fig. 1. ing experiments, we randomly divide the samples in
each of the 12 cancer categories into 10 subgroups, and
Results each time we union one subgroup from each cancer cat-
Experiment setup egory as the validation set, while all the other subgroups
Dataset are combined as the training set. This formulates 10
Our experiments are all conducted on the newly proposed training/validation configurations with fair distributions
TCGA-DeepGene dataset, which is a re-formulated subset of the 12 types of cancer, and will be used for the 10-
of The Cancer Genome Atlas (TCGA) dataset [39] that is fold cross validation in our following experiments.
widely applied in genomic researches.
The TCGA-DeepGene subset is formulated by assem- Constant parameters
bling the genes that contain somatic point mutation on For the proposed DNN classifier, the output size is set to
each of the 12 selected types of cancer. Detailed sample 12 (i.e. the 12 types of cancer to be classified); the total
and point mutation statistics for each cancer type can be training epoch Emax is set to 50; the learning rate is set
found in Table 3. The data is collected from the TCGA to 50-point logarithm space between 10− 1 and 10− 4; the
weight decay is set to 0.0005; and the training batch size
Table 2 Workflow of DNN classifier (i.e. the number of samples per training batch) is set to 256.
Input: Gene data matrix A ∈ {0, 1}m × n after CGF and ISR, where rows Additionally, in order to facilitate the evaluation of vari-
and columns correspond to samples and genes, respectively; max able parameters, we set each parameter a default value: the
training epoch Emax. distance threshold is set to 0.7; the group element threshold
1: Training: for each training epoch e ≤ Emax:
(a) For each sample ai = A(i, :): nCGF is set to 5; the non-zero element threshold nISR is set
i. Conduct feed-forwarding and compute the loss J; to 800; the hidden layer number and parameters per layer
ii. Conduct back-propagation to update the W and b of the DNN classifier are set to 4 and 8192, respectively.
2: Testing: for each sample ai = A(i, :):
(a) Conduct feed-forwarding and get softmax probability P;
(b) Adopt the cancer type correspond to max(P) as the result of ai. Evaluation metrics
Output: Trained network model (training) or classification results for the For all the evaluations in our experiments, we randomly se-
samples (testing).
lect 90% (2810) samples for training, and the rest 10% (312)
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 248 of 303

Table 3 Sample and mutation statistics of the TCGA-DeepGene dataset on 12 cancer types
Cancer name Sample number Missense mutation Nonsense mutation Nonstop mutation RNA Silent Splice_Site Translation Total
start site mutation
ACC 91 6741 501 15 368 2534 344 42 10,545
BLCA 130 24,067 2142 46 0 9662 528 55 36,500
BRCA 992 55,063 4841 133 3998 17,901 1424 0 83,360
CESC 194 26,606 2716 84 5595 9765 527 0 45,293
HNSC 279 31,416 2545 44 0 12,149 776 0 46,930
KIRP 171 8910 499 17 394 3411 524 0 13,755
LGG 284 5341 378 7 102 2074 294 0 8196
LUAD 230 44,800 3477 46 0 15,594 1377 99 65,393
PAAD 146 21,067 1496 19 859 7936 1005 111 32,493
PRAD 261 9628 563 15 652 3750 513 55 15,176
STAD 288 82,265 4200 92 48 33,344 1868 227 122,044
UCS 56 3070 187 2 234 1114 171 0 4778
Total 3122 318,974 23,545 520 12,250 119,234 9351 589 484,463

samples for testing. In parameter optimization steps for implemented on the MatConvNet toolbox [42], which is
DeepGene, we adopt the 10-fold cross validation accuracy a MATLAB-based convolutional neural network (CNN)
on the training set as the evaluation metric; on the other toolbox with various extensibilities.
hand, in the comparison with widely adopted models, we
adopt the testing accuracy as the evaluation metric. Evaluation of design options
Determination of CGF’s variable
Implementation There are two variables that need to be experimentally
The CGF and ISR steps are implemented by original determined for the CGF step, namely the distance
coding in MATLAB, while the DNN classifier is threshold dCGF and the group element threshold nCGF.
Table 4 10-fold cross validation accuracies (%) of DeepGene with different nCGF (row) and dCGF (column)

The optimal result is marked in red. Mean accuracy: 53.0%; standard deviation: 5.01%; maximum accuracy: 63.9%; minimum accuracy: 38.9%. The corresponding
3D bar-plot is shown in Fig. 3a for sensitivity review
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 249 of 303

To determine the two variables, we change them in 2- Determination of ISR’s variable

dimensional manner, while keeping all the other vari- The non-zero element threshold nISR needs to be experi-
ables the default values as described in the “Constant mentally determined for the ISR step. We monitor the
parameters” Section. The corresponding 10-fold cross number of non-zero elements for each sample in the
validation accuracies are listed in Table 4, and the dataset, and plot the corresponding histogram in Fig. 4.
corresponding 3D bar-plot to present sensitivity is It is seen that 3030 (or more than 97%) of the 3122 sam-
shown in Fig. 3a. We adopt dCGF = 0.7 and nCGF = 5 ples have less than 800 non-zero genes among the total
for the following experiments based on the observed 22,834 genes. We thus adopt nISR = 800, which not only
experimental results, since they contribute to the concentrates the data to the non-zero elements, but also
optimal performance. greatly shrinks the data length.

Fig. 3 3D bar-plots of parameter estimations for sensitivity review. The Z-axis stands for 10-fold cross validation accuracy. a Parameter estimation
for dCGF and nCGF, corresponding to Table 4; b parameter estimation for layer number and parameter number per layer for the DNN classifier,
corresponding to Table 5; c parameter estimation for cost and gamma for SVM, corresponding to Table 6; d parameter estimation for Table 7
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 250 of 303

Fig. 4 Non-zero element distribution of the gene samples in the TCGA-DeepGene dataset. Ninety-seven percent of all the 3122 samples have no
more than 800 non-zero gene elements

Determine the network architecture major innovations, i.e. CGF and ISR. It is mentionable that
We also need to determine the network architecture for the we conduct CGF and ISR separately and concatenate their
DNN classifier, which involves two variables: the hidden results (as shown in Fig. 1) instead of conducting them
layer number (#layer) and the parameter number per layer consecutively. The reason is that the outputs of CGF and
(#param). Enlightened by [43], we monitor the classifier’s ISR are binary data and index data, respectively. Consecu-
10-fold cross validation accuracy with various hidden layer tive conduction will only leave the index data (from ISR),
numbers and parameter numbers, the results of which are while separate conduction can benefit from both the bin-
listed in Table 5, and the corresponding 3D bar-plot to ary data and the index data, thus introduces less bias.
present sensitivity is shown in Fig. 3b. We see that the per- Based on Fig. 1, we compare the performances of the
formance reaches optimal at #layer = 4 and #param = 8192. DNN classifier with different configurations:
These values are thus adopted in our following experiments.
(1)CGF and ISR (i.e. the proposed input structure);
Evaluate the effect of combining CGF and ISR (2)Only CGF (the upper half of Fig. 1);
After determining the related parameters for the three (3)Only ISR (the lower half of Fig. 1);
steps of DeepGene, we evaluate the impact of our two (4)Neither CGF nor ISR (use the raw gene data instead).

Table 5 10-fold cross validation accuracies (%) of DeepGene with different #layer (row) and #param (column)

The optimal result is marked in red. Mean accuracy: 57.9%; standard deviation: 3.42%; maximum accuracy: 64.0%; minimum accuracy: 53.2%. The corresponding
3D bar-plot is shown in Fig. 3b for sensitivity review
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 251 of 303

The 10-fold cross validation results are shown in Fig. 5. KNN: we compare the performances of Euclidean dis-
It is clearly observed that the complete CGF + ISR out- tance and Pearson correlation coefficient, which are the
performs both CGF and ISR when conducted alone, and two most commonly used similarity measures in gene
also significantly outperforms the raw data without any data analysis [26]. The 10-fold cross validation results of
pre-processing. the two similarity measures with different neighborhood
numbers are shown in Table 7, and the corresponding
3D bar-plot to present sensitivity is shown in Fig. 3d.
Comparison with widely adopted models We adopt the Pearson correlation coefficient and set the
We then select three most representative data classifiers neighborhood number to 4, which lead to the optimal
that are commonly used in SMCC as comparison validation accuracy.
methods, namely Support Vector Machine (SVM) [28], NB: following [47], the average percentage of non-zero
k-Nearest Neighbors (KNN) [44] and Naïve Bayes (NB) elements in the samples of each cancer category is set as
[45]. In order to exhibit the pre-processing effect of CGF the prior probability.
and ISR, all the comparison methods use raw gene data In the performance comparison between different
as inputs. The three methods are set up as below. models, the testing accuracy is adopted as the evaluation
SVM: we use the LIBSVM toolbox [46] in implementing metric (see the “Evaluation metrics” Section), which is
the SVM. Based on the results of a previous work for gene generally slightly lower than the 10-fold validation accur-
classification [26], the kernel type (−t) is set to 0 (linear ker- acy of the corresponding model. The experiment results
nel). Note that due to the feature set is high dimensional, are plotted in Fig. 6. DeepGene shows significant advan-
the linear kernel is suggested over the RBF (Gaussian) ker- tage against all the three comparison methods. The per-
nel [46]; this suggestion is consistent to our trial and error formance improvements are 24.3% (65.5% vs. 52.7%),
experience on this problem. A 10-fold cross validation is 60.5% (65.5% vs. 40.8%) and 710% (65.5% vs. 9.23%)
conducted to optimize the parameters cost (−c) and gamma against SVM, KNN and NB, respectively. To further val-
(−g), and the other parameters are set as their default values. idate the performance of the DNN classifier itself with-
The cross validation results are shown in Table 6, and the out CGF and ISR, we also record the accuracy of the
corresponding 3D bar-plot to present sensitivity is shown in DNN classifier with raw gene data, which is the same in-
Fig. 3c. We adopt 22 = 4 and 2‐ 5 = 0.0313 for -c and -g, re- put as the comparison methods. The results are shown
spectively, which lead to the best results in Table 6. in Fig. 7, in which the DNN classifier still has the

Fig. 5 10-fold cross validation accuracy of DeepGene with different design options. Performance comparison of the complete DeepGene input
structure (CGF + ISR), CGF only, ISR only and raw gene data. The complete DeepGene shows significant advantage against the other
three options
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 252 of 303

Table 6 10-fold cross validation accuracy (%) of SVM with different cost (row) and gamma (column) parameters

The optimal result is marked in red. Mean accuracy: 46.6%; standard deviation: 3.97%; maximum accuracy: 55.4%; minimum accuracy: 37.6%. The corresponding
3D bar-plot is shown in Fig. 3c for sensitivity review

optimal accuracy (60.1%) against all of the comparison negative influence, and only focus the data to the dis-
methods. criminatory gene subset.

Discussion Indexed sparsity reduction

Clustered gene filtering The ISR step is meant to reduce the data sparsity by
The main purpose of the CGF step is to filter out irrele- converting the gene data into the indexes of its non-zero
vant genes in the samples and locate the candidate dis- elements. In that case only the non-zero elements’ infor-
criminatory gene subset. It first groups the genes based mation is left, while all the zero elements are discarded.
on popularity (mutation occurrence frequency) and The data sparsity will thus be tremendously reduced,
inter-sample similarity, and then selects the top genes in making the subsequent classifier only focus on the in-
each group, and finally unites all the genes selected as formative non-zero elements.
the output. The required parameter nISR is experimentally deter-
The two required parameters, dCGF and nCGF, are ex- mined. We monitor the non-zero element distribution
perimentally determined (as shown in Table 4). The two among all of the 3122 samples in the TCGA-DeepGene
adopted values, dCGF = 0.7 and nCGF = 5, are in the mid- dataset, and record the non-zero element range of each
stream of the evaluation ranges, which are more reliable sample. Figure 4 indicates that 97% of the samples have
than the marginal values. no more than 800 non-zero elements (which are only
By comparing the performances of the CGF against 3.5% of the total 22,834 genes per sample). We thus set
raw gene data, as the second and fourth bars in Fig. 5 in- nISR = 800, which is able to reduce the majority of the
dicate, the CGF has exhibited significant performance data sparsity while maximally reserving the discrimin-
boosting. It raises the validation accuracy by 4.25% (from atory information of the samples.
61.2 to 63.8%), and also contributes to the high perform- Like CGF, ISR has exhibited significant contribution in
ance of the combined CGF + ISR input structure. The improving the performance of our classifier, as the third
advantage of CGF lies in its ability to mask out the ma- bar in Fig. 5 indicates. It raises the accuracy against raw
jority of irrelevant genes, thus maximally suppress their gene data by 6.05% (from 61.2 to 64.9%), which is even

Table 7 10-fold cross validation accuracies (%) of KNN with different similarity measures (row) and neighborhood numbers (column)

The optimal result is marked in red. Mean accuracy: 35.3%; standard deviation: 5.63%; maximum accuracy: 43.6%; minimum accuracy: 28.2%. The corresponding
3D bar-plot is shown in Fig. 3d for sensitivity review
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 253 of 303

Fig. 6 Testing accuracy of DeepGene against three widely adopted classifiers. DeepGene is clearly advantageous to the comparison methods

Fig. 7 Testing accuracy of DeepGene against three widely adopted classifiers with raw gene input data. All methods use raw gene data as input.
The DNN classifier is still favorable against the other methods
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 254 of 303

more significant than what the CGF contributes. We attri- as large as that of DeepGene. Since DeepGene is
bute ISR’s advantage to its remarkable reduction of the based on DNN, it is more advantageous in processing
gene data sparsity. It is also mentionable that ISR exhibits complicated data structures, thus can benefit from
more strength when combined with CGF, as the first bar CGF + ISR more.
in Fig. 5 indicates. This can be explained by the synergy
effect of binary gene data and indexed gene data.
Furthermore, we note that ISR conducts lossless con- DNN classifier
version when nNZ ≤ nISR , i.e. the indexed data can be The DNN classifier is the mainstay of DeepGene, which
readily converted back to the original binary data if conducts the classification and generates the final output.
necessary. Figure 6 has shown the significant advantage of DeepGene
against three widely adopted classifiers, among which
Data optimization by CGF and ISR DeepGene exhibits at least 24% of performance improve-
Besides aiding our DeepGene method, the CGF and ISR ment. To examine the performance of the DNN classifier
steps can also benefit other classification methods for in- itself without the pre-processing steps of CGF and ISR, we
put data optimization. To evaluate the optimization ef- also record the accuracy of the DNN classifier with raw
fect, we apply CGF + ISR to the three classifiers SVM, gene data in Fig. 7, which has shown that the DNN classi-
KNN and NB discussed in the “Comparison with widely fier still generates the best accuracy (60.1% against the sec-
adopted models” Section, and record their testing accur- ond best 52.7% of SVM).
acies before and after the input data optimization. For To further validate that the 10-fold validation accuracy
fair comparison, the parameters of the classifiers remain of DNN is indeed higher than that of SVM, we assume
the same. that these two classifiers are independent of each other,
Figure 8 shows the accuracy change before and after and conduct t-test with the null hypothesis that these
the input data optimization of CGF + ISR. It is observed two classifiers have equal validation accuracy under the
that applying CGF + ISR can notably refine the input significance of 0.001. The sample standard deviation of
data, thus improve the testing accuracies of the classi- DNN and SVM are recorded as sX 1 ¼ 1:51% ¼ 0:0151
fiers. We also note that by applying CGF + ISR, the and sX 2 ¼ 2:12% ¼ 0:0212, respectively. The t statistic is
accuracy improvements of the three classifiers are not then calculated as:

Fig. 8 Testing accuracies of three widely adopted classifiers with and without CGF + ISR for input data optimization. Applying CGF + ISR can
notably refine the input data, thus improve the testing accuracies of the classifiers
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 255 of 303

X 1 −X 2 0:601−0:527 DeepGene consists of three major steps. The CGF step

t¼ qffiffi ¼ qffiffiffiffi ¼ 488:5; concentrates the gene data with mutation occurrence
sX 1 X 2 ⋅ n 3:387e−4 10
2 2
frequency; the ISR step reduces the gene data sparsity
with the indexes of non-zero elements; and in the final
where step, the DNN-based classifier takes the processed data
s2X 1 þ s2X 2 and generates the classification result with high-level
sX 1 X 2 ¼ ¼ 3:387e−4: data feature learning.
2
We conduct experiments on the compiled TCGA-
Here, the degree of freedom is n − 1 = 9. By checking DeepGene dataset, which is a reformulated subset of the
the one-tailed significance table, the corresponding t TCGA dataset with mutations on 12 types of cancer.
statistic of the p-value 0.001 is 3.922, which is far less Controlled variable experiments indicate that CGF, ISR
than our t = 488.5. Hence the null hypothesis is rejected and DNN classifier all have significant contribution in
in favor of the alternative hypothesis, and we prove that improving the classification accuracy. We then compare
the 10-fold cross validation accuracy of DNN is indeed DeepGene with three widely adopted data classifiers, the
higher than that of SVM. It is notable that using the results of which exhibit the remarkable advantages of
DNN alone is the lowest configuration of DeepGene (see DeepGene, which has achieved > 24% of performance
Fig. 5), and SVM has the highest performance out of the improvement in terms of testing accuracy against the
three comparison classifiers. As a result, our t-test above comparison classification methods.
has also proved that DeepGene is indeed higher in per- We demonstrated the advantages and potentials of the
formance against all of the three comparison methods. DeepGene model for somatic point mutation based gene
We attribute the advantage of the DNN classifier to its data processing, and we suggest that the model can be
capacity in extracting the complex features of the input extended and transferred to other complex genotype-
data. The multiple nonlinear processing layers make the phenotype association studies, which we believe will
DNN especially suitable in processing complex data that benefit many related areas. As for future studies, we will
are too tough for conventional linear classifiers such as refine our model for other complex and large-scale data,
linear kernel SVM. We also note that DeepGene is just as well as broadening our training dataset, so that the
one of our initial trials for DNN-based gene data pro- classification result can be further improved.
cessing, but has already shown promising results against
Acknowledgements
widely adopted methods. The DNN classifier has the po- We would like to thank the reviewers for their valuable suggestions and
tential to show greater advantages towards more com- remarks, which have contributed to the improvement of our paper.
plex (e.g. images or multi-dimensional gene data) and
Declaration
large-scale data to conventional classifiers, which will be This article has been published as part of BMC Bioinformatics Volume 17
discussed in our future works. Supplement 17, 2016: Proceedings of the 27th International Conference on
Genome Informatics: bioinformatics. The full contents of the supplement are
available online at http://bmcbioinformatics.biomedcentral.com/articles/
Limitation and future study supplements/volume-17-supplement-17.
Currently DeepGene is only tested on datasets of som-
atic point mutations with known cancer types, i.e., the Funding
This work is supported by the University of Sydney & Shanghai Jiaotong
histological biopsy sites are already known. Therefore, in University Joint Research Alliance (SJTU-USYD Translate Medicine Fund –
this study, DeepGene only demonstrates the power of Systems Biomedicine AF6260003/04), and Australian Research Council (ARC)
capturing complex association between somatic point funding. This work is also supported by the National Natural Science Fund of
China (NSFC 81502423 and NSFC 81272271), the China National Key Projects
mutation and cancer types, and more of its application for Infectious Disease (2012ZX10002012-008 and 2013ZX10002010-006), the
potentials will be evidenced by tumor samples with com- Shanghai Pujiang Talents Fund (15PJ1404100), the Chinese Education Minis-
pletely unknown cancer type information (such as CTC ter-Returned Oversea Talent Initiative Fund (15001643), and the Shanghai Board
of Education-Science Innovation (15ZZ014). Publication costs were funded by
or ctDNA data) in our future works. The association be- the National Natural Science Fund of China (NSFC 81502423).
tween point mutation and other genetic aberrations such
as copy number variance, translocation, and small inser- Availability of data and material
The MATLAB source code of DeepGene and the TCGA-DeepGene dataset
tion and deletion will also be covered in our future are both available at our website: https://github.com/yuanyc06/deepgene.
works. It will be proved that to a large extent, adopting
point mutation alone is good enough for cancer type or Authors’ contributions
YS provided the initial ideas and fundamental materials. YY designed the
subtype classification. algorithm, wrote the code, and conducted the experiments. YY and YS
wrote the main manuscript text. CL, JK, WC, ZH and DDF revised the
Conclusions manuscript. All authors read and approved the final manuscript.
In this paper, we propose the DeepGene method for Competing interests
somatic point mutation based cancer type classification. The authors declare that they have no competing interests.
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 256 of 303

Consent for publication 22. Balss J, Meyer J, Mueller W, Korshunov A, Hartmann C, von Deimling A.
Not applicable. Analysis of the IDH1 codon 132 mutation in brain tumors. Acta
Neuropathol. 2008;116(6):597–602.
Ethics approval and consent to participate 23. Winnepenninckx V, Lazar V, Michiels S, Dessen P, Stas M, Alonso SR,
Not applicable. Avril M-F, Romero PLO, Robert T, Balacescu O. Gene expression profiling
of primary cutaneous melanoma and clinical outcome. J Natl Cancer
Author details Inst. 2006;98(7):472–82.
1
School of Information Technologies, The University of Sydney, Darlington, 24. Cho J-H, Lee D, Park JH, Lee I-B. New gene selection method for classification of
NSW 2008, Australia. 2Key Laboratory of Systems Biomedicine, Shanghai cancer subtypes considering within‐class variation. FEBS Lett. 2003;551(1–3):3–7.
Center for Systems Biomedicine, Shanghai Jiaotong University, Shanghai 25. Yang K, Cai Z, Li J, Lin G. A stable gene selection in microarray data analysis.
200240, China. BMC Bioinformatics. 2006;7(1):228.
26. Cai Z, Xu L, Shi Y, Salavatipour MR, Goebel R, Lin G. Using gene clustering
Published: 23 December 2016 to identify discriminatory genes with higher classification accuracy.
Arlington: IEEE Symp Bioinformatics BioEngineering (BIBE). 2006:235–242.
27. Tao Y, Sam L, Li J, Friedman C, Lussier YA. Information theory applied to the
References sparse gene ontology annotation network to predict novel gene function.
1. Feuerstein M. Defining cancer survivorship. J Cancer Surviv. 2007;1(1):5–7. Bioinformatics. 2007;23(13):i529–38.
2. Stewart B, Wild CP. World cancer report 2014. 2015. World. 28. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
3. DeFrancesco L. Life Technologies promises [dollar] 1,000 genome. Nat 29. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F,
Biotechnol. 2012;30(2):126. Aken BL, Barrell D, Zadissa A, Searle S. GENCODE: the reference human
4. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760–74.
H, Loh ML, Downing JR, Caligiuri MA. Molecular classification of cancer: class 30. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with
discovery and class prediction by gene expression monitoring. Science. neural networks. Science. 2006;313(5786):504–7.
1999;286(5439):531–7. 31. Deng L, Yu D. Deep learning: methods and applications. Foundations
5. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies Trends Signal Processing. 2014;7(3–4):197–387.
H, Teague J, Butler A, Stevens C. Patterns of somatic mutation in human 32. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel
cancer genomes. Nature. 2007;446(7132):153–8. LD. Backpropagation applied to handwritten zip code recognition. Neural
6. Wang Q, Jia P, Li F, Chen H, Ji H, Hucks D, Dahlman KB, Pao W, Zhao Z. Comput. 1989;1(4):541–51.
Detecting somatic point mutations in cancer genome sequencing data: a 33. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep
comparison of mutation callers. Genome Med. 2013;5(10):91. convolutional neural networks. In: Advances in neural information
7. Longo DL. Tumor heterogeneity and personalized medicine. N Engl J Med. processing systems. 2012. p. 1097–105.
2012;366(10):956–7. 34. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D,
8. Sledge GW. What is targeted therapy? J Clin Oncol. 2005;23(8):1614–5. Vanhoucke V, Rabinovich A. Going deeper with convolutions. 2014. arXiv
9. Gudeman J, Jozwiakowski M, Chollet J, Randell M. Potential risks of preprint arXiv:14094842.
pharmacy compounding. Drugs R D. 2013;13(1):1–8. 35. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate
10. Franken B, de Groot MR, Mastboom WJ, Vermes I, van der Palen J, Tibbe AG, object detection and semantic segmentation. Columbus: IEEE Conf Comput
Terstappen LW. Circulating tumor cells, disease recurrence and survival in Vision Pattern Recognition (CVPR). 2014:580–587.
newly diagnosed breast cancer. Breast Cancer Res. 2012;14(5):1–8. 36. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic
11. Sleijfer S, Gratama J-W, Sieuwerts AM, Kraan J, Martens JW, Foekens JA. segmentation. 2014. arXiv preprint arXiv:14114038.
Circulating tumour cell detection on its way to routine diagnostic 37. Sun Y, Wang X, Tang X. Deep convolutional network cascade for facial point
implementation? Eur J Cancer. 2007;43(18):2645–50. detection. Portland: IEEE Conf Comput Vision Pattern Recognition (CVPR).
12. Hayes DF, Smerage J. Is there a role for circulating tumor cells in the 2013:3476–3483.
management of breast cancer? Clin Cancer Res. 2008;14(12):3646–50. 38. Sun Y, Wang X, Tang X. Deep learning face representation from predicting
13. Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, Ding 10,000 classes. Columbus: IEEE Conf Comput Vision Pattern Recognition
M, Bamford S, Cole C, Ward S. COSMIC: exploring the world’s knowledge of (CVPR). 2014:1891–1898.
somatic mutations in human cancer. Nucleic Acids Res. 2015;43(D1):D805–11. 39. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA):
14. Watson IR, Takahashi K, Futreal PA, Chin L. Emerging patterns of somatic an immeasurable source of knowledge. Contemp Oncol. 2015;19(1A):A68.
mutations in cancer. Nat Rev Genet. 2013;14(10):703–18. Last downloaded on April 8th, 2015.
15. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis 40. Nair V, Hinton GE. Rectified linear units improve restricted boltzmann
ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration machines. In: Proceedings of the 27th International Conference on Machine
discovery in cancer by exome sequencing. Genome Res. 2012;22(3):568–76. Learning (ICML-10). 2010. p. 807–14.
16. Browne RP, McNicholas PD, Sparling MD. Model-based learning using a 41. Bishop CM. Pattern recognition and machine learning, vol. 4. New York:
mixture of mixtures of gaussian and uniform distributions. IEEE Trans Springer; 2006.
Pattern Anal Mach Intell. 2012;34(4):814–7. 42. Vedaldi A, Lenc K. MatConvNet-convolutional neural networks for MATLAB.
17. Chicco D, Sadowski P, Baldi P. Deep autoencoder neural networks for gene 2014. arXiv preprint arXiv:14124564.
ontology annotation predictions. Proc ACM Conf Bioinformatics, Computational 43. Mostajabi M, Yadollahpour P, Shakhnarovich G. Feedforward semantic
Biology. Newport Beach: Health Informatics. 2014:533–540. segmentation with zoom-out features. 2014. arXiv preprint arXiv:14120774.
18. Chow CK, Zhu H, Lacy J, Lingen MW, Kuo WP, Chan K. A cooperative feature 44. Altman NS. An introduction to kernel and nearest-neighbor nonparametric
gene extraction algorithm that combines classification and clustering. regression. Am Stat. 1992;46(3):175–85.
Washington, DC: IEEE Intl Conf Bioinformatics Biomedicine Workshop 45. Rennie JD, Shih L, Teevan J, Karger DR. Tackling the poor assumptions of
(BIBMW). 2009:197–202. naive bayes text classifiers. Washington: ICML; 2003. p. 616–23.
19. Huang Z, Huang D, Ni S, Peng Z, Sheng W, Du X. Plasma microRNAs are 46. Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM
promising novel biomarkers for early detection of colorectal cancer. Int J Trans Intell Syst Technol (TIST). 2011;2(3):27.
Cancer. 2010;127(1):118–26. 47. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for
20. Aaroe J, Lindahl T, Dumeaux V, Saebo S, Tobin D, Hagen N, Skaane P, the classification of tumors using gene expression data. J Am Stat Assoc.
Lonneborg A, Sharma P, Borresen-Dale A-L. Gene expression profiling of 2002;97(457):77–87.
peripheral blood cells for early detection of breast cancer. Breast Cancer
Res. 2010;12(1):R7.
21. Kurman RJ, Visvanathan K, Roden R, Wu T, Shih I-M. Early detection and
treatment of ovarian cancer: shifting from early stage to minimal volume of
disease based on a new model of carcinogenesis. Am J Obstet Gynecol.
2008;198(4):351–6.

Project (Sec: 01)
No ratings yet
Project (Sec: 01)
10 pages
Paper 2 Cancer Dicovery
No ratings yet
Paper 2 Cancer Dicovery
18 pages
Convolutional Neural Network Models For Cancer Typ
No ratings yet
Convolutional Neural Network Models For Cancer Typ
14 pages
Cancers 16 02845
No ratings yet
Cancers 16 02845
18 pages
Deepgene Transformer: Transformers For The Gene Expression Based Classification of Cancer Subtypes
No ratings yet
Deepgene Transformer: Transformers For The Gene Expression Based Classification of Cancer Subtypes
11 pages
NIH Public Access: Author Manuscript
No ratings yet
NIH Public Access: Author Manuscript
14 pages
Classification of Genetic Mutations For Cancer
No ratings yet
Classification of Genetic Mutations For Cancer
6 pages
Convolutional Neural Network Models For Cancer Typ PDF
No ratings yet
Convolutional Neural Network Models For Cancer Typ PDF
34 pages
Final Year Project
No ratings yet
Final Year Project
10 pages
Cancerous Profiles - 2017 - Conference - Paper
No ratings yet
Cancerous Profiles - 2017 - Conference - Paper
6 pages
Gene Expression Analysis On Cancer Dataset
No ratings yet
Gene Expression Analysis On Cancer Dataset
11 pages
Bio Final Presentation18-28
No ratings yet
Bio Final Presentation18-28
17 pages
Ramirez 2020
No ratings yet
Ramirez 2020
14 pages
Cancer Type Prediction and Classification Based On RNA-sequencing Data
No ratings yet
Cancer Type Prediction and Classification Based On RNA-sequencing Data
4 pages
10 1016@j Engappai 2020 103571
No ratings yet
10 1016@j Engappai 2020 103571
10 pages
A Novel and Efficient Digital Pathology Classifier For Predicting Cancer Biomarkers Using Sequencer Architecture
No ratings yet
A Novel and Efficient Digital Pathology Classifier For Predicting Cancer Biomarkers Using Sequencer Architecture
11 pages
Deep Learning For Acute Myeloid Leukemia Diagnosis: DOI: 10.25122/jml-2019-0090
No ratings yet
Deep Learning For Acute Myeloid Leukemia Diagnosis: DOI: 10.25122/jml-2019-0090
6 pages
Portrait of A Cancer: Mutational Signature Analyses For Cancer Diagnostics
No ratings yet
Portrait of A Cancer: Mutational Signature Analyses For Cancer Diagnostics
14 pages
Research - 1
No ratings yet
Research - 1
19 pages
Summaries Assignment
No ratings yet
Summaries Assignment
10 pages
Deep Learning Can Predict Microsatellite Instability
No ratings yet
Deep Learning Can Predict Microsatellite Instability
10 pages
IJISD Volume 6 Issue 1 Page 67 77
No ratings yet
IJISD Volume 6 Issue 1 Page 67 77
11 pages
8.A Comparative Study On Classification Methods For Renal Cell and Lung Cancers Using RNA-Seq Data
No ratings yet
8.A Comparative Study On Classification Methods For Renal Cell and Lung Cancers Using RNA-Seq Data
9 pages
Bbab 432
No ratings yet
Bbab 432
12 pages
Supervised Learning Approach For Human Liver Cancer Diagnosis
No ratings yet
Supervised Learning Approach For Human Liver Cancer Diagnosis
10 pages
Cancer
No ratings yet
Cancer
9 pages
Mathematics 11 04937 v2
No ratings yet
Mathematics 11 04937 v2
40 pages
Model Performance and Interpretability s12859-023-05141-2
No ratings yet
Model Performance and Interpretability s12859-023-05141-2
16 pages
CNN Model for Colon Cancer Classification
No ratings yet
CNN Model for Colon Cancer Classification
16 pages
1 s2.0 S1532046420302550 Main
No ratings yet
1 s2.0 S1532046420302550 Main
17 pages
Cancer Medicine - 2017 - Sen - StrandAdvantage Test For Early Line and Advanced Stage Treatment Decisions in Solid Tumors
No ratings yet
Cancer Medicine - 2017 - Sen - StrandAdvantage Test For Early Line and Advanced Stage Treatment Decisions in Solid Tumors
19 pages
SSRN 4853237
No ratings yet
SSRN 4853237
17 pages
A Deep Learning System Accurately Classi Es Primary and Metastatic Cancers Using Passenger Mutation Patterns
No ratings yet
A Deep Learning System Accurately Classi Es Primary and Metastatic Cancers Using Passenger Mutation Patterns
12 pages
15.multiple Types of Cancer Classification Using CT MRI Images Based On Learning Without Forgetting Powered Deep Learning Models
No ratings yet
15.multiple Types of Cancer Classification Using CT MRI Images Based On Learning Without Forgetting Powered Deep Learning Models
19 pages
Paper BiomarkerxAI KSE 2024
No ratings yet
Paper BiomarkerxAI KSE 2024
6 pages
2018-Cell-Comprehensive Characterization of Cancer Driver Genes and Mutations
No ratings yet
2018-Cell-Comprehensive Characterization of Cancer Driver Genes and Mutations
37 pages
Improvement in Automated Diagnosis of Soft Tissues Tumors Using Machine Learning
No ratings yet
Improvement in Automated Diagnosis of Soft Tissues Tumors Using Machine Learning
38 pages
Neon DNA The Human Body Recipe Presentation
No ratings yet
Neon DNA The Human Body Recipe Presentation
18 pages
A Systematic Review On Deep Learning-Based Automated Cancer
No ratings yet
A Systematic Review On Deep Learning-Based Automated Cancer
20 pages
Cancer Mutation Clustering Method
No ratings yet
Cancer Mutation Clustering Method
9 pages
Cancer Detection Using Convolutional Neural Network: February 2021
No ratings yet
Cancer Detection Using Convolutional Neural Network: February 2021
10 pages
Revolutionizing Cancer Classification: The Snr-Ogscc Method For Improved Gene Selection and Clustering
No ratings yet
Revolutionizing Cancer Classification: The Snr-Ogscc Method For Improved Gene Selection and Clustering
7 pages
Enhancing Diagnostic Precision in Oncology
No ratings yet
Enhancing Diagnostic Precision in Oncology
6 pages
Recent Advancement in Cancer Diagnosis Using Machine Learning and Deep Learning Techniques A Comprehensive Review
No ratings yet
Recent Advancement in Cancer Diagnosis Using Machine Learning and Deep Learning Techniques A Comprehensive Review
30 pages
Fonc 1 1638212
No ratings yet
Fonc 1 1638212
12 pages
Breast Cancer Deep Learning
No ratings yet
Breast Cancer Deep Learning
14 pages
Gene Selection and Classification of Microarray Data Using Convolutional Neural Network
No ratings yet
Gene Selection and Classification of Microarray Data Using Convolutional Neural Network
6 pages
283ra53 Full
No ratings yet
283ra53 Full
11 pages
Deep Learning Based
No ratings yet
Deep Learning Based
3 pages
Classification of Kidney Cancer Data Using Depth Aware Generative Adversarial Networks Approach
No ratings yet
Classification of Kidney Cancer Data Using Depth Aware Generative Adversarial Networks Approach
8 pages
Osteosarcoma Bone Cancer Detection Adhora111
No ratings yet
Osteosarcoma Bone Cancer Detection Adhora111
6 pages
Diagnostics 13 00757
No ratings yet
Diagnostics 13 00757
12 pages
Deep Learning Predictive Model For Colon Cancer
No ratings yet
Deep Learning Predictive Model For Colon Cancer
10 pages
JPM 11 00061 v2
No ratings yet
JPM 11 00061 v2
12 pages
Machine Learning 2
No ratings yet
Machine Learning 2
18 pages
Geometric Theory of Information: Frank Nielsen
No ratings yet
Geometric Theory of Information: Frank Nielsen
397 pages
Tumor Immunology Course Overview
No ratings yet
Tumor Immunology Course Overview
1 page
Achrunka Oil06
No ratings yet
Achrunka Oil06
1 page
Bio43.Properties Life
No ratings yet
Bio43.Properties Life
1 page
Abandoned Code From CS106A HW1
No ratings yet
Abandoned Code From CS106A HW1
1 page
Works Cited: Primary Sources
No ratings yet
Works Cited: Primary Sources
1 page
Shashank ML
No ratings yet
Shashank ML
23 pages
DR Mehdi Hassan
No ratings yet
DR Mehdi Hassan
53 pages
2 Year Data Science Roadmap
No ratings yet
2 Year Data Science Roadmap
3 pages
KNN Limitations in Spam Filtering
No ratings yet
KNN Limitations in Spam Filtering
9 pages
Point Pattern Analysis
No ratings yet
Point Pattern Analysis
33 pages
Predictive Analytics of Lithium Ion Battery For Optimization and Battery Failure Using Machine Learning Algorithms
No ratings yet
Predictive Analytics of Lithium Ion Battery For Optimization and Battery Failure Using Machine Learning Algorithms
8 pages
TSP Cmes 2455513
No ratings yet
TSP Cmes 2455513
38 pages
Credit Card Fraud Detection Using ML
No ratings yet
Credit Card Fraud Detection Using ML
82 pages
AI and ML Notes
No ratings yet
AI and ML Notes
17 pages
2marks ML
No ratings yet
2marks ML
3 pages
Sem7 Aml PB Batch2021
No ratings yet
Sem7 Aml PB Batch2021
46 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
MLOps: ML System Fundamentals
No ratings yet
MLOps: ML System Fundamentals
15 pages
Section A: Ques. 1
No ratings yet
Section A: Ques. 1
31 pages
100 Data Science in R Interview Questions and Answers For 2016
100% (2)
100 Data Science in R Interview Questions and Answers For 2016
56 pages
FYP Snopsis
No ratings yet
FYP Snopsis
11 pages
KNN Regression
No ratings yet
KNN Regression
3 pages
Machine Learning and Web Scraping Lecture 03
No ratings yet
Machine Learning and Web Scraping Lecture 03
22 pages
VTU Connect App: Student Essentials
No ratings yet
VTU Connect App: Student Essentials
18 pages
Enhancing Error Prediction in Machineries Through Sensor Data Fusion
No ratings yet
Enhancing Error Prediction in Machineries Through Sensor Data Fusion
78 pages
ML Unit 3
No ratings yet
ML Unit 3
10 pages
Real Time Hand Gesture Recognition Research
No ratings yet
Real Time Hand Gesture Recognition Research
11 pages
Vowel Recognition
No ratings yet
Vowel Recognition
3 pages
Multi-Disease Prediction Guide
No ratings yet
Multi-Disease Prediction Guide
33 pages
Ds Capstone Template Coursera
No ratings yet
Ds Capstone Template Coursera
50 pages
11.5-Machine Learning and Evolutionary
No ratings yet
11.5-Machine Learning and Evolutionary
18 pages
HW1 Solution
No ratings yet
HW1 Solution
5 pages
Girish Patel - Cover Letter and Revised Resume - 15 April 2024
No ratings yet
Girish Patel - Cover Letter and Revised Resume - 15 April 2024
7 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Ss
No ratings yet
Ss
26 pages

DeepGene - An Advanced Cancer Type Classifier Based On Deep Learning and Somatic Point Mutations

Uploaded by

DeepGene - An Advanced Cancer Type Classifier Based On Deep Learning and Somatic Point Mutations

Uploaded by

The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476

RESEARCH Open Access

DeepGene: an advanced cancer type

Background To alleviate the impact of cancer to human health,

(1)The proposed CGF procedure locates the Methods

To determine the two variables, we change them in 2- Determination of ISR’s variable

Discussion Indexed sparsity reduction

X 1 −X 2 0:601−0:527 DeepGene consists of three major steps. The CGF step

You might also like