DeepGene - An Advanced Cancer Type Classifier Based On Deep Learning and Somatic Point Mutations
DeepGene - An Advanced Cancer Type Classifier Based On Deep Learning and Somatic Point Mutations
DOI 10.1186/s12859-016-1334-9
Abstract
Background: With the developments of DNA sequencing technology, large amounts of sequencing data have
become available in recent years and provide unprecedented opportunities for advanced association studies
between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic
point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data
sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in
improving the classification performance.
Results: To address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural
network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates
the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the
indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly
suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts
high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a
reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all
contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted
classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy.
Conclusions: Based on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type
classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three
widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high
level features between combinatorial somatic point mutations and cancer types.
© The Author(s). 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 244 of 303
classification methods that are mostly based on morpho- into the algorithm; Cai et al. [26] propose the clus-
logical appearances or gene expressions of the tumor, tered gene selection, which groups the genes via k-
SMCC is particularly effective in differentiating tumors means clustering and picks up the top genes in each
with similar histopathological appearances [4] and is sig- group that are closest to the centroid locations.
nificantly more robust to environmental influences, thus These methods are simple and effective in some
is favorable in delivering more accurate classification re- cases, but their heuristics are designed for continu-
sults. Other genetic aberrations such as copy number ous gene expression data, and are not directly
variance, translocation, and small insertion or deletion applicable to discrete, and especially binary point
have also been shown to be associated with different mutation data.
cancers [5, 6], but due to the major causal role of som- (2)Even within the discriminatory subset, the majority
atic point mutations and potential application consider- of genes are not guaranteed to contain informative
ation, we only focus on this kind of genetic aberration in point mutations and often remain normal (i.e. zero
this study. Moreover, the combinatorial point mutation values in the data) [27], which results in extremely
patterns learned in predicting cancer types/subtypes can sparse gene data (even all-zeros) that is difficult to
be used for developing diagnostic gene marker panels classify. Yet, to the best of our knowledge, there has
that are cost effective. This is particularly true , when been no existing work specifically devised for redu-
compared to DNA amplifications and rearrangements cing the data sparsity for SMCC.
which usually require whole genome sequencing and is (3)Different genes related to specific types of cancer are
expensive for patients, especially regarding time series generally correlated and have complex interactions
and whole genome sequencing used in tracing tumor which may impede the application of conventional
linage evolution during cancer progression. simple linear classifiers such as linear kernel support
Clinically, SMCC may significantly facilitate cancer- vector machine (SVM) [28]. Therefore, an advanced
related diagnoses and treatments, such as personalized classifier being capable of extracting the high level
tumor medicine [7], targeted tumor therapy [8] and features within the discriminatory subset is desired.
compound medicine [9]. It can also aid cancer early Although there have been recent works utilizing
diagnosis (CED) in combination with the sampling and sparse-coding [29] or auto-encoder [17] for gene an-
sequencing of circulating tumor cells (CTCs) or circu- notation, no work has been devoted in applying
lating DNA (ctDNA) [10–12]. Given the promising high-level machine learning approaches to SMCC.
applications above, SMCC is widely studied in recent
researches [13–15]. In recent years, the developments of deep neural net-
In recent years, the drastic developments of machine work (DNN) [30] have equipped bioinformaticians with
learning methods have greatly facilitated the researches powerful machine learning tools. DNN is a type of artifi-
in bioinformatics, including SMCC. In order to predict cial neural network that aims to model abstracted high-
the cancer types/subtypes more effectively, many ma- level data features using multiple nonlinear and complex
chine learning approaches have been applied in existing processing layers, and provides feedback via back-
cancer type prediction works, which have shown prom- propagation [31]. First introduced in 1989 [32], DNN has
ising results [16–18]. Currently, remarkable develop- garnered tremendous developments and is widely applied
ments have been demonstrated in tumor cases of in image classification [33, 34], object localization [35, 36],
colorectal [19], breast [20], ovary [21], brain [22], and facial recognition [37, 38], etc. DNN has the potential to
melanoma [23]. However, there are at least three major introduce novel opportunities for SMCC where it per-
unresolved challenges: fectly fits the need for large scale data processing and high
level feature extraction. However, to the present, applying
(1)Normal sequencing results involve extremely large customized DNN on SMCC is yet to be explored.
number of genes, usually in tens of thousands, but In this paper, we propose a novel SMCC method,
only a small discriminatory subset of genes is related named DeepGene, designed to simultaneously address
to the cancer classification task. The other genes are the three identified issues. DeepGene is a DNN-based
largely irrelevant genes whose existence will only classification model composed of three steps. It first
obstruct the cancer classification. Many recent conducts two pre-processing techniques, including the
works have been conducted in identifying the clustered gene filtering (CGF) based on mutation occur-
discriminatory subset of genes. For example, Cho rence frequency, and the indexed sparsity reduction
et al. [24] apply the mean and standard deviation of (ISR) based on indexes of non-zero elements; the gene
the distances from each sample to the class center as data is then classified by a fully-connected DNN classi-
criteria for classification; Yang et al. [25] improve fier into a specific cancer type. The proposed DeepGene
the method in [24] and bring inter-class variations model has four distinct contributions:
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 245 of 303
Fig. 1 Flowchart of the proposed DeepGene method. The raw gene data is first pre-processed by the clustered gene filtering (CGF) and the
indexed sparsity reduction (ISR), respectively, and then fed into the DNN classifier. The output label from the DNN indicates the cancer type of
the input gene sample
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 246 of 303
Table 1 Workflow of Clustered Gene Filtering (CGF) pISR to make it has the length of nISR. The workflow of
Input: Gene data matrix A ∈ {0, 1} m×n
, distance threshold dCGF, group ISR is illustrated in Fig. 2.
element threshold nCGF. The significance of ISR is apparent. For each gene
1: Sum A by row and sort the result in descent order, and then obtain
the sorted index A*sum; sample p, ISR filters out the majority of its zero elements
2: Initialize each element as ungrouped; and leaves most (if nNZ ≥ nISR) or all (if nNZ < nISR) of its
3: For each ungrouped element i in A*sum: non-zero elements. Since nISR ≪ length(p), the percent-
(a) For each ungrouped element j in A*sum other than i:
i. Calculate the similarity d(A(i, :), A(j, :)); age of zero elements will drop dramatically after ISR,
ii. If d(A(i, :), A(j, :)) > dCGF, assign j into the group of i; which means the impact of data sparsity will be signifi-
4: Set the output gene index set gout = ∅; cantly suppressed.
5: For each group c of A after step 3:
(a) If group element number nc ≥ nCGF, select the top nCGF genes with
the highest mutation occurrence frequency as gc; DNN-based classifier
(b) gout = gout ∪ gc; As introduced in the previous two sections, both CGF
6: Apply the index set gout on A and get the filtered gene data
ACGF = A(gout, :); and ISR have their own advantages when conducted
Output: ACGF, i.e. the gene data after CGF alone. However, the performance can be even higher if
they are combined together (see more details in the
“Evaluate the effect of combining CGF and ISR” Sec-
Starting from A*sum(1), which stands for the index of tion). We thus combine both CGF and ISR as the pre-
the gene with the highest occurrence frequency, we cal- processing for our DNN-based classifier.
culate its similarity with each of the following genes. If As shown in Fig. 1, the raw gene data is processed by
their similarity is larger than a predefined threshold CGF and ISR, separately, and then concatenated as the
dCGF, the latter gene is merged into the group of A*sum(1). input of the DNN classifier. The concatenation is con-
After the loop for A*sum(1), we conduct the loop for the ducted by appending the output of ISR to the tail of the
next ungrouped element in A*sum, until all the genes are output of CGF, by which the two outputs form a new
grouped with a unique group ID. and longer data vector. The classifier is a feed-forward
The final step is to filter the elements from each group artificial neural network with fixed input and output
and form the discriminatory subset. We do this by size, and multiple hidden layers for data processing. For
selecting the top nCGF genes in each group with the a hidden layer l, its activation (or output value to the
highest mutation occurrence frequency, where nCGF is next layer) is computed as:
another predefined threshold. Groups that have fewer
than nCGF elements are discarded. All of the selected xl ¼ f ðzl−1 Þ;
genes are then united as the result of CGF (steps 5 and where f is the activation function, zl is the total weighted
6 in Table 1). sum of the input:
z l ¼ W l xl þ b l ;
Indexed sparsity reduction
Although the CGF can effectively locate the discrimin- where Wl and bl are the weight matrix and bias vector of
atory gene subset and filter out the majority of irrelevant layer l (to be learned in training). In our case, we adopt
genes, it is still probable that the selected gene subset the ReLU [40] function as f, and x1 is the input gene
being highly sparse, i.e. most of the elements in ACGF are data after pre-processing. The size of the last layer L’s
zeros. The high sparsity is likely to obscure any distin- output xL equals to the number of cancer types ncancer
guishable feature in the gene data and severely hinder (ncancer = 12 in our case). xL is then processed by a soft-
the classification. Hence, an effective process in reducing max layer [41], and the loss J is computed by the loga-
the gene data sparsity is highly desired. rithm loss function:
To address the data sparsity issue, we propose the nX
cancer
indexed sparsity reduction (ISR) procedure, which min- J ¼− yi logPi ;
ifies the sparsity by converting the gene data into the in- i¼1
dexes of its non-zero genes. For a 1 × n gene sample
where yi ∈ {0, 1} is the ground truth label of cancer type
p ∈ {0, 1}1 × n, let the number of its non-zero element be
i, and
nNZ. We set a pre-defined threshold nISR. If nNZ ≥ nISR ,
find the indexes of its top nISR non-zero elements that xL ð i Þ
have the highest occurrence frequency in A*sum of the Pi ¼ X
expðxL ðjÞÞ
previous section, and these nISR indexes are listed in as- j
cending order as a vector pISR , which is the output of
ISR; if nNZ < nISR , we conduct zero-padding to the tail of is the softmax probability of cancer type i.
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 247 of 303
Fig. 2 Flowchart of the Indexed Sparsity Reduction (ISR) step. After indexing of the non-zero elements, if nNZ ≥ nISR, select the top nISR non-zero elements
that have the highest occurrence frequency; if nNZ < nISR, we conduct zero-padding to the tail of the output data so that it has the length of nISR
In training, the loss J is transferred from the last layer database with filter criteria IlluminaGA_DNASeq_Curated
to the former layers via back-propagation [32], by which updated before April, 2015. The mutation information for a
the parameters W and b of each layer are updated. The gene is represented by a binary value according to one or
training then enters the next epoch, and the feed- more mutation(s) (1) or without mutation (0) on that gene
forwarding and back-propagation are conducted again. for a specific sample. We assemble a total of 22,834 genes
The training stops when a pre-defined epoch number is from the 3122 samples, and generate a 22, 834 × 3, 122 bin-
reached. In testing, only the feed-forwarding is con- ary data matrix (i.e. the original data matrix A). This data
ducted (for once) for a testing sample, and the type of matrix is the product of our proposed TCGA-DeepGene
cancer i corresponding to the largest softmax probability subset, where each sample (column) is assigned one of the
of Pi is adopted as the classification result. The workflow labels {1, 2, …, 12} meaning the 12 types of cancer above.
of the DNN classifier is summarized in Table 2, and the To facilitate the 10-fold cross validation in the follow-
complete flowchart of DeepGene is illustrated in Fig. 1. ing experiments, we randomly divide the samples in
each of the 12 cancer categories into 10 subgroups, and
Results each time we union one subgroup from each cancer cat-
Experiment setup egory as the validation set, while all the other subgroups
Dataset are combined as the training set. This formulates 10
Our experiments are all conducted on the newly proposed training/validation configurations with fair distributions
TCGA-DeepGene dataset, which is a re-formulated subset of the 12 types of cancer, and will be used for the 10-
of The Cancer Genome Atlas (TCGA) dataset [39] that is fold cross validation in our following experiments.
widely applied in genomic researches.
The TCGA-DeepGene subset is formulated by assem- Constant parameters
bling the genes that contain somatic point mutation on For the proposed DNN classifier, the output size is set to
each of the 12 selected types of cancer. Detailed sample 12 (i.e. the 12 types of cancer to be classified); the total
and point mutation statistics for each cancer type can be training epoch Emax is set to 50; the learning rate is set
found in Table 3. The data is collected from the TCGA to 50-point logarithm space between 10− 1 and 10− 4; the
weight decay is set to 0.0005; and the training batch size
Table 2 Workflow of DNN classifier (i.e. the number of samples per training batch) is set to 256.
Input: Gene data matrix A ∈ {0, 1}m × n after CGF and ISR, where rows Additionally, in order to facilitate the evaluation of vari-
and columns correspond to samples and genes, respectively; max able parameters, we set each parameter a default value: the
training epoch Emax. distance threshold is set to 0.7; the group element threshold
1: Training: for each training epoch e ≤ Emax:
(a) For each sample ai = A(i, :): nCGF is set to 5; the non-zero element threshold nISR is set
i. Conduct feed-forwarding and compute the loss J; to 800; the hidden layer number and parameters per layer
ii. Conduct back-propagation to update the W and b of the DNN classifier are set to 4 and 8192, respectively.
2: Testing: for each sample ai = A(i, :):
(a) Conduct feed-forwarding and get softmax probability P;
(b) Adopt the cancer type correspond to max(P) as the result of ai. Evaluation metrics
Output: Trained network model (training) or classification results for the For all the evaluations in our experiments, we randomly se-
samples (testing).
lect 90% (2810) samples for training, and the rest 10% (312)
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 248 of 303
Table 3 Sample and mutation statistics of the TCGA-DeepGene dataset on 12 cancer types
Cancer name Sample number Missense mutation Nonsense mutation Nonstop mutation RNA Silent Splice_Site Translation Total
start site mutation
ACC 91 6741 501 15 368 2534 344 42 10,545
BLCA 130 24,067 2142 46 0 9662 528 55 36,500
BRCA 992 55,063 4841 133 3998 17,901 1424 0 83,360
CESC 194 26,606 2716 84 5595 9765 527 0 45,293
HNSC 279 31,416 2545 44 0 12,149 776 0 46,930
KIRP 171 8910 499 17 394 3411 524 0 13,755
LGG 284 5341 378 7 102 2074 294 0 8196
LUAD 230 44,800 3477 46 0 15,594 1377 99 65,393
PAAD 146 21,067 1496 19 859 7936 1005 111 32,493
PRAD 261 9628 563 15 652 3750 513 55 15,176
STAD 288 82,265 4200 92 48 33,344 1868 227 122,044
UCS 56 3070 187 2 234 1114 171 0 4778
Total 3122 318,974 23,545 520 12,250 119,234 9351 589 484,463
samples for testing. In parameter optimization steps for implemented on the MatConvNet toolbox [42], which is
DeepGene, we adopt the 10-fold cross validation accuracy a MATLAB-based convolutional neural network (CNN)
on the training set as the evaluation metric; on the other toolbox with various extensibilities.
hand, in the comparison with widely adopted models, we
adopt the testing accuracy as the evaluation metric. Evaluation of design options
Determination of CGF’s variable
Implementation There are two variables that need to be experimentally
The CGF and ISR steps are implemented by original determined for the CGF step, namely the distance
coding in MATLAB, while the DNN classifier is threshold dCGF and the group element threshold nCGF.
Table 4 10-fold cross validation accuracies (%) of DeepGene with different nCGF (row) and dCGF (column)
The optimal result is marked in red. Mean accuracy: 53.0%; standard deviation: 5.01%; maximum accuracy: 63.9%; minimum accuracy: 38.9%. The corresponding
3D bar-plot is shown in Fig. 3a for sensitivity review
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 249 of 303
Fig. 3 3D bar-plots of parameter estimations for sensitivity review. The Z-axis stands for 10-fold cross validation accuracy. a Parameter estimation
for dCGF and nCGF, corresponding to Table 4; b parameter estimation for layer number and parameter number per layer for the DNN classifier,
corresponding to Table 5; c parameter estimation for cost and gamma for SVM, corresponding to Table 6; d parameter estimation for Table 7
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 250 of 303
Fig. 4 Non-zero element distribution of the gene samples in the TCGA-DeepGene dataset. Ninety-seven percent of all the 3122 samples have no
more than 800 non-zero gene elements
Determine the network architecture major innovations, i.e. CGF and ISR. It is mentionable that
We also need to determine the network architecture for the we conduct CGF and ISR separately and concatenate their
DNN classifier, which involves two variables: the hidden results (as shown in Fig. 1) instead of conducting them
layer number (#layer) and the parameter number per layer consecutively. The reason is that the outputs of CGF and
(#param). Enlightened by [43], we monitor the classifier’s ISR are binary data and index data, respectively. Consecu-
10-fold cross validation accuracy with various hidden layer tive conduction will only leave the index data (from ISR),
numbers and parameter numbers, the results of which are while separate conduction can benefit from both the bin-
listed in Table 5, and the corresponding 3D bar-plot to ary data and the index data, thus introduces less bias.
present sensitivity is shown in Fig. 3b. We see that the per- Based on Fig. 1, we compare the performances of the
formance reaches optimal at #layer = 4 and #param = 8192. DNN classifier with different configurations:
These values are thus adopted in our following experiments.
(1)CGF and ISR (i.e. the proposed input structure);
Evaluate the effect of combining CGF and ISR (2)Only CGF (the upper half of Fig. 1);
After determining the related parameters for the three (3)Only ISR (the lower half of Fig. 1);
steps of DeepGene, we evaluate the impact of our two (4)Neither CGF nor ISR (use the raw gene data instead).
Table 5 10-fold cross validation accuracies (%) of DeepGene with different #layer (row) and #param (column)
The optimal result is marked in red. Mean accuracy: 57.9%; standard deviation: 3.42%; maximum accuracy: 64.0%; minimum accuracy: 53.2%. The corresponding
3D bar-plot is shown in Fig. 3b for sensitivity review
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 251 of 303
The 10-fold cross validation results are shown in Fig. 5. KNN: we compare the performances of Euclidean dis-
It is clearly observed that the complete CGF + ISR out- tance and Pearson correlation coefficient, which are the
performs both CGF and ISR when conducted alone, and two most commonly used similarity measures in gene
also significantly outperforms the raw data without any data analysis [26]. The 10-fold cross validation results of
pre-processing. the two similarity measures with different neighborhood
numbers are shown in Table 7, and the corresponding
3D bar-plot to present sensitivity is shown in Fig. 3d.
Comparison with widely adopted models We adopt the Pearson correlation coefficient and set the
We then select three most representative data classifiers neighborhood number to 4, which lead to the optimal
that are commonly used in SMCC as comparison validation accuracy.
methods, namely Support Vector Machine (SVM) [28], NB: following [47], the average percentage of non-zero
k-Nearest Neighbors (KNN) [44] and Naïve Bayes (NB) elements in the samples of each cancer category is set as
[45]. In order to exhibit the pre-processing effect of CGF the prior probability.
and ISR, all the comparison methods use raw gene data In the performance comparison between different
as inputs. The three methods are set up as below. models, the testing accuracy is adopted as the evaluation
SVM: we use the LIBSVM toolbox [46] in implementing metric (see the “Evaluation metrics” Section), which is
the SVM. Based on the results of a previous work for gene generally slightly lower than the 10-fold validation accur-
classification [26], the kernel type (−t) is set to 0 (linear ker- acy of the corresponding model. The experiment results
nel). Note that due to the feature set is high dimensional, are plotted in Fig. 6. DeepGene shows significant advan-
the linear kernel is suggested over the RBF (Gaussian) ker- tage against all the three comparison methods. The per-
nel [46]; this suggestion is consistent to our trial and error formance improvements are 24.3% (65.5% vs. 52.7%),
experience on this problem. A 10-fold cross validation is 60.5% (65.5% vs. 40.8%) and 710% (65.5% vs. 9.23%)
conducted to optimize the parameters cost (−c) and gamma against SVM, KNN and NB, respectively. To further val-
(−g), and the other parameters are set as their default values. idate the performance of the DNN classifier itself with-
The cross validation results are shown in Table 6, and the out CGF and ISR, we also record the accuracy of the
corresponding 3D bar-plot to present sensitivity is shown in DNN classifier with raw gene data, which is the same in-
Fig. 3c. We adopt 22 = 4 and 2‐ 5 = 0.0313 for -c and -g, re- put as the comparison methods. The results are shown
spectively, which lead to the best results in Table 6. in Fig. 7, in which the DNN classifier still has the
Fig. 5 10-fold cross validation accuracy of DeepGene with different design options. Performance comparison of the complete DeepGene input
structure (CGF + ISR), CGF only, ISR only and raw gene data. The complete DeepGene shows significant advantage against the other
three options
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 252 of 303
Table 6 10-fold cross validation accuracy (%) of SVM with different cost (row) and gamma (column) parameters
The optimal result is marked in red. Mean accuracy: 46.6%; standard deviation: 3.97%; maximum accuracy: 55.4%; minimum accuracy: 37.6%. The corresponding
3D bar-plot is shown in Fig. 3c for sensitivity review
optimal accuracy (60.1%) against all of the comparison negative influence, and only focus the data to the dis-
methods. criminatory gene subset.
Table 7 10-fold cross validation accuracies (%) of KNN with different similarity measures (row) and neighborhood numbers (column)
The optimal result is marked in red. Mean accuracy: 35.3%; standard deviation: 5.63%; maximum accuracy: 43.6%; minimum accuracy: 28.2%. The corresponding
3D bar-plot is shown in Fig. 3d for sensitivity review
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 253 of 303
Fig. 6 Testing accuracy of DeepGene against three widely adopted classifiers. DeepGene is clearly advantageous to the comparison methods
Fig. 7 Testing accuracy of DeepGene against three widely adopted classifiers with raw gene input data. All methods use raw gene data as input.
The DNN classifier is still favorable against the other methods
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 254 of 303
more significant than what the CGF contributes. We attri- as large as that of DeepGene. Since DeepGene is
bute ISR’s advantage to its remarkable reduction of the based on DNN, it is more advantageous in processing
gene data sparsity. It is also mentionable that ISR exhibits complicated data structures, thus can benefit from
more strength when combined with CGF, as the first bar CGF + ISR more.
in Fig. 5 indicates. This can be explained by the synergy
effect of binary gene data and indexed gene data.
Furthermore, we note that ISR conducts lossless con- DNN classifier
version when nNZ ≤ nISR , i.e. the indexed data can be The DNN classifier is the mainstay of DeepGene, which
readily converted back to the original binary data if conducts the classification and generates the final output.
necessary. Figure 6 has shown the significant advantage of DeepGene
against three widely adopted classifiers, among which
Data optimization by CGF and ISR DeepGene exhibits at least 24% of performance improve-
Besides aiding our DeepGene method, the CGF and ISR ment. To examine the performance of the DNN classifier
steps can also benefit other classification methods for in- itself without the pre-processing steps of CGF and ISR, we
put data optimization. To evaluate the optimization ef- also record the accuracy of the DNN classifier with raw
fect, we apply CGF + ISR to the three classifiers SVM, gene data in Fig. 7, which has shown that the DNN classi-
KNN and NB discussed in the “Comparison with widely fier still generates the best accuracy (60.1% against the sec-
adopted models” Section, and record their testing accur- ond best 52.7% of SVM).
acies before and after the input data optimization. For To further validate that the 10-fold validation accuracy
fair comparison, the parameters of the classifiers remain of DNN is indeed higher than that of SVM, we assume
the same. that these two classifiers are independent of each other,
Figure 8 shows the accuracy change before and after and conduct t-test with the null hypothesis that these
the input data optimization of CGF + ISR. It is observed two classifiers have equal validation accuracy under the
that applying CGF + ISR can notably refine the input significance of 0.001. The sample standard deviation of
data, thus improve the testing accuracies of the classi- DNN and SVM are recorded as sX 1 ¼ 1:51% ¼ 0:0151
fiers. We also note that by applying CGF + ISR, the and sX 2 ¼ 2:12% ¼ 0:0212, respectively. The t statistic is
accuracy improvements of the three classifiers are not then calculated as:
Fig. 8 Testing accuracies of three widely adopted classifiers with and without CGF + ISR for input data optimization. Applying CGF + ISR can
notably refine the input data, thus improve the testing accuracies of the classifiers
The Author(s) BMC Bioinformatics 2016, 17(Suppl 17):476 Page 255 of 303
Consent for publication 22. Balss J, Meyer J, Mueller W, Korshunov A, Hartmann C, von Deimling A.
Not applicable. Analysis of the IDH1 codon 132 mutation in brain tumors. Acta
Neuropathol. 2008;116(6):597–602.
Ethics approval and consent to participate 23. Winnepenninckx V, Lazar V, Michiels S, Dessen P, Stas M, Alonso SR,
Not applicable. Avril M-F, Romero PLO, Robert T, Balacescu O. Gene expression profiling
of primary cutaneous melanoma and clinical outcome. J Natl Cancer
Author details Inst. 2006;98(7):472–82.
1
School of Information Technologies, The University of Sydney, Darlington, 24. Cho J-H, Lee D, Park JH, Lee I-B. New gene selection method for classification of
NSW 2008, Australia. 2Key Laboratory of Systems Biomedicine, Shanghai cancer subtypes considering within‐class variation. FEBS Lett. 2003;551(1–3):3–7.
Center for Systems Biomedicine, Shanghai Jiaotong University, Shanghai 25. Yang K, Cai Z, Li J, Lin G. A stable gene selection in microarray data analysis.
200240, China. BMC Bioinformatics. 2006;7(1):228.
26. Cai Z, Xu L, Shi Y, Salavatipour MR, Goebel R, Lin G. Using gene clustering
Published: 23 December 2016 to identify discriminatory genes with higher classification accuracy.
Arlington: IEEE Symp Bioinformatics BioEngineering (BIBE). 2006:235–242.
27. Tao Y, Sam L, Li J, Friedman C, Lussier YA. Information theory applied to the
References sparse gene ontology annotation network to predict novel gene function.
1. Feuerstein M. Defining cancer survivorship. J Cancer Surviv. 2007;1(1):5–7. Bioinformatics. 2007;23(13):i529–38.
2. Stewart B, Wild CP. World cancer report 2014. 2015. World. 28. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
3. DeFrancesco L. Life Technologies promises [dollar] 1,000 genome. Nat 29. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F,
Biotechnol. 2012;30(2):126. Aken BL, Barrell D, Zadissa A, Searle S. GENCODE: the reference human
4. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760–74.
H, Loh ML, Downing JR, Caligiuri MA. Molecular classification of cancer: class 30. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with
discovery and class prediction by gene expression monitoring. Science. neural networks. Science. 2006;313(5786):504–7.
1999;286(5439):531–7. 31. Deng L, Yu D. Deep learning: methods and applications. Foundations
5. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies Trends Signal Processing. 2014;7(3–4):197–387.
H, Teague J, Butler A, Stevens C. Patterns of somatic mutation in human 32. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel
cancer genomes. Nature. 2007;446(7132):153–8. LD. Backpropagation applied to handwritten zip code recognition. Neural
6. Wang Q, Jia P, Li F, Chen H, Ji H, Hucks D, Dahlman KB, Pao W, Zhao Z. Comput. 1989;1(4):541–51.
Detecting somatic point mutations in cancer genome sequencing data: a 33. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep
comparison of mutation callers. Genome Med. 2013;5(10):91. convolutional neural networks. In: Advances in neural information
7. Longo DL. Tumor heterogeneity and personalized medicine. N Engl J Med. processing systems. 2012. p. 1097–105.
2012;366(10):956–7. 34. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D,
8. Sledge GW. What is targeted therapy? J Clin Oncol. 2005;23(8):1614–5. Vanhoucke V, Rabinovich A. Going deeper with convolutions. 2014. arXiv
9. Gudeman J, Jozwiakowski M, Chollet J, Randell M. Potential risks of preprint arXiv:14094842.
pharmacy compounding. Drugs R D. 2013;13(1):1–8. 35. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate
10. Franken B, de Groot MR, Mastboom WJ, Vermes I, van der Palen J, Tibbe AG, object detection and semantic segmentation. Columbus: IEEE Conf Comput
Terstappen LW. Circulating tumor cells, disease recurrence and survival in Vision Pattern Recognition (CVPR). 2014:580–587.
newly diagnosed breast cancer. Breast Cancer Res. 2012;14(5):1–8. 36. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic
11. Sleijfer S, Gratama J-W, Sieuwerts AM, Kraan J, Martens JW, Foekens JA. segmentation. 2014. arXiv preprint arXiv:14114038.
Circulating tumour cell detection on its way to routine diagnostic 37. Sun Y, Wang X, Tang X. Deep convolutional network cascade for facial point
implementation? Eur J Cancer. 2007;43(18):2645–50. detection. Portland: IEEE Conf Comput Vision Pattern Recognition (CVPR).
12. Hayes DF, Smerage J. Is there a role for circulating tumor cells in the 2013:3476–3483.
management of breast cancer? Clin Cancer Res. 2008;14(12):3646–50. 38. Sun Y, Wang X, Tang X. Deep learning face representation from predicting
13. Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, Ding 10,000 classes. Columbus: IEEE Conf Comput Vision Pattern Recognition
M, Bamford S, Cole C, Ward S. COSMIC: exploring the world’s knowledge of (CVPR). 2014:1891–1898.
somatic mutations in human cancer. Nucleic Acids Res. 2015;43(D1):D805–11. 39. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA):
14. Watson IR, Takahashi K, Futreal PA, Chin L. Emerging patterns of somatic an immeasurable source of knowledge. Contemp Oncol. 2015;19(1A):A68.
mutations in cancer. Nat Rev Genet. 2013;14(10):703–18. Last downloaded on April 8th, 2015.
15. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis 40. Nair V, Hinton GE. Rectified linear units improve restricted boltzmann
ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration machines. In: Proceedings of the 27th International Conference on Machine
discovery in cancer by exome sequencing. Genome Res. 2012;22(3):568–76. Learning (ICML-10). 2010. p. 807–14.
16. Browne RP, McNicholas PD, Sparling MD. Model-based learning using a 41. Bishop CM. Pattern recognition and machine learning, vol. 4. New York:
mixture of mixtures of gaussian and uniform distributions. IEEE Trans Springer; 2006.
Pattern Anal Mach Intell. 2012;34(4):814–7. 42. Vedaldi A, Lenc K. MatConvNet-convolutional neural networks for MATLAB.
17. Chicco D, Sadowski P, Baldi P. Deep autoencoder neural networks for gene 2014. arXiv preprint arXiv:14124564.
ontology annotation predictions. Proc ACM Conf Bioinformatics, Computational 43. Mostajabi M, Yadollahpour P, Shakhnarovich G. Feedforward semantic
Biology. Newport Beach: Health Informatics. 2014:533–540. segmentation with zoom-out features. 2014. arXiv preprint arXiv:14120774.
18. Chow CK, Zhu H, Lacy J, Lingen MW, Kuo WP, Chan K. A cooperative feature 44. Altman NS. An introduction to kernel and nearest-neighbor nonparametric
gene extraction algorithm that combines classification and clustering. regression. Am Stat. 1992;46(3):175–85.
Washington, DC: IEEE Intl Conf Bioinformatics Biomedicine Workshop 45. Rennie JD, Shih L, Teevan J, Karger DR. Tackling the poor assumptions of
(BIBMW). 2009:197–202. naive bayes text classifiers. Washington: ICML; 2003. p. 616–23.
19. Huang Z, Huang D, Ni S, Peng Z, Sheng W, Du X. Plasma microRNAs are 46. Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM
promising novel biomarkers for early detection of colorectal cancer. Int J Trans Intell Syst Technol (TIST). 2011;2(3):27.
Cancer. 2010;127(1):118–26. 47. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for
20. Aaroe J, Lindahl T, Dumeaux V, Saebo S, Tobin D, Hagen N, Skaane P, the classification of tumors using gene expression data. J Am Stat Assoc.
Lonneborg A, Sharma P, Borresen-Dale A-L. Gene expression profiling of 2002;97(457):77–87.
peripheral blood cells for early detection of breast cancer. Breast Cancer
Res. 2010;12(1):R7.
21. Kurman RJ, Visvanathan K, Roden R, Wu T, Shih I-M. Early detection and
treatment of ovarian cancer: shifting from early stage to minimal volume of
disease based on a new model of carcinogenesis. Am J Obstet Gynecol.
2008;198(4):351–6.