0% found this document useful (0 votes)
46 views9 pages

Ieee Review

protein purification

Uploaded by

sharoon bhatti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views9 pages

Ieee Review

protein purification

Uploaded by

sharoon bhatti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL.

1, 2008 41

Machine Learning Methods for Protein


Structure Prediction
Jianlin Cheng, Allison N. Tegge, Member, IEEE, and Pierre Baldi, Senior Member, IEEE

Methodological Review

Abstract—Machine learning methods are widely used in bioin-


formatics and computational and systems biology. Here, we
review the development of machine learning methods for protein
structure prediction, one of the most fundamental problems in
structural biology and bioinformatics. Protein structure predic-
tion is such a complex problem that it is often decomposed and
attacked at four different levels: 1-D prediction of structural
features along the primary sequence of amino acids; 2-D predic-
tion of spatial relationships between amino acids; 3-D prediction Fig. 1. Protein sequence-structure-function relationship. A protein is a linear
of the tertiary structure of a protein; and 4-D prediction of the polypeptide chain composed of 20 different kinds of amino acids represented
quaternary structure of a multiprotein complex. A diverse set of by a sequence of letters (left). It folds into a tertiary (3-D) structure (middle)
composed of three kinds of local secondary structure elements (helix – red; beta-
both supervised and unsupervised machine learning methods has
strand – yellow; loop – green). The protein with its native 3-D structure can carry
been applied over the years to tackle these problems and has sig- out several biological functions in the cell (right).
nificantly contributed to advancing the state-of-the-art of protein
structure prediction. In this paper, we review the development and
application of hidden Markov models, neural networks, support
vector machines, Bayesian methods, and clustering methods in structure elements are further packed to form a tertiary structure
1-D, 2-D, 3-D, and 4-D protein structure predictions. depending on hydrophobic forces and side chain interactions,
Index Terms—Bioinformatics, machine learning, protein such as hydrogen bonding, between amino acids [5]–[7]. The
folding, protein structure prediction. tertiary structure is described by the , , and coordinates of
all the atoms of a protein or, in a more coarse description, by
the coordinates of the backbone atoms. Finally, several related
I. INTRODUCTION protein chains can interact or assemble together to form protein
protein is a polymeric macromolecule made of amino acid complexes. These protein complexes correspond to the protein
A building blocks arranged in a linear chain and joined to-
gether by peptide bonds. The linear polypeptide chain is called
quaternary structure. The quaternary structure is described by
the coordinates of all the atoms, or all the backbone atoms in
the primary structure of the protein. The primary structure is a coarse version, associated with all the chains participating
typically represented by a sequence of letters over a 20-letter al- in the quaternary organization, given in the same frame of
phabet associated with the 20 naturally occurring amino acids. reference.
In its native environment, the chain of amino acids (or In a cell, proteins and protein complexes interact with each
residues) of a protein folds into local secondary structures other and with other molecules (e.g., DNA, RNA, metabolites)
including alpha helices, beta strands, and nonregular coils to carry out various types of biological functions ranging
[3], [4]. The secondary structure is specified by a sequence from enzymatic catalysis, to gene regulation and control of
classifying each amino acid into the corresponding secondary growth and differentiation, to transmission of nerve impulses
structure element (e.g., alpha, beta, or gamma). The secondary [8]. Extensive biochemical experiments [5], [6], [9], [10] have
shown that a protein’s function is determined by its structure.
Thus, elucidating a protein’s structure is key to understanding
Manuscript received August 27, 2008; revised October 03, 2008. First pub-
lished November 05, 2008; current version published December 12, 2008. The its function, which in turn is essential for any related biological,
work of J. Cheng was supported by an MU research board grant. The work of A. biotechnological, medical, or pharmaceutical applications.
N. Tegge was supported by an NLM fellowship. The work of P. Baldi was sup- Experimental approaches such as X-ray crystallography [11],
ported in part by the MU bioinformatics Consortium, in part by NIH Biomedical
Informatics Training under Grant (LM-07443-01), and in part by the National [12] and nuclear magnetic resonance (NMR) spectroscopy [13],
Science Foundation under MRI Grant (EIA-0321390) and Grant (0513376). [14] are the main techniques for determining protein structures.
J. Cheng is with the Computer Science Department, University of Missouri, Since the determination of the first two protein structures
Columbia, MO 65211 USA (e-mail: [email protected]).
A. N. Tegge is with the Informatics Institute, University of Missouri, Co- (myoglobin and haemoglobin) using X-ray crystallography
lumbia, MO 65211 USA (e-mail: [email protected]). [5], [6], the number of proteins with solved structures has
P. Baldi is with the Department of Computer Science and the Institute for Ge- increased rapidly. Currently, there are about 40 000 proteins
nomics and Bioinformatics, University of California, Irvine, CA 92697 USA(e-
mail: [email protected]). with empirically known structures deposited in the Protein
Digital Object Identifier 10.1109/RBME.2008.2008239 Data Bank (PDB) [15]. This growing set of solved structures
1937-3333/$25.00 © 2008 IEEE

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
42 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 1, 2008

Fig. 2. One-dimensional protein structure prediction. Three-dimensional example of 1-D protein structure prediction where the input primary sequence of amino
acid is “translated” into an output sequence of secondary structure assignments for each amino acid (C = coil; H = helix; E = beta-strand [extended sheet)].

provides invaluable information to help further understand how dent of any frame of coordinates, which appear only in the 3-D
a protein chain folds into its unique 3-D structure, how chains level. The 3-D prediction focuses on predicting the coordinates
interact in quaternary complexes, and how to predict structures for all the residues or atoms of a protein in a 3-D space. Although
from primary sequences [16]. the ultimate goal is to predict 3-D structure, 1-D and 2-D predic-
Since the pioneering experiments [1], [2], [5], [6], [17] tions are often used as input for 3-D coordinate predictors; fur-
showing that a protein’s structure is dictated by its sequence, thermore, 1-D and 2-D predictions are also of intrinsic interest
predicting protein structure from its sequence has become one (Fig. 4). Finally, 4-D prediction focuses on the prediction of the
of the most fundamental problems in structural biology (Fig. 1). structure of protein complexes comprised of several folded pro-
This is not only a fundamental theoretical challenge but also tein chains (Fig. 5).
a practical one due to the discrepancy between the number The 1-D, 2-D, and 3-D protein structure prediction methods
of protein sequences and solved structures. In the genomic are routinely evaluated in the Critical Assessment of Techniques
era, with the application of high-throughput DNA and protein for the Protein Structure Prediction (CASP) [34] experiment—a
sequencing technologies, the number of protein sequences community-wide experiment for blind protein structure pre-
has increased exponentially, at a pace that exceeds the pace at diction held every two years since 1994. The 4-D prediction
which protein structures are solved experimentally. Currently, methods are currently evaluated in the Critical Assessment of
only about 1.5% of protein sequences (about 40 000 out of Techniques for Protein Interaction (CAPRI) [35]—a commu-
2.5 million known sequences available) have solved structures nity-wide experiment for protein interaction. The assessment
and the gap between proteins with known structures and with results are published in the supplemental issues of the journal
unknown structures is still increasing. Proteins.
In spite of progress in robotics and other areas, experimental To date, the most successful structure prediction methods
determination of a protein structure can still be expensive, labor have been knowledge based. Knowledge-based methods in-
intensive, time consuming, and not always possible. Some of the volve learning or extracting knowledge from existing solved
hardest challenges involve large quaternary complexes or partic- protein structures and generalizing the gained knowledge to
ular classes of proteins, such as membrane proteins which are new proteins whose structures are unknown. Machine learning
associated with a complex lipid bilayer environment. These pro- methods [21] that can automatically extract knowledge from
teins are particularly difficult to crystallize. Although membrane the PDB are an important class of tools and have been widely
proteins are extremely important for biology and medicine, only used in all aspects of protein structure prediction. Here, we
a few dozen membrane protein structures are available in the review the development and application of machine learning
PDB. Thus, in the remainder of this paper we focus almost ex- methods in 1-D, 2-D, 3-D, and 4-D structure prediction.
clusively on globular, nonmembrane proteins that are typically We focus primarily on unsupervised clustering methods and
found in the cytoplasm or the nucleus of the cell, or that are se- three supervised machine learning methods including hidden
creted by the cell. Markov models (HMMs) [21], [36], [37], neural networks [21],
Protein structure prediction software is becoming an impor- [38], and support vector machines [39] for 1-D, 2-D, 3-D, and
tant proteomic tool for understanding phenomena in modern 4-D structure prediction problems. We emphasize their applica-
molecular and cell biology [18] and has important applications tions to the problem of predicting the structure of globular pro-
in biotechnology and medicine [19]. Here, we look at protein teins, which are the most abundant proteins—roughly 75% of
structure prediction at multiple levels, from 1-D to 4-D [20] a typical proteome—and for which several prediction methods
and focus on the contributions made by machine learning ap- have been developed. We also briefly review some applications
proaches [21]. The 1-D prediction focuses on predicting struc- of these methods to the prediction of the structure of membrane
tural features such as secondary structure [22]–[25] and rela- proteins, although far less training data is available for this class
tive solvent accessibility [26], [27] of each residue along the of proteins.
primary 1-D protein sequence (Fig. 2). The 2-D prediction fo-
cuses on predicting the spatial relationship between residues, II. MACHINE LEARNING METHODS FOR 1-D
such as distance and contact map prediction [28], [29] and disul- STRUCTURE PREDICTION
fide bond prediction [30]–[33] (Fig. 3). One essential character- Many protein structural feature predictions are 1-D prediction
istic of these 2-D representations is that they are independent of problems, including, for example, secondary structure predic-
any rotations and translations of the protein, therefore indepen- tion, solvent accessibility prediction, disordered region predic-

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
CHENG et al.: MACHINE LEARNING METHODS FOR PROTEIN STRUCTURE PREDICTION 43

Fig. 3. Two-dimensional protein structure prediction. Example depicts a predicted 2-D contact map with an 8 Angstrom cutoff. The protein sequence is aligned
along the sides of the contact map both horizontally and vertically. Each dot represents a predicted contact, i.e., a residue pair whose spatial distance is below 8
Angstroms. For instance, the red dotted lines mark a predicted contact associated with the pair (D, T).

The input for 1-D prediction problems is a protein primary


sequence and the output is a sequence of predicted features for
each amino acid in the sequence. The learning goal is to map
the input sequence of amino acids to the output sequence of
features. The 1-D structure prediction problem is often viewed
as a classification problem for each individual amino acid in
the protein sequence. Historically, protein secondary structure
prediction has been the most studied 1-D problem and has had
a fundamental impact on the development of protein structure
prediction methods [22], [23], [47]–[49]. Here, we will mainly
focus on machine learning methods for secondary structure pre-
diction of globular proteins. Similar techniques have also been
applied to other 1-D prediction problems.
Early secondary structure prediction methods [47] were
Fig. 4. Three-dimensional protein structure prediction. Three-dimensional based on extracting the statistical correlations between a
structure predictors often combine information from the primary sequence and
the predicted 1-D and 2-D structures to produce 3-D structure predictions. window of consecutive amino acids in a protein sequence and
the secondary structure classification of the amino acid in the
tion, binding site prediction, functional site prediction, protein center of the window. Simple correlation methods capture a
domain boundary prediction, and transmembrane helix predic- certain amount of information and can reach an accuracy of
tion [22], [23], [33], [40]–[45], [96]. about 50%, well above chance levels. With the development

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
44 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 1, 2008

methods and have been further refined. For instance, PSI-PRED


[24] uses PSI-BLAST [54] to derive new profiles based on
position specific scoring matrices to further improve secondary
structure prediction.
New algorithmic developments [49], [27] inspired by the
theory of probabilistic graphical models [21] have led to more
sophisticated recursive neural network architectures to try to
improve prediction accuracy by incorporating information
that extends beyond the fixed-size window input of traditional
feedforward neural networks. Large ensembles of hundreds of
neural networks have also been used [55]. The new technologies
available along with the increase of protein sequence databases
used to build profiles have improved secondary structure pre-
diction accuracy to about 78%–80%. Moreover, hybrid methods
[45], [57] that combine neural network approaches with ho-
mology searches have been developed to improve secondary
structure prediction. Homologous proteins are proteins that are
derived from the same evolutionary ancestor and therefore tend
to share structural and functional characteristics. A protein that
is strongly homologous to another protein with known struc-
tures in the PDB [15] will likely share a similar structure. In
addition to neural networks, support vector machines (SVMs)
are also another set of statistical machine learning methods
used to predict protein secondary structures and other 1-D
Fig. 5. Four-dimensional protein structure prediction. Four-dimensional pre-
diction derived by docking individual protein chains to create a protein com-
features of globular proteins with good accuracy [58].
plex. Machine learning methods are also frequently used to pre-
dict 1-D feature of membrane proteins. For instance, neural net-
works as well as HMMs have been used to identify membrane
of more powerful pattern recognition and nonlinear function proteins and predict their topology, which include predicting the
fitting methods, new approaches have been used to predict location of their alpha-helical or beta-strand regions and the in-
protein secondary structures. In the 1980s, feedforward neural tracellular or extracellular localization of the loop regions [59],
networks were first applied to secondary structure prediction [96].
and significantly improved prediction accuracy to a level in While 1-D prediction methods have made good progress
the 60% to 70% range [48]. This was probably the first time a over the past three decades, there is still room for some im-
large-scale machine learning method was successfully applied provement in both the accuracy and scope of these methods.
to a difficult problem in bioinformatics. A third important For instance, secondary structure prediction accuracy is still at
breakthrough occurred with the realization that higher accuracy least 8% below the predicted limit of 88% [60]. The prediction
could be achieved by using a richer input derived from a mul- of protein domain boundaries [33], [40]–[42] and disordered
tiple alignment of a sequence to its homologs. This is due to regions [43]–[45] are still at an early stage of development,
the fact that protein secondary structure is more conserved than while already showing promising results. Some improvements
protein primary sequence—i.e., protein sequences in the same may come from algorithmic improvements, for instance using
protein family evolving from the same ancestor have different ensemble and meta-learning techniques such as bagging and
amino acid sequences but often maintain the same secondary boosting [62] to combine classifiers to improve accuracy. Other
structure [50], [51]. Rost and Sander [22], [23] were the first to improvements may require exploiting new sources of biological
combine neural networks with multiple sequence alignments information. For instance, gene structure information, such
to improve secondary structure prediction accuracy to about as alternative splicing sites, may be used to improve domain
70%–74%. In this approach, instead of encoding each amino boundary prediction [42].
acid with a sparse binary vector of length 20 containing a single III. MACHINE LEARNING METHODS FOR 2-D
1-bit located at a different position for each different amino STRUCTURE PREDICTION
acid, the empirical probabilities (i.e., normalized frequencies)
of the 20 amino acids appearing in the corresponding column The classic 2-D structure prediction problem is the prediction
of the multiple sequence alignment are used. The positional of protein contact maps [28], [63], [64]. A protein contact map
frequency vector, called the profile of the family at the corre- is a matrix , where is either one or zero, depending on
sponding position, captures evolutionary information related whether the Euclidean distance between the two amino acids at
to the structural properties of the protein family. Profiles are linear positions and is above a specified distance threshold
relatively easy to create and allow one to leverage information (e.g., 8 Angstroms) or not. Distances can be measured, for
contained in the sequence databases (e.g., SWISSPROT [53]) instance, between corresponding backbone carbon atoms.
that are much larger than the PDB. Profiles are now used A coarser contact map can be derived in a similar way by
in virtually all knowledge-based protein structure prediction considering secondary structure elements. Finer contact maps

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
CHENG et al.: MACHINE LEARNING METHODS FOR PROTEIN STRUCTURE PREDICTION 45

can be derived by considering all the atoms of each amino corresponding beta-sheet. In part because of the requirements
acid. As previously mentioned, contact map representations imposed by the hydrogen bonding constraints, the accuracy of
are particularly interesting due to their invariance with respect amino acid pairing in beta-sheets is above 41%, higher than the
to rotations and translations. Given an accurate contact map, accuracy for generic contacts in contact maps. As with other
several algorithms can be used to reconstruct the corresponding 2-D prediction problems, feedforward and recursive neural net-
protein 3-D structure [65]–[67]. Since a contact map is es- works have been used to predict beta-sheet pairings. Currently,
sentially another representation of a protein 3-D structure, the the most successful method is a 2-D recursive neural network
difficulty of predicting a contact map is more or less equivalent approach which takes a grid of beta-residues as inputs [77] and,
to the difficulty of predicting the corresponding 3-D structure. together with graph matching algorithms, predicts pairings at
Contact maps can also be used to try to infer protein folding the residue, strand, and sheet levels.
rates [68], [69]. In addition to 2-D prediction for globular proteins, these tech-
Several machine learning methods, including neural networks niques have recently been used to predict contacts in transmem-
[28], [70]–[72], self-organizing maps [73], and support vector brane beta-barrel proteins. Prediction of transmembrane beta-
machines [74] have been applied to contact map prediction. barrel proteins have been used to reconstruct 3-D structures with
Standard feedforward neural networks and support vector ma- reasonable accuracy [59].
chines approaches use two windows around two target amino To use 2-D prediction more effectively as input features for
acids, and , to predict if they are in contact or not. This can 3-D structure prediction, one important task is to further im-
be viewed as a binary classification problem. Each position in prove 2-D prediction accuracy. As for 1-D predictions, progress
a window is usually a vector consisting of 20 numbers corre- may come from improvements in machine learning methods or,
sponding to the 20 profile probabilities, as in the 1-D predic- perhaps more effectively, from incorporating more informative
tion problem. Additional useful 1-D information that can be features in the inputs. For instance, recently mutual informa-
leveraged includes the predicted secondary structure or relative tion has been shown to be a useful feature for 2-D prediction
solvent accessibility of each amino acid. As in 1-D prediction, [72], [74]. On the reconstruction side, several optimization al-
methods based on local windows approaches cannot take into gorithms exist to try to reconstruct 3-D structures from con-
account the effect of amino acids outside of the window. To tact maps by using Monte Carlo methods [66], [82] and in-
overcome this problem, a 2-D-recursive neural network archi- corporating experimentally determined contacts or contacts ex-
tecture [29] that in principle can use the entire sequence to derive tracted from template structures into protein structure prediction
each prediction was designed to improve contact map predic- [83]–[85] or protein structure determination by NMR methods.
tion. In the latest Critical Assessment of Techniques for protein However, these methods cannot reliably reconstruct 3-D struc-
structure prediction (CASP) [34], three methods using standard tures from very noisy contact maps that were predicted from
neural networks [72], 2-D recursive neural networks [45], and primary sequence information alone [66], [82]. Thus, of parallel
support vector machines [74] achieved the best results [75]. importance is the development of more robust 3-D reconstruc-
Despite progress made in the last several years, contact map tion algorithms that can tolerate the noise contained in predicted
prediction remains largely an unsolved problem. The current contact maps.
precision and recall of medium and long-range contact predic-
tions is around 28% [74]. Although this number is quite low, IV. MACHINE LEARNING METHODS FOR 3-D
its accuracy is better than the accuracy of contacts generated STRUCTURE PREDICTION
by other ab initio 3-D structure prediction methods. Predicted Machine learning methods have been used in several aspects
contact maps are likely to provide some help in 3-D structure of protein 3-D structure prediction such as fold recognition [33],
prediction because even a small fraction of correctly predicted [86], [87], model generation [89], and model evaluation [90],
long-range contacts can effectively help build a protein topology [91].
[76]. Fold recognition aims to identify a protein, with known struc-
In addition to the prediction of general residue-residue con- ture, that is presumably similar to the unknown structure of a
tact maps, special attention has been paid to more specific con- query protein. Identification of structural homologs is an essen-
tact predictions: beta-strand pairing prediction [77] and disul- tial step for the most successful template-based 3-D structure
fide bond prediction [33], [78], [79]. Disulfide bonds are cova- prediction approaches. Neural networks were first used for this
lent bonds that can form between cysteine residues. These disul- task in combination with threading [86]. More recently, a gen-
fide bonds play a crucial role in stabilizing proteins, particularly eral machine learning framework has been proposed to improve
both the sensitivity and specificity of fold recognition based on
small proteins. Disulfide bond prediction involves predicting if
pairwise similarity features between query and template pro-
a disulfide bond exists between any two cysteine residues in
teins [88]. Although the current implementation of the frame-
a protein. Both neural networks and support vector machines
work uses support vector machines to identify folds, it can be
have been used to predict disulfide bonds. The average preci-
extended to any other supervised learning method.
sion and recall performance measures are slightly above 50%. In addition to classification methods, HMMs are among the
Likewise, one can try to predict if two amino acids in two dif- most important techniques for protein fold recognition. Earlier
ferent beta-strands are paired or not in the same beta sheet. Usu- HMM approaches, such as SAM [92] and HMMer [93], built an
ally, two paired beta-residues form hydrogen bonds with each HMM for a query with its homologous sequences and then used
other or their neighbors and contribute to the stabilization of the this HMM to score sequences with known structures in the PDB

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
46 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 1, 2008

using the Viterbi algorithm, an instance of dynamic program- More recently, RosettaDock uses the same simulated annealing
ming methods. This can be viewed as a form of profile-sequence technique as Rosetta for 3-D, with some adjustments to the 4-D
alignment. More recently, profile–profile methods have been problem [106]. More broadly, several ongoing efforts aim to
shown to significantly improve the sensitivity of fold recog- adapt 3-D methods to 4-D problems. For instance, clustering
nition over profile–sequence, or sequence–sequence, methods methods have been adapted to cluster docking conformations
[94]. In the HMM version of profile–profile methods, the HMM and to select centroids of clusters to generate final predictions
for the query is aligned with the prebuilt HMMs of the template [112].
library. This form of profile–profile alignment is also computed Four-dimensional prediction is closely related to 1-D, 2-D,
using standard dynamic programming methods. and 3-D prediction. For instance, if the protein interaction inter-
Optimization techniques, such as conjugate gradient descent faces (sites) can be accurately predicted by 1-D predictors [113],
and Monte Carlo methods (e.g., simulated annealing) that are the conformation search space for the protein docking phase can
widely used in statistical machine learning methods are also be drastically reduced. Since one of the major bottlenecks of
essential techniques for 3-D protein structure generation and 4-D prediction is the size of the conformation space to be sam-
sampling. Conjugate gradient descent (a technique also used in pled, which is even larger than in the 3-D case, improving in-
neural network learning) is used to generate structures in the terface prediction is an essential step to address this bottleneck.
most widely used comparative modeling tool Modeller [95]. Currently, neural networks, HMMs and support vector machine
Lattice Monte Carlo sampling is used in both template-based methods have been used to predict interface sites [114]. Most of
and ab initio structure modeling [85] and the most widely used these methods use some features extracted from the 3-D struc-
ab initio fragment assembly tool, Rosetta, uses simulated an- tures of the protein subunits. Since in most practical cases the
nealing sampling techniques [89]. 3-D structures themselves are currently not available, it may be
In addition to model generation, machine learning methods worthwhile to further develop methods to predict interactions
are also widely used to evaluate and select protein models. Most from protein sequences alone.
ab initio structure prediction methods use clustering techniques The other major bottleneck in protein docking comes from
to select models [96]. These methods first generate a large pop- induced conformational changes, which introduce an additional
ulation of candidate models and then cluster them into several layer of complexity that is not well handled by current methods
clusters based on the structure similarity between the models, [107]. Most current docking methods assume that the structures
using -means clustering or some other similar clustering algo- of the subunits are subjected to little or no changes during
rithm. Representative elements from each cluster, such as the docking. However, upon protein binding, individual proteins
centroids, are then proposed as possible 3-D structures. Usu- may undergo substantial or even large-scale conformational
ally, the centroid of the largest cluster is used as the most con- changes, which cannot be handled by current docking methods.
fident prediction, although occasionally the centroid of a dif- Developing machine learning methods to identify regions, such
ferent cluster can be even closer to the native structure. In ad- as flexible hinges, that facilitate large-scale movement may be
dition to clustering, supervised learning techniques have been of some help in predicting the overall structure of these protein
used to directly assess the quality of a protein model. Neural net- complexes, although the amount of available training data for
works have been used to estimate the root mean square distance this problem may not be as abundant as one would like.
(RMSD) between a model and the native structure [90]. Support Finally, as in the case of 3-D structure prediction, machine
vector machines have been used to rank protein models [91]. learning methods may help in developing better methods for
One main challenge of model selection is that current methods assessing the quality of 4-D models and predict their quality and
cannot consistently select the best model with lowest RMSD. confidence levels.
For model quality evaluation, the correlation between predicted
scores and real quality scores for hard targets (poor models) is VI. CONCLUSION
still low [97], i.e., some poor models may receive good predicted Machine learning methods have played, and continue to play,
scores. In addition, a statistical confidence score should be as- an important role in 1-D-4-D protein structure predictions, as
signed to the predicted quality scores for better model usage well as in many other related problems. For example, machine
and interpretation. It is likely that additional machine learning learning methods are being used to predict protein solubility
methods will have to be developed to better deal with these prob- [115], protein stability [116], protein signal peptides [117],
lems. [118], protein cellular localization [117], protein post-transla-
tion modification sites, such as phosphorilation sites [119], and
V. MACHINE LEARNING METHODS FOR 4-D protein epitopes [120]–[123]. Here, we have tried to give a se-
STRUCTURE PREDICTION lected and nonexhaustive overview of some of the applications
The aim of 4-D structure prediction is to predict the struc- of machine learning methods to protein structure prediction
ture of a protein complex consisting of two or more protein problems.
chains, also known as protein docking [98]–[106]. Like 3-D A common question often asked by students is which ma-
structure prediction, 4-D structure prediction is often reduced chine learning method is “better” or more suitable for a given
to a problem of conformation sampling with the use of energy problem? In short, should I use a neural network, an HMM, an
functions [107]–[110]. SVM, or something else? In our opinion, it turns out that this
Assuming the 3-D structures of each protein subunit are question is not as fundamental as it may seem. While a given
known, some docking methods use 3-D grid Fourier trans- machine learning approach may be easier to implement for a
formation methods [111] to dock protein subunits together. given problem, or more suited to a particular data format, to

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
CHENG et al.: MACHINE LEARNING METHODS FOR PROTEIN STRUCTURE PREDICTION 47

tackle difficult problems what matters in the end is the exper- [19] M. Jacobson and A. Sali, “Comparative protein structure modeling
tise a scientist has in a particular machine learning technology. and its applications to drug discovery,” in Annual Reports in Medical
Chemistry, J. Overington, Ed. London, U.K.: Academic, 2004, pp.
What can be obtained with a general-purpose machine learning 259–276.
method can be achieved using another general-purpose machine [20] B. Rost, J. Liu, D. Przybylski, R. Nair, K. O. Wrzeszczynski, H.
Bigelow, and Y. Ofran, “Prediction of protein structure through
learning method, provided the learning architecture and algo- evolution,” in Handbook of Chemoinformatics – From Data to Knowl-
rithms are properly crafted. edge, J. Gasteiger and T. Engel, Eds. New York: Wiley, 2003, pp.
1789–1811.
In the foreseeable future, machine learning methods will con- [21] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Ap-
tinue to play a role in protein structure prediction and its mul- proach, 2nd ed. Cambridge, MA: MIT Press, 2001.
tiple facets. The growth in the size of the available training [22] B. Rost and C. Sander, “Improved prediction of protein secondary
structure by use of sequence profiles and neural networks,” Proc. Nat.
sets coupled with the gap between the number of sequences Acad. Sci., vol. 90, no. 16, pp. 7558–7562, 1993a.
and the number of solved structures remain powerful motiva- [23] B. Rost and C. Sander, “Prediction of protein secondary structure at
better than 70% accuracy,” J. Mol. Bio., vol. 232, no. 2, pp. 584–599,
tors for further developments. Furthermore, in many cases ma- 1993b.
chine learning methods are relatively fast compared to other [24] D. T. Jones, “Protein secondary structure prediction based on position-
methods. Machine learning methods spend most of their time in specific scoring matrices,” J. Mol. Bio., vol. 292, pp. 195–202, 1999b.
[25] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi, “Improving the pre-
the learning phase, which can be done offline. In “production” diction of protein secondary structure in three and eight classes using
mode, a pretrained feedforward neural network, for instance, recurrent neural networks and profiles,” Proteins, vol. 47, pp. 228–235,
2002a.
can produce predictions rather fast. Both accuracy and speed [26] B. Rost and C. Sander, “Conservation and prediction of solvent acces-
considerations are likely to remain important as genomic, pro- sibility in protein families,” Proteins, vol. 20, no. 3, pp. 216–226, 1994.
[27] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio, “Prediction of co-
teomic, and protein engineering projects continue to generate ordination number and relative solvent accessibility in proteins,” Pro-
great challenges and opportunities in this area. teins, vol. 47, pp. 142–153, 2002b.
[28] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio, “Prediction of con-
tact maps with neural networks and correlated mutations,” Prot. Eng.,
vol. 13, pp. 835–843, 2001.
REFERENCES [29] G. Pollastri and P. Baldi, “Prediction of contact maps by GIOHMMs
and recurrent neural networks using lateral propagation from all four
[1] F. Sanger and E. O. Thompson, “The amino-acid sequence in the glycyl cardinal corners,” Bioinfo., vol. 18, no. Suppl 1, pp. S62–S70, 2002.
chain of insulin.1. The identification of lower peptides from partial hy- [30] P. Fariselli and R. Casadio, “Prediction of disulfide connectivity in pro-
drolysates,” J. Biochem., vol. 53, no. 3, pp. 353–366, 1953a. teins,” Bioinfo., vol. 17, pp. 957–964, 2004.
[2] F. Sanger and E. O. Thompson, “The amino-acid sequence in the glycyl [31] A. Vullo and P. Frasconi, “A recursive connectionist approach for pre-
chain of insulin. II. The investigation of peptides from enzymic hy- dicting disulfide connectivity in proteins,” in Proc. 18th Annu. ACM
drolysates,” J. Biochem., vol. 53, no. 3, pp. 366–374, 1953b. Symp. Applied Computing, 2003, pp. 67–71.
[3] L. Pauling and R. B. Corey, “The pleated sheet, a new layer config- [32] P. Baldi, J. Cheng, and A. Vullo, “Large-scale prediction of disulphide
uration of the polypeptide chain,” Proc. Nat. Acad. Sci., vol. 37, pp. bond connectivity,” in Advances in Neural Information Processing Sys-
251–256, 1951. tems, L. Bottou, L. Saul, and Y. Weiss, Eds. Cambridge, MA: MIT
[4] L. Pauling, R. B. Corey, and H. R. Branson, “The structure of proteins: Press, 2005, vol. 17, NIPS04 Conf., pp. 97–104.
Two hydrogenbonded helical configurations of the polypeptide chain,” [33] J. Cheng, H. Saigo, and P. Baldi, “Large-scale prediction of disulphide
Proc. Nat. Acad. Sci., vol. 37, pp. 205–211, 1951. bridges using kernel methods, two-dimensional recursive neural net-
[5] J. C. Kendrew, R. E. Dickerson, B. E. Strandberg, R. J. Hart, D. R. works, and weighted graph matching,” Proteins: Structure, Function,
Davies, D. C. Phillips, and V. C. Shore, “Structure of myoglobin: A Bioinformatics, vol. 62, no. 3, pp. 617–629, 2006b.
three-dimensional Fourier synthesis at 2_a resolution,” Nature, vol. [34] J. Moult, K. Fidelis, A. Kryshtafovych, B. Rost, T. Hubbard, and A.
185, pp. 422–427, 1960. Tramontano, “Critical assessment of methods of protein structure pre-
[6] M. F. Perutz, M. G. Rossmann, A. F. Cullis, G. Muirhead, G. Will, and diction-Round VII,” Proteins, vol. 29, pp. 179–187, 2007.
A. T. North, “Structure of haemoglobin: A three-dimensional Fourier [35] S. J. Wodak, “From the Mediterranean coast to the shores of Lake On-
synthesis at 5.5 Angstrom resolution, obtained by x-ray analysis,” Na- tario: CAPRI’s premiere on the American continent,” Proteins, vol. 69,
ture, vol. 185, pp. 416–422, 1960. pp. 687–698, 2007.
[7] K. A. Dill, “Dominant forces in protein folding,” Biochemistry, vol. 31, [36] P. Baldi, Y. Chauvin, T. Hunkapillar, and M. McClure, “Hidden
pp. 7134–7155, 1990. Markov models of biological primary sequence information,” Proc.
[8] R. A. Laskowski, J. D. Watson, and J. M. Thornton, “From protein Nat. Acad. Sci., vol. 91, no. 3, pp. 1059–1063, 1994.
structure to biochemical function?,” J. Struct. Funct. Genomics, vol. 4, [37] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler,
pp. 167–177, 2003. “Hidden Markov models in computational biology: Applications to
[9] A. Travers, “DNA conformation and protein binding,” Ann. Rev. protein modeling,” J. Mol. Biol., vol. 235, pp. 1501–1531, 1994.
Biochem., vol. 58, pp. 427–452, 1989. [38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
[10] P. J. Bjorkman and P. Parham, “Structure, function and diversity sentations by back-propagating error,” Nature, vol. 323, pp. 533–536,
of class I major histocompatibility complex molecules,” Ann. Rev. 1986.
Biochem., vol. 59, pp. 253–288, 1990. [39] V. Vapnik, The Nature of Statistical Learning Theory. Berlin, Ger-
[11] L. Bragg, The Development of X-Ray Analysis. London, U.K.: G. many: Springer-Verlag, 1995.
Bell, 1975. [40] K. Bryson, D. Cozzetto, and D. T. Jones, “Computer-assisted protein
[12] T. L. Blundell and L. H. Johnson, Protein Crystallography. New domain boundary prediction using the DomPred server,” Curr Protein
York: Academic, 1976. Pept Sci., vol. 8, pp. 181–188, 2007.
[13] K. Wuthrich, NMR of Proteins and Nucleic Acids. New York: Wiley, [41] M. Tress, J. Cheng, P. Baldi, K. Joo, J. Lee, J. H. Seo, J. Lee, D.
1986. Baker, D. Chivian, D. Kim, A. Valencia, and I. Ezkurdia, “Assessment
[14] E. N. Baldwin, I. T. Weber, R. S. Charles, J. Xuan, E. Appella, M. of predictions submitted for the CASP7 domain prediction category,”
Yamada, K. Matsushima, B. F. P. Edwards, G. M. Clore, A. M. Gro- Proteins: Structure, Function and Bioinformatics, vol. 68, no. S8, pp.
nenborn, and A. Wlodawar, “Crystal structure of interleukin 8: Sym- 137–151, 2007.
biosis of NMR and crystallography,” Proc. Nat. Acad. Sci., vol. 88, pp. [42] J. Cheng, “DOMAC: An accurate, hybrid protein domain prediction
502–506, 1991. server,” Nucleic Acids Res., vol. 35, pp. w354–w356, 2007.
[15] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. [43] Z. Obradovic, K. Peng, S. Vucetic, P. Radivojac, and A. K. Dunker,
Weissig, I. N. Shindyalov, and P. E. Bourne, “The protein data bank,” “Exploiting heterogeneous sequence properties improves prediction of
Nucl. Acids Res., vol. 28, pp. 235–242, 2000. protein disorder,” Proteins, vol. 61, no. Suppl 1, pp. 176–182, 2005.
[16] J. M. Chandonia and S. E. Brenner, “The impact of structural genomics: [44] J. J. Ward, L. J. McGuffin, K. Bryson, B. F. Buxton, and D. T.
Expectations and outcomes,” Science, vol. 311, pp. 347–351, 2006. Jones, “The DISOPRED server for the prediction of protein disorder,”
[17] C. B. Anfinsen, “Principles that govern the folding of protein chains,” Bioinfo., vol. 20, pp. 2138–2139, 2004.
Science, vol. 181, pp. 223–230, 1973. [45] J. Cheng, M. J. Sweredoski, and P. Baldi, “Accurate prediction of pro-
[18] D. Petrey and B. Honig, “Protein structure prediction: Inroads to bi- tein disordered regions by mining protein structure data,” Data Mining
ology,” Mol. Cell., vol. 20, pp. 811–819, 2005. Knowledge Discovery, vol. 11, pp. 213–222, 2005.

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
48 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 1, 2008

[46] A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer, “Pre- [72] G. Shackelford and K. Karplus, “Contact prediction using mutual in-
dicting transmembrane protein topology with a hidden Markov model: formation and neural nets,” Proteins, vol. 69, pp. 159–164, 2007.
Application to complete genomes,” J. Mol. Biol., vol. 305, no. 3, pp. [73] R. MacCallum, “Striped sheets and protein contact prediction,”
567–580, 2001. Bioinfo., vol. 20, no. Supplement 1, pp. i224–i231, 2004.
[47] P. Y. Chou and G. D. Fasman, “Prediction of the secondary structure [74] J. Cheng and P. Baldi, “Improved residue contact prediction using sup-
of proteins from their amino acid sequence,” Adv. Enzymol., vol. 47, port vector machines and a large feature set,” BMC Bioinformatics, vol.
pp. 45–148, 1978. 8, p. 113, 2007.
[48] N. Qian and T. J. Sejnowski, “Predicting the secondary structure of [75] J. M. G. Izarzugaza, O. Graña, M. L. Tress, A. Valencia, and N. D.
globular proteins using neural network models,” J. Mol. Biol., vol. 202, Clarke, “Assessment of intramolecular contact predictions for CASP7,”
pp. 265–884, 1988. Proteins, vol. 69, pp. 152–158.
[49] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, “Exploiting [76] S. Wu and Y. Zhang, “A comprehensive assessment of sequence-based
the past and the future in protein secondary structure prediction,” Bioin- and template-based methods for protein contact prediction,” Bioinfo.,
formatics, vol. 15, no. 11, pp. 937–946, 1999. 2008, to be published.
[50] I. P. Crawford, T. Niermann, and K. Kirchner, “Prediction of secondary [77] J. Cheng and P. Baldi, “Three-stage prediction of protein beta-sheets
structure by evolutionary comparison: Application to the a subunit of by neural networks, alignments, and graph algorithms,” Bioinfo., vol.
tryptophan synthase,” Proteins, vol. 2, pp. 118–129, 1987. 21, pp. i75–i84, 2005.
[51] G. J. Barton, R. H. Newman, P. S. Freemont, and M. J. Crumpton, [78] P. Fariselli, P. Riccobelli, and R. Casadio, “Role of evolutionary infor-
“Amino acid sequence analysis of the annexin supergene family of pro- mation in predicting the disulfide-bonding state of cysteine in proteins,”
teins,” Eur. J. Biochem., vol. 198, pp. 749–760, 1991. Proteins, vol. 36, pp. 340–346, 1999.
[52] B. Rost and C. Sander, “Improved prediction of protein secondary [79] A. Vullo and P. Frasconi, “Disulfide connectivity prediction using re-
structure by use of sequence profiles and neural networks,” Proc. Nat. cursive neural networks and evolutionary information,” Bioinfo., vol.
Acad. Sci., vol. 90, pp. 7558–7562, 1993. 20, pp. 653–659, 2004.
[53] A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. [80] J. Cheng, H. Saigo, and P. Baldi, “Large-scale prediction of disulphide
Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. bridges using kernel methods, two-dimensional recursive neural net-
A. Natale, C. O’Donovan, N. Redaschi, and L. S. Yeh, “The universal works, and weighted graph matching,” Proteins: Structure, Function,
protein resource (UniProt),” Nucleic Acids Res., vol. 33, pp. D154–159, Bioinformatics, vol. 62, no. 3, pp. 617–629, 2006b.
2005. [81] A. Z. Randall, J. Cheng, M. Sweredoski, and P. Baldi, “TMBpro: Sec-
[54] S. F. Altschul, T. L. Madden, A. A. Scha_er, J. Zhang, Z. Zhang, W. ondary structure, beta-contact, and tertiary structure prediction of trans-
Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new membrane beta-barrel proteins,” Bioinfo., vol. 24, pp. 513–520, 2008.
generation of protein database search programs,” Nuc. As. Res., vol. [82] M. Vassura, L. Margara, P. Di Lena, F. Medri, P. Fariselli, and R.
25, no. 17, pp. 3389–3402, 1997. Casadio, “FT-COMAR: Fault tolerant three-dimensional structure
[55] G. Pollastri and A. McLysaght, “Porter: A new, accurate server for reconstruction from protein contact maps,” Bioinfo., vol. 24, pp.
protein secondary structure prediction,” Bioinfo., vol. 21, no. 8, pp. 1313–1315, 2008.
1719–20, 2005. [83] C. A. Rohl and D. Baker, “De novo determination of protein back-
[56] J. Cheng, A. Z. Randall, M. J. Sweredoski, and P. Baldi, “SCRATCH: bone structure from residual dipolar couplings using Rosetta,” J. Amer.
A protein structure and structural feature prediction server,” Nuc. As. Chemical Soc., vol. 124, pp. 2723–2729, 2004.
Res., vol. 33, pp. 72–76, 2005. [84] P. M. Bowers, C. E. Strauss, and D. Baker, “De novo protein structure
[57] R. Bondugula and D. Xu, “MUPRED: A tool for bridging the gap be- determination using sparse NMR data,” J. Biomol. NMR, vol. 18, no.
tween template based methods and sequence profile based methods for 4, pp. 311–318, 2000.
protein secondary structure prediction,” Proteins, vol. 66, no. 3, pp. [85] Y. Zhang and J. Skolnick, “Automated structure prediction of weakly
664–670, 2007. homologous proteins on a genomic scale,” Proc Nat. Acad. Sci., vol.
[58] J. J. Ward, L. J. McGuffin, B. F. Buxton, and D. T. Jones, “Secondary 101, no. 20, pp. 7594–7599, 2004a.
structure prediction using support vector machines,” Bioinfo., vol. 19, [86] D. T. Jones, “GenTHREADER: An efficient and reliable protein fold
pp. 1650–1655, 2003. recognition method for genomic sequences,” J. Mol. Biol., vol. 287, pp.
[59] Randall, J. Cheng, M. Sweredoski, and P. Baldi, “TMBpro: Secondary 797–815, 1999a.
structure, beta-contact, and tertiary structure prediction of transmem- [87] D. Kim, D. Xu, J. Guo, K. Ellrott, and Y. Xu, “PROSPECT II: Pro-
brane beta-barrel proteins,” Bioinfo., vol. 24, pp. 513–520, 2008. tein structure prediction method for genome-scale applications,” Pro-
[60] B. Rost, “Rising accuracy of protein secondary structure prediction,” tein Eng., vol. 16, no. 9, pp. 641–650, 2003.
in Protein Structure Determination, Analysis, and Modeling for Drug [88] J. Cheng and P. Baldi, “A machine learning information retrieval
Discovery, D. Chasman, Ed. New York: Marcel Dekker, 2003, pp. approach to protein fold recognition,” Bioinfo., vol. 22, no. 12, pp.
207–249. 1456–1463, 2006.
[61] J. Cheng, M. Sweredoski, and P. Baldi, “DOMpro: Protein domain [89] K. T. Simons, C. Kooperberg, E. Huang, and D. Baker, “Assembly of
prediction using profiles, secondary structure, relative solvent acces- protein tertiary structures from fragments with similar local sequences
sibility, and recursive neural networks,” Data Mining Knowledge Dis- using simulated annealing and Bayesian scoring functions,” J. Mol.
covery, vol. 13, pp. 1–10, 2006. Biol., vol. 268, pp. 209–225, 1997.
[62] Y. Freund, “Boosting a weak learning algorithm by majority,” in Proc. [90] B. Wallner and A. Elofsson, “Prediction of global and local model
Third Annu. Workshop Computational Learning Theory, 1990. quality in CASP7 using Pcons and ProQ,” Proteins, vol. 69, pp.
[63] O. Olmea and A. Valencia, “Improving contact predictions by the com- 184–193, 2007.
bination of correlated mutations and other sources of sequence infor- [91] J. Qiu, W. Sheffler, D. Baker, and W. S. Noble, “Ranking predicted
mation,” Fold Des, vol. 2, pp. s25–s32, 1997. protein structures with support vector regression,” Proteins, vol. 71,
[64] P. Baldi and G. Pollastri, “A machine learning strategy for protein anal- pp. 1175–1182, 2007.
ysis,” IEEE Intelligent Systems, Special Issue Intelligent Systems in Bi- [92] K. Karplus, C. Barrett, and R. Hughey, “Hidden Markov models for
ology, vol. 17, no. 2, pp. 28–35, Feb. 2002. detecting remote protein homologies,” Bioinfo., vol. 14, no. 10, pp.
[65] A. Aszodi, M. Gradwell, and W. Taylor, “Global fold determination 846–856, 1998.
from a small number of distance restraints,” J. Mol. Biol., vol. 251, pp. [93] S. R. Eddy, “Profile hidden Markov models,” Bioinfo., vol. 14, pp.
308–326, 1995. 755–763, 1998.
[66] M. Vendruscolo, E. Kussell, and E. Domany, “Recovery of protein [94] J. Soeding, “Protein homology detection by HMM-HMM compar-
structure from contact maps,” Folding Design, vol. 2, pp. 295–306, ison,” Bioinfo., vol. 21, pp. 951–960, 2005.
1997. [95] A. Sali and T. L. Blundell, “Comparative protein modelling by satis-
[67] J. Skolnick, A. Kolinski, and A. Ortiz, “MONSSTER: A method for faction of spatial restraints,” J. Mol. Biol., vol. 234, pp. 779–815, 1993.
folding globular proteins with a small number of distance restraints,” J [96] Y. Zhang and J. Skolnick, “SPICKER: A clustering approach to iden-
Mol. Biol., vol. 265, pp. 217–241, 1997. tify near-native protein folds,” J. Comp. Chem., vol. 25, pp. 865–871,
[68] K. Plaxco, K. Simons, and D. Baker, “Contact order, transition state 2004b.
placement and the refolding rates of single domain proteins,” J. Mol. [97] D. Cozzetto, A. Kryshtafovych, M. Ceriani, and A. Tramontano, “As-
Biol., vol. 277, pp. 985–994, 1998. sessment of predictions in the model quality assessment category,” Pro-
[69] M. Punta and B. Rost, “Protein folding rates estimated from contact teins, vol. 69, no. S8, pp. 175–183, 2007.
predictions,” J. Mol. Biol., pp. 507–512, 2005a. [98] P. Aloy, G. Moont, H. A. Gabb, E. Querol, F. X. Aviles, and M. J.
[70] P. Baldi and G. Pollastri, “The principled design of large-scale re- E. Sternberg, “Modelling protein docking using shape complemen-
cursive neural network architectures—DAG-RNNs and the protein tarity, electrostatics and biochemical information,” Proteins, vol. 33,
structure prediction problem,” J. Machine Learning Res., vol. 4, pp. pp. 535–549, 1998.
575–602, 2003. [99] A. J. Bordner and A. A. Gorin, “Protein docking using surface
[71] M. Punta and B. Rost, “PROFcon: Novel prediction of long-range con- matching and supervised machine learning,” Proteins, vol. 68, pp.
tacts,” Bioinfo., vol. 21, pp. 2960–2968, 2005b. 488–502, 2007.

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.
CHENG et al.: MACHINE LEARNING METHODS FOR PROTEIN STRUCTURE PREDICTION 49

[100] V. Chelliah, T. L. Blundell, and J. Fernandez-Recio, “Efficient re- [121] J. Larsen, O. Lund, and M. Nielsen, “Improved method for predicting
straints for protein-protein docking by comparison of observed amino linear B-cell epitopes,” Immunome Res., vol. 2, p. 2, 2006.
acid substitution patterns with those predicted from local environ- [122] J. Sweredoski and P. Baldi, “PEPITO: Improved discontinuous B-cell
ment,” J. Mol. Biol., vol. 357, pp. 1669–1682, 2006. epitope prediction using multiple distance thresholds and half-sphere
[101] R. Chen, L. Li, and Z. Weng, “ZDOCK: An initial-stage protein exposure,” Bioinformatics, vol. 24, pp. 1459–1460, 2008a.
docking algorithm,” Proteins, vol. 52, pp. 80–87, 2003. [123] J. Sweredoski and P. Baldi, COBEpro: A Novel System for Predicting
[102] S. R. Comeau, D. W. Gatchell, S. Vajda, and C. J. Camacho, “ClusPro: Continuous B-Cell Epitopes, 2008, submitted for publication.
An automated docking and discrimination method for the prediction of
protein complexes,” Bioinfo., vol. 20, pp. 45–50, 2004.
[103] M. D. Daily, D. Masica, A. Sivasubramanian, S. Somarouthu, and J. Jianlin Cheng received the Ph.D. degree from the
J. Gray, “CAPRI rounds 3–5 reveal promising successes and future University of California, Irvine, 2006.
challenges for RosettaDock,” Proteins, vol. 60, pp. 181–186, 2005. He is an Assistant Professor of bioinformatics in
[104] C. Dominguez, R. Boelens, and A. Bonvin, “HADDOCK: A protein- the Computer Science Department, University of
protein docking approach based on biochemical or biophysical infor-
Missouri, Columbia (MU). He is affiliated with the
mation,” J. Amer. Chem. Soc., vol. 125, pp. 1731–1737, 2003.
[105] H. A. Gabb, R. M. Jackson, and M. J. E. Sternberg, “Modelling protein MU Informatics Institute, the MU Interdisciplinary
docking using shape complementarity, electrostatics, and biochemical Plant Group, and the National Center for Soybean
information,” J. Mol. Biol., vol. 272, pp. 106–120, 1997. Biotechnology. His research is focused on bioinfor-
[106] J. J. Gray, S. E. Moughan, C. Wang, O. Schueler-Furman, B. Kuhlman, matics, systems biology, and machine learning.
C. A. Rohl, and D. Baker, “Protein-protein docking with simultaneous
optimization of rigid body displacement and side chain conforma-
tions,” J. Mol. Biol., vol. 331, pp. 281–299, 2003.
[107] S. J. Wodak and R. Mendez, “Prediction of protein-protein interactions:
The CAPRI experiment, its evaluation and implications,” Curr. Opin.
Struct. Biol., vol. 14, pp. 242–249, 2004. Allison N. Tegge (M’08) received the B.Sc. degree
[108] H. Lu, L. Lu, and J. Skolnick, “Development of unified statistical po- in animal science and the M.Sc. degree in bioinfor-
tentials describing protein-protein interactions,” Biophysical J., vol. 84, matics, both from the University of Illinois, Urbana-
pp. 1895–1901, 2003. Champaign. She is working toward the Ph.D. degree
[109] J. Mintseris, B. Pierce, K. Wiehe, R. Anderson, R. Chen, and Z. Weng, in bioinformatics at the University of Missouri, Co-
“Integrating statistical pair potentials into protein complex prediction,” lumbia (MU).
Proteins, vol. 69, pp. 511–520, 2007.
[110] G. Moont, H. A. Gabb, and M. J. Sternberg, “Use of pair potentials She is a National Library of Medicine (NLM)
across protein interfaces in screening predicted docked complexes,” Fellow. Her research interests include protein struc-
Proteins, vol. 35, pp. 364–373, 1999. ture prediction and systems biology.
[111] E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. A. Friesem, C. Aflalo,
and I. A. Vakser, “Molecular surface recognition: Determination of
geometric fit between proteins and their ligands by correlation tech-
niques,” Proc. Nat. Acad. Sci., vol. 89, pp. 2195–2199, 1992.
[112] S. Lorenzen and Y. Zhang, “Identification of near-native structures
by clustering protein docking conformations,” Proteins, vol. 68, pp. Pierre Baldi (M’88–SM’01) received the Ph.D. de-
187–194, 2007. gree from the California Institute of Technology, in
[113] H. X. Zhou and S. Qin, “Interaction-site prediction for protein com- 1986.
plexes: A critical assessment,” Bioinfo., vol. 23, no. 17, pp. 2203–2209, He is the Chancellor’s Professor in the School of
2007. Information and Computer Sciences and the Depart-
[114] H. X. Zhou and Y. Shan, “Prediction of protein interaction sites from ment of Biological Chemistry and the Director of the
sequence profile and residue neighbor list,” Proteins, vol. 44, pp. UCI Institute for Genomics and Bioinformatics at the
336–343, 2001.
[115] P. Smialowski, A. J. Martin-Galiano, A. Mikolajka, T. Girschick, University of California, Irvine. From 1986 to 1988,
T. A. Holak, and D. Frishman, “Protein solubility: Sequence based he was a Postdoctoral Fellow at the University of Cal-
prediction and experimental verification,” Bioinformatics, vol. 23, pp. ifornia, San Diego. From 1988 to 1995, he held fac-
2536–2542, 2007. ulty and member of the technical staff positions at the
[116] J. Cheng, A. Randall, and P. Baldi, “Prediction of protein stability California Institute of Technology and at the Jet Propulsion Laboratory. He was
changes for single site mutations using support vector machines,” Pro- CEO of a startup company from 1995 to 1999 and joined UCI in 1999. His re-
teins, vol. 62, no. 4, pp. 1125–1132, 2006c. search work is at the intersection of the computational and life sciences, in par-
[117] O. Emanuelsson, S. Brunak, G. V. Heijne, and H. Nielsen, “Locating ticular the application of AI/statistical/machine learning methods to problems
proteins in the cell using TargetP, SignalP, and related tools,” Nature in bio- and chemical informatics. He has published over 200 peer-reviewed re-
Protocols, vol. 2, pp. 953–971, 2007. search articles and four books: Modeling the Internet and the We–Probabilistic
[118] J. D. Bendtsen, H. Nielsen, G. V. Heijne, and S. Brunak, “Improved Methods and Algorithms (Wiley, 2003); DNA Microarrays and Gene Regula-
prediction of signal peptides: SignalP 3.0,” J. Mol. Biol., vol. 340, pp.
tion–From Experiments to Data Analysis and Modeling (Cambridge University
783–795, 2004.
[119] N. Blom, S. Gammeltoft, and S. Brunak, “Sequence- and structure- Press, 2002); The Shattered Self–The End of Evolution, (MIT Press, 2001); and
based prediction of eukaryotic protein phosphorylation sites,” J. Molec- Bioinformatics: the Machine Learning Approach (MIT Press, Second Edition,
ular Biol., vol. 294, pp. 1351–1362, 1999. 2001).
[120] P. H. Andersen, M. Nielsen, and O. Lund, “Prediction of residues in Dr. Baldi is the recipient of a 1993 Lew Allen Award, a 1999 Laurel
discontinuous B-cell epitopes using protein 3D structures,” Protein Wilkening Faculty Innovation Award, and a 2006 Microsoft Research Award
Sci., vol. 15, pp. 2558–2567, 2006. and was elected an AAAI Fellow in 2007.

Authorized licensed use limited to: University of Missouri System. Downloaded on April 8, 2009 at 18:32 from IEEE Xplore. Restrictions apply.

You might also like