An Analysis of QSAR Research Based On Machine Learning Concepts
An Analysis of QSAR Research Based On Machine Learning Concepts
net/publication/339957628
CITATIONS READS
54 1,080
2 authors, including:
SEE PROFILE
All content following this page was uploaded by Mohammad Reza Keyvanpour on 22 June 2023.
REVIEW ARTICLE
ISSN: 1570-1638
eISSN: 1875-6220
Drug
Current
Discovery
Technologies
An Analysis of QSAR Research Based on Machine Learning Concepts
1
Department of Computer Engineering, Alzahra University, Tehran, Iran; 2Data Mining Research Laboratory, Depart-
ment of Computer Engineering, Alzahra University, Tehran, Iran
Received: April 08, 2019 of a recent and comprehensive analysis of these algorithms. This study systematically reviews the
Revised: August 22, 2019
Accepted: October 28, 2019
application of machine learning algorithms in QSAR, aiming to provide an analytical framework.
For this purpose, we present a framework called ‘ML-QSAR‘. This framework has been designed
DOI:
10.2174/1570163817666200316104404 for future research to: a) facilitate the selection of proper strategies among existing algorithms ac-
cording to the application area requirements, b) help to develop and ameliorate current methods
and c) providing a platform to study existing methodologies comparatively. In ML-QSAR, first a
structured categorization is depicted which studied the QSAR modeling research based on machine
models. Then several criteria are introduced in order to assess the models. Finally, inspired by
aforementioned criteria the qualitative analysis is carried out.
Keywords: QSAR modeling, machine learning, drug discovery, drug design, computational intelligence, drug design, ADME/T modeling.
the best ligands and finally performing experiments to vali- 2. RELATED WORK
date the compounds (Fig. 1). Data curation actually fulfills
An increasing number of studies applying QSAR learn-
the preprocessing step, in which noise and redundancy are ing based approach have been conducted. Various research-
cleaned. The next step in QSAR is to generate descriptors for ers have provided reviews of existing literature [5-26]. A few
molecular structures. Similar to other learning tasks, the of them cover recent and entire work published in this issue.
dataset is divided into training and test sets. Then the QSAR
model is built. Validation of the QSAR model is performed A number of studies [5-8] have focused on machine
in the next step. Then the ligands are virtually designed. learning methods and review them in different QSAR part.
Subsequently, the best ligands are predicted and selected. In a few studies [5, 6] ANNs applications in QSAR have
Finally, the empirical experiments are conducted to validate been examined. In another study [7], several applications of
the compound. Machine learning approaches can be used in SVM, Particle Swarm Optimization (PSO) and Genetic Pro-
several of these steps; here, we considered the use of learn- gramming (GP) in drug design are reported while a research
ing methods in constructing QSAR model (Stage 4). only presents the usage of SVM in QSAR [8].
Some studies presented overviews of various learning al-
gorithms employed for a specific task in QSAR [9-15]. Au-
thors focused on the application of machine learning for vir-
tual screening in a study [9]. A study provided a comparative
study for peptides binding to the human amphiphysin-1 SH3
domain [10]. The applications of several learning methods
for aqueous solubility have been evaluated in another study
[11]. Several machine learning approaches have been com-
pared in a few studies [12, 13]. Another study aims to predict
potentiators of metabotropic glutamate receptor 5 (mGluR5)
compounds while Bruce et al. attempt to classify active or
inactive compounds from different datasets. Other works
highlighted the differences between deep and shallow neural
networks through an experimental comparison, to predict
activity cliffs in QSAR data sets. A study on machine-
learning techniques focused on ligand-based virtual screen-
ing (LBVS) [15].
A non-exclusive methods-based review was provided on
machine learning algorithms used in cheminformatics re-
search in another study where authors particularly focused
on supervised learning, which makes predictions on the
properties of molecules [16]. A review of processing steps in
the cheminformatics and frequently used machine learning
models in drug discovery and QSAR analysis are presented
Fig. (1). QSAR protocol [4]. in a study [17]. Also, there is a discussion about limitations
and future directions.
This paper provides a framework called ML-QSAR, of- The authors first traced the history of machine learning
fering an overview of QSAR studies in which machine learn- and provided insight into applications of deep learning strat-
ing concepts are employed. Also, learning based QSAR egies in drug discovery. They introduced deep learning as a
techniques are addressed and their classification is presented. beneficial model especially for big data in another study
Several criteria are offered to evaluate the models and an [18].
analytical comparison based on these measures is conducted. Goh et al. reviewed deep neural networks used in
ML-QSAR contributes to future research in three ways: first cheminformatics including QSAR and virtual screening,
by determining application area and requirements it helps to considering deep learning algorithms as a useful solution for
properly opt for methods, second it offers different criteria computational chemistry. Other works also concentrated on
under which future work also can be explored, and third by deep learning models applied in drug discovery, particularly
providing an analysis of existing methods which can lead to in the area of cheminformatics and biological image analysis
improving them. [19, 20]. Authors discuss the application of machine learning
The reminder of this paper is organized as follows. Sec- and deep learning in the drug discovery.
tion 2 gives an overview of previous review studies related Another study [21, 22] presented a brief overview of
to QSAR learning based models. In section 3, the framework transfer and multi-task learning, which are two subsets of
for analyzing QSAR learning based methods called ML- machine learning for drug design. In addition, they provided
QSAR is provided. Classification of methods is provided in insight into the potential applications of these approaches in
3.1, Sec 3.2 depicts criteria and analyses are outlined in Sec drug design. The authors provide another short study of ma-
3.3. Conclusions are drawn in Section 4. chine learning in drug design. They review some statistical
algorithms and learning algorithms. The authors in another
QSAR Research Based on Machine Learning Concepts Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 19
Fig. (3). General methodology of QSAR study [29].
study limit their investigation to learning strategies used in an- which start by giving molecular descriptors as input and at-
tibacterial drug discovery by QSAR modeling (Fig. 2) [23, 24]. tempts to find the activity of a new compound. Inverse-
QSAR is composed of strategies which aim to find the val-
Also, learning based algorithms applied in property pre-
ues of molecular descriptors that have led to a desired activi-
diction, de novo design and synthesis planning have been
ty value [27, 28].
presented in another study [25, 26]. Despite extensive studies
carried out on machine learning approaches in QSAR, re- In this section, we will consider each field with different
search lacks a novel comprehensive review; therefore, we types of learning techniques and review related studies that
offer an analytic overview. have been conducted.
3.1.1. Forward QSAR
3. ML-QSAR: PROPOSED FRAMEWORK FOR QSAR
RESEARCH BASED ON MACHINE LEARNING In QSAR, molecular structures are correlated with de-
We present a three component analytical framework scriptors. A forward QSAR model maps from molecular
comprising of classification, criteria and analysis. QSAR has descriptor space X to biological activity y (f:X → y) in
been widely applied in different processes and various do- which f is a scoring function. Fig. (3) indicates the block
mains of drug design. A considerable number of machine diagram of forward QSAR [29].
learning methods have been used in QSAR in order to devel- Machine learning in forward QSAR is applied in applica-
op molecular compounds. This study aims to classify em- tions such as feature representation [30], descriptor selection
ployed algorithms according to their direction steps. This [31] and modeling. Herein, the direct use of learning tech-
framework introduces several criteria that define the proper niques for QSAR modeling is considered.
model. Based on these criteria, analysis and comparison are
performed. Our framework is made up of three components: Usually in QSAR, the input data are molecular de-
scriptors and output is either the result of a regression task
1. Classification of QSAR studies based on QSAR di- which makes predictions about the biological activity
rection and machine learning strategies (/property) or classifies a problem into a group of categories
2. Definitionsofseveral criteria [32] employed to build a new molecular compound. Differ-
ent learning methods, including Multiple Linear Regression
3. Analyzing methodologies according to the pro- [33, 34], Linear Regression [35, 36], Artificial Neural Net-
posed criteria works [35, 37-44], Support Vector Machine(SVM), Decision
These components are described in details, as follows. Tree, Random Forests, Random Committee (RC), Naïve
Bayes, K- Nearest Neighbors, kappa nearest neighbor, Par-
3.1. Classification of QSAR Studies based on Machine tial Least Squares (PLS), Gaussian Process, Multilayer per-
Learning Strategies ceptron, Ensemble methods, Transfer Learning, and Deep
Neural Networks have been applied in QSAR application to
Various categorizations have been proposed for machine make predictions or classifications for their assigned tasks.
learning based QSAR. According to the process, QSAR can Table 1 lists each machine learning algorithm and their ap-
be used in two main fields, including forward QSAR and plication in QSAR modeling, and brief introductions for
inverse QSAR. Forward QSAR refers to the techniques learning methodologies are provided in what follows.
20 Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 Keyvanpour and Shirzad
Table 1. Machine learning models used in forward QSAR and their applications.
Blood-brain-barrier (BBB), human intestinal absorption (HIA) and enantiomeric excess (EE). [35]
Linear (Logistic) Regression
Toxicity of piperidine derivatives against Aedes aegypti (projection pursuit regression (PPR)) [36]
Blood-brain-barrier (BBB), human intestinal absorption (HIA) and enantiomeric excess (EE). [35]
blood-brain-barrier (BBB), human intestinal absorption (HIA) and enantiomeric excess (EE). [35]
Decision Tree modes of toxic action of phenols to Tetrahymena pyriformis (CTs) [42]
Blood-Brain-Barrier (BBB), Human Intestinal Absorption (HIA) and Enantiomeric Excess (EE). [35]
Random Forests
Microarray-based cancer classification [56]
Blood-brain-barrier (BBB), human intestinal absorption (HIA) and enantiomeric excess (EE). [35]
Random Committee (RC)
Prediction of Activity of Inhibitors of Beta-Secretase (BACE1) [44]
Kappa nearest neighbor (kNN) Classification of Cytochrome P450 1A2 Inhibitors and Noninhibitors [59]
Multiple Linear Regression includes a statistical process Where Y is the dependent variable which needs to be
which attempts to discover the relationships between de- predicted, X can be a vector or a matrix of descriptors. Pa-
scriptors as dependent variables and activity viewed as inde- rameter b indicates the regressors and a is the intercept. The
pendent variables [33, 34] whilst linear regression explores term Xb is the inner product between X and b.
the relationships between the dependent variables and an
independent variable [35, 36]. The relationships in linear Artificial neural networks [35, 37-44] inspired by human’
regression are modeled by a linear function which attempts brain neurons are layers of connected neurons made up of
to predict the dependent unknown variable. Linear regression input layers and output layers. The input layer receives input
can be generally formulated by the following equation: variables, exerts activation on them and produces output.
Y=Xb+a (2) The target vector is applied by NN learner in order to evalu-
22 Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 Keyvanpour and Shirzad
ate the learning and sets weight values which minimize the P(A/B) = P(B/A) P(A)/ P(B) (4)
overall error. Fig. (4) shows a simple neural network.
K- Nearest Neighbors algorithm is applied by [36, 42, 64,
65]. K-NN used for both classification and regression issues
is a non-parametric learning method, which assumes that
data follows a probability distribution. A new instance in the
feature space is classified by a majority vote of its k the clos-
est neighbors in classification task; in case of regression task,
out is a probability value. A similarity measure is utilized to
find the nearest neighbors.
Gaussian Process is performed in other studies [66-68].
The GP method is known as a robust nonlinear stochastic
model for regression and classification, which follows the
assumption that random variables have a normal distribution.
In GP for machine learning, variables are vectors and the
kernel function is applied for prediction. GP follows lazy
learning in which generalization of the training data is post-
poned until the system receives a query.
Multilayer perceptron is a kind of artificial neural net-
work [72]. In MLP, nodes are connected with nodes from
Fig. (4). A simple neural networks. the previous layer, where connections between nodes do not
Support Vector Machine (SVM) is a maximum margin follow a cyclical form. An MLP at least includes an input
classifier inspired by the principles of structural risk minimi- layer, a hidden layer and an output layer. Back propagation
zation, applied in in several studies [8,36, 42-43, 46-61]. strategy is employed in MLP for training in order to learn
SVM aims to learn the "maximum-margin hyperplane" weights in the network.
which means to separate the set of instances where their
label yi=1 from the set of instances which their label yj=-1 Bayesian regularized genetic neural networks is an en-
by a hyperplane. As a result, the distance from the hyper- semble model that has been applied in a few studies [69-72].
plane to the nearest belonging to the other class is maxim- Bayesian techniques have been implemented in neural net-
ized. A hyperplane includes the set of instances fulfilled in works to overcome their shortcomings. In BRGNN, applying
the following formula: Bayesian regularization prevents the overfitting of the genet-
ic algorithm. GA is employed in Bayesian regularized genet-
(3) ic neural network as a feature selection [76, 77].
where indicates the vector to the hyperplane, which Ensemble method including adaboost and bagging have
can be normalized or un-normalized. gained considerable interest [61, 73, 74]. Adaptive boosting
Decision Trees have been applied in studies [12, 35, 42, (adaboost) improves the performance of learning algorithms
43, 55, 59]. A decision tree is a tree-like structure predictor by combining the output of the weak learners (any individual
in which features are indicated by internal nodes. Decisions learner method) and weighting them. Bagging, which is usu-
are represented through branches (links) and class labels ally applied in RF, can be used for any kind of learning
indicate how a decision can be taken as a result of calculat- method. The main difference between these techniques is
ing all features. Classification rules in DT are shown by trac- that in Bagging any learner receives the same probability
ing the paths from root node to leaf node. DT can be applied while for Boosting the learners are weighted.
both for the discrete and continuous target value, first called
classification trees, later one called regression trees. Recently, transfer learning has been filed in the deep neu-
ral network learning, which gains knowledge for a problem
Random forest (RF) is a widely used ensemble classifica-
and trains a model so that it can be reused in order to solve
tion and regression model which combines a set of decision
other related issues. In transfer learning, instead of learning
trees as base learners by voting to boost accuracy in compar-
the weights of a network for a new task A, a learned model
ison to a single classifier [35, 44, 54, 56, 59, 62]. Random
for the task B is deployed. This method can be useful for the
forests with the goal of reducing the average variance on
new problems which lack adequate labeled data.
multiple decision trees, which have been trained on the di-
vided training set. Bootstrap aggregating (bagging) algo- Deep Neural Networks are a class of algorithms consid-
rithm is employed, which applies random parts of the train- ered as neural networks which have extra hidden layers. Fig.
ing set to tree learners. (5) indicates a deep neural network. In the deep belief net-
Naïve Bayes classifier is a probabilistic strategy which work, each layer is a hidden layer for its previous layer and
applies probability prior to the Bayes rule to generate class is the input layer for its next layer. CNN is another deep NN
labels according to properties while assuming features inde- applied in QSAR which is considered as a multilayer percep-
pendent of each other [63, 64]. The probability value of P for tron. Different studies [61, 62, 77] applied deep neural net-
state A for a given state B is described by equation (4): works. Authors provided the overview and comparison of
Neural Networks and Deep learning.
QSAR Research Based on Machine Learning Concepts Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 23
Fig. (5). Deep neural network. (A higher resolution / colour version of this figure is available in the electronic copy of the article).
Table 2. Machine learning models used in inverse QSAR and their applications.
Genetic Algorithms Selection of natural amino acids in active peptides for the synthesis of the library [83]
Constructing chemical graphs on the basis of MLR equations and algorithms [95]
A combination of differential evolution (DE) and Support quire a future step of feature selection which is com-
Vector Machine (SVM) was applied for inverse QSAR in putationally and timely expensive. Embedded feature
another study [90]. SVM method was introduced in the pre- selection strategies offer a quick solution. Thus, sup-
vious part, and differential evolution is a method of evolu- porting feature selection can be interpreted as a merit
tionary computation which is beyond the scope of machine for a model.
learning.
• Computational cost: some algorithms are cost effec-
MLR, which was introduced in the previous part, is ap- tive while others are computationally intensive. Due
plied in inverse QSAR as well Table 2 [91-95]. to the fact that other processes of drug design such as
extracting descriptors are required, the cost can be
3.2. Criteria thought as an influential measure to opt for a model-
Authors listed twelve key issues for QSAR models to be ing technique.
identified as robust model [96]. Here, we define our metrics • Memory requirements: the amount of memory need-
and show to what extent each model fulfills these criteria. ed for each algorithm is another decisive measure.
We define the following measures in order to evaluate the
current application of machine learning used by the QSAR
models. 3.3. Analysis
• Data Generalization: the ability of the model to ex- Table 3 has summarized the properties of various widely
tend to the unlabeled data. used machine learning models utilized in QSAR modeling.
In addition to the previous criteria, we note the advantages
• Model Generalization: this shows whether a solution and disadvantages of each learning model.
is valid on both inverse and forward QSAR.
Enjoying sound properties have made SVM the most
• Accuracy: this measure indicates how precious a popular method in QSAR. It has been utilized in both predic-
model is according to well-defined measures such as tion and classification tasks of forward QSAR. SVM fails to
Root Mean Square Error (RMSE). model unlabeled data, but it has been applied both in forward
and inverse QSAR. According to evaluation measures, SVM
• Interpretability: this measure indicates to what extent is precious. However, it suffers from weak interpretability
a prediction, or classification can be understood by for QSAR issues. For various tasks, SVM model proves to
human via the model’s features. In other words, be efficient and thus it has high modelability. SVM lacks
some models can reveal information about the sali- embedded feature selection process. A fine property of SVM
ence of each feature. is its requirement for low spaces of memory. SVM computa-
tional cost depends very much on the kernel. Being computa-
• Modelability: this metric expresses how much a tionally reasonable is considered as SVM’s merit. Good
model is applicable in various QSAR based tasks. generalization performance and poor interpretably have been
• Supporting feature selection: the large number of addressed as the advantage and disadvantage of SVM, re-
features is a challenge for some tasks and may re- spectively.
QSAR Research Based on Machine Learning Concepts Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 25
Memory requirements
Model Generalization
Data Generalization
Computational cost
Interpretability
Disadvantages
Modelability
Advantages
Accuracy
Methods
Another popular machine learning method is artificial its poor interpretability. DNNs enjoy hierarchical feature
neural networks applied in the QSAR modeling [35, 37-44]. learning as their plus side.
ANN is a supervised model, which is unable to work with Decision trees are also used in QSAR modeling [12, 35,
unlabeled data. In contrast, several ANN models such as 42, 43, 55, 59]. One advantage of DT is its inbuilt feature
Hebbian learning support unlabeled data. ANN is model selection process. DT is not data and model generalizable. A
generalizable, accurate, modelable, fairly cost-sensitive and DT is partly accurate and easy to interpret. DT shows mod-
needs low memory space. Also, like SVM it suffers from elable characteristics. In terms of memory requirement and
low interpretability and does not support feature selection. cost, it depicts medium performance. DT’s most obvious
The most significant advantage of ANN is its performance superiority is its high interpretably. Moreover, it can easily
on complex data, and its drawback is its poor interpretability. work with heterogeneous data which refers to data with vari-
Deep neural network refers to novel technologies com- ous feature types. DTs suffer from overfitting problem in
posed of several layers of neural networks. These methods which model fits to train dataset.
are applied in QSAR and provide desirable results. These
Random Forests are the extended model of DTs, similar
techniques behave almost similar to ANN whilst they outper-
to them. RF models are not data and model generalizable,
form ANN in some points. Several DNN models such as
easy to interpret, modelable and support feature selection.
Autoencoders and Deep Belief Nets can be employed for
RFs are more computationally expensive than DT and re-
unlabeled data. In comparison with ANN, DNNs are more
quire more memory space. Their significant advantages are
interpretable. On the other hand, DNNs require larger
memory space and are computationally more expensive in being highly interpretable and the ability to tolerate overfit-
comparison with ANN. Their downside, similar to ANN, is ting and their main drawback is their long training time.
26 Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 Keyvanpour and Shirzad
Naïve Bayes is another classifier used in QSAR model- Numerous descriptors have been extracted for QSAR
ing. NB methods are designed for labeled data which means modeling, which made the computation costly and time-
they are data generalizable. Also, NB has not been applied in consuming. Selecting the most significant descriptors has
QSAR model which prevents it from being model general- been introduced a demanding issue in QSAR modeling. The
izable. NBs are fairly accurate methods and easy to interpret. feature selection procedure is employed in order to reduce
Moreover, they are very modelable. Low memory consump- the number of descriptors. In a study [100] the outcomes of
tion and low computational expense are two merit properties making use of feature selection have been clearly presented.
of these methods. They are effective but on small datasets Authors applied feature selection process and compared their
they lead to low accuracy. model with feature selection and without feature selection.
K nearest neighbors algorithms are other models in Random Forests (RFs) voting method and Support Vector
QSAR. KNN technique only works with labeled data and has Machine (SVM) were used as feature selection strategy and
not been applied for inverse QSAR, which addresses the fact learning model, respectively. The results indicated the merits
that the KNN is not data and model generalizable. KNN is of using feature selection. As they reported feature selection
partly accurate, and its performance can be influenced by process led to about 19% reduction in prediction error
noisy or irrelevant features. These methods are difficult to (RMSE) on average. Also, the model witnessed a rise of
interpret for QSAR usages. Failure to support feature selec- 49% in terms of percentage of variance explained (PVE) as a
tion and consuming high memory, being computationally result of selecting features. The reduction in the number of
expensive, and being sensitive to noisy or irrelevant features features was more than 1000 variables [96].
are the downsides of these models. Simplicity and being Two studies [97, 98] considered the problem of designing
non-parametric are the major advantages of KNN. inhibitors of the enzyme acetylcholinesterase (AChE), where
Ensembles are designed to make progress in the base ma- the first one selected the descriptors manually while the se-
chine learning algorithm. While a single model such as a cond one applied wrapper feature selection. In another study,
decision tree has not been expected to represent a high- authors assessed the correlation between descriptors consid-
performance, ensembles have shown better prediction. They ering the covariance matrix of descriptors for 24 molecules
inherit the properties of their learners such as the ability to [97]. The matrix contains covariance between 325 de-
work on unlabeled data, generalization, interpretability, scriptors, where 14 descriptors have been selected by users,
modelability, supporting feature selection. Ensemble meth- with the aim of low redundancy and preventing chance cor-
ods are known as accurate strategies. The memory require- relations. Descriptor selection was fulfilled in another study
ments and the computational cost of these approaches are for choosing descriptors which are responsible for AChE
highly affected by the basic learners. Considering modeling activity [98]. They applied wrapper feature selection in
several learners, they are supposed to be memory consuming which a machine learning algorithm is trained to select a
and computationally expensive. subset of features. As reported, more than 2000 features
were investigated. Two strategies, including Multiple Linear
Multiple linear regression (MLR) is another well-known Regression (MLR) and Support Vector Machine (SVM)
algorithm applied in QSAR [33, 34, 91-95]. MLR is not gen- were used by the author as a black box. Each model chose 4
eralizable to unlabeled data. MLR has been applied both in descriptors. These studies worked on AChE inhibition [97]
forward and inverse QSAR. MLR is accurate, easy to inter- and analyzed only 325 descriptors, while another study [98]
pret, and modelable. MLR does not support feature selection, considered more than 2000 descriptors. The first model se-
and it needs medium time and space. The main problem with lected 14 descriptors while second one chose 4 descriptors.
MLR is the issue of leaning data with exactly two values for The first model selection is carried out by authors taking two
a variable (dichotomous variable). considerations, while the second model automatically select-
Linear regression (LR) applied in QSAR follows MLR in ed the features. Both models worked on covariance matrix.
attributes such as not being generalizable to unlabeled data, For large number of descriptors manual selection is not pos-
accurate, easy to interpret, and modelable. Like MLR, they sible. Training machine learning model requires some con-
do not support feature selection. On the other hand, LR siderations such as parameter tuning and hardware devices.
needs lower time and space than what MLR requires. Whilst
Automatic feature selection by removing a considerable
MLR faces problems with learning dichotomous variables,
number of descriptors leads to quick results. Moreover, the
LR shows acceptable performance with this kind of data.
precision of computer aided feature selection is undeniable.
The most obvious issue with LR is its poor performance on
Feature selection can be considered as an extra stage of
complex data.
modeling however, it can be performed offline. Another
study [99] removes this stage by applying a novel deep
3.4. Computer aided QSAR vs. Traditional QSAR learning model.
QSAR models are computational techniques which are
designed to predict and classify compounds. With the aim of CONCLUSION
performing computational operations quickly and preciously The QSAR analysis has provided progressive methods
computer-aided drug design (CADD) has been developed. for chemoinformatic. The present review attempted to collect
Traditional QSAR is limited to simple regression while different studies based on machine learning algorithms in the
computer aided QSAR takes advantage of a wide range of field of QSAR modeling in a framework called ML-QSAR.
machine learning and deep learning strategies. Herein, the To this end, we address the existing studies, make a classifi-
impact of applying learning algorithms on QSAR in part of cation based on the point from which algorithms start. This
descriptors selection is mentioned. classification includes forward and inverse QSAR in its first
QSAR Research Based on Machine Learning Concepts Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 27
[30] Winter R, Montanari F, Noé F, Clevert DA. Learning continuous http://dx.doi.org/10.1021/ci200409x PMID: 22582859
and data-driven molecular descriptors by translating equivalent [47] Gertrudes JC, Maltarollo VG, Silva Ra, Oliveira PR, Honório KM.
chemical representations. Chem. Sci., 2019; 10: 1692-701. da Silva aBF. Machine learning techniques and drug design. Curr
[31] Martínez MJ, Razuc M, Ponzoni I. MoDeSuS: A machine learning Med Chem 2012; 19(25) 89-97.
tool for selection of molecular descriptors in QSAR studies applied [48] Dobchev DA, Pillai GG, Karelson M. In silico machine learning
to molecular informatics. biomed research international 2019; methods in drug development. Curr Top Med Chem 2014; 14(16):
2019: 12. 1913-22.
[32] Roy K, Kar S. Das R.N. Statistical methods in QSAR/QSPR. A http://dx.doi.org/10.2174/1568026614666140929124203 PMID:
primer on QSAR/QSPR modeling springer briefs in molecular sci- 25262800
ence. Cham: Springer 2015. [49] Chen H, Carlsson L, Eriksson M, Varkonyi P, Norinder U, Nilsson
[33] Hemmateenejad B, Miri R, Akhond M, Shamsipur M. QSAR study I. Beyond the scope of Free-Wilson analysis: building interpretable
of the calcium channel antagonist activity of some recently synthe- QSAR models with machine learning algorithms. J Chem Inf Mod-
sized dihydropyridine derivatives. An application of genetic algo- el 2013; 53(6): 1324-36.
rithm for variable selection in MLR and PLS methods. Chemom http://dx.doi.org/10.1021/ci4001376 PMID: 23789733
Intell Lab Syst 2002; 64(1): 91-9. http://dx.doi.org/10.1016/S0169-
[50] Heikamp K, Bajorath J. Prediction of compounds with closely
7439(02)00068-0
related activity profiles using weighted support vector machine lin-
[34] Churchwell CJ, Rintoul MD, Martin S, et al. The signature molecu- ear combinations. J Chem Inf Model 2013; 53(4): 791-801.
lar descriptor. 3. Inverse-quantitative structure-activity relationship
http://dx.doi.org/10.1021/ci400090t PMID: 23517241
of ICAM-1 inhibitory peptides. J Mol Graph Model 2004; 22(4):
263-73. [51] Burbidge R, Trotter M, Buxton B, Holden S. Drug design by ma-
chine learning: support vector machines for pharmaceutical data
http://dx.doi.org/10.1016/j.jmgm.2003.10.002 PMID: 15177078
analysis. Comput Chem 2001; 26(1): 5-14.
[35] Ponzoni I, Sebastián-Pérez V, Requena-Triguero C, et al. Hybridiz-
http://dx.doi.org/10.1016/S0097-8485(01)00094-8 PMID:
ing feature selection and feature learning approaches in QSAR
11765851
modeling for drug discovery. Sci Rep 2017; 7(1): 2403.
[52] Kong D-X, Ren W, Lü W, Zhang HY. Do biologically relevant
http://dx.doi.org/10.1038/s41598-017-02114-3 PMID: 28546583
compounds have more chance to be drugs? J Chem Inf Model
[36] Doucet JP, Papa E, Doucet-Panaye A, Devillers J. QSAR models 2009; 49(10): 2376-81.
for predicting the toxicity of piperidine derivatives against Aedes
http://dx.doi.org/10.1021/ci900229c PMID: 19852515
aegypti. SAR QSAR Environ Res 2017; 28(6): 451-70.
[53] Deng Z-L, Du CX, Li X, et al. Exploring the biologically relevant
http://dx.doi.org/10.1080/1062936X.2017.1328855 PMID:
chemical space for drug discovery. J Chem Inf Model 2013;
28604113
53(11): 2820-8.
[37] Tetko IV, Tanchuk VY, Chentsova NP, et al. HIV-1 reverse tran-
http://dx.doi.org/10.1021/ci400432a PMID: 24125686
scriptase inhibitor design using artificial neural networks. J Med
Chem 1994; 37(16): 2520-2526. [54] Olier I, Sadawi N, Bickerton GR, et al. Meta-QSAR: a large-scale
application of meta-learning to drug design and discovery. Mach
[38] Maddalena DJ, Johnston GA. Prediction of receptor properties and
Learn 2018; 107(1): 285-311.
binding affinity of ligands to benzodiazepine/GABAA receptors
using artificial neural networks. J Med Chem 1995; 38(4): 715-24. http://dx.doi.org/10.1007/s10994-017-5685-x PMID: 31997851
http://dx.doi.org/10.1021/jm00004a017 PMID: 7861419 [55] Zhang H, Chen QY, Xiang ML, Ma CY, Huang Q, Yang SY. In
silico prediction of mitochondrial toxicity by using GA-CG-SVM
[39] Hu L, Chen G, Chau RM. A neural networks-based drug discovery
approach. Toxicol In Vitro 2009; 23(1): 134-40.
approach and its application for designing aldose reductase inhibi-
tors. J Mol Graph Model 2006; 24(4): 244-53. http://dx.doi.org/10.1016/j.tiv.2008.09.017 PMID: 18940245
http://dx.doi.org/10.1016/j.jmgm.2005.09.002 PMID: 16226911 [56] Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of
random forests and support vector machines for microarray-based
[40] Antanasijević D, Antanasijević J, Trišović N, Ušćumlić G, Pocajt
cancer classification. BMC Bioinformatics 2008; 9(1): 319.
V. From classification to regression multi-tasking QSAR modelling
using a novel modular neural network: Simultaneous prediction of http://dx.doi.org/10.1186/1471-2105-9-319 PMID: 18647401
anticonvulsant activity and neurotoxicity of succinimides. Mol [57] Liu HX, Zhang RS, Yao XJ, Liu MC, Hu ZD, Fan BT. QSAR and
Pharmaceutics 2017; 14(12): 4476-4484. classification models of a novel series of COX-2 selective inhibi-
[41] Sheikhpour R, Sarram MA, Rezaeian M, Sheikhpour E. QSAR mod- tors: 1,5-diarylimidazoles based on support vector machines. J
elling using combined simple competitive learning networks and Comput Aided Mol Des 2004; 18(6): 389-99.
RBF neural networks. SAR QSAR Environ Res 2018; 29(4): 257-76. http://dx.doi.org/10.1007/s10822-004-2722-1 PMID: 15663000
http://dx.doi.org/10.1080/1062936X.2018.1424030 PMID: [58] Warmuth MK, Liao J, Rätsch G, Mathieson M, Putta S, Lemmen
29372662 C. Active learning with support vector machines in the drug dis-
[42] Castillo-Garit JA, Casañola-Martin GM, Barigye SJ, Pham-The H, covery process. J Chem Inf Comput Sci 2003; 43(2): 667-73.
Torrens F, Torreblanca A. Machine learning-based models to predict http://dx.doi.org/10.1021/ci025620t PMID: 12653536
modes of toxic action of phenols to Tetrahymena pyriformis. In: SAR [59] Vasanthanathan P, Taboureau O, Oostenbrink C, Vermeulen NP,
andQSAR in Environmental Research. 2017; 28:9: p. 735-747. Olsen L, Jørgensen FS. Classification of cytochrome P450 1A2 in-
http://dx.doi.org/10.1080/1062936X.2017.1376705 hibitors and noninhibitors by machine learning techniques. Drug
[43] Prachayasittikul V, Worachartcheewan A, Shoombuatong W, Pra- Metab Dispos 2009; 37(3): 658-64.
chayasittikul V, Nantasenamat C. Classification of P-glycoprotein- http://dx.doi.org/10.1124/dmd.108.023507 PMID: 19056915
interacting compounds using machine learning methods. EXCLI J [60] Fernandez-Lozano C, Cuiñas RF, Seoane JA, Fernández-Blanco E,
2015; 14: 958-70. Dorado J, Munteanu CR. Classification of signaling proteins based
PMID: 26862321 on molecular star graph descriptors using Machine Learning mod-
[44] Ponzoni I, Sebastián-Pérez V, Martínez M, et al. QSAR classifica- els. J Theor Biol 2015; 384: 50-8.
tion models for predicting the activity of inhibitors of betasecretase http://dx.doi.org/10.1016/j.jtbi.2015.07.038 PMID: 26297890
(BACE1) associated with Alzheimer’s disease scientific reports. [61] Mansouri KN, Cariello A, Korotcov V, et al. Open source QSAR
Sci Rep 2019; 9: 1-13. models for pKa prediction using multiple machine learning ap-
[45] Thai K M, Huynh N T, Ngo T D, Mai T T, Nguyen T H, Tran T D. proaches. J Cheminform 2019; 11(60): 1-20.
Three- and four-class classification models for P-glycoprotein in- [62] Liu R, Madore M, Glover KP, Feasel MG, Wallqvist A. Assessing
hibitors using counter-propagation neural networks. J SAR and deep and shallow learning methods for quantitative prediction of
QSAR in Environm Res 2015; 26(2): 139-163. acute chemical toxicity. Toxicol Sci 2018; 164(2): 512-526.
[46] Varnek A, Baskin I. Machine learning methods for property predic- http://dx.doi.org/10.1093/toxsci/kfy111
tion in chemoinformatics: Quo Vadis? J Chem Inf Model 2012;
52(6): 1413-37.
QSAR Research Based on Machine Learning Concepts Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 29
[63] Koutsoukas A, Lowe R, Kalantarmotamedi Y, et al. In silico target [78] An Y, Sherman W, Dixon SL. Kernel-based partial least squares:
predictions: defining a benchmarking data set and comparison of application to fingerprint-based QSAR with model visualization. J
performance of the multiclass naïve bayes and parzen-rosenblatt Chem Inf Model 2013; 53(9): 2312-21.
window. J Chem Inf Model 2013; 53(8): 1957-66. http://dx.doi.org/10.1021/ci400250c PMID: 23901898
http://dx.doi.org/10.1021/ci300435j PMID: 23829430 [79] Ghasemi F, Mehridehnavi A, Pérez-Garrido A, Pérez-Sánchez H.
[64] Ballabio D, Grisoni F, Consonni V, Todeschini R. Integrated Neural network and deep-learning algorithms used in QSAR stud-
QSAR models to predict acute oral systemic toxicity. Mol Inform ies: merits and drawbacks. Drug Discov Today 2018; 23(10): 1784-
2018; 381800124. 90.
http://dx.doi.org/10.1002/minf.201800124 PMID: 30549437 http://dx.doi.org/10.1016/j.drudis.2018.06.016 PMID: 29936244
[65] Tripaldi P, Pérez-González A, Rojas C, Radax J, Ballabio D, [80] Miyao T, Kaneko H, Funatsu K. Inverse qspr/qsar analysis for
Todeschini R. Classification-based QSAR models for the predic- chemical structure generation (from y to x). J Chem Inf Model
tion of the bioactivity of ACE-inhibitor peptides. Protein Pept Lett 2016; 56(2): 286-99.
2018; 25(11): 1015-23. http://dx.doi.org/10.1021/acs.jcim.5b00628 PMID: 26818135
http://dx.doi.org/10.2174/0929866525666181114145658 PMID: [81] Miyao T, Arakawa M, Funatsu K. Exhaustive structure generation
30430931 for inverse-QSPR/QSAR. Mol Inform 2010; 29(1-2): 111-25.
[66] Ahmadi M, Vogt M, Iyer P, Bajorath J, Fröhlich H. Predicting [82] Hasegawa K, Kimura T, Funatsu K. Inverse QSAR study using
potent compounds via model-based global optimization. J Chem evolutionary algorithm. J Comput Aid Chem 2009; 10: 10-5.
Inf Model 2013; 53(3): 553-559. [83] Cho S J, Zheng W, Tropsha A. Rational combinatorial library
http://dx.doi.org/10.1021/ci3004682 PMID: 23363236 design. 2. rational design of targeted combinatorial peptide libraries
[67] Obrezanova O, Segall MD. Gaussian processes for classification: using chemical similarity probe and the inverse QSAR approaches.
QSAR modeling of ADMET and target activity. J Chem Inf Model J Chem Inf Comput Sci 1998; 38, 2: 259-268.
2010; 50(6): 1053-61. [84] Wong WW, Burkowski FJ. A constructive approach for discover-
http://dx.doi.org/10.1021/ci900406x PMID: 20433177 ing new drug leads: Using a kernel methodology for the inverse
[68] Obrezanova O, Csanyi G, Gola JM, Segall MD. Gaussian process- QSAR problem. J Cheminform 2009; 28; 1: 4.
es: a method for automatic QSAR modeling of ADME properties. J [85] Matveieva M, Cronin MTD, Polishchuk P. Interpretation of QSAR
Chem Inf Model 2007; 47(5): 1847-57. models: mining structural patterns taking into account molecular
http://dx.doi.org/10.1021/ci7000633 PMID: 17602549 context. Mol Inform 2019; 38(3): e1800084.
[69] González MP, Caballero J, Tundidor-Camba A, Helguera AM, http://dx.doi.org/10.1002/minf.201800084 PMID: 30346106
Fernández M. Modeling of farnesyltransferase inhibition by some [86] Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H. Appli-
thiol and non-thiol peptidomimetic inhibitors using genetic neural cation of generative autoencoder in de novo molecular design. Mol
networks and RDF approaches. Bioorg Med Chem 2006; 14(1): Inform 2018; 37(1-2), 1700123.
200-13. http://dx.doi.org/10.1002/minf.201700123 PMID: 29235269
http://dx.doi.org/10.1016/j.bmc.2005.08.009 PMID: 16185882 [87] Olivecrona M, Blaschke T, Engkvist O, Chen H. Molecular de-
[70] Caballero J, Garriga M, Fernández M. 2D Autocorrelation model- novo design through deep reinforcement learning. J Cheminform
ing of the negative inotropic activity of calcium entry blockers us- 2017; 9(1): 48.
ing Bayesian-regularized genetic neural networks. Bioorg Med http://dx.doi.org/10.1186/s13321-017-0235-x PMID: 29086083
Chem 2006; 14(10): 3330-40. [88] Popova M, Isayev O, Tropsha A. Deep reinforcement learning for
http://dx.doi.org/10.1016/j.bmc.2005.12.048 PMID: 16442799 de novo drug design. Sci Adv 2018; 4(7): eaap7885.
[71] Caballero J, Fernández M. Linear and nonlinear modeling of anti- http://dx.doi.org/10.1126/sciadv.aap7885 PMID: 30050984
fungal activity of some heterocyclic ring derivatives using multiple [89] Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating focused
linear regression and Bayesian-regularized neural networks. J Mol molecule libraries for drug discovery with recurrent neural net-
Model 2006; 12(2): 168-81. works. ACS Cent Sci 2018; 4(1): 120-31.
http://dx.doi.org/10.1007/s00894-005-0014-x PMID: 16205958 http://dx.doi.org/10.1021/acscentsci.7b00512 PMID: 29392184
[72] Fernández M, Caballero J, Fernández L, Abreu JI, Garriga M. [90] Miyao T, Funatsu K, Bajorath J. Exploring differential evolution
Protein radial distribution function (P-RDF) and Bayesian- for inverse QSAR analysis. Chem Inf Sci 2017; 6: 1-20.
Regularized Genetic Neural Networks for modeling protein con- [91] Kier LB, Hall LH, Frazer JW. Design of molecules from quantita-
formational stability: chymotrypsin inhibitor 2 mutants. J Mol tive structure activity relationship models.1. information transfer
Graph Model 2007; 26(4): 748-59. between path and vertex degree counts. J Chem Inf Comput Sci
http://dx.doi.org/10.1016/j.jmgm.2007.04.011 PMID: 17569565 1993; 33(1): 143-147.
[73] Agrafiotis DK, Cedeño W, Lobanov VS. On the use of neural net- [92] Hall LH, Kier LB, Frazer JW. Design of molecules from quantita-
work ensembles in QSAR and QSPR. J Chem Inf Comput Sci tive structure activity relationship models.2. derivation and proof of
2002; 42(4): 903-11. information transfer relating equations. J Chem Inf Comput Sci
http://dx.doi.org/10.1021/ci0203702 PMID: 12132892 1993; 33(1): 148-152.
[74] Liu Y. Drug design by machine learning: Ensemble learning for [93] Skvortsova MI, Baskin II, Slovokhotova OL, et al. Inverse problem
QSAR modeling. Machine Learning and Applications. Proceedings in QSAR/QSPR studies for the case of topological indexes charac-
of the Fourth International Conference on Machine Learning and terizing molecular shape (Kier Indices). J Chem Inf Comput Sci
Applications (ICMLA); 2005 Dec 15-17, Los Angeles, CA, USA. 1993; 33(4): 630-634.
IEEE Computer Society 2005. [94] Skvortsova MI, Fedyaev KS, Palyulin VA, et al. Inverse struc-
[75] Simões RS, Oliveira PR, Honório KM, Lima CAM. Applying tureproperty relationship problem for the case of a correlation
Transfer Learning to QSAR Regression Models. In: Latifi S, Ed. equation containing the hosoya index. Dokl Chem 2001; 379(1-3):
Information Technology - New Generations Advances in Intelli- 191-195.
gent Systems and Computing. Springer, Cham 2018. http://dx.doi.org/10.1023/A:1019217526008
http://dx.doi.org/10.1007/978-3-319-77028-4_81 [95] Souyei B, Seyd A H, Zaiz F, Rebiai A. Application of inverse
[76] Rensi SE, Altman RB. Shallow representation learning via kernel QSAR/QSPR analysis for pesticides structures generation. Acta
PCA improves QSAR modelability. J Chem Inf Model 2017; 57(8): Chimica Slovenica 2019; 66(2) : 1-11.
1859-67. [96] Shoombuatong W, et al. Towards the revival of interpretable
http://dx.doi.org/10.1021/acs.jcim.6b00694 PMID: 28727421 QSAR models. Advances in QSAR modeling challenges and ad-
[77] Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V. Deep neural nets vances in computational chemistry and physics. Cham: Springer
as a method for quantitative structure-activity relationships. J Chem 2017; Vol. 24, pp. 3-55.
Inf Model 2015; 55(2): 263-74. http://dx.doi.org/10.1007/978-3-319-56850-8_1
http://dx.doi.org/10.1021/ci500747n PMID: 25635324
30 Current Drug Discovery Technologies, 2021, Vol. 18, No. 1 Keyvanpour and Shirzad
[97] Andersson CD, Hillgren JM, Lindgren C, et al. Benefits of statisti- http://dx.doi.org/10.2174/1570159X14666161213142841
cal molecular design, covariance analysis, and reference models in [99] Suman KC, Mani Alla SR. Descriptor free QSAR modeling using
QSAR: a case study on acetylcholinesterase. J Comput Aided Mol deep learning with long short-term memory neural networks. Front
Des 2015; 29(3): 199-215. Artif Intell 2019; 2: 1-18.
http://dx.doi.org/10.1007/s10822-014-9808-1 PMID: 25351962 [100] Kausar S, Falcao AO. An automated framework for QSAR model
[98] Pulikkal BP, Marunnan SM, Bandaru S. Common SAR derived building. J Cheminform 2018; 10(1): 1.
from linear and non-linear QSAR studies on AChE inhibitors used http://dx.doi.org/10.1186/s13321-017-0256-5 PMID: 29340790
in the treatment of Alzheimer’s Disease. Curr Neuropharmacol
2017; 14;15(8): 1093-1099.