Choose Your Own Algorithm
Introduction
• Machine learning is part art and part science.
When you look at machine learning
algorithms, there is no one solution or one
approach that fits all.
• There are several factors that can affect your
decision to choose a machine learning
algorithm.
Introduction
• Some problems are very specific and require a unique
approach. E.g. if you look at a recommender system,
it’s a very common type of machine learning
algorithm and it solves a very specific kind of problem.
• While some other problems are very open and need a
trial & error approach.
• Supervised learning, classification and regression etc.
are very open. They could be used in anomaly
detection, or they could be used to build more general
sorts of predictive models.
• Besides some of the decisions that we make
when choosing a machine learning algorithm
have less to do with the optimization or the
technical aspects of the algorithm but more to
do with business decisions.
• We look at some of the factors that can help
you narrow down the search for your machine
learning algorithm.
Steps
• Understand Your Data
• Know your data
• Clean your data
• Augment your data
• Categorize the problem
• Understand your constraints
• Find the available algorithms
Linear Regression
• Time to go one location to another
• Predicting sales of particular product next
month
• Impact of blood drug content on coordination
• Predict monthly gift card sales and improve
yearly revenue projections
Decision Trees
• Investment decisions
• Customer churn
• Banks loan defaulters
• Build vs Buy decisions
• Sales lead qualifications
K Nearest Neighbor
• Sometimes you don’t know any labels and
your goal is to assign labels according to the
features of objects. This is called clusterization
task. Clustering algorithms can be used for
example, when there is a large group of users
and you want to divide them into particular
groups based on some common attributes.
SVM
• Detecting persons with common diseases such
as diabetes.
• Hand-written character recognition
• Text categorization — news articles by topics
• Stock market price prediction
Naïve Bayes
• Sentiment analysis and text classification
• Recommendation systems like Netflix, Amazon
• To mark an email as spam or not spam
• Face recognition
Random Forest
• Random Forest can be used in real-world
applications such as,
• Predict patients for high risks
• Predict parts failures in manufacturing
• Predict loan defaulters
Cheat Sheet
Performance Metrics
• True Positive: You predicted positive, and it’s true.
• True Negative: You predicted negative, and it’s true.
• False Positive: (Type 1 Error): You predicted positive, and it’s false.
• False Negative: (Type 2 Error): You predicted negative, and it’s false.
• Accuracy: the proportion of the total number of correct predictions that were
correct.
• Positive Predictive Value or Precision: the proportion of positive cases that were
correctly identified.
• Negative Predictive Value: the proportion of negative cases that were correctly
identified.
• Sensitivity or Recall: the proportion of actual positive cases which are correctly
identified.
• Specificity: the proportion of actual negative cases which are correctly identified.
• Rate: It is a measuring factor in a confusion matrix. It has also 4 types TPR, FPR,
TNR, and FNR
If the classifier predicts negative, you can trust it, the example is negative. However,
pay attention, if the example is negative, you can’t be sure it will predict it as
negative (specificity=78%).
If the classifier predicts positive, you can’t trust it (precision=33%). However, if the
example is positive, you can trust the classifier (recall=100%).
• Predicting everything as positive clearly can’t
be a good idea. However, because the
population is imbalanced the precision is
relatively high, the recall is 100% because all
the positive examples are predicted as positive.
But the specificity is 0% because no negative
example is predicted as negative.
The accuracy for the problem in hand comes out to be 88%. As you can see from
the above two tables, the Positive Predictive Value is high, but the negative
predictive value is quite low. The same holds for Sensitivity and Specificity. This is
primarily driven by the threshold value we have chosen. If we decrease our
threshold value, the two pairs of starkly different numbers will come closer.
In general, we are concerned with one of the above-defined metrics. For instance,
in a pharmaceutical company, they will be more concerned with a minimal wrong
positive diagnosis. Hence, they will be more concerned about high Specificity. On
the other hand, an attrition model will be more concerned with Sensitivity.
Confusion matrices are generally used only with class output models.
F1 Score
In the last section, we discussed precision and recall for classification problems and
also highlighted the importance of choosing a precision/recall basis for our use case.
What if, for a use case, we are trying to get the best precision and recall at the same
time? F1-Score is the harmonic mean of precision and recall values for a
classification problem.
Now, an obvious question that comes to mind is why you are taking
a harmonic mean and not an arithmetic mean. This is because HM
punishes extreme values more. Let us understand this with an
example. We have a binary classification model with the following
results:
Precision: 0, Recall: 1
Here, if we take the arithmetic mean, we get 0.5. It is clear that the
above result comes from a dumb classifier that ignores the input
and predicts one of the classes as output. Now, if we were to take
HM, we would get 0 which is accurate as this model is useless for
all purposes.
This seems simple. There are situations, however, for which a data
scientist would like to give a percentage more importance/weight
to either precision or recall
• Area Under the ROC Curve (AUC – ROC)
• This is again one of the popular evaluation metrics used
in the industry. The biggest advantage of using the ROC
curve is that it is independent of the change in the
proportion of responders. This statement will get clearer
in the following sections.
• Let’s first try to understand what the ROC (Receiver
operating characteristic) curve is. If we look at the
confusion matrix below, we observe that for a
probabilistic model, we get different values for each
metric.
Clustering Metrics
• Silhouette Score
• The Silhouette Score and Silhouette Plot are used to measure
the separation distance between clusters
• Dunn Index
• Dunn’s Index (DI) is another metric for clustering algorithm
evaluation. Dunn’s Index equals the minimum inter-cluster
distance divided by the maximum cluster size. Large inter-
cluster distances (better separation) and smaller cluster sizes
(more compact clusters) lead to a higher DI value. A higher DI
implies better clustering. It assumes that better clustering
means that clusters are compact and well-separated from other
clusters.