Zarantech - Intro To ML
Zarantech - Intro To ML
to Machine
Learning
Disclaimer
• This presentation, including examples, images, and references are provided for informational
purposes only.
• Complying with all applicable copyrights laws is the responsibility of the user.
• Without limiting the rights under copyright, no part of this document may be reproduced,
stored or introduced into a retrieval system, or transmitted in any form or by any means.
• Credits shall be given to the images taken from the open-source and cannot be used for
promotional activities
2
Trainer Introduction
What is Business
Analysis?
Why Business
Analysis is Important?
3
Copyright 2017, © ZaranTech LLC. All rights reserved
Agenda
”
- Tom Mitchell
“Former Chair of the Machine Learning
Department”
What is Machine Learning ?
• Provide a definition of what the learner should learn and the need
for learning
Phase 1 – Training Phase: Here training data is used to train the model by
pairing the given input with the expected output (learning model)
Phase 2 – Validation and Test Phase: This phase measures goodness of the
learning model that has been trained and also estimates the model
properties, such as error measures, sensitivity, specificity recall, precision, and
others. It uses a validation dataset, and the output is a sophisticated learning
model
Phase 3 – Application Phase: In this phase, the model is subject to the real-
world data for which the results need to be derived
Phases for Performing ML
The figure shows how learning can be applied to predict the behaviour:
Why Use Machine Learning?
Advantages
Disadvantages
• Often much more accurate than
human-crafted rules (since data
• Need a lot of data driven)
• Humans often incapable of
• Error prone - usually impossible expressing what they know (e.g.,
to get perfect accuracy rules of English, or how to
recognize letters), but can easily
classify examples
• Don’t need a human expert or
programmer
• Automatic method to search for
hypotheses explaining data.
• Cheap and flexible — can apply to
any learning task
Practical Applications of ML
Spam detection
Digit recognition
Product
recommendation
Face detection
Complementing Fields
of Machine Learning
Complementing Fields of Machine Learning
Four Fields of
Machine
Learning
Data mining is the process of sorting through large datasets to identify patterns and
establish relationships to solve problems through data analysis.
The fields of Data Mining and Machine Learning are intertwined and there is a significant
overlap in the underlying principles and methodologies.
Artificial Intelligence
2
Machine Learning is a subfield of Artificial Intelligence.
Just like Machine Learning, Statistical Learning is also about building the ability to infer
from the data that in some cases represents experience.
Machine Learning
Modeling Flow
Machine Learning Modeling Flow
Business
Data Analysis
Problem
Data
Datasets Preparation
Modeling
Evaluation
Deployment
Machine Learning Modeling Flow (Contd.)
Business Firstly define the problem statement and the goal to be achieved
Problem
along with assumptions we have in the data.
Data Analysis
Analyze data like whether it is regression or classification problem.
Data
Preparation Check for the null values and outliers and clear.
Machine Learning Modeling Flow
(Contd.)
• Labeled Data takes a set of unlabeled data and augments each piece of that unlabeled data with some sort
of meaningful "tag," "label," or "class" that is somehow informative or desirable to know
• For example, labels for the above types of unlabeled data might be whether this photo contains a horse or
a cow, which words were uttered in this audio recording, what type of action is being performed in this
video etc.
Statistical Learning Perspective
The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning
algorithm is trying to learn. That is, given some input variables (input), what is the predicted output variable
(output).
Output = f(Input)
• The columns that are the inputs are referred to as input variables
• The column of data we would like to predict for new input data in the future is called the output variable. It is
also called the response variable
A B C
1 X1 X2 Y
2 2.3 2.6 1
3 2.1 2 0
4 2.2 2.3 1
Statistical Learning Perspective (contd.)
• The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for
new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate
predictions possible
• Generally, there are more than one input variable. In that case the group of input variables are referred to as
the Input Vector
Algorithms that simplify the function to a known form are called Parametric
Machine Learning Algorithms
Consider a functional form for the mapping function is a line, as is used in linear regression:
B0 + B1
X1 + B2 X2 = 0
Where B0, B1 and B2 are the coefficients of the line that control the intercept and slope, and X1 and X2 are two
input variables.
• Estimate the coefficients of the line equation and we have a predictive model for the problem.
• Often the assumed functional form is a linear combination of the input variables and as such parametric
machine learning algorithms are often also called linear machine learning algorithms
• The problem is, the actual unknown underlying function may not be a linear function like a line. It could be
almost a line and require some minor transformation of the input data to work right. Or it could be nothing
like a line in which case the assumption is wrong and the approach will produce poor results.
Parametric Machine Learning Algorithms
Benefits of Parametric Machine Learning Algorithms:
• Less Data. They do not require as much training data and can work well even if the fit to the data is not
perfect
• Constrained. By choosing a functional form these methods are highly constrained to the specified form
• Poor Fit. In practice the methods are unlikely to match the underlying mapping function
Non-parametric Machine Learning Algorithms
Algorithms that do not make strong assumptions about the form of the mapping function are called non-parametric
machine learning algorithms.
• By not making assumptions, they are free to learn any functional form from the training data
• The method does not assume anything about the form of the mapping function other than patterns that are
close are likely have a similar output variable
• This method works good when there is lot of data and it becomes easier to choose the right feature
Non-parametric Machine Learning Algorithms
Benefits of Non-parametric Machine Learning Algorithms:
• More data. Require a lot more training data to estimate the mapping function
• Slower. A lot slower to train as they often have far more parameters to train
• Overfitting. More of a risk to overt the training data and it is harder to explain why specific predictions are
made
Types of Machine
Learning
Types of Machine Learning
There are three types of Machine learning:
Machine
Learning
Supervised All data is labeled and the algorithms learn to
predict the output from the input data
Learning
Unsupervised All data is unlabeled and the algorithms learn to
Learning inherent structure from the input data
Supervised Learning
Regression Classification
Linear Logistic
Regression Regression
Naïve Bayes
Decision Tree
(CART, C5.0)
Supervised Learning : Regression
Linear Regression
Logistic Regression
1
A Supervised Machine Learning algorithm which can be used for both classification or regression
challenges.
2
Uses a technique called the kernel trick to transform data and then based on transformations it finds an
optimal boundary between the possible outputs.
3
The goal of SVM is to find the hyperplane that separates these two classes.
Supervised Learning : Classification
Support Vector Machine
Best known:
o C5
o CART
• Very fast to train and evaluate
• Relatively easy to interpret
Supervised Learning : Classification
Artificial Neural Networks
• Artificial Neural Networks are a class of pattern matching techniques
• These methods are used for regression, classifications, image recognition, sequential
data
• They relate to Deep Learning Modeling and have many subfields of algorithms that
help solve specific problems in context
Supervised Learning : Classification
• It performs explicitly and compares new problem instances with instances seen in training, which has been
stored in memory
• Instance Based Learning models work on a groups of instances that are critical to the problem
• The results across instances are compared, which can include an instance of new data as well
Supervised Learning : Instance Based Learning
K-Nearest Neighbour
• K-Nearest Neighbour is a simple algorithm that stores all available cases and predict the target based on a
similarity measure
• Highly effective inductive inference method for noisy training data and complex target functions
Unsupervised Learning
Unsupervised Learning
Unsupervised
Learning
Associatio
Clustering
n
K-means
Apriori Rules
Clustering
Unsupervised Learning : Clustering
K-Means Clustering
• The clustering-based learning method is identified as an unsupervised learning task
• The goal of cluster technique is find the similar groups in data
• Clustering is the assignment of a set of observations into subsets so that observations
in the same cluster are similar in some sense
Unsupervised Learning : Clustering
K-Means Clustering
• Used for data discovery and understanding the underlying structure of data
• Useful for unlabeled data as a first round of analysis
• Makes no assumptions on data
• Manually given a target number of clusters
• Utilizes distance metric and clustering algorithm
• The k-means algorithm partitions the given data into k clusters with each cluster having a center called a
centroid
• k is positive integer number
• The purpose of clustering is to classify the data
• The algorithm works iteratively to assign each data point to one of K groups based on the features that are
provided
Unsupervised Learning : Association Rule Analysis
An association rule problem is where you want to discover rules that describe large
portions of your data, such as people that buy X also tend to buy Y.
Market Basket Analysis is one of the applications of Association Rule Analysis in the retail industry.
Apriori Algorithm
It is a classical algorithm in data mining Apriori uses a "bottom up" approach,
used for mining frequent item sets and where frequent subsets are extended
relevant association rules one item at a time
Performance Measures
Performance Measure - Regression
01 02 03
• As the name suggests, the mean absolute error is an average of the absolute errors.
• Mean Absolute Error (MAE) is a measure of difference between two continuous
variables.
• The MAE is more intuitive and less sensitive to outliers.
Performance Measure - Regression
• The mean squared error (MSE) measures the average of the squares of
the errors that is, the difference between the estimator and what is estimated.
• It is always non-negative, and values closer to zero are better.
• The MSE is a good performance metric as it has more statistical grounding with
variance.
Performance Measure - Regression
Frequently used measure of the differences between predicted values (Pi) by a model
and the values actually observed (Oi), where n is the number of observation.
In a Classification problem, you can represent the errors using “Confusion Matrix”
Consider 10,000 customer records to predict which customers are likely to respond to a marketing effort
Accuracy
le
a mp Here, predictions that were correct: TN = 9000 and TP = 500
Ex
Recall
Recall is the percent of positives cases that you were able to catch.
• Recall is also known as ‘Sensitivity’
• Sensitivity measures the proportion of actual positives that are correctly identified as such (e.g., the
percentage of sick people who are correctly identified as having the condition)
le
a mp Here, TP = 500 and FN = 100
Ex
Precision
le
a mp
Ex
Here, TP = 500 and FP = 400
Sensitivity
Sensitivity (also called the true positive rate or recall) measures the proportion of
actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as
having the condition).
Specificity
Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly
identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
Entropy
Measure of Impurities
Bias–Variance Trade-Off
A person with high bias is someone who starts to answer before you can even finish asking. A
person with high variance is someone who can think of all sorts of crazy answers. Combining
these gives you different personalities:
- High bias/low variance: this is someone who usually gives you the same answer, no matter
what you ask, and is usually wrong about it;
- High bias/high variance: someone who takes wild guesses, all of which are sort of wrong;
- Low bias/high variance: a person who listens to you and tries to answer the best they can,
but that daydreams a lot and may say something totally crazy;
- Low bias/low variance: a person who listens to you very carefully and gives you good
answers pretty much all the time.
Bias Error
• Bias are the simplifying assumptions made by a model to make the target function easier to learn
• Generally parametric algorithms have a high bias making them fast to learn and easier to understand
• Low Bias: Suggests less assumptions about the form of the target function
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbours and Support
Vector Machines.
• High-Bias: Suggests more assumptions about the form of the target function
Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis
and Logistic Regression.
Variance Error
• Variance is the amount that the estimate of the target function will change if different training data was
used
• The target function is estimated from the training data by a machine learning algorithm, so we should
expect the algorithm to have some variance
• Low Variance: Suggests small changes to the estimate of the target function with changes to the training
dataset.
• High Variance: Suggests large changes to the estimate of the target function with changes to the training
dataset.
Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and
Support Vector Machines.
Bias-Variance Trade-Off
1. Training set: The training set is typically 60% of the data. As the name suggests,
this is used for training a machine learning model.
2. Validation set: The validation is also called the the development set. This is
typically 20% of the data. This set is not used during training. It is
used to test the quality of the trained model.
3. Test set: This set is typically 20% of the data. Its only purpose is to report the
accuracy of the final model.
Bias-Variance Trade-Off (contd.)
The prediction error for any machine learning algorithm can be broken down
Into three parts:
• Even from a perfect model, we might not be able to remove the errors completely made by a learning
algorithm
• This is because the training data itself may contain noise. This error is called Irreducible error or Bayes’
error rate or the Optimum Error rate
• While we cannot do anything about the Irreducible Error, we can reduce the errors due to bias and
variance
If the model is not performing well, it is usually a high bias or a high variance problem. The figure below
graphically shows the effect of model complexity on error due to bias and variance.
As we increase the complexity of your model, you will see a reduction in error due to lower bias in the model.
However, this only happens till a particular point. As you continue to make your model more complex, you end
up over-fitting your model and hence your model will start suffering from high variance.
Bias-Variance Trade-Off (contd.)
Trade-off is tension between the error introduced by the bias and the variance.
A champion model should maintain a balance between these two types of errors. This is known as the trade-
off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.
Overfitting and
Underfitting
Data Inconsistencies in Machine Learning
Unpredictable data
Under fitting
formats
There are some established processes in place today to address these inconsistencies.
Unpredictable Data Formats
A model is said to be under-fitting when it doesn't take into consideration enough information to accurately model the
actual data.
Over Fitting
• This case is just the opposite of the under-fitting case. While too small a sample is not appropriate to define
an optimal solution, a large dataset also runs the risk of having the model over-fit the data
• It usually occurs when the statistical model describes noise instead of describing the relationships
Data Instability
1
Machine Learning Algorithms are usually robust
to noise within the data.
2
A problem will occur if the outliers are due to
manual error or misinterpretation of the
relevant data.
3
This will result in a skewing of the data, which
will ultimately end up in an incorrect model.
4
There is a strong need to have a process to correct
or handle human errors that can result in building
an incorrect model.
Resampling Methods
Bootstrapping
The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates
from multiple small data samples.
• The basic idea of bootstrapping is that inference about a population from sample data (sample →
population) can be modelled by resampling the sample data and performing inference about a sample from
resampled data (resampled → sample)
• As the population is unknown, the true error in a sample statistic against its population value is unknown
• In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of
inference of the 'true' sample from resampled data
(resampled → sample) is measurable
Bootstrapping - Advantages & Disadvantages
Advantages
It is a straightforward way to derive
estimates of standard errors and
A great Bootstrap is also an
confidence intervals for complex
advantage of appropriate way to
estimators of complex parameters of
bootstrap is its control and check the
the distribution, such as percentile
simplicity stability of the results
points, proportions, odds ratio, and
correlation coefficients
Disadvantages
Jackknife, which is similar to bootstrapping, is used to estimate the bias and standard
error (variance) of a statistic, when a random sample of observations is used
to calculate it.
• Jackknife is a resampling technique especially useful for variance and bias estimation
• The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset
and calculating the estimate and then finding the average of these calculations
• Given a sample of size n, the jackknife estimate is found by aggregating the estimates of each (n-1) -sized sub-
sample
Cross Validation
Cross-validation is used to estimate the skill of machine learning models
Used in applied machine learning to compare and select a model for a given
predictive modeling problem
• Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data
sample
LpO CV involves using p observation as the validation set and the remaining observations as the training set. This is
repeated on all ways to cut the original sample on a validation set of p observation and a training set.
• This is repeated on all ways to cut the original sample on a validation set of p observations and a training set
• LpO cross-validation requires training and validating the model nCp times, where n is the number of
observations in the original sample, and where nCp is the binomial coefficient
• For p > 1 and for even moderately large n, LpO CV can become computationally infeasible
Leave-One-Out Cross-Validation (LOOCV)
• The process looks similar to jackknife; however, with cross-validation one computes a statistic on the left-out
sample(s), while with jackknifing one computes a statistic from the kept samples only
• LOO cross-validation requires less computation time than LpO cross-validation because there are only nCp = n
passes rather than nCk
• However, n passes may still require quite a large computation time, in which case other approaches such as k-
fold cross validation may be more appropriate.
k-fold Cross-Validation
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples
• Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data
• The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as
the validation data
• The advantage of this method over repeated random sub-sampling is that all observations are used for both
training and validation, and each observation is used for validation exactly once
The k value must be chosen carefully for your data sample. A poorly chosen value for k may
result in a mis-representative idea of the skill of the model, such as a score with a high variance or a high bias.
• Value of k should be divisible by the number of rows in your training dataset, to ensure each of the k groups
has the same number of rows
• Choose a value for k that splits the data into groups with enough rows that each group is still representative
of the original dataset
• A good default to use is k=3 for a small dataset or k=10 for a larger dataset
• A quick way to check if the fold sizes are representative is to calculate summary statistics such as mean and
standard deviation and see how much the values differ from the same statistics on the whole dataset
Ensemble Methods
Max Voting
• In this technique, multiple models are used to make predictions for each data point
• The predictions by each model are considered as a ‘vote’
• The predictions which we get from the majority of the models are used as the final prediction
• The max voting method is generally used for classification
Example
Suppose you asked 5 of your Friends to rate your movie (out of 5); we’ll assume three of them
rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating
will be taken as 4. You can consider this as taking the mode of all the predictions.
Example
Suppose you asked 5 of your Friends to rate your movie (out of 5); we’ll assume three of them
rated it as 4 while two of them gave it a 5.
Now, the averaging method would take the average of all the values.
i.e. (5+4+5+4+4)/5 = 4.4
Example
Suppose you asked 5 of your Friends to rate your movie (out of 5); we’ll assume three of them
rated it as 4 while two of them gave it a 5.
Now, if two of your Friends are critics, while others have no prior experience in this field, then
the answers by these two friends are given more importance as compared to the other people.
The result is calculated as [(4*0.23) + (5*0.23) + (4*0.18) + (5*0.18) + (4*0.18)] = 4.41.
Given a sample of data, multiple bootstrapped subsamples are pulled. A Decision Tree is formed on each of the
bootstrapped subsamples. After each subsample Decision Tree has been formed, an algorithm is used to
aggregate over the Decision Trees to form the most efficient predictor.
Boosting
• The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners
• Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous
model
• To find weak learners, we apply base learning (ML) algorithms with a different distribution
• Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative
process
• After many iterations, the boosting algorithm combines these weak rules into a single strong prediction
rule
Boosting (contd.)
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention
to observations having prediction error. Then, we apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves
the prediction power of the model. Boosting pays higher focus on examples which are mis-classified or have
higher errors by preceding weak rules.
Questions and Answers
103
Copyright 2017, © ZaranTech LLC. All rights reserved
Training Details
[Link]
n/create_request#/ticket-form
[Link]
n/create_request#/ticket-form/26919
104
Copyright 2017, © ZaranTech LLC. All rights reserved
Please leave your feedback after the session
Contact Number E-mail Website
515-309-7846 info@[Link] [Link]