0% found this document useful (0 votes)
21 views105 pages

Zarantech - Intro To ML

This document provides an introduction to Machine Learning (ML), outlining its definition, phases, and types, including supervised, unsupervised, and semi-supervised learning. It discusses the importance of data in ML, the differences between parametric and non-parametric algorithms, and various practical applications. Additionally, it highlights the relationship between ML, data mining, artificial intelligence, and data science.

Uploaded by

hemanagegowda94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views105 pages

Zarantech - Intro To ML

This document provides an introduction to Machine Learning (ML), outlining its definition, phases, and types, including supervised, unsupervised, and semi-supervised learning. It discusses the importance of data in ML, the differences between parametric and non-parametric algorithms, and various practical applications. Additionally, it highlights the relationship between ML, data mining, artificial intelligence, and data science.

Uploaded by

hemanagegowda94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Introduction

to Machine
Learning
Disclaimer

• This presentation, including examples, images, and references are provided for informational
purposes only.

• Complying with all applicable copyrights laws is the responsibility of the user.

• Without limiting the rights under copyright, no part of this document may be reproduced,
stored or introduced into a retrieval system, or transmitted in any form or by any means.

• Credits shall be given to the images taken from the open-source and cannot be used for
promotional activities

2
Trainer Introduction

What is Business
Analysis?

Why Business
Analysis is Important?

3
Copyright 2017, © ZaranTech LLC. All rights reserved
Agenda

In this session, you will learn about:

• Introduction to Machine Learning


• Machine Learning Modeling Flow
• How to Treat Data in ML
• Parametric & Non-parametric ML Algorithm
• Types of Machine Learning
• Performance Measures
• Bias-Variance Trade-Off
• Overfitting & Underfitting
• Resampling Methods
• Ensemble Methods
Introduction to
Machine Learning
(ML)
What is Machine Learning ?

“ Machine Learning is said to learn from experience with


respect to some class of tasks, and a performance
measure P, if learners performance at tasks in a class, as a
measured by P, improves with experience


- Tom Mitchell
“Former Chair of the Machine Learning
Department”
What is Machine Learning ?

Machine learning is a subset of artificial intelligence that often uses statistical


techniques to give the ability to "learn" with data

Some considerations to define a learning problem are:

• Provide a definition of what the learner should learn and the need
for learning

• Define the data requirements and the sources of the data

• Define if the learner should operate on the dataset in entirety or a


subset will do
Phases for Performing ML
Machine learning typically follows three phases:

Phase 1 – Training Phase: Here training data is used to train the model by
pairing the given input with the expected output (learning model)

Phase 2 – Validation and Test Phase: This phase measures goodness of the
learning model that has been trained and also estimates the model
properties, such as error measures, sensitivity, specificity recall, precision, and
others. It uses a validation dataset, and the output is a sophisticated learning
model

Phase 3 – Application Phase: In this phase, the model is subject to the real-
world data for which the results need to be derived
Phases for Performing ML

The figure shows how learning can be applied to predict the behaviour:
Why Use Machine Learning?

Advantages

Disadvantages
• Often much more accurate than
human-crafted rules (since data
• Need a lot of data driven)
• Humans often incapable of
• Error prone - usually impossible expressing what they know (e.g.,
to get perfect accuracy rules of English, or how to
recognize letters), but can easily
classify examples
• Don’t need a human expert or
programmer
• Automatic method to search for
hypotheses explaining data.
• Cheap and flexible — can apply to
any learning task
Practical Applications of ML

Spam detection

Credit card fraud


detection

Digit recognition

Product
recommendation

Face detection
Complementing Fields
of Machine Learning
Complementing Fields of Machine Learning

Data Mining Artificial


Intelligence

Four Fields of
Machine
Learning

Statistical Learning Data Science


Data Mining

Data mining is the process of sorting through large datasets to identify patterns and
establish relationships to solve problems through data analysis.

The fields of Data Mining and Machine Learning are intertwined and there is a significant
overlap in the underlying principles and methodologies.
Artificial Intelligence

Artificial Intelligence focuses on building systems that


1
can reduce human behaviour.

2
Machine Learning is a subfield of Artificial Intelligence.

Machine Learning and Artificial Intelligence employ learning 3


algorithms and focus on automation when reasoning or
decision making.
Data Science

• Data science is all about turning data


into products
• It is analytics and Machine Learning
put into action to draw inferences and
insights out of data
• Data science is perceived to be a first
step from traditional data analysis
and knowledge systems, such as Data
Warehouses (DW) and Business
Intelligence (BI), which considers all
aspects of big data
• Machine Learning and Data Science
have prediction as a common binding
outcome given the problem's context
Statistical Learning
In Statistical Learning, the predictive functions are arrived at and primarily derived
from samples of data.

It is of great importance how the data is


collected, cleansed, and managed in this
process.

Statistics is pretty close to mathematics,


as it is about quantifying data and
operating on numbers.

Just like Machine Learning, Statistical Learning is also about building the ability to infer
from the data that in some cases represents experience.
Machine Learning
Modeling Flow
Machine Learning Modeling Flow

Business
Data Analysis
Problem

Data
Datasets Preparation

Modeling

Evaluation

Deployment
Machine Learning Modeling Flow (Contd.)

Business Firstly define the problem statement and the goal to be achieved
Problem
along with assumptions we have in the data.

Data Analysis
Analyze data like whether it is regression or classification problem.

Data
Preparation Check for the null values and outliers and clear.
Machine Learning Modeling Flow
(Contd.)

Modeling Build a model using train data.

Evaluation Tweak the model using the test data.

Deployment Deploy the model based on accuracy on final version.


How to Treat Data in
Machine Learning
How to Treat Data in Machine Learning?

The data that is being


Data forms the main source referenced here can be in any Data in the ML context, can
of learning in machine format, can be received at either be labeled or
learning. any frequency, and can be of unlabeled.
any size.
Labeled & Unlabeled Data
• Unlabeled Data consists of samples of natural or human-created artifacts that you can obtain relatively
easily from the world
• Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets etc.

• Labeled Data takes a set of unlabeled data and augments each piece of that unlabeled data with some sort
of meaningful "tag," "label," or "class" that is somehow informative or desirable to know
• For example, labels for the above types of unlabeled data might be whether this photo contains a horse or
a cow, which words were uttered in this audio recording, what type of action is being performed in this
video etc.
Statistical Learning Perspective
The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning
algorithm is trying to learn. That is, given some input variables (input), what is the predicted output variable
(output).

Output = f(Input)
• The columns that are the inputs are referred to as input variables
• The column of data we would like to predict for new input data in the future is called the output variable. It is
also called the response variable

Output Variable = f(Input Variables)

A B C
1 X1 X2 Y
2 2.3 2.6 1
3 2.1 2 0
4 2.2 2.3 1
Statistical Learning Perspective (contd.)
• The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for
new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate
predictions possible

• Generally, there are more than one input variable. In that case the group of input variables are referred to as
the Input Vector

Output Variable = f(Input Vector)


For example, a statistics text may talk about the input variables as independent variables and the output
variable as the dependent variable. This is because in the framing of the prediction problem the output is
dependent (a function of) the input (also called the independent variables)

Dependent Variable = f(Independent Variables)


A B C
1 X1 X2 Y
2 2.3 2.6 1
Y = f(X)
3 2.1 2 0
4 2.2 2.3 1
Parametric & Non-parametric
Machine Learning Algorithm
Parametric Machine Learning Algorithms

Algorithms that simplify the function to a known form are called Parametric
Machine Learning Algorithms

“A learning model that summarizes data with a set of parameters of fixed


size (independent of the number of training examples) is called a
parametric model. No matter how much data you throw at a parametric
model, it won't change its mind about how many parameters it needs.”

- Artificial Intelligence: A Modern Approach


Parametric Machine Learning Algorithms
A parametric algorithm involve two steps:
1. Select a form for the function
2. Learn the coefficients for the function from the training data

Consider a functional form for the mapping function is a line, as is used in linear regression:
B0 + B1
X1 + B2 X2 = 0

Where B0, B1 and B2 are the coefficients of the line that control the intercept and slope, and X1 and X2 are two
input variables.

• Estimate the coefficients of the line equation and we have a predictive model for the problem.
• Often the assumed functional form is a linear combination of the input variables and as such parametric
machine learning algorithms are often also called linear machine learning algorithms
• The problem is, the actual unknown underlying function may not be a linear function like a line. It could be
almost a line and require some minor transformation of the input data to work right. Or it could be nothing
like a line in which case the assumption is wrong and the approach will produce poor results.
Parametric Machine Learning Algorithms
Benefits of Parametric Machine Learning Algorithms:

• Simpler. These methods are easier to understand and interpret results

• Speed. Parametric models are very fast to learn from data

• Less Data. They do not require as much training data and can work well even if the fit to the data is not
perfect

Limitations of Parametric Machine Learning Algorithms:

• Constrained. By choosing a functional form these methods are highly constrained to the specified form

• Limited Complexity. The methods are more suited to simpler problems

• Poor Fit. In practice the methods are unlikely to match the underlying mapping function
Non-parametric Machine Learning Algorithms

Algorithms that do not make strong assumptions about the form of the mapping function are called non-parametric
machine learning algorithms.

• By not making assumptions, they are free to learn any functional form from the training data

• The method does not assume anything about the form of the mapping function other than patterns that are
close are likely have a similar output variable

• This method works good when there is lot of data and it becomes easier to choose the right feature
Non-parametric Machine Learning Algorithms
Benefits of Non-parametric Machine Learning Algorithms:

• Flexibility. Capable of fitting a large number of functional forms

• Power. No assumptions (or weak assumptions) about the underlying function

• Performance. Can result in higher performance models for prediction

Limitations of Non-parametric Machine Learning Algorithms:

• More data. Require a lot more training data to estimate the mapping function

• Slower. A lot slower to train as they often have far more parameters to train

• Overfitting. More of a risk to overt the training data and it is harder to explain why specific predictions are
made
Types of Machine
Learning
Types of Machine Learning
There are three types of Machine learning:

Machine
Learning
Supervised All data is labeled and the algorithms learn to
predict the output from the input data
Learning
Unsupervised All data is unlabeled and the algorithms learn to
Learning inherent structure from the input data

Semi-supervised Some data is labeled but most of it is unlabeled


and a mixture of supervised and unsupervised
Learning techniques can be used
Supervised Learning

Supervised Learning

Regression Classification

Linear Logistic
Regression Regression

Decision Tree Support Vector


(RPART) Machines

Naïve Bayes

Decision Tree
(CART, C5.0)
Supervised Learning : Regression

Linear Regression

• Linear regression models are often fitted using


the least squares approach
• It will predict only the continuous variable
• When there is a single input variable (x), the
method is referred to as simple linear
regression
• When there are multiple input variables, then
refers to the method as multiple linear
regression
Supervised Learning : Classification

Logistic Regression

• Logistic Regression is used to predict the


probability of an instance belonging to the
default class
• Logistic Regression is a statistical method for
analyzing a dataset in which there are one or
more independent variables that determine an
outcome
• The outcome is measured with a dichotomous
variable
Supervised Learning : Classification

Support Vector Machine

1
A Supervised Machine Learning algorithm which can be used for both classification or regression
challenges.

2
Uses a technique called the kernel trick to transform data and then based on transformations it finds an
optimal boundary between the possible outputs.

3
The goal of SVM is to find the hyperplane that separates these two classes.
Supervised Learning : Classification
Support Vector Machine

• Multiple hyperplanes can be defined to classify data


• Using PLA (Perceptron Learning Algorithm), start with a random hyperplane and use it to classify the data
• Pick a misclassified example and select another hyperplane by updating the value, hoping it will work better
at classifying this example (this is called the update rule)
• Classify the data with this new hyperplane
• Repeat the steps
Supervised Learning : Classification
Naïve Bayes

• Naïve Bayes classifier uses the Bayes Theorem


• It assumes that all the features are unrelated to each other
• Presence or absence of a feature does not influence the presence or absence of any other feature
• Eg: Spam filtering is the best known use of Naive Bayesian text classification
Supervised Learning : Classification
Decision Tree

Learning from examples

Best known:
o C5
o CART
• Very fast to train and evaluate
• Relatively easy to interpret
Supervised Learning : Classification
Artificial Neural Networks
• Artificial Neural Networks are a class of pattern matching techniques
• These methods are used for regression, classifications, image recognition, sequential
data
• They relate to Deep Learning Modeling and have many subfields of algorithms that
help solve specific problems in context
Supervised Learning : Classification

Methods in Artificial Neural Networks

Learning Vector Quantization Self-Organizing Maps

It is an algorithm that is a type of It is used to transform an incoming


Artificial Neural Networks and signal pattern of arbitrary dimension
uses neural computation. into a one or two dimensional discrete
map, to perform this transformation
adaptively in a topologically ordered
fashion.
Supervised Learning : Instance Based Learning

• Instance Based Learning sometimes called memory-based learning

• It performs explicitly and compares new problem instances with instances seen in training, which has been
stored in memory

• Instance Based Learning models work on a groups of instances that are critical to the problem

• The results across instances are compared, which can include an instance of new data as well
Supervised Learning : Instance Based Learning

K-Nearest Neighbour
• K-Nearest Neighbour is a simple algorithm that stores all available cases and predict the target based on a
similarity measure
• Highly effective inductive inference method for noisy training data and complex target functions
Unsupervised Learning
Unsupervised Learning

Unsupervised
Learning

Associatio
Clustering
n

K-means
Apriori Rules
Clustering
Unsupervised Learning : Clustering

K-Means Clustering
• The clustering-based learning method is identified as an unsupervised learning task
• The goal of cluster technique is find the similar groups in data
• Clustering is the assignment of a set of observations into subsets so that observations
in the same cluster are similar in some sense
Unsupervised Learning : Clustering

K-Means Clustering
• Used for data discovery and understanding the underlying structure of data
• Useful for unlabeled data as a first round of analysis
• Makes no assumptions on data
• Manually given a target number of clusters
• Utilizes distance metric and clustering algorithm
• The k-means algorithm partitions the given data into k clusters with each cluster having a center called a
centroid
• k is positive integer number
• The purpose of clustering is to classify the data
• The algorithm works iteratively to assign each data point to one of K groups based on the features that are
provided
Unsupervised Learning : Association Rule Analysis
An association rule problem is where you want to discover rules that describe large
portions of your data, such as people that buy X also tend to buy Y.

Market Basket Analysis is one of the applications of Association Rule Analysis in the retail industry.

Apriori Algorithm
It is a classical algorithm in data mining Apriori uses a "bottom up" approach,
used for mining frequent item sets and where frequent subsets are extended
relevant association rules one item at a time
Performance Measures
Performance Measure - Regression

Performance Measures for regression are used to evaluate learning algorithms


and form an important aspect of machine learning.
In some cases, these measures are also used as heuristics to build learning models.

01 02 03

Mean Squared Mean Absolute Root Mean


Error Error Square Error
(MSE) (MAE) (RMSE)
Performance Measure - Regression

Mean Absolute Error

• As the name suggests, the mean absolute error is an average of the absolute errors.
• Mean Absolute Error (MAE) is a measure of difference between two continuous
variables.
• The MAE is more intuitive and less sensitive to outliers.
Performance Measure - Regression

Mean Squared Error

• The mean squared error (MSE) measures the average of the squares of
the errors that is, the difference between the estimator and what is estimated.
• It is always non-negative, and values closer to zero are better.
• The MSE is a good performance metric as it has more statistical grounding with
variance.
Performance Measure - Regression

Root Mean Square Error

Frequently used measure of the differences between predicted values (Pi) by a model
and the values actually observed (Oi), where n is the number of observation.

Lesser the RMSE


Performance Measure - Classification

In a Classification problem, you can represent the errors using “Confusion Matrix”

Consider 10,000 customer records to predict which customers are likely to respond to a marketing effort

Predicted (that there will Predicted (that there will


Action
be a buy) be NO buy)

Actually bought TP : 500 FP: 400

Actually did not buy FN : 100 TN : 9000


Performance Measure - Classification

Accuracy

Accuracy is the percent of predictions that were correct.

Accuracy = (TP + TN) / (TP + FP + FN + TN)

le
a mp Here, predictions that were correct: TN = 9000 and TP = 500
Ex

The "accuracy" is (500 +9000) out of 10,000 = 95%.


Performance Measure - Classification

Recall

Recall is the percent of positives cases that you were able to catch.
• Recall is also known as ‘Sensitivity’
• Sensitivity measures the proportion of actual positives that are correctly identified as such (e.g., the
percentage of sick people who are correctly identified as having the condition)

Recall = TP / (TP + FN)

le
a mp Here, TP = 500 and FN = 100
Ex

The "recall" is 500 out of 600 = 83.33%


Performance Measure - Classification

Precision

Precision is the percent of positive predictions that were correct.


• Precision is also known as ‘Positive predictive value’ (PPV)
• It is the fraction of relevant instances among the retrieved instances

Precision = TP / (TP + FP)

le
a mp
Ex
Here, TP = 500 and FP = 400

The "precision" is 500 out of 900 = 55.55%


Performance Measure - Classification

Sensitivity
Sensitivity (also called the true positive rate or recall) measures the proportion of
actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as
having the condition).

Sensitivity = TP / (TP + FN)

Specificity
Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly
identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

Specificity = TN / (TN + FP)


Receiver Operating Characteristics (ROC) Curve

• It shows the tradeoff between sensitivity and


specificity

• The closer the curve follows the left-hand border


and then the top border of the ROC space, the
more accurate the test

• The closer the curve comes to the 45-degree


diagonal of the ROC space, the less accurate the
test

• The area under the curve is a measure of text


accuracy
Sensitivity = TP / (TP + FN)

Specificity = TN / (TN + FP)

Positive predictive value (PPV) = TP / (TP + FP)

Negative predictive value (NPV) = TN / (TN + FN)


How to Measure Purity?

Entropy
Measure of Impurities
Bias–Variance Trade-Off
A person with high bias is someone who starts to answer before you can even finish asking. A
person with high variance is someone who can think of all sorts of crazy answers. Combining
these gives you different personalities:

- High bias/low variance: this is someone who usually gives you the same answer, no matter
what you ask, and is usually wrong about it;

- High bias/high variance: someone who takes wild guesses, all of which are sort of wrong;

- Low bias/high variance: a person who listens to you and tries to answer the best they can,
but that daydreams a lot and may say something totally crazy;

- Low bias/low variance: a person who listens to you very carefully and gives you good
answers pretty much all the time.
Bias Error

• Bias are the simplifying assumptions made by a model to make the target function easier to learn
• Generally parametric algorithms have a high bias making them fast to learn and easier to understand

• Low Bias: Suggests less assumptions about the form of the target function

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbours and Support
Vector Machines.

• High-Bias: Suggests more assumptions about the form of the target function

Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis
and Logistic Regression.
Variance Error

• Variance is the amount that the estimate of the target function will change if different training data was
used
• The target function is estimated from the training data by a machine learning algorithm, so we should
expect the algorithm to have some variance

• Low Variance: Suggests small changes to the estimate of the target function with changes to the training
dataset.

Examples of low-variance machine learning algorithms include: Linear


Regression, Linear Discriminant Analysis and Logistic Regression.

• High Variance: Suggests large changes to the estimate of the target function with changes to the training
dataset.

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and
Support Vector Machines.
Bias-Variance Trade-Off

We can divide the received data into three parts:

1. Training set: The training set is typically 60% of the data. As the name suggests,
this is used for training a machine learning model.

2. Validation set: The validation is also called the the development set. This is
typically 20% of the data. This set is not used during training. It is
used to test the quality of the trained model.

3. Test set: This set is typically 20% of the data. Its only purpose is to report the
accuracy of the final model.
Bias-Variance Trade-Off (contd.)

The prediction error for any machine learning algorithm can be broken down
Into three parts:

Bias Error Variance Error Irreducible Error

• Even from a perfect model, we might not be able to remove the errors completely made by a learning
algorithm
• This is because the training data itself may contain noise. This error is called Irreducible error or Bayes’
error rate or the Optimum Error rate
• While we cannot do anything about the Irreducible Error, we can reduce the errors due to bias and
variance

Total Error = Bias + Variance + Irreducible


Error
Bias-Variance Trade-Off (contd.)

If the model is not performing well, it is usually a high bias or a high variance problem. The figure below
graphically shows the effect of model complexity on error due to bias and variance.

As we increase the complexity of your model, you will see a reduction in error due to lower bias in the model.
However, this only happens till a particular point. As you continue to make your model more complex, you end
up over-fitting your model and hence your model will start suffering from high variance.
Bias-Variance Trade-Off (contd.)
Trade-off is tension between the error introduced by the bias and the variance.

A champion model should maintain a balance between these two types of errors. This is known as the trade-
off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.
Overfitting and
Underfitting
Data Inconsistencies in Machine Learning

Data Inconsistencies that may be encountered while implementing Machine


learning projects, such as:

Unpredictable data
Under fitting
formats

Over fitting Data instability

There are some established processes in place today to address these inconsistencies.
Unpredictable Data Formats

Complexity will creep in


when the new data
entering the system
comes in formats that are
not supported by the
machine learning system.

It is now difficult to say if


our models work well for
Machine learning is meant
the new data given the
to work with new data
instability in the formats
constantly coming into the
that we receive the data,
system and learning from
unless there is a
that data.
mechanism built to handle
this.
Under Fitting

A model is said to be under-fitting when it doesn't take into consideration enough information to accurately model the
actual data.
Over Fitting

• This case is just the opposite of the under-fitting case. While too small a sample is not appropriate to define
an optimal solution, a large dataset also runs the risk of having the model over-fit the data
• It usually occurs when the statistical model describes noise instead of describing the relationships
Data Instability

1
Machine Learning Algorithms are usually robust
to noise within the data.

2
A problem will occur if the outliers are due to
manual error or misinterpretation of the
relevant data.

3
This will result in a skewing of the data, which
will ultimately end up in an incorrect model.

4
There is a strong need to have a process to correct
or handle human errors that can result in building
an incorrect model.
Resampling Methods
Bootstrapping

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates
from multiple small data samples.

• The basic idea of bootstrapping is that inference about a population from sample data (sample →
population) can be modelled by resampling the sample data and performing inference about a sample from
resampled data (resampled → sample)

• As the population is unknown, the true error in a sample statistic against its population value is unknown

• In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of
inference of the 'true' sample from resampled data
(resampled → sample) is measurable
Bootstrapping - Advantages & Disadvantages

Advantages
It is a straightforward way to derive
estimates of standard errors and
A great Bootstrap is also an
confidence intervals for complex
advantage of appropriate way to
estimators of complex parameters of
bootstrap is its control and check the
the distribution, such as percentile
simplicity stability of the results
points, proportions, odds ratio, and
correlation coefficients

Disadvantages

The apparent simplicity


Although bootstrapping is may conceal the fact that
(under some conditions) important assumptions
asymptotically consistent, are being made when
it does not provide general undertaking the bootstrap
finite-sample guarantees analysis (e.g.
independence of samples)
Jackknife

Jackknife, which is similar to bootstrapping, is used to estimate the bias and standard
error (variance) of a statistic, when a random sample of observations is used
to calculate it.

• Jackknife is a resampling technique especially useful for variance and bias estimation

• The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset
and calculating the estimate and then finding the average of these calculations

• Given a sample of size n, the jackknife estimate is found by aggregating the estimates of each (n-1) -sized sub-
sample
Cross Validation
Cross-validation is used to estimate the skill of machine learning models
Used in applied machine learning to compare and select a model for a given
predictive modeling problem

• Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data
sample

• Common types of Cross validation:

• LpO CV – Leave-p-out cross-validation

• LOOCV - Leave-one-out cross-validation

• k-fold Cross Validation


Leave-p-Out Cross-Validation (LpO CV)

LpO CV involves using p observation as the validation set and the remaining observations as the training set. This is
repeated on all ways to cut the original sample on a validation set of p observation and a training set.

• This is repeated on all ways to cut the original sample on a validation set of p observations and a training set

• LpO cross-validation requires training and validating the model nCp times, where n is the number of
observations in the original sample, and where nCp is the binomial coefficient

• For p > 1 and for even moderately large n, LpO CV can become computationally infeasible
Leave-One-Out Cross-Validation (LOOCV)

Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1

• The process looks similar to jackknife; however, with cross-validation one computes a statistic on the left-out
sample(s), while with jackknifing one computes a statistic from the kept samples only

• LOO cross-validation requires less computation time than LpO cross-validation because there are only nCp = n
passes rather than nCk

• However, n passes may still require quite a large computation time, in which case other approaches such as k-
fold cross validation may be more appropriate.
k-fold Cross-Validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples

• Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data

• The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as
the validation data

• The k results can then be averaged to produce a single estimation

• The advantage of this method over repeated random sub-sampling is that all observations are used for both
training and validation, and each observation is used for validation exactly once

• 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter


k-fold Cross-Validation (contd.)

The k value must be chosen carefully for your data sample. A poorly chosen value for k may
result in a mis-representative idea of the skill of the model, such as a score with a high variance or a high bias.

Common tactics for choosing a value for k are as follows:

• Value of k should be divisible by the number of rows in your training dataset, to ensure each of the k groups
has the same number of rows

• Choose a value for k that splits the data into groups with enough rows that each group is still representative
of the original dataset

• A good default to use is k=3 for a small dataset or k=10 for a larger dataset

• A quick way to check if the fold sizes are representative is to calculate summary statistics such as mean and
standard deviation and see how much the values differ from the same statistics on the whole dataset
Ensemble Methods
Max Voting
• In this technique, multiple models are used to make predictions for each data point
• The predictions by each model are considered as a ‘vote’
• The predictions which we get from the majority of the models are used as the final prediction
• The max voting method is generally used for classification

Example

Suppose you asked 5 of your Friends to rate your movie (out of 5); we’ll assume three of them
rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating
will be taken as 4. You can consider this as taking the mode of all the predictions.

Friend 1 Friend 2 Friend 3 Friend 4 Friend 5 Final Rating


4 5 4 5 4 4
Averaging
• In this technique, multiple predictions are made for each data point in averaging
• Take an average of predictions from all the models and use it to make the final prediction
• Averaging can be used for making predictions in regression problems or while calculating probabilities for
classification problems

Example

Suppose you asked 5 of your Friends to rate your movie (out of 5); we’ll assume three of them
rated it as 4 while two of them gave it a 5.
Now, the averaging method would take the average of all the values.
i.e. (5+4+5+4+4)/5 = 4.4

Friend 1 Friend 2 Friend 3 Friend 4 Friend 5 Final Rating


4 5 4 5 4 4.4
Weighted Average
• This is an extension of the averaging method
• All models are assigned different weights defining the importance of each model for prediction

Example

Suppose you asked 5 of your Friends to rate your movie (out of 5); we’ll assume three of them
rated it as 4 while two of them gave it a 5.
Now, if two of your Friends are critics, while others have no prior experience in this field, then
the answers by these two friends are given more importance as compared to the other people.
The result is calculated as [(4*0.23) + (5*0.23) + (4*0.18) + (5*0.18) + (4*0.18)] = 4.41.

Friend 1 Friend 2 Friend 3 Friend 4 Friend 5 Final Rating

Weight 0.23 0.23 0.18 0.18 0.18


Rating 4 5 4 5 4 4.41
Bootstrap Aggregation (Bagging)
• Bootstrap Aggregation or Bagging tries to implement similar learners on small sample populations and then
takes a mean of all the predictions
• It combines Bootstrapping and Aggregation to form one ensemble model
• Reduces the variance error and helps to avoid overfitting

Given a sample of data, multiple bootstrapped subsamples are pulled. A Decision Tree is formed on each of the
bootstrapped subsamples. After each subsample Decision Tree has been formed, an algorithm is used to
aggregate over the Decision Trees to form the most efficient predictor.
Boosting
• The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners
• Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous
model

How boosting identify weak learners?

• To find weak learners, we apply base learning (ML) algorithms with a different distribution

• Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative
process

• After many iterations, the boosting algorithm combines these weak rules into a single strong prediction
rule
Boosting (contd.)

How do we choose different distribution for each


round?
Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention
to observations having prediction error. Then, we apply the next base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves
the prediction power of the model. Boosting pays higher focus on examples which are mis-classified or have
higher errors by preceding weak rules.
Questions and Answers

103
Copyright 2017, © ZaranTech LLC. All rights reserved
Training Details

Upcoming Batches: Technical Support:

[Link]
n/create_request#/ticket-form

Blog Access: Placement Assistance:

[Link]
n/create_request#/ticket-form/26919

104
Copyright 2017, © ZaranTech LLC. All rights reserved
Please leave your feedback after the session
Contact Number E-mail Website
515-309-7846 info@[Link] [Link]

You might also like