LECTURE : INTRODUCTION TO RANDOM
FOREST AND GRADIENT BOOSTING METHODS
- Presented by Shreyas S.K
30-03-2019 1
AD RESEARCH GROUP
WHAT IS MACHINE LEARNING ABOUT??
30-03-2019 2
data
APPLICATIONS OF MACHINE LEARNING
3
ANATOMY OF DECISION TREE
4
• Trees that predict categorical results are
called as decision trees
• At each node certain set of rules should be
satisfied
• Output from each node will be a Boolean
(True/False)
• Splitting is a process of dividing a node into
two or more sub nodes
• Root node represents the entire population
• When sub nodes split into further sub
nodes then it’s a decision node
• Nodes that do not split are called as
terminal nodes/leaf nodes
ROOT NODE
DECISION NODE
LEAF NODE
Decision tree for Regression dataset
X[i] :- Input variables in the dataset
MSE :- Mean Squared Error of all samples in a node
Samples :- Total number of samples in a node
Value :- Average value of all samples corresponding to
an output variable in a node
30-03-2019
DECISION TREES FOR CLASSIFICATION
5
Predict whether or not to play tennis based on
Temperature, Humidity, Wind and Outlook
• A good decision tree is the one which makes correct
predictions for any unseen data
• Split at each node is made based on Gini-score
• Best split is the one which yields the lowest Gini-score
30-03-2019
DECISION TREE FOR REGRESSION
• Regression trees predict continuous values
• Values at the leaves are the average of all
samples in the leaf
• Best split at each node is based on MSE or
weighted average of standard deviation
6
Predict the average precipitation based on the
Slope and Elevation of the Himalayan region
30-03-2019
BEST SPLIT BASED ON STANDARD DEVIATION
7Weighted standard deviation30-03-2019
HOW LONG TO KEEP SPLITTING??..
• Until:
• Leaf nodes are pure – Only one class remains
• A maximum depth is reached
• A performance metric is achieved
• Problem:
• Decision trees tend to overfit
• Small changes in data greatly affects the prediction
• Solution:
• Pruning the trees
• Restricting the tree from growing to it’s fullest
• Maintain minimum number of samples in leaf
nodes
30-03-2019 8
Pros and Cons of Classification and Regression Trees
Advantages
• Simple to understand, interpret and
visualise
• Can handle both numerical and
categorical data
• Less effort in data preparation
• Non linear relationships between
parameters wont affect tree
performance
• Implicitly performs feature selection
Disadvantages
• Prone to create over complex trees
which lack generalization capability
• Unstable, small variations in data
results into completely different tree
• They create biased trees if some
classes dominate
• Cannot guarantee to return global
optimal decision tree
30-03-2019 9
Lower the variance of individual trees by Ensemble methods like Bagging and Boosting
ANALOGY OF ENSEMBLE LEARNING
10
Decision Tree 1 Decision Tree 2 Decision Tree 3
2.91
2.6 2.95 3.2 Desired output :- 2.85Predicted outputs
30-03-2019
RANDOM FOREST METHOD
11
Training dataset
Bootstrap sample 1 Bootstrap sample 2 Bootstrap sample k
In Bag
(2/3)
Out of Bag
(1/3)
In Bag
(2/3)
Out of Bag
(1/3)
In Bag
(2/3)
Out of Bag
(1/3)
Prediction 1 Prediction 2 Prediction k
Average of k
predictions
30-03-2019
RANDOM FOREST – A BAGGING APPROACH
30-03-2019 12
PSEUDO CODE FOR RANDOM FOREST METHOD
1. Randomly select “k” features from total “m” features
• k< m
2. Among “k” features, calculate the node “d” using best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps 1 to 3 until a predefined number of nodes is reached
5. Build a forest by repeating steps 1 to 4 “n” number of times to create “n”
number of trees
6. Takes the test features and uses the rules of each randomly created trees to
predict the output
7. Calculates the votes for each predicted target
8. High voted predicted target is considered as the final prediction
30-03-2019 13
OVERFITTING – HIGH VARIANCE
• High variance
• Outcome can vary even if there are tiniest changes in the input
• Do not generalise well to new data
• High variance compared to “PHYSICAL BALANCE”
• If you are balancing on one foot while standing on solid ground you’re
not likely to fall over.
• But what if there are suddenly 100 mph wind gusts? I bet you’d fall
over.
• That’s because your ability to balance on one leg is highly dependent
on the factors in your environment.
• If even one thing changes, it could completely mess you up!
• If we mess with any factors in its training data, we could completely
change the outcome.
• This is not stable model and therefore not a model of which we
would want to make decisions.
30-03-2019 14
Don’t fall, lil guy!!
APPLICATIONS OF RANDOM FOREST
30-03-2019 15
1. Banking
• To find loyal and fraud customers
• Growth of a bank purely depends on loyal customers
• To identify customers not profitable to bank
• Bank won’t approve loans to such customers if
identified
2. Medicine
• To identify the disease by analysing patient’s
medical records
• Identify correct combination of components to
validate the medicine
3. E-commerce
• Identify likelihood of customer liking a
recommended product
GRADIENT BOOSTING METHOD
16
Create a decision tree
on known response
values
Make Predictions
Calculate errors
(Residuals)
Fit new tree using
errors as response
values
Combine new tree
with tree from
previous iteration
• Tuning parameters:
1. Number of trees
2. Maximum depth of each tree
3. Maximum features at each split
4. Learning rate
5. Minimum samples in leaf
• Builds decision trees sequentially
• More weight is given to mispredicted values
at each stage of training
• Builds more accurate models as the final
output is the average of predictions of all
decision trees
30-03-2019
GRADIENT BOOSTING – A BOOSTING APPROACH
30-03-2019 17
PSEUDO CODE FOR GRADIENT BOOSTING METHOD
1. Initialize the approximation function F(x):
2. For m=1 to M do:
• Calculate the pseudo responses
• Fit the regression tree using the training set
• Calculate the step size using the line search
• Update the model:
3. End the algorithm: is the final output
30-03-2019 18
AN EXAMPLE BASED ON GRADIENT BOOSTING
30-03-2019 19
Predict the age of a person based on whether they play video games, enjoy gardening and
their preference in wearing hats
Objective :- Minimize Squared Error
LOSS FUNCTION :- SQUARED ERROR
30-03-2019 20
F1 = F0 + gamma0 ∗ h0PseudoResidual0 = Age − F0F0 = (1/n) ∗ k=1
n
Age 𝑆𝑆𝐸 =
𝑘=1
𝑛
(𝐴𝑔𝑒 − 𝐹1)2
BOOSTING – SEQUENTIAL ACCUMULATION
30-03-2019 21
Tree1 Residual = Age – Tree1 Prediction Combined Prediction = Tree1 Prediction + Tree2 Prediction
THANK YOU FOR PATIENT HEARING!!!!..
2230-03-2019

Introduction to random forest and gradient boosting methods a lecture

  • 1.
    LECTURE : INTRODUCTIONTO RANDOM FOREST AND GRADIENT BOOSTING METHODS - Presented by Shreyas S.K 30-03-2019 1 AD RESEARCH GROUP
  • 2.
    WHAT IS MACHINELEARNING ABOUT?? 30-03-2019 2 data
  • 3.
  • 4.
    ANATOMY OF DECISIONTREE 4 • Trees that predict categorical results are called as decision trees • At each node certain set of rules should be satisfied • Output from each node will be a Boolean (True/False) • Splitting is a process of dividing a node into two or more sub nodes • Root node represents the entire population • When sub nodes split into further sub nodes then it’s a decision node • Nodes that do not split are called as terminal nodes/leaf nodes ROOT NODE DECISION NODE LEAF NODE Decision tree for Regression dataset X[i] :- Input variables in the dataset MSE :- Mean Squared Error of all samples in a node Samples :- Total number of samples in a node Value :- Average value of all samples corresponding to an output variable in a node 30-03-2019
  • 5.
    DECISION TREES FORCLASSIFICATION 5 Predict whether or not to play tennis based on Temperature, Humidity, Wind and Outlook • A good decision tree is the one which makes correct predictions for any unseen data • Split at each node is made based on Gini-score • Best split is the one which yields the lowest Gini-score 30-03-2019
  • 6.
    DECISION TREE FORREGRESSION • Regression trees predict continuous values • Values at the leaves are the average of all samples in the leaf • Best split at each node is based on MSE or weighted average of standard deviation 6 Predict the average precipitation based on the Slope and Elevation of the Himalayan region 30-03-2019
  • 7.
    BEST SPLIT BASEDON STANDARD DEVIATION 7Weighted standard deviation30-03-2019
  • 8.
    HOW LONG TOKEEP SPLITTING??.. • Until: • Leaf nodes are pure – Only one class remains • A maximum depth is reached • A performance metric is achieved • Problem: • Decision trees tend to overfit • Small changes in data greatly affects the prediction • Solution: • Pruning the trees • Restricting the tree from growing to it’s fullest • Maintain minimum number of samples in leaf nodes 30-03-2019 8
  • 9.
    Pros and Consof Classification and Regression Trees Advantages • Simple to understand, interpret and visualise • Can handle both numerical and categorical data • Less effort in data preparation • Non linear relationships between parameters wont affect tree performance • Implicitly performs feature selection Disadvantages • Prone to create over complex trees which lack generalization capability • Unstable, small variations in data results into completely different tree • They create biased trees if some classes dominate • Cannot guarantee to return global optimal decision tree 30-03-2019 9 Lower the variance of individual trees by Ensemble methods like Bagging and Boosting
  • 10.
    ANALOGY OF ENSEMBLELEARNING 10 Decision Tree 1 Decision Tree 2 Decision Tree 3 2.91 2.6 2.95 3.2 Desired output :- 2.85Predicted outputs 30-03-2019
  • 11.
    RANDOM FOREST METHOD 11 Trainingdataset Bootstrap sample 1 Bootstrap sample 2 Bootstrap sample k In Bag (2/3) Out of Bag (1/3) In Bag (2/3) Out of Bag (1/3) In Bag (2/3) Out of Bag (1/3) Prediction 1 Prediction 2 Prediction k Average of k predictions 30-03-2019
  • 12.
    RANDOM FOREST –A BAGGING APPROACH 30-03-2019 12
  • 13.
    PSEUDO CODE FORRANDOM FOREST METHOD 1. Randomly select “k” features from total “m” features • k< m 2. Among “k” features, calculate the node “d” using best split point 3. Split the node into daughter nodes using the best split 4. Repeat steps 1 to 3 until a predefined number of nodes is reached 5. Build a forest by repeating steps 1 to 4 “n” number of times to create “n” number of trees 6. Takes the test features and uses the rules of each randomly created trees to predict the output 7. Calculates the votes for each predicted target 8. High voted predicted target is considered as the final prediction 30-03-2019 13
  • 14.
    OVERFITTING – HIGHVARIANCE • High variance • Outcome can vary even if there are tiniest changes in the input • Do not generalise well to new data • High variance compared to “PHYSICAL BALANCE” • If you are balancing on one foot while standing on solid ground you’re not likely to fall over. • But what if there are suddenly 100 mph wind gusts? I bet you’d fall over. • That’s because your ability to balance on one leg is highly dependent on the factors in your environment. • If even one thing changes, it could completely mess you up! • If we mess with any factors in its training data, we could completely change the outcome. • This is not stable model and therefore not a model of which we would want to make decisions. 30-03-2019 14 Don’t fall, lil guy!!
  • 15.
    APPLICATIONS OF RANDOMFOREST 30-03-2019 15 1. Banking • To find loyal and fraud customers • Growth of a bank purely depends on loyal customers • To identify customers not profitable to bank • Bank won’t approve loans to such customers if identified 2. Medicine • To identify the disease by analysing patient’s medical records • Identify correct combination of components to validate the medicine 3. E-commerce • Identify likelihood of customer liking a recommended product
  • 16.
    GRADIENT BOOSTING METHOD 16 Createa decision tree on known response values Make Predictions Calculate errors (Residuals) Fit new tree using errors as response values Combine new tree with tree from previous iteration • Tuning parameters: 1. Number of trees 2. Maximum depth of each tree 3. Maximum features at each split 4. Learning rate 5. Minimum samples in leaf • Builds decision trees sequentially • More weight is given to mispredicted values at each stage of training • Builds more accurate models as the final output is the average of predictions of all decision trees 30-03-2019
  • 17.
    GRADIENT BOOSTING –A BOOSTING APPROACH 30-03-2019 17
  • 18.
    PSEUDO CODE FORGRADIENT BOOSTING METHOD 1. Initialize the approximation function F(x): 2. For m=1 to M do: • Calculate the pseudo responses • Fit the regression tree using the training set • Calculate the step size using the line search • Update the model: 3. End the algorithm: is the final output 30-03-2019 18
  • 19.
    AN EXAMPLE BASEDON GRADIENT BOOSTING 30-03-2019 19 Predict the age of a person based on whether they play video games, enjoy gardening and their preference in wearing hats Objective :- Minimize Squared Error
  • 20.
    LOSS FUNCTION :-SQUARED ERROR 30-03-2019 20 F1 = F0 + gamma0 ∗ h0PseudoResidual0 = Age − F0F0 = (1/n) ∗ k=1 n Age 𝑆𝑆𝐸 = 𝑘=1 𝑛 (𝐴𝑔𝑒 − 𝐹1)2
  • 21.
    BOOSTING – SEQUENTIALACCUMULATION 30-03-2019 21 Tree1 Residual = Age – Tree1 Prediction Combined Prediction = Tree1 Prediction + Tree2 Prediction
  • 22.
    THANK YOU FORPATIENT HEARING!!!!.. 2230-03-2019

Editor's Notes

  • #2 Good morning everyone! My name is Shreyas. Now I’ll be giving a presentation on the topic “”
  • #11 The analogy of ensemble methods can be described by comparing the workflow with most popular show “Who wants to be a Billionaire”?. There are three lifelines in this show as shown. At each stage of training, we build decision trees and each of them will be giving an output. Each tree is a weak learner as the predicted output of each tree is some what better than random guessing. Here we are combining a set of weak learners and projecting a strong learner by averaging output. Probability of getting a correct answer from a friend is comparably lower than the answer we get from a set of audience.
  • #12 Data splitting will divide the total dataset into training and testing sets. Further the training sets are divided into bootstrap samples.
  • #17 The predictions made by every new tree in each iteration will be stronger than the previous one.