0% found this document useful (0 votes)
13 views

MLS+1+-+Decision+Trees+and+Random+Forests

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

MLS+1+-+Decision+Trees+and+Random+Forests

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Decision Tree and Random Forest

[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Topics covered so far
1. Decision Trees
a. Introduction
b. Advantages and Disadvantages
c. Building a Decision Tree
d. Impurity Measures

R8L0PN473Fe. Overfitting
[email protected]

2. Random Forest
a. Bias-Variance Tradeoff
b. Pruning
c. Bagging
d. Random Forest

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Discussion Questions
1. What is a decision tree and how does it work?

2. How do we measure the impurity in a decision tree?

[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
Decision Tree
● A decision tree is one of the most popular and effective supervised learning techniques for classification
problems, that works well with both categorical and continuous variables.
● It is a graphical representation of all the possible solutions to a decision that is based on a certain condition.
● In this algorithm, the training set is split into two or more sets based on the split condition over input
variables.
● For example: A person has to decide on going out to play tennis or not by looking at the weather conditions.
○ If it’s cloudy, then the person will go out to play.
[email protected]
R8L0PN473F○ If it’s sunny, the person will check the humidity level, if that’s normal, the person will go out to play.
○ If it’s rainy, the person further checks the wind speed, if that’s weak, the person will go out to play.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action. Image Source
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
Impurity Measures in Decision Trees
Decision trees recursively split features with respect to their target variable’s purity. The algorithm is designed to
optimize each split such that the purity will be maximized. Impurity can be measured in many ways such as
Entropy, Information Gain, etc.

GINI INDEX ENTROPY INFORMATION GAIN VARIANCE

When to use Classification Tree Classification Tree Classification Tree Regression Tree
[email protected]
R8L0PN473F

Formula G = 1- Σci=1(pi2) E = -∑P(X).logP(X) IG (Y, X) = E(Y) - E(Y|X) V = Σ(x-μ)2/N

0 to 0.5 0 to 1 0 to 1
Range 0 = most pure 0 = most pure 0 = less gain -
0.5 = most impure 1 = most impure 1 = more gain

Computationally The most common


Easy to compute Computationally
Characteristics intensive measure of
Non-additive intensive
Additive dispersion
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Discussion Questions
1. What do you mean by Ensemble Learning?

2. What is bootstrap aggregation and how does it work?

3. What is a random forest and how is it useful?

4. What are the


[email protected] advantages and disadvantages of the random forest algorithm?
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Ensemble Learning
● Ensemble Learning is a paradigm of machine learning methods for combining predictions from multiple
models.
● The central motivation is rooted under the belief that a committee of experts working together can perform
better than a single expert.

[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Bootstrap Aggregation (Bagging)
● Bagging is a technique of merging the outputs of various models to get a final result.
● It reduces the chances of overfitting by training each model only with a randomly chosen subset of the
training data. Training can be done in parallel.
● It essentially trains a large number of “strong” learners in parallel (each model is an overfit for that subset of
the data)
● Then it combines (using averaging or majority voting) these learners together to "smooth out" predictions.
[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action. Image Source
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
Random Forest algorithm
● Random Forest is a supervised machine learning algorithm, which can be used for both classification and
regression.
● It generates decision trees using random samples of the original dataset where the collection of the
generated decision tree is defined as forest. In each tree, at all levels, a random subset of original features is
chosen to select the best split from using an attribute selection indicator such as entropy, information gain,
etc.
[email protected]
The following steps
R8L0PN473F are involved in this algorithm:
1. Selection of a random sample of a given dataset.
2. Using attribute selection indicators, create a
decision tree for each sample and record the
prediction outcome from each model.
3. Applying the voting/averaging method over
predicted outcomes of individual models.
4. Considering the final results as the average
value or most voted value.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
Advantages and Disadvantages of Random Forest
Advantages:

● It can be used to solve classification as well as regression problems.

● It is one of the most accurate algorithms because of the number of decision trees taking part in the process.

● In general, it does not suffer from overfitting.


[email protected]
R8L0PN473F
● It is used to select features of relatively more importance and helps in feature selection.

Disadvantages:
● The Random Forest algorithm is very slow compared to others because it calculates predictions for each
decision tree for every sample and then votes on them to select the best one, which is time-consuming.

● It is difficult to interpret the model in comparison to decision tree where you can easily make the decision
following the path of the tree.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 10
Case Study
[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
[email protected]
R8L0PN473F Appendix

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
Pruning
● One of the problems with the decision tree is that it easily overfits the training data and becomes too large
and complex.
● A complex and large tree poorly generalizes to new data, whereas a small tree fails to capture the
information of the training data.
● Pruning can be defined as shortening the branches of the tree. It is the process of reducing the size of the
tree by turning some branch node into a leaf node and removing all the subsequent nodes under the
original branch.
[email protected]
● By removing branches, we can reduce the complexity of tree, which helps in reducing the overfitting of the
R8L0PN473F
tree.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action. Image Source
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
Cost Complexity Pruning
● Cost Complexity Pruning is the most popular pruning technique for decision trees. It takes into account both
the number of errors and the complexity of the tree.
● This technique is parametrized by the cost complexity parameter, ccp_alpha which reduces the complexity
of the tree by controlling the number of leaf nodes, which eventually reduces overfitting. Greater values of
ccp_alpha increase the number of nodes pruned.
● The complexity parameter is used to define the cost-complexity measure, Rα(T) of a given tree T:
[email protected] Rα(T) = R(T) + α|T|
R8L0PN473F

where |T| is the number of terminal nodes and R(T) is the total misclassification rate of the terminal nodes.

● Cost complexity pruning proceeds in the following stages:


○ A sequence of trees(T0, T1,..., Tk) for different values of alpha is built on the training data where T0 is
the original tree before pruning and Tk is the root tree.
○ The tree Ti+1 is obtained by replacing one or more of the sub-trees in the predecessor tree Ti with
suitable leaves.
○ The impurity of each pruned tree (T0, T1,..., Tk) is estimated and the best pruned tree is then selected
based on the metric under consideration (using test data).
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
Hyperparameters in Random Forest
1. Number of trees (n_estimators):
● It specifies the number of trees in the forest of the model.
● The default value for this parameter is 100, which means that 100 different decision trees will be
constructed in the random forest.

2. Maximum Depth (max_depth):


● It specifies the maximum depth of the tree.
[email protected]
R8L0PN473F● The default value is none, which means each tree will expand until every leaf is pure.

3. The minimum number of samples per leaf (min_samples_leaf):


● It specifies the minimum number of samples required to be at a leaf node.
● The default value is 1, which means that every leaf must have at least 1 sample that it classifies.

4. The minimum number of samples to split (min_samples_split):


● It specifies the minimum number of samples required to split a node.
● The default value for this parameter is 2, which means that an internal node must have at least two
samples before it can
This be split
file is tofor
meant have a more
personal specific
use by classification. only.
[email protected]
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 15
Happy Learning !
[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action. 16
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like