0% found this document useful (0 votes)

29 views

Random Forests

Presentation on Random Forest Algorithm

Uploaded by

antaladevp

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Random Forests

Presentation on Random Forest Algorithm

Uploaded by

antaladevp

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

RANDOM FORESTS

PRESENTER: MITHUN ALVA

AGENDA

 Definition

 Origin of Random Forests

 The Algorithm

 Advantages, Shortcomings and Applications of Random Forests

 An example of using Random Forests (using R)

 Readings/ References for further review

DEFINITION
 Random Forests are an ensemble learning method for
classification & regression
 Ensemble methods are made up of multiple learning algorithms,
which collectively provide better prediction as compared to any
single one of them.

 They are made up of multiple decision trees

 The output is
 the mode of the predicted classes ( in case of classification), and
 the mean of the prediction value ( in case of regression) provided by the
individual trees.
ORIGIN OF RANDOM FORESTS
 Algorithm developed by Leo Breiman and Adele Cutler.
 Leo Breiman
 January 27, 1928 – July 5, 2005
 Professor Emeritus of Statistics at University of California, Berkeley.

 Adele Cutler is his long-time collaborator and former Ph.D student.

 “Random Forests” is their trademark.

ORIGIN OF RANDOM FORESTS

 Decision Tree Models

 Involve segmenting the predictor space into a number of simple
regions.

 Since the rules to segment the space can be summarized in a tree,

these models are called “Decision Tree” models.

 Can be applied to both ‘classification’ and ‘regression’ problems.

ORIGIN OF RANDOM FORESTS
 Decision Tree Models: An Example
 Predicting the ‘Choice’ (of car model) chosen based on:
 Expense: Expense tolerance of the subject
 Gender : Gender of the subject
 PreviousOwnership: Number of cars previously owned by subject

 Sample training dataset

Previous
Gender Ownership Expense Choice
Male 0 Less Yaris
Male 1 Less Yaris
Female 0 Less Yaris
Male 1 Less Yaris
Female 1 More Tundra
Male 2 More Tundra
Female 2 More Tundra
Female 1 Less Prius
Male 0 Average Prius
Female 1 Average Prius
ORIGIN OF RANDOM FORESTS
 Process of building the tree
 Link in ‘References’ section: Step-by-Step Decision Tree building
 Key concept: Gini Index
 Gini Index is a measure of node purity.
 Calculated as Σ Pj * (1- Pj),

where Pj represents the proportion of observations in the “jth” class

 For our dataset, Pj values are:

P(Yaris) 0.4

P(Tundra) 0.3

P(Prius) 0.3

 And, the Gini Index is (0.4)(0.6)+(0.3)(0.7)+(0.3)*(0.7) = 0.66

 Final tree:
 Use Gini Index to calculate ‘Information Gain’ for each variable.
 The variable providing the best ‘Information Gain’ is plotted on the tree. Repeat till a full tree is
generated.
ORIGIN OF RANDOM FORESTS
 Advantages of Decision Tree Models
 Easy to interpret and explain
 Implicitly perform variable screening
 Variables associated with top few nodes are the most important

 Require relatively less data preparation effort

 Can handle a mix of categorical and continuous variables
 Can handle missing values
 Not sensitive to outliers
 Scaling of parameters ( e.g. revenue in millions and loan age in years in the
same dataset) is not necessary.

 Can handle non-linear relationships.

ORIGIN OF RANDOM FORESTS
 Shortcomings of Decision Tree Models

 High Variance: If we split the training data into two parts at random
and fit a decision tree to both halves, we could get very different
trees.

 Tend to favor categorical predictors with many levels. Variables with

a large number of levels can cause severe overfitting.

 ‘Bagging’ attempts to address the above shortcomings.

ORIGIN OF RANDOM FORESTS
 Bagging (also known as ‘bootstrap aggregation’)

 From your full dataset, take a sample , generate a tree and obtain
predictions.

 Repeat with a different sample, from the same dataset. The new tree
will typically make different predictions.

 Continue sampling and generating trees in this manner till about 500
trees are obtained.

 This process is called “Bagging”.

ORIGIN OF RANDOM FORESTS
 ‘Out of Bag’ (OOB) Data

 If we sample from available data and build a tree, we already have

holdout data available for that tree. This data is referred to as “Out
of Bag” data.

 Every tree grown has a different holdout sample

 Every record in the full dataset is “in bag” for some trees(about 2/3 rd)
and “out of bag” for the other trees.
ORIGIN OF RANDOM FORESTS
 ‘Out of Bag’ (OOB) Data

 Suppose a given record was “in bag” for 375 trees and “out of bag”
for the remaining 125 trees.

 Predictions for this record could be generated using just the “out of
bag” trees.

 Always having OOB data means we can effectively work with

relatively small datasets.
5 1 1 4 1

ORIGIN OF RANDOM FORESTS

 Bagging
 The sampling method here is bootstrap sampling

 Each time the # of observations in the sample = # of observations in the

training data
 However, sampling is done with replacement.
 Therefore all observations will not be present in the chosen sample

 Example: if the training data is {1,2,3,4,5}

 Sample 1 could be {5,1,1,4,1}

 Sample 2 could be {2,1,5,3,3}
 Sample 3 could be {1,2,3,2,5}
 …and so on
ORIGIN OF RANDOM FORESTS
 Bagging & Predictor subset-ing

 Trees in the Bagger were found to be too similar to each other

 To address this, Breiman introduced randomness into the actual tree growing as
well
 Normally, all possible predictors are evaluated for their ability to form a node in the
tree and partition the data in the best possible manner.
 Instead, every time we are forming a node, a subset of the predictors is considered.
 From among these predictors, the one providing the best partitioning is used to
form the node.
 A new random subset of predictors is chosen to build each node.

 ‘Random Forests’ combines the concepts of decision trees, bagging and

predictor subset-ing.
ORIGIN OF RANDOM FORESTS

 Breiman and Cutler suggested using one of the following rules to

form the subset of predictors.

Predictors(N) sqrt(N) 0.5sqrt(N) 2sqrt(N) Log2(N)

100 10 5 20 7
1,000 31 15.5 62 10
10,000 100 50 200 13
100,000 316 158 632 17
1,000,000 1000 500 2000 20
ALGORITHM

Training Dataset

Draw a sample to build tree.

Sample size= Training Dataset size. Remaining (“Out of Bag”) Data
Sampling is done with replacement.

Repeat
until Randomly select ‘mtry’ predictors
specified
number Repeat
of trees until tree is
(ntree) is fully grown
obtained
Grow Tree using the best of the ‘mtry’ predictors
to split data

Estimate error by applying the fully grown tree to

“Out of Bag” data
ADVANTAGES OF RANDOM FORESTS
 Automatic identification of important predictors

 Good for wide data; provide good accuracy and generate reliable
predictor importance rankings.

 Resistant to over training.

 Each decision tree is independent. Therefore trees can be grown

on different cores or different computers, allowing for quicker
analysis.
SHORTCOMINGS OF RANDOM FORESTS

 Suited for wide datasets with only a moderate number of

rows. Breiman recommends the use of other tools for larger
datasets.

 Large memory needed to store built models.

 Overfitting might be seen with noisy data.

APPLICATIONS OF RANDOM FORESTS

 Online targeted marketing

 Credit card fraud detection
 Text analytics
 Credit risk and insurance risk
 Retail Sales prediction
 Biological & Medical Research
 Manufacturing Quality Control
EXAMPLE

 R packages used in example

 randomForest: Breiman and Cutler's random forests for classification and

regression.

 rpart: Recursive partitioning for classification, regression

and survival trees. An
implementation of most of the functionality of the 1984
book by Breiman, Friedman, Olshen and Stone.

 caret: Miscellaneous functions for training and plotting

classification and regression models.
EXAMPLE

 Titanic dataset https://www.kaggle.com/c/titanic

 Predictors :
 pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
 name Name
 sex Sex (“female”, “male”)
 age Age ( in years)
 sibsp Number of Siblings/Spouses Aboard
 parch Number of Parents/Children Aboard
 ticket Ticket Number
 fare Passenger Fare
 cabin Cabin
 embarked Port of Embarkation(C = Cherbourg; Q = Queenstown;
S = Southampton)
EXAMPLE
 Titanic dataset

 Response :
 survived 0 = No; 1 = Yes

 Ask:
 “predict which passengers survived the tragedy”
EXAMPLE
 Additional features created:
 Title:
 Isolating ‘Titles” (i.e Col, Dr, Lady, Master, Miss, etc…) from the ‘Name’
field.
 converting them into “Mlle”, “Sir” or “Lady”

 FamilyID
 Large families might have had trouble getting to lifeboats together
 SibSp+Parch+1 will give Family Size

 Last Name like “Johnson” is common.

 Join with Last Name with count to uniquely identify Family Name &
Size.
EXAMPLE
 Data Preparation:
 ‘randomForest’ package in R cannot handle missing values

 ‘Age’ has 263 missing values

 Could be replaced by mean/median of all other non-missing values
 Another way is to use a decision tree & do a prediction on missing values
 tree is build using the ‘rpart’ package
 ‘Embarked’ has missing values in two rows
 Replace them with ‘S’.
 Nearly 70 % of the population embarked at Southampton

 ‘Fare’ had 1 missing value

 Replace with median of non-missing fare values
EXAMPLE
 Data Preparation:

Random Forests in R can only digest factors with up to 32 levels

 FamilyID has a larger number of levels

 Create a new feature called FamilyID2

 Equal to “Small” if FamilySize < =3, and FamilyID otherwise

EXAMPLE
 Install the ‘randomForest’ package in R
 Install.packages (‘randomForest’)

 To obtain same results every time you run the code, use
‘set.seed’

 Syntax: fit=
randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
+ Parch + Fare + Embarked + Title + FamilySize +FamilyID2,
data=train,
importance=TRUE, # enables inspection of variable
importance
ntree=2000) # number of tree to grow. Default is
500
EXAMPLE

 Inspect variable importance plots

 varImpPlot(fit)

 Will tell us which variables have the highest impact on the predictive
ability of the model

 Variable associated with the most decrease of the appropriate

measure has the highest impact.
EXAMPLE
 ‘Title’ has the strongest impact, in terms of both Accuracy and
Gini Index
 Added features ‘FamilyID2’ and ‘FamilySize’ have substantial impact
EXAMPLE
 Additional parameters for model

randomForest(as.factor(Survived) ~
Pclass+Sex +Age + SibSp +Parch + Fare + Embarked +Title + FamilySize
+FamilyID2,
data=train,
importance=TRUE, # enables inspection of variable importance
ntree= , # number of tree to grow. Default is 500
mtry= , #number of variables selected at each node.
#Default is square root of the number of variables
nodesize= , #minimum size of terminal nodes. Setting this to the value ‘k’
#means that no node with fewer than k cases will be split.
#Default =1 for classification and 5 for regression
………..)
EXAMPLE
 Performance Evaluation
 Confusion Matrix
Predicted
Actual 0 1
0 491 92
1 58 250
549 342

 Accuracy = (491 + 250)/(549+342) = 0.8316

 95% CI : (0.8054, 0.8557)
EXAMPLE
 Tuning the Random Forests model

 Objective is to find the best value of mtry (i.e. number of

predictors chosen at each node)

tunefit= train(as.factor(Survived)~ .,
data=train1,
method="rf",
# ‘rf’ stands for random forest
metric="Accuracy", #
what are we trying to improve
tuneGrid=data.frame(mtry=c(2,3,4))) # set of values to
be considered
EXAMPLE
 Tuning the Random Forests model

 Objective is to find the best value of mtry (i.e. number of features

chosen at each node)

mtry Accuracy
2 0.8236414

3 0.8310338
4 0.8302738
EXAMPLE
 Apply model to test data
 Prediction = predict (tunefit, newdata =test)
 Will give you predictions for ‘Survival’. ‘0’ and ‘1’ values
REFERENCES
 Leo Breiman's Random Forests Page
 An Introduction to Random Forest for Beginners: Salford Systems
 Random Forests Lecture by Nando Freitas, University of British Columbia
 Random Forests Lecture by Derek Kane
 The Elements of Statistical Learning: Hastie, Tibshirani & Friedman
 Introduction to Statistical Learning: James, Witten, Hastie and Tibshirani
 Trevor Stephens: Titanic Dataset Analysis using R
 Curt Wehrley: Titanic Dataset Analysis using R
 randomForest package in R
 rpart package in R
 caret package in R
 Step-by-Step Decision Tree building
 Machine Learning Benchmarks and Random Forest Regression, Mark Se
gal

Machine Learning With Random Forests and Decision Trees - A Visual Guide For Beginners (Naren) PDF
No ratings yet
Machine Learning With Random Forests and Decision Trees - A Visual Guide For Beginners (Naren) PDF
68 pages
Sminar 8
No ratings yet
Sminar 8
10 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Random Forests
No ratings yet
Random Forests
43 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
ML-Lec6
No ratings yet
ML-Lec6
4 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
Random Forest
No ratings yet
Random Forest
8 pages
Da MS
No ratings yet
Da MS
24 pages
Random Forest
No ratings yet
Random Forest
29 pages
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
No ratings yet
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
12 pages
Random Forest
No ratings yet
Random Forest
25 pages
Random Forest
No ratings yet
Random Forest
25 pages
Random Forests For Beginners PDF
No ratings yet
Random Forests For Beginners PDF
71 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Random forest algorithm 1
No ratings yet
Random forest algorithm 1
14 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
4 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Random Forest
No ratings yet
Random Forest
83 pages
Montillo RandomForests 4-2-2009
No ratings yet
Montillo RandomForests 4-2-2009
28 pages
Random Forest
No ratings yet
Random Forest
11 pages
Machine Learning Random Forest Algorithm - Javatpoint
No ratings yet
Machine Learning Random Forest Algorithm - Javatpoint
14 pages
Machine learning
No ratings yet
Machine learning
5 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
Random Forest
No ratings yet
Random Forest
8 pages
Advanced Predictive Analytics Using R & Python: - Muquayyar Ahmed Data Scientist
No ratings yet
Advanced Predictive Analytics Using R & Python: - Muquayyar Ahmed Data Scientist
11 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
Week 6 - Random Forest
No ratings yet
Week 6 - Random Forest
12 pages
Random Forest
No ratings yet
Random Forest
6 pages
Biau 2016
No ratings yet
Biau 2016
31 pages
Random Forest
No ratings yet
Random Forest
21 pages
Ushna FYP
No ratings yet
Ushna FYP
25 pages
Present
No ratings yet
Present
20 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
Oomd
No ratings yet
Oomd
11 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
Random FOrest
No ratings yet
Random FOrest
19 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
9 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Bagging and Random Forest Presentation1
100% (2)
Bagging and Random Forest Presentation1
23 pages
An Introduction to Random Forest Algorithm for beginners
No ratings yet
An Introduction to Random Forest Algorithm for beginners
16 pages
Random Forest
No ratings yet
Random Forest
32 pages
03_Random Forest
No ratings yet
03_Random Forest
24 pages
Data Mining Notes
No ratings yet
Data Mining Notes
5 pages
CSL0777 L26
No ratings yet
CSL0777 L26
33 pages
015 - Random Forest
No ratings yet
015 - Random Forest
15 pages
Random Forests: N 1 N J X A I X A I
No ratings yet
Random Forests: N 1 N J X A I X A I
12 pages
Random Forest (RF) : Decision Trees
No ratings yet
Random Forest (RF) : Decision Trees
3 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Learn The Basics Of Decision Trees A Popular And Powerful Machine Learning Algorithm
From Everand
Learn The Basics Of Decision Trees A Popular And Powerful Machine Learning Algorithm
UBER AUTHOR
No ratings yet
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
Koyck - Polynomial DL (PDL) : - A Form of Short-Term Model - Explanatory Variables - Three Components
No ratings yet
Koyck - Polynomial DL (PDL) : - A Form of Short-Term Model - Explanatory Variables - Three Components
5 pages
Anim Feed Sci Technol 165 68 Meta Analysis Pigs Betaine
No ratings yet
Anim Feed Sci Technol 165 68 Meta Analysis Pigs Betaine
11 pages
Econometrics-Ii Quiz
No ratings yet
Econometrics-Ii Quiz
1 page
Transformações No R
No ratings yet
Transformações No R
4 pages
ML
No ratings yet
ML
8 pages
Moving Range: ISSN: 2339-2541 JURNAL GAUSSIAN, Volume 3, Nomor 4, Tahun 2014, Halaman 701 - 710
No ratings yet
Moving Range: ISSN: 2339-2541 JURNAL GAUSSIAN, Volume 3, Nomor 4, Tahun 2014, Halaman 701 - 710
10 pages
1.13 Covariance: Definition
No ratings yet
1.13 Covariance: Definition
24 pages
Factor Analysis
No ratings yet
Factor Analysis
26 pages
Econometrics - Solution sh.2B 2024
No ratings yet
Econometrics - Solution sh.2B 2024
9 pages
Instant Download Principles of Econometrics, 5th Ed. R. Carter Hill PDF All Chapters
100% (2)
Instant Download Principles of Econometrics, 5th Ed. R. Carter Hill PDF All Chapters
66 pages
Statistical Methods in Biology Design and Analysis of Experiments and Regression 1st Edition Welham 2024 Scribd Download
No ratings yet
Statistical Methods in Biology Design and Analysis of Experiments and Regression 1st Edition Welham 2024 Scribd Download
40 pages
Worksheet 7 Solution
No ratings yet
Worksheet 7 Solution
4 pages
Pengaruh On The Job Training Dan Off The
No ratings yet
Pengaruh On The Job Training Dan Off The
12 pages
KNN VS Kmeans
No ratings yet
KNN VS Kmeans
3 pages
Introduction To Statistical Learning R Labs and Exercises Code
No ratings yet
Introduction To Statistical Learning R Labs and Exercises Code
33 pages
User Manual On Correlation Analysis Using Jamovi
No ratings yet
User Manual On Correlation Analysis Using Jamovi
7 pages
A Practical Introduction To Nordpred - Cancerview - Ca
No ratings yet
A Practical Introduction To Nordpred - Cancerview - Ca
46 pages
uji valid dan relib Ferry
No ratings yet
uji valid dan relib Ferry
7 pages
Group Assignment - Data Mining
No ratings yet
Group Assignment - Data Mining
28 pages
Econometric Scours Outlines by Hamidullah
No ratings yet
Econometric Scours Outlines by Hamidullah
4 pages
Sem For Dummies
No ratings yet
Sem For Dummies
28 pages
20200929105758YPDCHAN001ST3188 Topic 9 2021
No ratings yet
20200929105758YPDCHAN001ST3188 Topic 9 2021
73 pages
Uncertainty Slope Intercept of Least Squares Fit
No ratings yet
Uncertainty Slope Intercept of Least Squares Fit
14 pages
Empirical Methods - Esther Duflo 2002
No ratings yet
Empirical Methods - Esther Duflo 2002
36 pages
Correlation Between Body Mass Index (BMI) and Z-Score (BMD) AP Spine (Z-Score)
No ratings yet
Correlation Between Body Mass Index (BMI) and Z-Score (BMD) AP Spine (Z-Score)
7 pages
Handbook of Statistical Methods for Randomized Controlled Trials 1st Edition Fast eBook Download
100% (4)
Handbook of Statistical Methods for Randomized Controlled Trials 1st Edition Fast eBook Download
16 pages
Introduction To Multivariate Analysis MPU2263
No ratings yet
Introduction To Multivariate Analysis MPU2263
14 pages
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
No ratings yet
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
16 pages
California Housing Price Prediction .
No ratings yet
California Housing Price Prediction .
1 page