Random Forests
Random Forests
Definition
The Algorithm
The output is
the mode of the predicted classes ( in case of classification), and
the mean of the prediction value ( in case of regression) provided by the
individual trees.
ORIGIN OF RANDOM FORESTS
Algorithm developed by Leo Breiman and Adele Cutler.
Leo Breiman
January 27, 1928 – July 5, 2005
Professor Emeritus of Statistics at University of California, Berkeley.
P(Yaris) 0.4
P(Tundra) 0.3
P(Prius) 0.3
Final tree:
Use Gini Index to calculate ‘Information Gain’ for each variable.
The variable providing the best ‘Information Gain’ is plotted on the tree. Repeat till a full tree is
generated.
ORIGIN OF RANDOM FORESTS
Advantages of Decision Tree Models
Easy to interpret and explain
Implicitly perform variable screening
Variables associated with top few nodes are the most important
High Variance: If we split the training data into two parts at random
and fit a decision tree to both halves, we could get very different
trees.
From your full dataset, take a sample , generate a tree and obtain
predictions.
Repeat with a different sample, from the same dataset. The new tree
will typically make different predictions.
Continue sampling and generating trees in this manner till about 500
trees are obtained.
Every record in the full dataset is “in bag” for some trees(about 2/3 rd)
and “out of bag” for the other trees.
ORIGIN OF RANDOM FORESTS
‘Out of Bag’ (OOB) Data
Suppose a given record was “in bag” for 375 trees and “out of bag”
for the remaining 125 trees.
Predictions for this record could be generated using just the “out of
bag” trees.
Training Dataset
Repeat
until Randomly select ‘mtry’ predictors
specified
number Repeat
of trees until tree is
(ntree) is fully grown
obtained
Grow Tree using the best of the ‘mtry’ predictors
to split data
Good for wide data; provide good accuracy and generate reliable
predictor importance rankings.
Predictors :
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex (“female”, “male”)
age Age ( in years)
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation(C = Cherbourg; Q = Queenstown;
S = Southampton)
EXAMPLE
Titanic dataset
Response :
survived 0 = No; 1 = Yes
Ask:
“predict which passengers survived the tragedy”
EXAMPLE
Additional features created:
Title:
Isolating ‘Titles” (i.e Col, Dr, Lady, Master, Miss, etc…) from the ‘Name’
field.
converting them into “Mlle”, “Sir” or “Lady”
FamilyID
Large families might have had trouble getting to lifeboats together
SibSp+Parch+1 will give Family Size
To obtain same results every time you run the code, use
‘set.seed’
Syntax: fit=
randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
+ Parch + Fare + Embarked + Title + FamilySize +FamilyID2,
data=train,
importance=TRUE, # enables inspection of variable
importance
ntree=2000) # number of tree to grow. Default is
500
EXAMPLE
varImpPlot(fit)
Will tell us which variables have the highest impact on the predictive
ability of the model
randomForest(as.factor(Survived) ~
Pclass+Sex +Age + SibSp +Parch + Fare + Embarked +Title + FamilySize
+FamilyID2,
data=train,
importance=TRUE, # enables inspection of variable importance
ntree= , # number of tree to grow. Default is 500
mtry= , #number of variables selected at each node.
#Default is square root of the number of variables
nodesize= , #minimum size of terminal nodes. Setting this to the value ‘k’
#means that no node with fewer than k cases will be split.
#Default =1 for classification and 5 for regression
………..)
EXAMPLE
Performance Evaluation
Confusion Matrix
Predicted
Actual 0 1
0 491 92
1 58 250
549 342
tunefit= train(as.factor(Survived)~ .,
data=train1,
method="rf",
# ‘rf’ stands for random forest
metric="Accuracy", #
what are we trying to improve
tuneGrid=data.frame(mtry=c(2,3,4))) # set of values to
be considered
EXAMPLE
Tuning the Random Forests model
mtry Accuracy
2 0.8236414
3 0.8310338
4 0.8302738
EXAMPLE
Apply model to test data
Prediction = predict (tunefit, newdata =test)
Will give you predictions for ‘Survival’. ‘0’ and ‘1’ values
REFERENCES
Leo Breiman's Random Forests Page
An Introduction to Random Forest for Beginners: Salford Systems
Random Forests Lecture by Nando Freitas, University of British Columbia
Random Forests Lecture by Derek Kane
The Elements of Statistical Learning: Hastie, Tibshirani & Friedman
Introduction to Statistical Learning: James, Witten, Hastie and Tibshirani
Trevor Stephens: Titanic Dataset Analysis using R
Curt Wehrley: Titanic Dataset Analysis using R
randomForest package in R
rpart package in R
caret package in R
Step-by-Step Decision Tree building
Machine Learning Benchmarks and Random Forest Regression, Mark Se
gal