0% found this document useful (0 votes)
29 views

Random Forests

Presentation on Random Forest Algorithm

Uploaded by

antaladevp
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Random Forests

Presentation on Random Forest Algorithm

Uploaded by

antaladevp
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

RANDOM FORESTS

PRESENTER: MITHUN ALVA


AGENDA

 Definition

 Origin of Random Forests

 The Algorithm

 Advantages, Shortcomings and Applications of Random Forests

 An example of using Random Forests (using R)

 Readings/ References for further review


DEFINITION
 Random Forests are an ensemble learning method for
classification & regression
 Ensemble methods are made up of multiple learning algorithms,
which collectively provide better prediction as compared to any
single one of them.

 They are made up of multiple decision trees

 The output is
 the mode of the predicted classes ( in case of classification), and
 the mean of the prediction value ( in case of regression) provided by the
individual trees.
ORIGIN OF RANDOM FORESTS
 Algorithm developed by Leo Breiman and Adele Cutler.
 Leo Breiman
 January 27, 1928 – July 5, 2005
 Professor Emeritus of Statistics at University of California, Berkeley.

 Adele Cutler is his long-time collaborator and former Ph.D student.

 “Random Forests” is their trademark.


ORIGIN OF RANDOM FORESTS

 Decision Tree Models


 Involve segmenting the predictor space into a number of simple
regions.

 Since the rules to segment the space can be summarized in a tree,


these models are called “Decision Tree” models.

 Can be applied to both ‘classification’ and ‘regression’ problems.


ORIGIN OF RANDOM FORESTS
 Decision Tree Models: An Example
 Predicting the ‘Choice’ (of car model) chosen based on:
 Expense: Expense tolerance of the subject
 Gender : Gender of the subject
 PreviousOwnership: Number of cars previously owned by subject

 Sample training dataset


Previous
Gender Ownership Expense Choice
Male 0 Less Yaris
Male 1 Less Yaris
Female 0 Less Yaris
Male 1 Less Yaris
Female 1 More Tundra
Male 2 More Tundra
Female 2 More Tundra
Female 1 Less Prius
Male 0 Average Prius
Female 1 Average Prius
ORIGIN OF RANDOM FORESTS
 Process of building the tree
 Link in ‘References’ section: Step-by-Step Decision Tree building
 Key concept: Gini Index
 Gini Index is a measure of node purity.
 Calculated as Σ Pj * (1- Pj),

where Pj represents the proportion of observations in the “jth” class


 For our dataset, Pj values are:

P(Yaris) 0.4

P(Tundra) 0.3

P(Prius) 0.3

 And, the Gini Index is (0.4)*(0.6)+(0.3)*(0.7)+(0.3)*(0.7) = 0.66

 Final tree:
 Use Gini Index to calculate ‘Information Gain’ for each variable.
 The variable providing the best ‘Information Gain’ is plotted on the tree. Repeat till a full tree is
generated.
ORIGIN OF RANDOM FORESTS
 Advantages of Decision Tree Models
 Easy to interpret and explain
 Implicitly perform variable screening
 Variables associated with top few nodes are the most important

 Require relatively less data preparation effort


 Can handle a mix of categorical and continuous variables
 Can handle missing values
 Not sensitive to outliers
 Scaling of parameters ( e.g. revenue in millions and loan age in years in the
same dataset) is not necessary.

 Can handle non-linear relationships.


ORIGIN OF RANDOM FORESTS
 Shortcomings of Decision Tree Models

 High Variance: If we split the training data into two parts at random
and fit a decision tree to both halves, we could get very different
trees.

 Tend to favor categorical predictors with many levels. Variables with


a large number of levels can cause severe overfitting.

 ‘Bagging’ attempts to address the above shortcomings.


ORIGIN OF RANDOM FORESTS
 Bagging (also known as ‘bootstrap aggregation’)

 From your full dataset, take a sample , generate a tree and obtain
predictions.

 Repeat with a different sample, from the same dataset. The new tree
will typically make different predictions.

 Continue sampling and generating trees in this manner till about 500
trees are obtained.

 This process is called “Bagging”.


ORIGIN OF RANDOM FORESTS
 ‘Out of Bag’ (OOB) Data

 If we sample from available data and build a tree, we already have


holdout data available for that tree. This data is referred to as “Out
of Bag” data.

 Every tree grown has a different holdout sample

 Every record in the full dataset is “in bag” for some trees(about 2/3 rd)
and “out of bag” for the other trees.
ORIGIN OF RANDOM FORESTS
 ‘Out of Bag’ (OOB) Data

 Suppose a given record was “in bag” for 375 trees and “out of bag”
for the remaining 125 trees.

 Predictions for this record could be generated using just the “out of
bag” trees.

 Always having OOB data means we can effectively work with


relatively small datasets.
5 1 1 4 1

ORIGIN OF RANDOM FORESTS


 Bagging
 The sampling method here is bootstrap sampling

 Each time the # of observations in the sample = # of observations in the


training data
 However, sampling is done with replacement.
 Therefore all observations will not be present in the chosen sample

 Example: if the training data is {1,2,3,4,5}

 Sample 1 could be {5,1,1,4,1}


 Sample 2 could be {2,1,5,3,3}
 Sample 3 could be {1,2,3,2,5}
 …and so on
ORIGIN OF RANDOM FORESTS
 Bagging & Predictor subset-ing

 Trees in the Bagger were found to be too similar to each other


 To address this, Breiman introduced randomness into the actual tree growing as
well
 Normally, all possible predictors are evaluated for their ability to form a node in the
tree and partition the data in the best possible manner.
 Instead, every time we are forming a node, a subset of the predictors is considered.
 From among these predictors, the one providing the best partitioning is used to
form the node.
 A new random subset of predictors is chosen to build each node.

 ‘Random Forests’ combines the concepts of decision trees, bagging and


predictor subset-ing.
ORIGIN OF RANDOM FORESTS

 Breiman and Cutler suggested using one of the following rules to


form the subset of predictors.

Predictors(N) sqrt(N) 0.5*sqrt(N) 2*sqrt(N) Log2(N)


100 10 5 20 7
1,000 31 15.5 62 10
10,000 100 50 200 13
100,000 316 158 632 17
1,000,000 1000 500 2000 20
ALGORITHM

Training Dataset

Draw a sample to build tree.


Sample size= Training Dataset size. Remaining (“Out of Bag”) Data
Sampling is done with replacement.

Repeat
until Randomly select ‘mtry’ predictors
specified
number Repeat
of trees until tree is
(ntree) is fully grown
obtained
Grow Tree using the best of the ‘mtry’ predictors
to split data

Estimate error by applying the fully grown tree to


“Out of Bag” data
ADVANTAGES OF RANDOM FORESTS
 Automatic identification of important predictors

 Good for wide data; provide good accuracy and generate reliable
predictor importance rankings.

 Resistant to over training.

 Each decision tree is independent. Therefore trees can be grown


on different cores or different computers, allowing for quicker
analysis.
SHORTCOMINGS OF RANDOM FORESTS

 Suited for wide datasets with only a moderate number of


rows. Breiman recommends the use of other tools for larger
datasets.

 Large memory needed to store built models.

 Overfitting might be seen with noisy data.


APPLICATIONS OF RANDOM FORESTS

 Online targeted marketing


 Credit card fraud detection
 Text analytics
 Credit risk and insurance risk
 Retail Sales prediction
 Biological & Medical Research
 Manufacturing Quality Control
EXAMPLE

 R packages used in example

 randomForest: Breiman and Cutler's random forests for classification and


regression.

 rpart: Recursive partitioning for classification, regression


and survival trees. An
implementation of most of the functionality of the 1984
book by Breiman, Friedman, Olshen and Stone.

 caret: Miscellaneous functions for training and plotting


classification and regression models.
EXAMPLE

 Titanic dataset https://www.kaggle.com/c/titanic

 Predictors :
 pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
 name Name
 sex Sex (“female”, “male”)
 age Age ( in years)
 sibsp Number of Siblings/Spouses Aboard
 parch Number of Parents/Children Aboard
 ticket Ticket Number
 fare Passenger Fare
 cabin Cabin
 embarked Port of Embarkation(C = Cherbourg; Q = Queenstown;
S = Southampton)
EXAMPLE
 Titanic dataset

 Response :
 survived 0 = No; 1 = Yes

 Ask:
 “predict which passengers survived the tragedy”
EXAMPLE
 Additional features created:
 Title:
 Isolating ‘Titles” (i.e Col, Dr, Lady, Master, Miss, etc…) from the ‘Name’
field.
 converting them into “Mlle”, “Sir” or “Lady”

 FamilyID
 Large families might have had trouble getting to lifeboats together
 SibSp+Parch+1 will give Family Size

 Last Name like “Johnson” is common.


 Join with Last Name with count to uniquely identify Family Name &
Size.
EXAMPLE
 Data Preparation:
 ‘randomForest’ package in R cannot handle missing values

 ‘Age’ has 263 missing values


 Could be replaced by mean/median of all other non-missing values
 Another way is to use a decision tree & do a prediction on missing values
 tree is build using the ‘rpart’ package
 ‘Embarked’ has missing values in two rows
 Replace them with ‘S’.
 Nearly 70 % of the population embarked at Southampton

 ‘Fare’ had 1 missing value


 Replace with median of non-missing fare values
EXAMPLE
 Data Preparation:

Random Forests in R can only digest factors with up to 32 levels

 FamilyID has a larger number of levels

 Create a new feature called FamilyID2

 Equal to “Small” if FamilySize < =3, and FamilyID otherwise


EXAMPLE
 Install the ‘randomForest’ package in R
 Install.packages (‘randomForest’)

 To obtain same results every time you run the code, use
‘set.seed’

 Syntax: fit=
randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
+ Parch + Fare + Embarked + Title + FamilySize +FamilyID2,
data=train,
importance=TRUE, # enables inspection of variable
importance
ntree=2000) # number of tree to grow. Default is
500
EXAMPLE

 Inspect variable importance plots

 varImpPlot(fit)

 Will tell us which variables have the highest impact on the predictive
ability of the model

 Variable associated with the most decrease of the appropriate


measure has the highest impact.
EXAMPLE
 ‘Title’ has the strongest impact, in terms of both Accuracy and
Gini Index
 Added features ‘FamilyID2’ and ‘FamilySize’ have substantial impact
EXAMPLE
 Additional parameters for model

randomForest(as.factor(Survived) ~
Pclass+Sex +Age + SibSp +Parch + Fare + Embarked +Title + FamilySize
+FamilyID2,
data=train,
importance=TRUE, # enables inspection of variable importance
ntree= , # number of tree to grow. Default is 500
mtry= , #number of variables selected at each node.
#Default is square root of the number of variables
nodesize= , #minimum size of terminal nodes. Setting this to the value ‘k’
#means that no node with fewer than k cases will be split.
#Default =1 for classification and 5 for regression
………..)
EXAMPLE
 Performance Evaluation
 Confusion Matrix
Predicted
Actual 0 1
0 491 92
1 58 250
549 342

 Accuracy = (491 + 250)/(549+342) = 0.8316


 95% CI : (0.8054, 0.8557)
EXAMPLE
 Tuning the Random Forests model

 Objective is to find the best value of mtry (i.e. number of


predictors chosen at each node)

tunefit= train(as.factor(Survived)~ .,
data=train1,
method="rf",
# ‘rf’ stands for random forest
metric="Accuracy", #
what are we trying to improve
tuneGrid=data.frame(mtry=c(2,3,4))) # set of values to
be considered
EXAMPLE
 Tuning the Random Forests model

 Objective is to find the best value of mtry (i.e. number of features


chosen at each node)

mtry Accuracy
2 0.8236414

3 0.8310338
4 0.8302738
EXAMPLE
 Apply model to test data
 Prediction = predict (tunefit, newdata =test)
 Will give you predictions for ‘Survival’. ‘0’ and ‘1’ values
REFERENCES
 Leo Breiman's Random Forests Page
 An Introduction to Random Forest for Beginners: Salford Systems
 Random Forests Lecture by Nando Freitas, University of British Columbia
 Random Forests Lecture by Derek Kane
 The Elements of Statistical Learning: Hastie, Tibshirani & Friedman
 Introduction to Statistical Learning: James, Witten, Hastie and Tibshirani
 Trevor Stephens: Titanic Dataset Analysis using R
 Curt Wehrley: Titanic Dataset Analysis using R
 randomForest package in R
 rpart package in R
 caret package in R
 Step-by-Step Decision Tree building
 Machine Learning Benchmarks and Random Forest Regression, Mark Se
gal

You might also like