0% found this document useful (0 votes)
128 views

Random Forest Presentation

Uploaded by

mohanmanidharb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Random Forest Presentation

Uploaded by

mohanmanidharb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

A Deep Dive into Random Forests

By: Amit Kumar(EC21B004)


Anurudh kumar(EC21B010)
Abhishek kumar(EC21B001)
Random Forests May Seem Scary…
But They’re Actually Not Too Bad!
Plan
• Decision Tree
• Random Forests (+ Bagging)
• Tree Optimization and Feature Importance (Gini Criterion)
• Model Regularization
• Closing Notes
Decision Tree
Decision Tree
Decision Tree
Properties
• Feature can show up more
than once in different
branches (e.g. windy).
• Node can have both a
branch and leaf stemming
from it.
Decision Tree – Pros and Cons

Pros:
• Non-linear decision boundaries
• Easy to interpret
• Numerical & Categorical Data
Decision Tree – Pros and Cons

Cons:
• Easy to Overfit
• High Variance (i.e. unstable).
Random Forests
What is Random Forest?
• Random Forest is an ensemble learning method used for classification
and regression tasks.

• It builds multiple decision trees and merges them to get a more


accurate and stable prediction.

• Key Term: Ensemble Learning - Combining multiple models to


improve accuracy.
How Does Random Forest Work?
• Random Forest creates multiple decision trees from different subsets
of data.

• Each decision tree makes a prediction.

• For classification: uses majority voting.

• For regression: averages the outputs of all decision trees.


Random Forests – Many Decision
Trees
• We can guess that a Random Forest = many decision trees. But how?
• Many copies of the exact same tree is useless…
Random Forests – Many Decision
Trees
• OK, so we want some tree variation, but how…
Random Forests – Many Decision
Trees Remember: Decision Tree
• We want tree variation, but how…
• Vary trees such that overall variance is reduced:
Random Forests – Many Decision
Trees Remember: Decision Tree
• We want tree variation, but how…
• Vary trees such that overall variance is reduced:
• STATS101:
Given a set of independent, uncorrelated observations
each with variance , the variance of is .
Random Forests – Many Decision
Trees Remember: Decision Tree
• We want tree variation, but how…
• Vary trees such that overall variance is reduced:
• STATS101:
Given a set of independent, uncorrelated observations
each with variance , the variance of is .

This is why a forest of


identical trees is useless.
Random Forests – Many Decision
Trees Remember: Decision Tree
• We want tree variation, but how…
• Vary trees such that overall variance is reduced:
• STATS101:
Given a set of independent, uncorrelated observations
each with variance , the variance of is .

This is why a forest of This is why ensembling


identical trees is useless. many models together
always improves results.
Random Forests – Many Decision
Trees
• How do we make trees as independent and uncorrelated
as possible?
Random Forests – Randomize Data
Random Forests – Randomize Data
Random Forests – Intuition Check

• What happens if you assign more/less data per tree?

• What happens if you select more/less of the total features per tree:
Random Forests – Intuition Check

• What happens if you assign more/less data per tree?


Less: Trees more uncorrelated, but at some point too little data hurts training.
More: Trees become more correlated, but training of each tree improved.

• What happens if you select more/less of the total features per tree:
Less: Trees more uncorrelated, but at some point many trees become “dead”,
i.e. fitting entire trees on unimportant features.
More: Trees become more correlated, but training of each tree improved.
Tree Optimization and Feature
Importance
Tree Optimization – Greedy Criterion
• Trees grown according to what the local
best option is.
• Criterion: Gini, Information Gain.
Short Aside - Greedy Algorithm
Example
Example: Find largest path.
Tree Optimization – Greedy Criterion
• Note: The criterion governing tree
growth is different than your global cost
function (e.g. precision-recall, accuracy,
etc.), which determines how well your
entire model is doing.
Tree Optimization – Gini Impurity
“Gini impurity is a measure of how often a randomly chosen element
from the set would be incorrectly labeled if it was randomly labeled
according to the distribution of labels in the subset.”

( )
𝐶𝑛 2
𝑁 𝑛 ,𝑖
𝐺𝑛 =1 − ∑
𝑖= 0 𝑁𝑛
Tree Optimization – Gini Impurity
Example

Splits decided such that the gini impurity is locally minimized


Model Regularization
Model Regularization
• Main complexity parameter
is max_depth of the tree.
• Deep trees can split the
data up more, leading to
overfitting.
• Some nodes here have a
single sample in it!
Model Regularization

Deep, Unregularized Tree Shallower, Regularized Tree


Advantage of Random forest
1. High Accuracy:
Random Forests usually provide more accurate predictions than a single decision tree because they
combine the outputs of many trees, reducing overfitting.

2. Handles Both Categorical and Numerical Data:


Random Forests can work with both types of data, making them versatile in various applications.

3. Robust to Overfitting:
While a single decision tree may overfit to noisy data, Random Forests reduce this risk by averaging
the predictions of multiple trees, which smoothens out irregularities.

4. Handles Missing Data Well:


Random Forests can handle missing data through the use of surrogate splits, which allows the
model to still make predictions when some feature values are missing.
Disadvantage of Random forest
1. Computationally Intensive:
Training a Random Forest can be computationally expensive, especially when there are a large
number of trees or when the dataset is very large.

2. Slower Prediction Time:


Because it averages the output of many trees, Random Forests may take longer to make predictions
compared to simpler models like logistic regression or individual decision trees.

3. Not Suitable for Real-Time Predictions:


Due to its complexity and slower prediction time, Random Forests may not be the best choice for real-
time or low-latency applications.

4. Less Effective for Small Data Sets:


If the dataset is small, Random Forests may not perform significantly better than simpler algorithms, as
the gain from assembling multiple trees is minimal when there isn't much data to work with."
Applications of Random Forest
• Classification: Email spam detection, image classification, sentiment
analysis.

• Regression: Stock market prediction, weather forecasting, medical


diagnosis.

• Feature Selection: Identifying important features in datasets.


Conclusion
• Random Forest is a powerful algorithm for classification and
regression.

• It is robust to overfitting and works well with diverse datasets.

• However, it can be resource-intensive and difficult to interpret.


The End
Insight Fello w 𝑖

Watched the Paid attention


clock/dozed off the entire time

You might also like