DT, RF and XGB
Machine learning chapter
Decision Tree
• A decision tree is a versatile non-
parametric algorithm used for both
classification and regression tasks. It
features a hierarchical structure with a
root node, branches, internal nodes,
and leaf nodes. This tree-like model is
employed in decision support systems,
depicting decisions and outcomes
based on conditional control
statements. Its straightforward
structure makes it easy to understand,
and it finds applications in diverse
areas for tasks such as classification
and regression, using feature-based
splits to guide predictions from the
root to the leaves.
Decision Tree Terminologies
• Root Node: The initial node at the beginning of a decision tree, where the entire
population or dataset starts dividing based on various features or conditions.
A Root Node
Decision Tree Terminologies
• Decision Nodes: Nodes resulting from the splitting of root nodes are known as
decision nodes. These nodes represent intermediate decisions or conditions within
the tree.
Decision Nodes
Decision Tree Terminologies
• Leaf Nodes: Nodes where further splitting is not possible, often indicating the final
classification or outcome. Leaf nodes are also referred to as terminal nodes.
Leaf Nodes
Decision Tree Terminologies
• Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section
of a decision tree is referred to as a sub-tree. It represents a specific portion of the
decision tree.
Sub-Tree
Decision Tree Terminologies
• Pruning: The process of removing or cutting down specific nodes in a decision tree
to prevent overfitting and simplify the model.
Sub-Tree
Decision Tree Terminologies
• Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as a parent
node, and the sub-nodes emerging from it are referred to as child nodes. The parent node represents a
decision or condition, while the child nodes represent the potential outcomes or further decisions based
on that condition.
Parent
child
child
child
Entropy
• Entropy is nothing but the uncertainty in our dataset or measure of disorder.
• The formula for Entropy is shown below:
𝐸 ( 𝑆 ) =− 𝑝 ¿ ¿
Here,
• is the probability of positive class
• is the probability of negative class
• is the subset of the training example
Decision Trees use Entropy !
• Entropy basically measures the impurity of a node. Impurity is the degree of
randomness; it tells how random our data is. At some dataset that either you
should be getting “yes”, or you should be getting “no”.
Decision Trees use Entropy !
For feature 3,
Information Gain
• Information gain measures the reduction of uncertainty given some feature and it is
also a deciding factor for which attribute should be selected as a decision node or
root node.
• It is just entropy of the full dataset – entropy of the dataset given some feature.
𝐼𝐺=𝐸 ( 𝑌 ) − 𝐸 ( 𝑌 | 𝑋 )
Random Forest
• Random Forest Algorithm
widespread popularity stems
from its user-friendly nature
and adaptability, enabling it to
tackle both classification and
regression problems effectively.
The algorithm’s strength lies in
its ability to handle complex
datasets and mitigate
overfitting, making it a valuable
tool for various predictive tasks
in machine learning.
Random Forest Understanding
• Let’s dive into a real-life analogy to understand this concept further. A student
named X wants to choose a course after his college, and he is confused about the
choice of course based on his skill set. So he decides to consult various people like
his cousins, teachers, parents, degree students, and working people. He asks them
varied questions like why he should choose, job opportunities with that course,
course fee, etc. Finally, after consulting various people about the course he decides
to take the course suggested by most people.
Ensemble Learning Technique
Ensemble simply means combining multiple models. Thus a collection of
models is used to make predictions rather than an individual model.
• Ensemble uses two types of methods:
Bagging It creates a different training subset
from sample training data with replacement &
the final output is based on majority voting.
Boosting It combines weak learners into
strong learners by creating sequential models
such that the final model has the highest
accuracy. For example, ADA BOOST, XG
BOOST.
Random Forest Algorithm
• Step 1: In the Random forest model, a subset
of data points and a subset of features is
selected for constructing each decision tree.
Simply put, n random records and m features
are taken from the data set having k number
of records.
• Step 2: Individual decision trees are
constructed for each sample.
• Step 3: Each decision tree will generate an
output.
• Step 4: Final output is considered based on
Majority Voting or Averaging for Classification
and regression, respectively.
Difference Between Decision Tree
and Random Forest
Decision trees Random Forest
Decision trees normally suffer from the Random forests are created from subsets of
problem of overfitting if it’s allowed to data, and the final output is based on
grow without any control. average or majority ranking; hence the
problem of overfitting is taken care of.
A single decision tree is faster in It is comparatively slower.
computation.
When a data set with features is taken as Random forest randomly selects
input by a decision tree, it will formulate observations, builds a decision tree, and
some rules to make predictions. takes the average result. It doesn’t use any
set of formulas.
XGBoost Algorithm
• a potent algorithm, excels in scalability,
facilitating swift learning through
parallel and distributed computing
while ensuring efficient memory
utilization. CERN recognized its merit
as the optimal approach for classifying
signals from the Large Hadron Collider.
Faced with the challenge of processing
3 petabytes of data annually, XGBoost
emerged as the most effective and
robust solution, adept at distinguishing
extremely rare signals from
background noise in complex physical
processes.
Gradient Boosting
• Gradient Boosting, including algorithms like XGBoost, has proven to be a powerful and
versatile machine learning technique with several advantages and potentials
1. High Predictive Accuracy :It builds a strong predictive model by
combining the predictions of multiple weak learners (typically
decision trees).
2. Handling Nonlinear Relationships : capable of capturing complex,
nonlinear relationships in data, making it suitable for a wide range of
applications.
3. Flexibility :It can be used for both regression and classification
problems, making it a versatile choice for different types of tasks.
And more like : Feature Importance, Robustness to Overfitting,
Parallelization, Ensemble Learning
Unique Features of XGBoost
• Regularization: XGBoost has an option to penalize complex models through both L1
and L2 regularization. Regularization helps in preventing overfitting
• Handling sparse data: Missing values or data processing steps like one-hot encoding
make data sparse. XGBoost incorporates a sparsity-aware split finding algorithm to
handle different types of sparsity patterns in the data
• Weighted quantile sketch: Most existing tree based algorithms can find the split points
when the data points are of equal weights (using quantile sketch algorithm)
• Out-of-core computing: This feature optimizes the available disk space and maximizes
its usage when handling huge datasets that do not fit into memory
• Cache awareness: In XGBoost, non-continuous memory access is required to get the
gradient statistics by row index. Hence, XGBoost has been designed to make optimal
use of hardware.
Session Finished
Thank You!
MACHINFY EDUCATION TEAM