0% found this document useful (0 votes)
29 views53 pages

Chapter 5. Decision Trees

Chapter 5 discusses decision trees (DT), a non-parametric supervised machine learning algorithm used for classification and regression tasks. It covers the structure, advantages, disadvantages, and applications of DT in various fields, as well as the algorithm's workings, including how to find the best split using GINI Impurity. The chapter also includes practical exercises for implementing DT in data analysis.

Uploaded by

mpineau2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views53 pages

Chapter 5. Decision Trees

Chapter 5 discusses decision trees (DT), a non-parametric supervised machine learning algorithm used for classification and regression tasks. It covers the structure, advantages, disadvantages, and applications of DT in various fields, as well as the algorithm's workings, including how to find the best split using GINI Impurity. The chapter also includes practical exercises for implementing DT in data analysis.

Uploaded by

mpineau2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

CHAPTER 5.

DECISION TREES
INTERMEDIATE ECONOMETRICS & DATA ANALYSIS
CHAPTER 5. PLAN

PART I. ABOUT DT PART IV. IN-CLASS PRACTICE


A. What is DT? 1. Import Libraries
B. Why use DT? 2. Data Import
C. When to use DT? 3. Data Preparation
4. Data Transformation
5. Data Splitting
6. Model Building
PART II. DT ALGORITHM 7. Performance Measure
D. How DT works? 8. Tree Visualization
E. DT Calculations

PART III. BEFORE CHOOSING DT


PART V. AT-HOME PRACTICE
F. DT Advantages
• “wine.csv”
G. DT Disatvantages
ABOUT DT
PART I
A. WHAT IS DT?
(1/2)

• A decision tree (DT) is a non-parametric supervised machine learning


algorithm used for classification and regression tasks.

• It is often referred to as a classification tree when used for classification


problems and a regression tree when used for regression problems.
A. WHAT IS DT?
(2/2)
• Decision trees can handle both numeric and categorical features. They can
have various shapes:
• Deep tree: Characterized by extended nodes, often with many levels.

• Bushy tree: Features a broad structure with spreading branches.

◦ Balanced tree: Maintains a consistent number of branches and nodes at each


level.

• Decision trees can be categorized based on their purpose:


• A classification tree: Used to predict categorical outcomes (i.e., classes).

◦ A regression tree: Used to predict continuous numerical values.


B. WHY USE DT?

• Decision trees can be applied to both regression and classification tasks.


However, in the IEDA course, we will focus exclusively on their use for
classification.

• Additionally, due to their tree structure, decision trees

provide straightforward feature visualization and simplify

the decision-making process.


C. WHEN TO USE DT?
(1/3)

1. IN BUSINESS

• In business, decision trees can assist in the decision-making process by


analyzing the potential consequences of significant managerial decisions.
C. WHEN TO USE DT?
(2/3)

2. IN LAW

• In law, decision trees can be used to evaluate the various financial outcomes
that may arise from litigation.
C. WHEN TO USE DT?
(3/3)

3. IN VIDEO GAMES

• In video games, decision trees enable players to shape their own story or
outcome by selecting the options they believe are best.
DT ALGORITHM
PART II
A. HOW DT WORKS?
(1/6)

• A decision tree is typically constructed with the root at the top or on the left
side, depending on the orientation. If the branches are not labeled, the left
branch is generally assumed to represent “true”, while the right branch
represents “false”.
ROOT NODE
Tree roots

BRANCHES
INTERNAL NODES
Outcomes of previous decisions or tests
on features
BRANCHE
S LEAF NODE BRANCHES
Final decision or outcome => No more
branches
BRANCHES
A. HOW DT WORKS?
(2/6)

1. The root node initiates the


decision tree. It uses the feature
that best splits the data and
represents the initial decision to
be made (such as determining the
type of animal in this example).
A. HOW DT WORKS?
(3/6)

2. Internal nodes are positioned in


the middle of the decision tree.
They neither start nor end the tree
but represent tests on features
(such as whether the animal is
short or tall, or has a long or short
neck).
A. HOW DT WORKS?
(4/6)

3. Leaf nodes conclude the decision


tree. They are not followed by
additional branches and represent
the final outcome of a decision
path (such as “might be an
elephant” or “might be a rat”).
A. HOW DT WORKS?
(5/6)

“Growing a tree involves deciding on which features to choose and what


conditions to use for splitting, along with knowing when to stop. As a tree
generally grows arbitrarily, you will need to trim it down for it to look
beautiful.” *
A. HOW DT WORKS?
(6/6)

• In other words, to build a decision tree, we need to address the following


questions:
◦ Which features should be used?

◦ In what order should the selected features be considered?

◦ How should continuous features be handled?

◦ When should we stop growing the tree?

• Therefore, to ensure that our decision tree is optimal, we must:


1. Choose the best possible split (addressing the first three questions).

2. Decide when to stop splitting (addressing the last question).


B. DT CALCULATIONS
1. BEST SPLIT
(1/8)

• Finding the best split requires constructing and comparing multiple decision
trees to determine the most effective one.

• To evaluate which tree is the best, we use the “GINI Impurity”, also known as
the “GINI Index”.

• This index measures the probability that a randomly selected observation will
be incorrectly classified.
1. BEST SPLIT
(2/8)
IMPURITY PURITY

• Impurity refers to the degree of • Purity refers to homogeneity within a node.


heterogeneity within a node.

• An impure node contains cases distributed • A pure node has all cases classified into a
across more than one branch (as single branch or all cases belonging to a
illustrated in the example on slide 24). single class (as shown in the example on
slide 24).

• The greater the heterogeneity of a node,


the more challenging the classification • A homogeneous node simplifies
becomes, which results in a less accurate classification and enhances the model’s
model. performance..
1. BEST SPLIT
(3/8)
GINI INDEX COMPUTATION INTERPRETATION

• The GINI Impurity ranges from 0 to 1, • A GINI Impurity of “0” indicates a pure
where 0 indicates a perfect split (i.e., node, meaning either only one class is
the best possible tree) and 1 indicates present, or all cases belong to a single
the worst. branch of the node.

• The index is calculated by subtracting • A GINI Impurity of “0.5” suggests an


the sum of the squared proportions of equal distribution of cases between
each class from one: branches in a binary classification.

• A GINI Impurity of “1” indicates


maximum impurity.
1. BEST SPLIT
(4/8)
• The first step in finding the best split is to identify the optimal root node.
The root node uses the feature that best divides the data, which can be
either quantitative or qualitative.
QUANTITATIVE FEATURE QUALITATIVE FEATURE

1. Sort the values from smallest to largest; 1. Use each qualitative feature to build a

2. Calculate the average of each pair of decision tree;

consecutive values; 2. Use the GINI Impurity to determine


3. Build multiple decision trees using these the best split.
averages as potential root nodes;

4. Use the GINI Impurity to determine the


best split.
1. BEST SPLIT
(5/8)

GENDER
GENDER GPA LIKES MATH
Femal Mal
M 4.0 YES e e
F 3.2 YES LIKES LIKES
MATH MATH
F 3.5 NO
Yes No
M 3.8 NO

F 3.0 NO
2 2 1 1

F 3.1 YES

1. QUALITATIVE FEATURE
1. BEST SPLIT
(6/8)
GPA <
GENDER GPA LIKES MATH 3.05
Tru Fals
F 3.0
3.0
NO e e
F 3.1
5
YES LIKES LIKES
3.1
5
MATH MATH
F 3.2 YES PURE
3.3
5 NODE
F 3.5 NO
3.6
5 0 1 3 2
M 3.8 NO
3.9
M 4.0 YES
GPA <
Repeat this 3.15
Tru Fals
process for all e e
2. QUANTITATIVE FEATURE other calculated
means. LIKES LIKES
MATH MATH
IMPURE
NODES

1 1 2 2
1. BEST SPLIT
(6/8)
GPA <
GENDER GPA LIKES MATH 3.05
Tru Fals
F 3.0
3.0
NO e e
F 3.1
5
YES LIKES LIKES
3.1
5
MATH MATH
F 3.2 YES
3.3
5
F 3.5 NO
3.6
5 0 1 3 2
M 3.8 NO
3.9
M 4.0 YES

2. QUANTITATIVE FEATURE
1. BEST SPLIT
(6/8)
GPA <
GENDER GPA LIKES MATH 3.15
Tru Fals
F 3.0
3.0
NO e e
F 3.1
5
YES LIKES LIKES
3.1
5
MATH MATH
F 3.2 YES
3.3
5
F 3.5 NO
3.6
5 1 1 2 2
M 3.8 NO
3.9
M 4.0 YES

2. QUANTITATIVE FEATURE
1. BEST SPLIT
(7/9)

GENDER GPA LIKES MATH

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎ቁ + ቀ ∗𝟎.𝟒𝟖ቁ = 0.4


𝟏 𝟓
F 3.0 NO
3.0
5 𝟔 𝟔

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎.𝟓ቁ + ቀ ∗𝟎.𝟓ቁ = 0.5


𝟐 𝟒
F 3.1 YES
3.1
5 𝟔 𝟔

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎.𝟒𝟒ቁ + ቀ ∗𝟎.𝟒𝟒ቁ = 0.44


𝟑 𝟑
F 3.2 YES
3.3
5 𝟔 𝟔

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎.𝟓ቁ + ቀ ∗𝟎.𝟓ቁ = 0.45


𝟒 𝟐
F 3.5 NO
3.6
5 𝟔 𝟔

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎.𝟒𝟖ቁ + ቀ ∗𝟎ቁ = 0.4


𝟓 𝟏
M 3.8 NO
𝟔 𝟔
3.9
M 4.0 YES

2. QUANTITATIVE FEATURE 1. QUALITATIVE FEATURE

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎ቁ + ቀ ∗𝟎.𝟒𝟖ቁ = 0.4


𝟏 𝟓
𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎.𝟓ቁ + ቀ ∗𝟎.𝟓ቁ = 0.5
𝟒 𝟐
GPA <
Root 𝟔 𝟔
𝟔 𝟔
3.05 GENDER
𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎.𝟒𝟖ቁ + ቀ ∗𝟎ቁ = 0.4
𝟓 𝟏
Node GPA < 3.9 𝟔 𝟔
1. BEST SPLIT
(8/9)
GPA < IMPUR
PURE E
3.05
NODES NODES
GENDER GPA LIKES MATH

F 3.0 NO GENDER UNNECESSAR GENDER


Y NODE
F 3.1 YES Mal
Female
e
F 3.2 YES LIKES LIKES
LIKES LIKES
F 3.5 NO MATH MATH MATH MATH

M 3.8 NO

M 4.0 YES
0 1 0 0 2 1 1 1
M 3.7 ?
𝟎 𝟐 𝟏 𝟐 𝟎 𝟐 𝟎 𝟐 𝟐 𝟐 𝟏 𝟐 𝟐 𝟐
𝐆𝐈 = 𝟏 − ൬ ቀ 𝟏ቁ + ቀ 𝟏ቁ ൰ = 0 𝐆𝐈 = 𝟏 − ൬ ቀ 𝟎ቁ + ቀ 𝟎ቁ ൰ = 1 𝐆𝐈 = 𝟏 − ൬ ቀ ቁ + ቀ ቁ ൰ = 0.44 𝐆𝐈 = 𝟏 − ൬ ቀ ቁ + ቀ ቁ ൰ = 0.5
F 2.9 ? 𝟏 𝟏
𝟑 𝟑 𝟐 𝟐

F 3.3 ?

𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐆𝐈 = ቀ ∗𝟎ቁ + ቀ ∗𝟎.𝟒𝟒ቁ + ቀ ∗𝟎.𝟓ቁ = 0.39


𝟏 𝟑 𝟐
Best split, but not ideal.
𝟔 𝟔 𝟔
1. BEST SPLIT
(9/9)
GPA <
3.05
GENDER GPA LIKES MATH

F 3.0 NO FEMALE GENDER


F 3.1 YES Female Male
F 3.2 YES
LIKES LIKES LIKES
F 3.5 NO MATH MATH MATH
M 3.8 NO

M 4.0 YES
0 1 2 1 1 1
M 3.7 ?

F 2.9 ?

F 3.3 ?
1. BEST SPLIT
(9/9)
GPA <
3.05
GENDER GPA LIKES MATH

F 3.0 NO FEMALE GENDER


F 3.1 YES Female Male
F 3.2 YES
LIKES LIKES LIKES
F 3.5 NO MATH MATH MATH
M 3.8 NO

M 4.0 YES
0 1 2 1 1 1
M 3.7 ?

F 2.9 ?
• With
F equal chances
3.3 (50%) that ?the new student either likes or dislikes math, we cannot
make a clear classification. This illustrates what we referred to earlier as an imperfect
model, where impurities remain high.
1. BEST SPLIT
(9/9)
GPA <
3.05
GENDER GPA LIKES MATH

F 3.0 NO FEMALE GENDER


F 3.1 YES Female Male
F 3.2 YES
LIKES LIKES LIKES
F 3.5 NO MATH MATH MATH
M 3.8 NO

M 4.0 YES
0 1 2 1 1 1
M 3.7 TIE

F 3.9 ?

F 3.3 ?

• Since most students in this node (2 out of 3) like math, we can infer that the new student
is likely to like it as well.
1. BEST SPLIT
(9/9)
GPA <
3.05
GENDER GPA LIKES MATH

F 3.0 NO FEMALE GENDER


F 3.1 YES Female Male
F 3.2 YES
LIKES LIKES LIKES
F 3.5 NO MATH MATH MATH
M 3.8 NO

M 4.0 YES
0 1 2 1 1 1
M 3.7 TIE

F 3.9 YES
However, having only one case per leaf is not
F 2.9 ?
sufficient to generalize classification predictions.
This new student does not like math (pure Therefore, to avoid overfitting, we need to
node).
determine when to stop splitting the tree.
2. STOPPING CRITERIA
(1/4)

BEST-CASE SCENARIO COMMON SCENARIO

• We stop splitting when we achieve • Not all leaves will achieve 100%
100% purity with a sufficient purity.
number of cases in each leaf.
• Therefore, we need to determine
when to stop splitting the tree.

• In other words, we should decide


when to stop adding new nodes.
2. STOPPING CRITERIA
(2/4)
• One method to limit the growth of the tree is to set constraints such as a:

◦ Maximum depth: Define the maximum length of the path from the root to any leaf,

thereby limiting the number of splits and features used in the decision tree.

◦ Minimum number of cases per leaf: Define a threshold for the minimum number of

cases required in a leaf.

• To determine these limits, you can:

1.Test various values to find the optimal ones;

2.Use cross-validation to assess performance;

3.Select the best model based on performance measures.


2. STOPPING CRITERIA
(3/4)

• Another method is “pruning”, which involves trimming unnecessary branches


that do not significantly contribute to the decision tree, as demonstrated on
slide 28. Removing these branches can enhance the tree’s performance.

• There are two types of pruning:


• Pre-pruning (or forward pruning): This technique prevents the decision tree from
growing beyond a certain point. It involves making decisions during the tree-
building process.

• Post-pruning (or backward pruning): This technique involves removing branches


after the decision tree has been fully grown.
2. STOPPING CRITERIA
(4/4)

• For the IEDA course, we will use Scikit-Learn’s default criteria to construct
the best decision tree possible, with the exception of setting “max_depth =
None”.

• Here are some important criteria :


• The GINI Impurity to measure the quality of splits;

• “splitter = best” to ensure that only the most important features are considered
for splitting;

• “max_depth = int” (e.g., 5) to limit the depth of the tree and prevent it from
growing until all nodes are pure, which helps avoid overfitting.
BEFORE CHOOSING DT
PART III
BEFORE CHOOSING DT
ADVANTAGES DISADVANTAGES

• Easy to understand, visualize and interpret. • Requires balanced data; can produce biased
results if some classes dominate.

• Unstable, as small changes in the data can


• Minimal data preparation required (e.g., handles lead to entirely different trees.
missing values well).
• Prone to overfitting.
• Applicable for both classification and regression • a

tasks.

• Poor at generalization (i.e., inference).


• Capable of handling both numerical and
categorical data.
https://www.youtube.com/watch?v=_L39rN6gz7Y
IN-CLASS PRACTICE
PART IV
1. IMPORT LIBRARIES
2. DATA IMPORT
3. DATA PREPERATION
4. DATA TRANSFORMATION
(1/2)
4. DATA TRANSFORMATION
(2/2)
5. DATA SPLITTING
6. MODEL BUILDING
7. PERFORMANCE MEASURE
8. DT VISUALIZATION
(1/2)
8. DT VISUALIZATION
(2/2)
https://
www.wooclap.com/
AT-HOME PRACTICE
PART V
CHAPTER 5 HOMEWORK

1. Import libraries as needed;

2. Import the “wine.csv” dataset;

3. Prepare the data;

4. Transform the data if necessary;

5. Split the data;

6. Build the DT model;

7. Evaluate the model performance;

8. Plot the DT.

You might also like