0% found this document useful (1 vote)
13 views192 pages

04 - Decision Tree Learning

The document discusses Decision Tree Learning, a method for approximating both discrete and continuous target functions, primarily used for classification. It outlines the structure of decision trees, including their representation as tree-like structures and if-then rules, and introduces key concepts such as the ID3 learning algorithm, entropy, and information gain. Additionally, it highlights when to consider decision trees based on the nature of the instances and target functions.

Uploaded by

hitha2122004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
13 views192 pages

04 - Decision Tree Learning

The document discusses Decision Tree Learning, a method for approximating both discrete and continuous target functions, primarily used for classification. It outlines the structure of decision trees, including their representation as tree-like structures and if-then rules, and introduces key concepts such as the ID3 learning algorithm, entropy, and information gain. Additionally, it highlights when to consider decision trees based on the nature of the instances and target functions.

Uploaded by

hitha2122004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 192

Decision Tree Learning1

MA325 - Machine Learning

A. Senthil Thilak

Department of Mathematical and Computational Sciences


National Institute of Technology Karnataka

1
Topics from reference (Mitchell, 1997): Chapter 3
A. Senthil Thilak (NITK) Supervised Learning - DTL 1 / 35
Decision Tree Learning

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree:

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability),

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:
• Decision tree representation

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:
• Decision tree representation
• ID3 learning algorithm

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:
• Decision tree representation
• ID3 learning algorithm
• Entropy, Information gain

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:
• Decision tree representation
• ID3 learning algorithm
• Entropy, Information gain
• Issues in DT learning

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35


Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:
• Decision tree representation
• ID3 learning algorithm
• Entropy, Information gain
• Issues in DT learning
• Overfitting
• Pruning strategies
A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35
Decision tree (DT) representation
Decision tree (DT) representation
• A DT uses a top-down greedy approach and classifies instances by sorting them down
the tree, from the root to some leaf node, which provides the classification of the
instance.
Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis

Dl Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No

Figure:
TABLE 3.2 Training examples for the target
Training examples for the target concept PlayTennis.
concept “PlayTennis”
Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis Outlook

Dl Sunny Hot High Weak No


D2 Sunny Hot High Strong No
Sunny Overcast Rain
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes Humidity Wind
Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes High Normal Strong Weak
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes No Yes No Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No
Figure: A DT for the target concept
Figure:
TABLE 3.2 Training examples for the target “PlayTennis”
Training examples for the target concept PlayTennis.
concept “PlayTennis”

• Each internal node tests an attribute.


Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis Outlook

Dl Sunny Hot High Weak No


D2 Sunny Hot High Strong No
Sunny Overcast Rain
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes Humidity Wind
Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes High Normal Strong Weak
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes No Yes No Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No
Figure: A DT for the target concept
Figure:
TABLE 3.2 Training examples for the target “PlayTennis”
Training examples for the target concept PlayTennis.
concept “PlayTennis”

• Each internal node tests an attribute.


• Each branch corresponds to value of the attribute corresponding to the nodes from
which it branches out.
Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis Outlook

Dl Sunny Hot High Weak No


D2 Sunny Hot High Strong No
Sunny Overcast Rain
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes Humidity Wind
Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes High Normal Strong Weak
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes No Yes No Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No
Figure: A DT for the target concept
Figure:
TABLE 3.2 Training examples for the target “PlayTennis”
Training examples for the target concept PlayTennis.
concept “PlayTennis”

• Each internal node tests an attribute.


• Each branch corresponds to value of the attribute corresponding to the nodes from
which it branches out.
• Each leaf node assigns a classification.
When to Consider Decision Trees?

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued
• Disjunctive hypothesis may be required

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

• Equipment or medical diagnosis

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

• Equipment or medical diagnosis


• Credit risk analysis

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


When to Consider Decision Trees?

• Instances describable by attribute-value pairs


• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

• Equipment or medical diagnosis


• Credit risk analysis
• Modeling calendar scheduling preferences

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35


Top-Down Induction of Decision Trees - Basic ID3
algorithm
ID3(D, TA, A)
(D → Training examples; TA → the attribute whose classification/value is to be predicted by
DT; A → Set of other attribute that may be tested by the learned DT)
1 Create a Root node for the tree
2 If all Examples are positive, Return the single-node tree Root, with label = +
3 If all Examples are negative, Return the single-node tree Root, with label = −
4 If A is empty, Return the single-node tree Root, with label = most common value of TA
in D
5 Else A ← the “best” decision attribute for next node (Assign A as decision attribute for
Root node)
6 For each value of A, create new descendant of node
7 Sort training examples to leaf nodes
8 If training examples are perfectly classified, then STOP, Else iterate over new leaf nodes
9 End
10 Return Root
A. Senthil Thilak (NITK) Supervised Learning - DTL 5 / 35
Choosing the best attribute?

A. Senthil Thilak (NITK) Supervised Learning - DTL 6 / 35


Choosing the best attribute?

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]


E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )


= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

A. Senthil Thilak (NITK) Supervised Learning - DTL 6 / 35


Choosing the best attribute?

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]


E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )


= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

Use the statistical property called Information gain which measures how well a given attribute
separates the training examples according to the target classification.

A. Senthil Thilak (NITK) Supervised Learning - DTL 6 / 35


Entropy - A measure of impurity/ambiguity
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.


Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
→ Entropy = 0 (min)
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
→ Entropy = 0 (min)
• Impure case: p = 0.5
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
→ Entropy = 0 (min)
• Impure case: p = 0.5
→ Entropy = 1 (max)
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.


1.0

• Pure cases: p = 0 or 1 Entropy(S)


→ Entropy = 0 (min) 0.5

• Impure case: p = 0.5


→ Entropy = 1 (max)
0.0 0.5 1.0
p
+
Information Gain - For choice of Hypothesis space

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


Information Gain - For choice of Hypothesis space
• Choice of Root node:

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A
X |Sv |
• Gain(S, A) ≡ Entropy(S) − Entropy(Sv )
|S|
v∈V alues(A)

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A
X |Sv |
• Gain(S, A) ≡ Entropy(S) − Entropy(Sv )
|S|
v∈V alues(A)
[29+,35-] A1=? [29+,35-] A2=?

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A
X |Sv |
• Gain(S, A) ≡ Entropy(S) − Entropy(Sv )
|S|
v∈V alues(A)
[29+,35-] A1=? [29+,35-] A2=?

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

• Selecting the Next Attribute (internal node)

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A
X |Sv |
• Gain(S, A) ≡ Entropy(S) − Entropy(Sv )
|S|
v∈V alues(A)
[29+,35-] A1=? [29+,35-] A2=?

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

Which attribute is the best classifier?


• Selecting the Next Attribute (internal node)
S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]


E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )


= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35


stands for entropy and S for the original collection of examples. Given an initial collection S of 9
positive and 5 negative examples, [9+, 5-1, sorting these by their Humidity produces collections of
Partially learnt tree
[3+, 4-1 (Humidity = High) and [6+, 1-1 (Humidity = Normal). The information gained by this
partitioning is .151, compared to a gain of only .048for the attribute Wind.

3.4.2 An Illustrative Example


To illustrate the operation of ID3, consider the learning task represented by the
training examples of Table 3.2. Here the target attribute PlayTennis, which can
have values yes or no for different Saturday mornings, is to be predicted based
on other attributes of the morning in question. Consider the first step through

Day Outlook Temperature Humidity Wind PlayTennis

Dl Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No

TABLE 3.2
Training examples for the target concept PlayTennis.
stands for entropy and S for the original collection of examples. Given an initial collection S of 9
positive and 5 negative examples, [9+, 5-1, sorting these by their Humidity produces collections of
Partially learnt tree
[3+, 4-1 (Humidity = High) and [6+, 1-1 (Humidity = Normal). The information gained by this
partitioning is .151, compared to a gain of only .048for the attribute Wind.

3.4.2 An Illustrative Example


{D1, D2, ..., D14}
To illustrate the operation of ID3, consider the learning task represented by the
training examples of Table 3.2. Here the target attribute PlayTennis, which can [9+,5−]
have values yes or no for different Saturday mornings, is to be predicted based
on other attributes of the morning in question. Consider the first step through Outlook

Day Outlook Temperature Humidity Wind PlayTennis


Sunny Overcast Rain
Dl Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes {D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
D4 Rain Mild High Weak Yes
[2+,3−] [4+,0−] [3+,2−]
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes ? Yes
?
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes
Dl3 Overcast Hot Normal Weak Yes Which attribute should be tested here?
Dl4 Rain Mild High Strong No
Ssunny = {D1,D2,D8,D9,D11}
TABLE 3.2
Training examples for the target concept PlayTennis. Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570

Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019


Hypothesis Space Search by ID3

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35


Hypothesis Space Search by ID3
• ID3 is characterized as searching a space of hypotheses for one that best fits the training
examples.

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35


Hypothesis Space Search by ID3
• ID3 is characterized as searching a space of hypotheses for one that best fits the training
examples.
• The hypothesis space for ID3 is the set of possible decision trees.

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35


Hypothesis Space Search by ID3
• ID3 is characterized as searching a space of hypotheses for one that best fits the training
examples.
• The hypothesis space for ID3 is the set of possible decision trees.
• ID3 performs a simple-to-complex hill-climbing search through this space, starting with
empty tree, then inductively considering more elaborate hypotheses in search of a DT
that correctly classifies the training data, guided by the Information gain measure.

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35


Hypothesis Space Search by ID3
• ID3 is characterized as searching a space of hypotheses for one that best fits the training
examples.
• The hypothesis space for ID3 is the set of possible decision trees.
• ID3 performs a simple-to-complex hill-climbing search through this space, starting with
empty tree, then inductively considering more elaborate hypotheses in search of a DT
that correctly classifies the training data, guided by the Information gain measure.

+ – +

...

A1
A2 Hyposthesis space search by ID3 searches
+ – + + + – + – through the space of possible decision trees
from simplest to increasingly complex,
...
guided by the information gain heuristic.
A2 A2

+ – + – + – + –
A3 A4

+

... ...

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35


Hypothesis Space Search by ID3 (contd...) I
Hypothesis Space Search by ID3 (contd...) II
• Statisically-based search choices - uses all traning examples at each step in the search
for refining its current hypothesis, contrasting with incremental decisions as in Find-S or
Candidate-Elimination algorithms
Advantages:
• Less sensitive to errors
• Robust to nosiy data - Extended by modifiying its termination criterion to accept
hypotheses that imperfectly fit the traning data
• Can handle missing data!
Limitations:
• Unlike Candidate-Elimination method, ID3 determines only a single consistent
hypothesis. So, it loses the capability of determining how many alternative
consistent decision trees are available or to pose new instance queries that
optimally resolve among these competing hypotheses.
• ID3 performs no backtracking to reconsider its choice of attributes. Hence,
converges only to a locally optimal solution, which may be a less desirable choice
upon encountering a different branch of search. (Pruning overcomes the issue!!!)
Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35


Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased?

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35


Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased? Not really...

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35


Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased? Not really...

Inductive bias of ID3?

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35


Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased? Not really...

Inductive bias of ID3?


• Approx. IB: Prefer short trees over longer ones → Occam’s Razor.
• Prefers not only shorter trees but also those trees that place progressively the attributes
with high information gain closest to the root. Hence, ID3 performs better than even
BFS-based ID3.

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35


Occam’s Razor
Bias: Preference for some hypotheses, rather than a restriction of hypothesis space H.

Occam’s razor [by William of Occam around 1320]: Prefer the shortest hypothesis that fits
the data.

Why prefer short hypotheses?

Arguments in favor:
→ Fewer short hypotheses than long hypotheses
→ a short hypothesis that fits data unlikely to be coincidence
→ a long hypothesis that fits data might be coincidence

Arguments opposed:
→ There are many ways to define small sets of hypotheses.
→ For eg., all trees with a prime number of nodes that use attributes beginning with “Z”
→ What’s so special about small sets based on size of hypothesis??

A. Senthil Thilak (NITK) Supervised Learning - DTL 14 / 35


Issues in DT Learning

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Issues in DT Learning

• How deeply to grow the DT?

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Issues in DT Learning

• How deeply to grow the DT?


• Choosing an appropriate attribute selection measure.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Issues in DT Learning

• How deeply to grow the DT?


• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Issues in DT Learning

• How deeply to grow the DT?


• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handling attributes with differing costs.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Issues in DT Learning

• How deeply to grow the DT?


• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handling attributes with differing costs.
• Handling continuous attributes.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Issues in DT Learning

• How deeply to grow the DT?


• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handling attributes with differing costs.
• Handling continuous attributes.
• Improving computational efficiency.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35


Overfitting in Decision Trees

What is Overfitting?
Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H,

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35


Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35


Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there
exists some alternative hypothesis h0 ∈ H such that h has smaller error than h0 over the
training examples,

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35


Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there
exists some alternative hypothesis h0 ∈ H such that h has smaller error than h0 over the
training examples, but h0 has a smaller error than h over the entire distribution of instances
(including instances beyond the training set).

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35


Overfitting in Decision Trees

What is Overfitting?
Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there
exists some alternative hypothesis h0 ∈ H such that h has smaller error than h0 over the
training examples, but h0 has a smaller error than h over the entire distribution of instances
(including instances beyond the training set).

Presicely,
• Consider a hypothesis h and its
- error over training data: errortrain (h) and
- eror over entire distribution D of data: errorD (h)
• We say h ∈ H overfits training data if there is another hypothesis h0 ∈ H such that
- errortrain (h) < error_train(h0 ) and
- errorD (h) > errorD (h0 ) (↑ Variance)
• Amount (or Degree) of Overfitting = errorD (h) − errorD (h0 )

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35


Overfitting in Decision Trees

Overfitting in Decision Trees

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35


Overfitting in Decision Trees

Overfitting in Decision Trees

• Basic ID3 grows each branch of the DT just deeply enough to perfectly
classify the training examples.

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35


Overfitting in Decision Trees

Overfitting in Decision Trees

• Basic ID3 grows each branch of the DT just deeply enough to perfectly
classify the training examples.
• Though it’s a reasonable strategy, it can lead to difficulties on noisy data
or when the number of training examples is too small to predict the true
target function.

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35


Overfitting in Decision Trees

Overfitting in Decision Trees

• Basic ID3 grows each branch of the DT just deeply enough to perfectly
classify the training examples.
• Though it’s a reasonable strategy, it can lead to difficulties on noisy data
or when the number of training examples is too small to predict the true
target function.
• Either of these cases lead to trees that overfit the training examples.

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35


Overfitting in Decision Trees - Impact of noisy data
Overfitting in Decision Trees - Impact of noisy data
Consider the training example
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

Add a noisy training example (False negative):


Sunny, Hot, N ormal, Strong, P layT ennis =
No
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
ID3 outputs a DT (h) more complex than the original tree (h’), that fits perfectly the
training examples.
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
ID3 outputs a DT (h) more complex than the original tree (h’), that fits perfectly the
training examples.
Overfitting also occurs due to coincidental regularities (some attributes unrelated to true
target function partition the examples very well)!!!
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
ID3 outputs a DT (h) more complex than the original tree (h’), that fits perfectly the
training examples.
Overfitting also occurs due to coincidental regularities (some attributes unrelated to true
target function partition the examples very well)!!!
Overfitting decreases the accuracy of the learned DTs by 10-25% on most problems!!!
Overfitting in Decision Trees

Overfitting in Decision Trees (contd...)

0.9

0.85

0.8
Degree of Overfitting
0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

A. Senthil Thilak (NITK) Supervised Learning - DTL 19 / 35


Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35


Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35


Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:


• Stop growing the tree (earlier) when the data split is not statistically
significant.

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35


Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:


• Stop growing the tree (earlier) when the data split is not statistically
significant. Difficult to get precise estimiation on when to stop!!
• Grow full tree,

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35


Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:


• Stop growing the tree (earlier) when the data split is not statistically
significant. Difficult to get precise estimiation on when to stop!!
• Grow full tree, then post-prune (More Successful!!)

Irrespective of the approaches, the key question is: How to select the “best”
tree???

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35


Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:


• Stop growing the tree (earlier) when the data split is not statistically
significant. Difficult to get precise estimiation on when to stop!!
• Grow full tree, then post-prune (More Successful!!)

Irrespective of the approaches, the key question is: How to select the “best”
tree???

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two:

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis &

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis & Validation set

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis & Validation set → to evaluate the accuracy of this hypothesis
over subsequent data, in particular to evaluate the effect of pruning. Must
be large enough to provide statistially significant samples

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis & Validation set → to evaluate the accuracy of this hypothesis
over subsequent data, in particular to evaluate the effect of pruning. Must
be large enough to provide statistially significant samples
• Use all the available data for training and apply a statistical test like
Chi-square test & estimate whether further expanding (or pruning) is
likely to improve performance beyond the training set.

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis & Validation set → to evaluate the accuracy of this hypothesis
over subsequent data, in particular to evaluate the effect of pruning. Must
be large enough to provide statistially significant samples
• Use all the available data for training and apply a statistical test like
Chi-square test & estimate whether further expanding (or pruning) is
likely to improve performance beyond the training set.
• Minimum Description Length (MDL) Principle:

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over


training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis & Validation set → to evaluate the accuracy of this hypothesis
over subsequent data, in particular to evaluate the effect of pruning. Must
be large enough to provide statistially significant samples
• Use all the available data for training and apply a statistical test like
Chi-square test & estimate whether further expanding (or pruning) is
likely to improve performance beyond the training set.
• Minimum Description Length (MDL) Principle:

minimize size(tree) + size(misclassifications(tree))

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node:

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:
1 Split the dataset into training and validation set.
2 Build a tree that classifies training set correctly.
3 Fix each decision node in the tree as a candidate for Pruning.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:
1 Split the dataset into training and validation set.
2 Build a tree that classifies training set correctly.
3 Fix each decision node in the tree as a candidate for Pruning.
• Evaluate impact on validation set by pruning each possible node

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:
1 Split the dataset into training and validation set.
2 Build a tree that classifies training set correctly.
3 Fix each decision node in the tree as a candidate for Pruning.
• Evaluate impact on validation set by pruning each possible node
• Greedily remove the one that most improves validation set accuracy
(Validation accuracy → Fraction of examples in validation set correctly
predicated by the tree)
4 Do until further pruning is harmful.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:
1 Split the dataset into training and validation set.
2 Build a tree that classifies training set correctly.
3 Fix each decision node in the tree as a candidate for Pruning.
• Evaluate impact on validation set by pruning each possible node
• Greedily remove the one that most improves validation set accuracy
(Validation accuracy → Fraction of examples in validation set correctly
predicated by the tree)
4 Do until further pruning is harmful.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35


Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35


Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

• Merit: Reduced-error pruning produces smallest version of most accurate subtree!!!

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35


Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

• Merit: Reduced-error pruning produces smallest version of most accurate subtree!!!


• Demerit: What if data is limited???

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35


Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

• Merit: Reduced-error pruning produces smallest version of most accurate subtree!!!


• Demerit: What if data is limited??? (Dataset = Training set + Validation set + Test
dataset)

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:


1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:


1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.
2 Convert tree to equivalent set of rules - Create one rule for each path
from the root node to a leaf node.

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:


1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.
2 Convert tree to equivalent set of rules - Create one rule for each path
from the root node to a leaf node.
3 Prune (Generalize) each rule independently of others

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:


1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.
2 Convert tree to equivalent set of rules - Create one rule for each path
from the root node to a leaf node.
3 Prune (Generalize) each rule independently of others
4 Sort the final rules using their estimated accuracy

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:


1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.
2 Convert tree to equivalent set of rules - Create one rule for each path
from the root node to a leaf node.
3 Prune (Generalize) each rule independently of others
4 Sort the final rules using their estimated accuracy
5 Classify new instances using the sorted sequence.

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error


pruning

Basic steps in Rule post-pruning:


1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.
2 Convert tree to equivalent set of rules - Create one rule for each path
from the root node to a leaf node.
3 Prune (Generalize) each rule independently of others
4 Sort the final rules using their estimated accuracy
5 Classify new instances using the sorted sequence.
Note: The rule post-pruning results only in a sequence of rules which will be
used to classify new instances and we don’t go back to decision trees.

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35


Overfitting in Decision Trees

Estimating rule accuracy:

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35


Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35


Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

• Use the most frequently adopted method of evaluating the performance


on the training set itself (pessimistically that Training set gives an
estimate biased in favour of the rules, at a given confidence interval) and
calculate the SD in this estimated accuracy assuming a binomial
distribution. For a given confidence interval, the lower-bound estimate is
taken as the measure of rule performance. (e.g., C4.5 (Quinlan, 1993))

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35


Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

• Use the most frequently adopted method of evaluating the performance


on the training set itself (pessimistically that Training set gives an
estimate biased in favour of the rules, at a given confidence interval) and
calculate the SD in this estimated accuracy assuming a binomial
distribution. For a given confidence interval, the lower-bound estimate is
taken as the measure of rule performance. (e.g., C4.5 (Quinlan, 1993))
- Not statistically valid, but works well on large data sets.

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35


Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

• Use the most frequently adopted method of evaluating the performance


on the training set itself (pessimistically that Training set gives an
estimate biased in favour of the rules, at a given confidence interval) and
calculate the SD in this estimated accuracy assuming a binomial
distribution. For a given confidence interval, the lower-bound estimate is
taken as the measure of rule performance. (e.g., C4.5 (Quinlan, 1993))
- Not statistically valid, but works well on large data sets.
- Lesser SD implies the pessimistic estimate is very close to observed
accuracy.

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules

• Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node
becomes the rule consequent (postcondition).

A. Senthil Thilak (NITK) Supervised Learning - DTL 26 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules

• Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node
becomes the rule consequent (postcondition).
• Next, each such rule is pruned by removing antecedent or precondition
whose removal does not worsen its estimated accuracy.

A. Senthil Thilak (NITK) Supervised Learning - DTL 26 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules

• Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node
becomes the rule consequent (postcondition).
• Next, each such rule is pruned by removing antecedent or precondition
whose removal does not worsen its estimated accuracy.
• Select the best by improved estimated accuracy.
• Proceed until further pruning is harmful.

A. Senthil Thilak (NITK) Supervised Learning - DTL 26 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

A. Senthil Thilak (NITK) Supervised Learning - DTL 27 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o

A. Senthil Thilak (NITK) Supervised Learning - DTL 27 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
IF (Outlook==Sunny) ∧(Humidity==Normal)

THEN PlayTennis=Yes
...

A. Senthil Thilak (NITK) Supervised Learning - DTL 27 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:


IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:


IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:


IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)
• Prue each of these preconditions and compute the estimated rule
accuracy for each.

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:


IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)
• Prue each of these preconditions and compute the estimated rule
accuracy for each.
• Select that pruning which results in improved accuracy.

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:


IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)
• Prue each of these preconditions and compute the estimated rule
accuracy for each.
• Select that pruning which results in improved accuracy.
• Repeat this independently for each such rule formed from the DT.

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:


IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)
• Prue each of these preconditions and compute the estimated rule
accuracy for each.
• Select that pruning which results in improved accuracy.
• Repeat this independently for each such rule formed from the DT.
• Sort these final pruned rules (by accuracy) to classify the instances.

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.
- Pruning decisions regarding attributes/decision nodes can be performed differently
for each path.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.
- Pruning decisions regarding attributes/decision nodes can be performed differently
for each path.
- In contrast, pruning a tree directly results either removing a decision node
completely or retain in its original form.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.
- Pruning decisions regarding attributes/decision nodes can be performed differently
for each path.
- In contrast, pruning a tree directly results either removing a decision node
completely or retain in its original form.
• Removes distinction between attribute tests near the root and those near
the leaves.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.
- Pruning decisions regarding attributes/decision nodes can be performed differently
for each path.
- In contrast, pruning a tree directly results either removing a decision node
completely or retain in its original form.
• Removes distinction between attribute tests near the root and those near
the leaves.
- Reduces bookkeeping issues regarding re-organizing tree.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:


• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.
- Pruning decisions regarding attributes/decision nodes can be performed differently
for each path.
- In contrast, pruning a tree directly results either removing a decision node
completely or retain in its original form.
• Removes distinction between attribute tests near the root and those near
the leaves.
- Reduces bookkeeping issues regarding re-organizing tree.
• Rules are comprehensive and improve readability.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35


Overfitting in Decision Trees

Continuous Valued Attributes

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35


Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35


Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.


• Question: What if the attribute values are continuous?
• Solution: Discretize (or split) the attribute space using a threshold.
(Boolean split)

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35


Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.


• Question: What if the attribute values are continuous?
• Solution: Discretize (or split) the attribute space using a threshold.
(Boolean split)
- A ← Ac , where Ac is True if A < c and False otherwise.
• What threshold to use?

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35


Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.


• Question: What if the attribute values are continuous?
• Solution: Discretize (or split) the attribute space using a threshold.
(Boolean split)
- A ← Ac , where Ac is True if A < c and False otherwise.
• What threshold to use? - We need c that results in greatest
information gain.
- Sort the values of the attribute A and consider all mid-way points at which the
classification changes as threshold c.
- c that maximizes information gain must lie at such a boundary (Fayyad, 1991).

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35


Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35


Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35


Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

48 + 60 80 + 90
• Candidate thresholds: c = or c = .
2 2

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35


Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

48 + 60 80 + 90
• Candidate thresholds: c = or c = .
2 2
• Pick the one with greatest information gain.

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35


Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

48 + 60 80 + 90
• Candidate thresholds: c = or c = .
2 2
• Pick the one with greatest information gain.
• T emperature > 54 is the best in this case.

Note: The other option is to split into multiple intervals, instead of two
intervals based on a single threshold.

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example.

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence,

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence,

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. -

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure:

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure: Use GainRatio instead!!! (Quinlan, 1986)

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure: Use GainRatio instead!!! (Quinlan, 1986)
Gain(S, A)
GainRatio(S, A) =
SplitInf ormation(S, A)

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure: Use GainRatio instead!!! (Quinlan, 1986)
Gain(S, A)
GainRatio(S, A) =
SplitInf ormation(S, A)

c
X |Si | |Si |
SplitInf ormation(S, A) = − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi .

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure: Use GainRatio instead!!! (Quinlan, 1986)
Gain(S, A)
GainRatio(S, A) =
SplitInf ormation(S, A)

c
X |Si | |Si |
SplitInf ormation(S, A) = − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi .
(SplitInformation → Measures how broadly and uniformly the attribute splits the data &

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35


Overfitting in Decision Trees

Attributes with Many Values


Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure: Use GainRatio instead!!! (Quinlan, 1986)
Gain(S, A)
GainRatio(S, A) =
SplitInf ormation(S, A)

c
X |Si | |Si |
SplitInf ormation(S, A) = − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi .
(SplitInformation → Measures how broadly and uniformly the attribute splits the data & is the
entropy of S wrt to attribute A, in contrast to usual entropy. )
A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35
Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.
- Example:

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.
- Example: Let x be an instance for which the value of attribute A is
missing and let A be a boolean attribute.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.
- Example: Let x be an instance for which the value of attribute A is
missing and let A be a boolean attribute.
Suppose A(x = 1) = 0.4 and A(x = 0) = 0.6

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.
- Example: Let x be an instance for which the value of attribute A is
missing and let A be a boolean attribute.
Suppose A(x = 1) = 0.4 and A(x = 0) = 0.6
- A fractional 0.4 of instance x goes to the branch for A = 1 and 0.6 of x to
that for A = 0.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35


Overfitting in Decision Trees

Handling Missing Attribute Values


• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.
- Example: Let x be an instance for which the value of attribute A is
missing and let A be a boolean attribute.
Suppose A(x = 1) = 0.4 and A(x = 0) = 0.6
- A fractional 0.4 of instance x goes to the branch for A = 1 and 0.6 of x to
that for A = 0.
• Use these fractional instances to compute gain.
A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35
Overfitting in Decision Trees

Attributes with (differing) Costs


Consider
• Medical diagnosis

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35


Overfitting in Decision Trees

Attributes with (differing) Costs


Consider
• Medical diagnosis → classify based on the attribute values for
Temperature, BiopsyResult, Pulse, BloodTestResults, etc.

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35


Overfitting in Decision Trees

Attributes with (differing) Costs


Consider
• Medical diagnosis → classify based on the attribute values for
Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
• Attributes vary significantly in their costs, both in terms of monetary cost
and that to patient comfort.

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35


Overfitting in Decision Trees

Attributes with (differing) Costs


Consider
• Medical diagnosis → classify based on the attribute values for
Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
• Attributes vary significantly in their costs, both in terms of monetary cost
and that to patient comfort.

How to learn a consistent tree with low expected cost?

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35


Overfitting in Decision Trees

Attributes with (differing) Costs


Consider
• Medical diagnosis → classify based on the attribute values for
Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
• Attributes vary significantly in their costs, both in terms of monetary cost
and that to patient comfort.

How to learn a consistent tree with low expected cost?


One approach for attribute selection: Replace gain by
• Tan and Schlimmer (1990), Tan (1993)
Gain2 (S, A)
.
Cost(A)
• Nunez (1988)
2Gain(S,A) − 1
(Cost(A) + 1)w
where w ∈ [0, 1] determines importance of cost vs. information gain.
A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35
Overfitting in Decision Trees

References

Alpaydin, E. (2020). Introduction to machine learning. MIT press.


Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, Inc., USA, 1
edition.

A. Senthil Thilak (NITK) Supervised Learning - DTL 35 / 35

You might also like