0% found this document useful (1 vote)

13 views192 pages

04 - Decision Tree Learning

The document discusses Decision Tree Learning, a method for approximating both discrete and continuous target functions, primarily used for classification. It outlines the structure of decision trees, including their representation as tree-like structures and if-then rules, and introduces key concepts such as the ID3 learning algorithm, entropy, and information gain. Additionally, it highlights when to consider decision trees based on the nature of the instances and target functions.

Uploaded by

hitha2122004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

13 views192 pages

04 - Decision Tree Learning

Uploaded by

hitha2122004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 192

Decision Tree Learning1

MA325 - Machine Learning

A. Senthil Thilak

Department of Mathematical and Computational Sciences

National Institute of Technology Karnataka

1
Topics from reference (Mitchell, 1997): Chapter 3
A. Senthil Thilak (NITK) Supervised Learning - DTL 1 / 35
Decision Tree Learning

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35

Decision Tree Learning
Introduction:

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35

Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);

A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35

Decision Tree Learning
Introduction:
• A method used for approximating both Discrete-valued target function
(Classification) and Continuous-valued target function (Regression);
• But, most commonly used for classficiation, in which the learned
function is represented as a tree-like structure called “Decision tree”.
• Other representations of a decision tree: If-then rules (improves human
readability), Disjuction of Conjunctions.
Topics to be learnt:
• Decision tree representation
• ID3 learning algorithm
• Entropy, Information gain
• Issues in DT learning
• Overfitting
• Pruning strategies
A. Senthil Thilak (NITK) Supervised Learning - DTL 2 / 35
Decision tree (DT) representation
Decision tree (DT) representation
• A DT uses a top-down greedy approach and classifies instances by sorting them down
the tree, from the root to some leaf node, which provides the classification of the
instance.
Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis

Dl Sunny Hot High Weak No

D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No

Figure:
TABLE 3.2 Training examples for the target
Training examples for the target concept PlayTennis.
concept “PlayTennis”
Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis Outlook

Dl Sunny Hot High Weak No

D2 Sunny Hot High Strong No
Sunny Overcast Rain
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes Humidity Wind
Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes High Normal Strong Weak
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes No Yes No Yes
Dl3 Overcast Hot Normal Weak Yes
Dl4 Rain Mild High Strong No
Figure: A DT for the target concept
Figure:
TABLE 3.2 Training examples for the target “PlayTennis”
Training examples for the target concept PlayTennis.
concept “PlayTennis”

• Each internal node tests an attribute.

Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis Outlook

Dl Sunny Hot High Weak No

• Each internal node tests an attribute.

• Each branch corresponds to value of the attribute corresponding to the nodes from
which it branches out.
Decision tree Example
3.4.2 An Illustrative (DT) representation
To illustrate the operation of ID3, consider the learning task represented by the
• Aexamples
training DT usesofaTable 3.2. Here
top-down the target
greedy attribute and
approach PlayTennis, which
classifies can
instances by sorting them down
have values yes or no for different Saturday mornings, is to be predicted based
the tree, from the root to some leaf node, which provides the
on other attributes of the morning in question. Consider the first step through
classification of the
instance.

Day Outlook Temperature Humidity Wind PlayTennis Outlook

Dl Sunny Hot High Weak No

• Each internal node tests an attribute.

• Each branch corresponds to value of the attribute corresponding to the nodes from
which it branches out.
• Each leaf node assigns a classification.
When to Consider Decision Trees?

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued
• Disjunctive hypothesis may be required

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

• Equipment or medical diagnosis

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

• Equipment or medical diagnosis

• Credit risk analysis

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

When to Consider Decision Trees?

• Instances describable by attribute-value pairs

• Target function is discrete-valued
• Disjunctive hypothesis may be required
• Possibly noisy training data or missing data

Examples

• Equipment or medical diagnosis

• Credit risk analysis
• Modeling calendar scheduling preferences

A. Senthil Thilak (NITK) Supervised Learning - DTL 4 / 35

Top-Down Induction of Decision Trees - Basic ID3
algorithm
ID3(D, TA, A)
(D → Training examples; TA → the attribute whose classification/value is to be predicted by
DT; A → Set of other attribute that may be tested by the learned DT)
1 Create a Root node for the tree
2 If all Examples are positive, Return the single-node tree Root, with label = +
3 If all Examples are negative, Return the single-node tree Root, with label = −
4 If A is empty, Return the single-node tree Root, with label = most common value of TA
in D
5 Else A ← the “best” decision attribute for next node (Assign A as decision attribute for
Root node)
6 For each value of A, create new descendant of node
7 Sort training examples to leaf nodes
8 If training examples are perfectly classified, then STOP, Else iterate over new leaf nodes
9 End
10 Return Root
A. Senthil Thilak (NITK) Supervised Learning - DTL 5 / 35
Choosing the best attribute?

A. Senthil Thilak (NITK) Supervised Learning - DTL 6 / 35

Choosing the best attribute?

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]

E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )

= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

A. Senthil Thilak (NITK) Supervised Learning - DTL 6 / 35

Choosing the best attribute?

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]

E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )

= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

Use the statistical property called Information gain which measures how well a given attribute
separates the training examples according to the target classification.

A. Senthil Thilak (NITK) Supervised Learning - DTL 6 / 35

Entropy - A measure of impurity/ambiguity
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
→ Entropy = 0 (min)
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
→ Entropy = 0 (min)
• Impure case: p = 0.5
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

• Pure cases: p = 0 or 1
→ Entropy = 0 (min)
• Impure case: p = 0.5
→ Entropy = 1 (max)
Entropy - A measure of impurity/ambiguity
• S is a sample set of training examples
• p⊕ is the proportion of positive examples in S
• p is the proportion of negative examples in S
• Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p (Binary classif ication)

X
Entropy(S) ≡ − pc log2 pc (M ulticlass classif ication),
c∈C

where C denotes the set of class labels/values of the target attribute.

1.0

• Pure cases: p = 0 or 1 Entropy(S)

→ Entropy = 0 (min) 0.5

• Impure case: p = 0.5

→ Entropy = 1 (max)
0.0 0.5 1.0
p
+
Information Gain - For choice of Hypothesis space

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

Information Gain - For choice of Hypothesis space
• Choice of Root node:

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

Information Gain - For choice of Hypothesis space
• Choice of Root node:
• Gain(S, A) → expected reduction in entropy due to sorting on A
X |Sv |
• Gain(S, A) ≡ Entropy(S) − Entropy(Sv )
|S|
v∈V alues(A)

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

• Selecting the Next Attribute (internal node)

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

Which attribute is the best classifier?

• Selecting the Next Attribute (internal node)
S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]

E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )

= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

A. Senthil Thilak (NITK) Supervised Learning - DTL 8 / 35

stands for entropy and S for the original collection of examples. Given an initial collection S of 9
positive and 5 negative examples, [9+, 5-1, sorting these by their Humidity produces collections of
Partially learnt tree
[3+, 4-1 (Humidity = High) and [6+, 1-1 (Humidity = Normal). The information gained by this
partitioning is .151, compared to a gain of only .048for the attribute Wind.

3.4.2 An Illustrative Example

To illustrate the operation of ID3, consider the learning task represented by the
training examples of Table 3.2. Here the target attribute PlayTennis, which can
have values yes or no for different Saturday mornings, is to be predicted based
on other attributes of the morning in question. Consider the first step through

Day Outlook Temperature Humidity Wind PlayTennis

Dl Sunny Hot High Weak No

TABLE 3.2
Training examples for the target concept PlayTennis.
stands for entropy and S for the original collection of examples. Given an initial collection S of 9
positive and 5 negative examples, [9+, 5-1, sorting these by their Humidity produces collections of
Partially learnt tree
[3+, 4-1 (Humidity = High) and [6+, 1-1 (Humidity = Normal). The information gained by this
partitioning is .151, compared to a gain of only .048for the attribute Wind.

3.4.2 An Illustrative Example

{D1, D2, ..., D14}
To illustrate the operation of ID3, consider the learning task represented by the
training examples of Table 3.2. Here the target attribute PlayTennis, which can [9+,5−]
have values yes or no for different Saturday mornings, is to be predicted based
on other attributes of the morning in question. Consider the first step through Outlook

Day Outlook Temperature Humidity Wind PlayTennis

Sunny Overcast Rain
Dl Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes {D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
D4 Rain Mild High Weak Yes
[2+,3−] [4+,0−] [3+,2−]
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes ? Yes
?
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
Dl0 Rain Mild Normal Weak Yes
Dl1 Sunny Mild Normal Strong Yes
Dl2 Overcast Mild High Strong Yes
Dl3 Overcast Hot Normal Weak Yes Which attribute should be tested here?
Dl4 Rain Mild High Strong No
Ssunny = {D1,D2,D8,D9,D11}
TABLE 3.2
Training examples for the target concept PlayTennis. Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570

Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

Hypothesis Space Search by ID3

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35

Hypothesis Space Search by ID3
• ID3 is characterized as searching a space of hypotheses for one that best fits the training
examples.

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35

Hypothesis Space Search by ID3
• ID3 is characterized as searching a space of hypotheses for one that best fits the training
examples.
• The hypothesis space for ID3 is the set of possible decision trees.
• ID3 performs a simple-to-complex hill-climbing search through this space, starting with
empty tree, then inductively considering more elaborate hypotheses in search of a DT
that correctly classifies the training data, guided by the Information gain measure.

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35

+ – +

...

A1
A2 Hyposthesis space search by ID3 searches
+ – + + + – + – through the space of possible decision trees
from simplest to increasingly complex,
...
guided by the information gain heuristic.
A2 A2

+ – + – + – + –
A3 A4
–
+

... ...

A. Senthil Thilak (NITK) Supervised Learning - DTL 10 / 35

Hypothesis Space Search by ID3 (contd...) I
Hypothesis Space Search by ID3 (contd...) II
• Statisically-based search choices - uses all traning examples at each step in the search
for refining its current hypothesis, contrasting with incremental decisions as in Find-S or
Candidate-Elimination algorithms
Advantages:
• Less sensitive to errors
• Robust to nosiy data - Extended by modifiying its termination criterion to accept
hypotheses that imperfectly fit the traning data
• Can handle missing data!
Limitations:
• Unlike Candidate-Elimination method, ID3 determines only a single consistent
hypothesis. So, it loses the capability of determining how many alternative
consistent decision trees are available or to pose new instance queries that
optimally resolve among these competing hypotheses.
• ID3 performs no backtracking to reconsider its choice of attributes. Hence,
converges only to a locally optimal solution, which may be a less desirable choice
upon encountering a different branch of search. (Pruning overcomes the issue!!!)
Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35

Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased?

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35

Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased? Not really...

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35

Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased? Not really...

Inductive bias of ID3?

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35

Inductive Bias in ID3

Note that the hypotheses space H is the power set of instances X

→ Unbiased? Not really...

Inductive bias of ID3?

• Approx. IB: Prefer short trees over longer ones → Occam’s Razor.
• Prefers not only shorter trees but also those trees that place progressively the attributes
with high information gain closest to the root. Hence, ID3 performs better than even
BFS-based ID3.

A. Senthil Thilak (NITK) Supervised Learning - DTL 13 / 35

Occam’s Razor
Bias: Preference for some hypotheses, rather than a restriction of hypothesis space H.

Occam’s razor [by William of Occam around 1320]: Prefer the shortest hypothesis that fits
the data.

Why prefer short hypotheses?

Arguments in favor:
→ Fewer short hypotheses than long hypotheses
→ a short hypothesis that fits data unlikely to be coincidence
→ a long hypothesis that fits data might be coincidence

Arguments opposed:
→ There are many ways to define small sets of hypotheses.
→ For eg., all trees with a prime number of nodes that use attributes beginning with “Z”
→ What’s so special about small sets based on size of hypothesis??

A. Senthil Thilak (NITK) Supervised Learning - DTL 14 / 35

Issues in DT Learning

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Issues in DT Learning

• How deeply to grow the DT?

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Issues in DT Learning

• How deeply to grow the DT?

• Choosing an appropriate attribute selection measure.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Issues in DT Learning

• How deeply to grow the DT?

• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Issues in DT Learning

• How deeply to grow the DT?

• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handling attributes with differing costs.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Issues in DT Learning

• How deeply to grow the DT?

• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handling attributes with differing costs.
• Handling continuous attributes.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Issues in DT Learning

• How deeply to grow the DT?

• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handling attributes with differing costs.
• Handling continuous attributes.
• Improving computational efficiency.

A. Senthil Thilak (NITK) Supervised Learning - DTL 15 / 35

Overfitting in Decision Trees

What is Overfitting?
Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H,

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35

Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35

Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35

Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Definition
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there
exists some alternative hypothesis h0 ∈ H such that h has smaller error than h0 over the
training examples, but h0 has a smaller error than h over the entire distribution of instances
(including instances beyond the training set).

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35

Overfitting in Decision Trees

What is Overfitting?
Overfitting in Decision Trees

What is Overfitting? - Formal definition of Overfitting

Presicely,
• Consider a hypothesis h and its
- error over training data: errortrain (h) and
- eror over entire distribution D of data: errorD (h)
• We say h ∈ H overfits training data if there is another hypothesis h0 ∈ H such that
- errortrain (h) < error_train(h0 ) and
- errorD (h) > errorD (h0 ) (↑ Variance)
• Amount (or Degree) of Overfitting = errorD (h) − errorD (h0 )

A. Senthil Thilak (NITK) Supervised Learning - DTL 16 / 35

Overfitting in Decision Trees

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35

Overfitting in Decision Trees

• Basic ID3 grows each branch of the DT just deeply enough to perfectly
classify the training examples.

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35

Overfitting in Decision Trees

• Basic ID3 grows each branch of the DT just deeply enough to perfectly
classify the training examples.
• Though it’s a reasonable strategy, it can lead to difficulties on noisy data
or when the number of training examples is too small to predict the true
target function.

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35

Overfitting in Decision Trees

A. Senthil Thilak (NITK) Supervised Learning - DTL 17 / 35

Overfitting in Decision Trees - Impact of noisy data
Overfitting in Decision Trees - Impact of noisy data
Consider the training example
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

Add a noisy training example (False negative):

Sunny, Hot, N ormal, Strong, P layT ennis =
No
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
ID3 outputs a DT (h) more complex than the original tree (h’), that fits perfectly the
training examples.
Overfitting also occurs due to coincidental regularities (some attributes unrelated to true
target function partition the examples very well)!!!
Overfitting in Decision Trees - Impact of noisy data
Consider the training example

What impact on the earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Add a noisy training example (False negative):
Sunny, Hot, N ormal, Strong, P layT ennis =
No
ID3 outputs a DT (h) more complex than the original tree (h’), that fits perfectly the
training examples.
Overfitting also occurs due to coincidental regularities (some attributes unrelated to true
target function partition the examples very well)!!!
Overfitting decreases the accuracy of the learned DTs by 10-25% on most problems!!!
Overfitting in Decision Trees

Overfitting in Decision Trees (contd...)

0.9

0.85

0.8
Degree of Overfitting
0.75
Accuracy

0.7

0.65

0.6 On training data

On test data
0.55

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

A. Senthil Thilak (NITK) Supervised Learning - DTL 19 / 35

Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35

Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35

Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:

• Stop growing the tree (earlier) when the data split is not statistically
significant.

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35

Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:

• Stop growing the tree (earlier) when the data split is not statistically
significant. Difficult to get precise estimiation on when to stop!!
• Grow full tree,

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35

Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:

• Stop growing the tree (earlier) when the data split is not statistically
significant. Difficult to get precise estimiation on when to stop!!
• Grow full tree, then post-prune (More Successful!!)

Irrespective of the approaches, the key question is: How to select the “best”
tree???

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35

Overfitting in Decision Trees

Avoiding Overfitting

How can we avoid overfitting?

Two common approaches:

Irrespective of the approaches, the key question is: How to select the “best”
tree???

A. Senthil Thilak (NITK) Supervised Learning - DTL 20 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

training data & Measure performance over separate validation data set
(distinct from training data), thereby evaluate the utility of post-pruning
nodes from the tree. (Common!!!)
- Split the entire dataset into two: Training set → to form the learned
hypothesis & Validation set → to evaluate the accuracy of this hypothesis
over subsequent data, in particular to evaluate the effect of pruning. Must
be large enough to provide statistially significant samples
• Use all the available data for training and apply a statistical test like
Chi-square test & estimate whether further expanding (or pruning) is
likely to improve performance beyond the training set.

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Three approaches to avoid overfitting

• Training and Validation set approach: Measure performance over

minimize size(tree) + size(misclassifications(tree))

A. Senthil Thilak (NITK) Supervised Learning - DTL 21 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node:

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:
1 Split the dataset into training and validation set.
2 Build a tree that classifies training set correctly.
3 Fix each decision node in the tree as a candidate for Pruning.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

Reduced-error pruning:
1 Split the dataset into training and validation set.
2 Build a tree that classifies training set correctly.
3 Fix each decision node in the tree as a candidate for Pruning.
• Evaluate impact on validation set by pruning each possible node
• Greedily remove the one that most improves validation set accuracy
(Validation accuracy → Fraction of examples in validation set correctly
predicated by the tree)
4 Do until further pruning is harmful.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Pruning & Reduced-Error Pruning

Pruning a decision node: Removing the subtree rooted at that node, making
it a leaf node and assigning it the most common classification of the training
examples affiliated with that node.

A. Senthil Thilak (NITK) Supervised Learning - DTL 22 / 35

Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data

On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35

Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data

On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

• Merit: Reduced-error pruning produces smallest version of most accurate subtree!!!

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35

Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data

On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

• Merit: Reduced-error pruning produces smallest version of most accurate subtree!!!

• Demerit: What if data is limited???

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35

Overfitting in Decision Trees

Effect of Reduced-Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data

On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

• Merit: Reduced-error pruning produces smallest version of most accurate subtree!!!

• Demerit: What if data is limited??? (Dataset = Training set + Validation set + Test
dataset)

A. Senthil Thilak (NITK) Supervised Learning - DTL 23 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

1 Find the DT from the training set, growing until the training set is fit and
allows overfitting to occur.
2 Convert tree to equivalent set of rules - Create one rule for each path
from the root node to a leaf node.
3 Prune (Generalize) each rule independently of others
4 Sort the final rules using their estimated accuracy
5 Classify new instances using the sorted sequence.

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Rule Post-Pruning − An alternative to Reduced-error

pruning

Basic steps in Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 24 / 35

Overfitting in Decision Trees

Estimating rule accuracy:

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35

Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35

Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

• Use the most frequently adopted method of evaluating the performance

on the training set itself (pessimistically that Training set gives an
estimate biased in favour of the rules, at a given confidence interval) and
calculate the SD in this estimated accuracy assuming a binomial
distribution. For a given confidence interval, the lower-bound estimate is
taken as the measure of rule performance. (e.g., C4.5 (Quinlan, 1993))

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35

Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

• Use the most frequently adopted method of evaluating the performance

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35

Overfitting in Decision Trees

Estimating rule accuracy:

• Use a validation set disjoint from training set and evaluate the estimated
rule accuracy. (or)

• Use the most frequently adopted method of evaluating the performance

A. Senthil Thilak (NITK) Supervised Learning - DTL 25 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules

• Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node
becomes the rule consequent (postcondition).

A. Senthil Thilak (NITK) Supervised Learning - DTL 26 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules

• Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node
becomes the rule consequent (postcondition).
• Next, each such rule is pruned by removing antecedent or precondition
whose removal does not worsen its estimated accuracy.

A. Senthil Thilak (NITK) Supervised Learning - DTL 26 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules

A. Senthil Thilak (NITK) Supervised Learning - DTL 26 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

A. Senthil Thilak (NITK) Supervised Learning - DTL 27 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o

A. Senthil Thilak (NITK) Supervised Learning - DTL 27 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
IF (Outlook==Sunny) ∧(Humidity==Normal)

THEN PlayTennis=Yes
...

A. Senthil Thilak (NITK) Supervised Learning - DTL 27 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

IF (Outlook==Sunny) ∧ (Humidity==High)
THEN P layT ennis = N o
• Preconditions: (Outlook==Sunny), (Humidity==High)
• Prue each of these preconditions and compute the estimated rule
accuracy for each.
• Select that pruning which results in improved accuracy.
• Repeat this independently for each such rule formed from the DT.

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

An illustration to convert a Tree to Rules (contd...)

• Consider the rule:

A. Senthil Thilak (NITK) Supervised Learning - DTL 28 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

• Allows distinguishing the different contexts in which a decision node is
used.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

• Allows distinguishing the different contexts in which a decision node is
used.
- Each path → a distict rule.
- Pruning decisions regarding attributes/decision nodes can be performed differently
for each path.
- In contrast, pruning a tree directly results either removing a decision node
completely or retain in its original form.
• Removes distinction between attribute tests near the root and those near
the leaves.

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Rule post-pruning vs. Reduced-error pruning

Advantages of Rule post-pruning:

A. Senthil Thilak (NITK) Supervised Learning - DTL 29 / 35

Overfitting in Decision Trees

Continuous Valued Attributes

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35

Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35

Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.

• Question: What if the attribute values are continuous?
• Solution: Discretize (or split) the attribute space using a threshold.
(Boolean split)

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35

Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35

Overfitting in Decision Trees

Continuous Valued Attributes

• Basic ID3 assumes a discrete set of values for each attribute.

• Question: What if the attribute values are continuous?
• Solution: Discretize (or split) the attribute space using a threshold.
(Boolean split)
- A ← Ac , where Ac is True if A < c and False otherwise.
• What threshold to use? - We need c that results in greatest
information gain.
- Sort the values of the attribute A and consider all mid-way points at which the
classification changes as threshold c.
- c that maximizes information gain must lie at such a boundary (Fayyad, 1991).

A. Senthil Thilak (NITK) Supervised Learning - DTL 30 / 35

Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35

Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35

Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

48 + 60 80 + 90
• Candidate thresholds: c = or c = .
2 2

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35

Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

48 + 60 80 + 90
• Candidate thresholds: c = or c = .
2 2
• Pick the one with greatest information gain.

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35

Overfitting in Decision Trees

Continuous Valued Attributes (contd..)

Example

Temperature 40 48 60 72 80 90
PlayTennis? No No Yes Yes Yes No

48 + 60 80 + 90
• Candidate thresholds: c = or c = .
2 2
• Pick the one with greatest information gain.
• T emperature > 54 is the best in this case.

Note: The other option is to split into multiple intervals, instead of two
intervals based on a single threshold.

A. Senthil Thilak (NITK) Supervised Learning - DTL 31 / 35

Overfitting in Decision Trees

Attributes with Many Values

Problem:

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

Problem:
• Information gain favours attributes with many values.

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example.

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence,

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

Problem:
• Information gain favours attributes with many values.
• Imagine using Date as an attribute in PlayTennis example. It has a large
number of possible values and hence, would have the hightest
information gain and hence, selected as root resulting in a tree of depth
one. - A poor predictor for subsequent instances!!!
An alternative measure:

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

c
X |Si | |Si |
SplitInf ormation(S, A) = − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi .

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35

Overfitting in Decision Trees

Attributes with Many Values

c
X |Si | |Si |
SplitInf ormation(S, A) = − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi .
(SplitInformation → Measures how broadly and uniformly the attribute splits the data & is the
entropy of S wrt to attribute A, in contrast to usual entropy. )
A. Senthil Thilak (NITK) Supervised Learning - DTL 32 / 35
Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

• What if some examples missing values of A?

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

• What if some examples missing values of A?
- Example: Patient data. Probably, BloodTestResult value is missing.
• Treat the missing value as another value (or).
• Ignore those instance having missing values.
- Problematic as it means throwing away data
(or)
• Assign it the most common value.
• Assign it the most common value based on the class that the example
belongs to.
• Assign probability pi to each possible value vi of attribute A.
- Example: Let x be an instance for which the value of attribute A is
missing and let A be a boolean attribute.
Suppose A(x = 1) = 0.4 and A(x = 0) = 0.6

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

A. Senthil Thilak (NITK) Supervised Learning - DTL 33 / 35

Overfitting in Decision Trees

Handling Missing Attribute Values

Attributes with (differing) Costs

Consider
• Medical diagnosis

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35

Overfitting in Decision Trees

Attributes with (differing) Costs

Consider
• Medical diagnosis → classify based on the attribute values for
Temperature, BiopsyResult, Pulse, BloodTestResults, etc.

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35

Overfitting in Decision Trees

Attributes with (differing) Costs

Consider
• Medical diagnosis → classify based on the attribute values for
Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
• Attributes vary significantly in their costs, both in terms of monetary cost
and that to patient comfort.

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35

Overfitting in Decision Trees

Attributes with (differing) Costs

How to learn a consistent tree with low expected cost?

A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35

Overfitting in Decision Trees

Attributes with (differing) Costs

How to learn a consistent tree with low expected cost?

One approach for attribute selection: Replace gain by
• Tan and Schlimmer (1990), Tan (1993)
Gain2 (S, A)
.
Cost(A)
• Nunez (1988)
2Gain(S,A) − 1
(Cost(A) + 1)w
where w ∈ [0, 1] determines importance of cost vs. information gain.
A. Senthil Thilak (NITK) Supervised Learning - DTL 34 / 35
Overfitting in Decision Trees

References

Alpaydin, E. (2020). Introduction to machine learning. MIT press.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, Inc., USA, 1
edition.

A. Senthil Thilak (NITK) Supervised Learning - DTL 35 / 35

Decision Tree Learning Lecture
No ratings yet
Decision Tree Learning Lecture
13 pages
Decision Trees & ID3 for Beginners
No ratings yet
Decision Trees & ID3 for Beginners
109 pages
Planning
No ratings yet
Planning
14 pages
Midpoint Circle Algorithm
No ratings yet
Midpoint Circle Algorithm
10 pages
Computer Graphics Unit 1 Bca 4th Sem Ccsu
No ratings yet
Computer Graphics Unit 1 Bca 4th Sem Ccsu
45 pages
PCA in Machine Learning Explained
No ratings yet
PCA in Machine Learning Explained
20 pages
CG Unit 3
No ratings yet
CG Unit 3
24 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Back Face Detection Algorithm
No ratings yet
Back Face Detection Algorithm
7 pages
Unit 1 CG
No ratings yet
Unit 1 CG
25 pages
2D Geometric Transformations Guide
No ratings yet
2D Geometric Transformations Guide
43 pages
All Units CG 29 32 PDF
No ratings yet
All Units CG 29 32 PDF
4 pages
Decision Tree Classifier Project
100% (1)
Decision Tree Classifier Project
20 pages
2D Transformation
No ratings yet
2D Transformation
29 pages
For Unit 4 Useful
100% (1)
For Unit 4 Useful
107 pages
Lect3 UWA PDF
No ratings yet
Lect3 UWA PDF
73 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
96 pages
2D Transformation
No ratings yet
2D Transformation
7 pages
Introduction
No ratings yet
Introduction
6 pages
AI Introduction
No ratings yet
AI Introduction
11 pages
Introduction to Pattern Recognition
100% (1)
Introduction to Pattern Recognition
39 pages
(7.1) Dda (Digital Differential Analyzer) Line Algorithm
No ratings yet
(7.1) Dda (Digital Differential Analyzer) Line Algorithm
7 pages
ANN-unit 4
No ratings yet
ANN-unit 4
25 pages
Scilab Manual For Image Processing by MR Gautam Pal Computer Engineering Tripura Institute of Technlogy
No ratings yet
Scilab Manual For Image Processing by MR Gautam Pal Computer Engineering Tripura Institute of Technlogy
50 pages
Unit 4
No ratings yet
Unit 4
12 pages
Cs 171 07a Games MiniMax
No ratings yet
Cs 171 07a Games MiniMax
28 pages
Industrial and Manufacturing Engineering Operations Research Write Up Title Decision Trees Name Registration Number
No ratings yet
Industrial and Manufacturing Engineering Operations Research Write Up Title Decision Trees Name Registration Number
11 pages
Module-1 DL
No ratings yet
Module-1 DL
53 pages
6.7 Basic Illumination Models Ambient, Diffuse and Specular Reflections
No ratings yet
6.7 Basic Illumination Models Ambient, Diffuse and Specular Reflections
3 pages
Image Enhancement Techniques
No ratings yet
Image Enhancement Techniques
15 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
122 pages
Graphics Line & Circle Algorithms
No ratings yet
Graphics Line & Circle Algorithms
8 pages
Machine Learning: Chapter 4. Artificial Neural Networks
0% (1)
Machine Learning: Chapter 4. Artificial Neural Networks
34 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
M.Tech CSE: Machine Learning Course
No ratings yet
M.Tech CSE: Machine Learning Course
8 pages
Self Organizing Maps
No ratings yet
Self Organizing Maps
27 pages
Decision Tree - A Step-by-Step Guide
No ratings yet
Decision Tree - A Step-by-Step Guide
36 pages
Machine Learning 1
No ratings yet
Machine Learning 1
29 pages
Neuro-Fuzzy and Soft Computing J.-S.R. Jang
No ratings yet
Neuro-Fuzzy and Soft Computing J.-S.R. Jang
12 pages
Understanding Intelligent Agents in AI
No ratings yet
Understanding Intelligent Agents in AI
49 pages
FLAT Notes PDF
No ratings yet
FLAT Notes PDF
110 pages
Truth Maintenance Systems
No ratings yet
Truth Maintenance Systems
5 pages
AI Planning: Concepts & Techniques
No ratings yet
AI Planning: Concepts & Techniques
34 pages
An Introduction To Kohonen Self Organizing Maps: Rajarshi Guha
No ratings yet
An Introduction To Kohonen Self Organizing Maps: Rajarshi Guha
12 pages
3D Modeling: A Comprehensive Guide
No ratings yet
3D Modeling: A Comprehensive Guide
13 pages
Single Layer & Multilayer Perceptron
No ratings yet
Single Layer & Multilayer Perceptron
14 pages
ML Unit 3
No ratings yet
ML Unit 3
30 pages
Window To Viewport
No ratings yet
Window To Viewport
2 pages
Deep Neural Networks Explained
No ratings yet
Deep Neural Networks Explained
12 pages
Basics and Benefits of Neural Networks
No ratings yet
Basics and Benefits of Neural Networks
46 pages
Learning from Examples in AI
No ratings yet
Learning from Examples in AI
84 pages
Gamified Teaching and Learning Chapter Sampler
No ratings yet
Gamified Teaching and Learning Chapter Sampler
169 pages
Decision Tree Analysis Guide
No ratings yet
Decision Tree Analysis Guide
20 pages
Image Compression (Chapter 8) : CS474/674 - Prof. Bebis
No ratings yet
Image Compression (Chapter 8) : CS474/674 - Prof. Bebis
128 pages
Lecture 7 - Classification (Rules and Naïve Bayes)
100% (1)
Lecture 7 - Classification (Rules and Naïve Bayes)
19 pages
Machine Learning Complete Notes
No ratings yet
Machine Learning Complete Notes
102 pages
Decision Tree Learning Overview
No ratings yet
Decision Tree Learning Overview
29 pages
Decision Trees & Kernel Machines
No ratings yet
Decision Trees & Kernel Machines
39 pages
Unit 3
No ratings yet
Unit 3
46 pages
CR 30-X - Chapter 11 - Installation Planning 3.0
No ratings yet
CR 30-X - Chapter 11 - Installation Planning 3.0
52 pages
Toxic Comment Severity Analysis
No ratings yet
Toxic Comment Severity Analysis
8 pages
Storage Services - Elastic Volume Service
No ratings yet
Storage Services - Elastic Volume Service
151 pages
80386 Microprocessor Architecture Overview
No ratings yet
80386 Microprocessor Architecture Overview
118 pages
Functional Design Document Microsoft Dynamics CRM
No ratings yet
Functional Design Document Microsoft Dynamics CRM
84 pages
B.Tech Midterm Exam Schedule 2011
No ratings yet
B.Tech Midterm Exam Schedule 2011
11 pages
Anshul Resume IITGoa
No ratings yet
Anshul Resume IITGoa
1 page
Gjshkbbjbkgja
No ratings yet
Gjshkbbjbkgja
2 pages
McKinsey 2025 Crack Event
No ratings yet
McKinsey 2025 Crack Event
12 pages
September OJT Weekly Report
No ratings yet
September OJT Weekly Report
2 pages
Cisco Packet Capture Tools Overview
No ratings yet
Cisco Packet Capture Tools Overview
150 pages
DSD Subsystem Design
No ratings yet
DSD Subsystem Design
65 pages
C++ Programming Tasks and Solutions
100% (2)
C++ Programming Tasks and Solutions
12 pages
Invenger Technologies Interview Experiences-2025 Batch
No ratings yet
Invenger Technologies Interview Experiences-2025 Batch
15 pages
Understanding Deepfakes and Cheap Fakes
No ratings yet
Understanding Deepfakes and Cheap Fakes
50 pages
IT Exam Timetable Semester 2 2022
No ratings yet
IT Exam Timetable Semester 2 2022
2 pages
CS8792 UNIT 5 Notes
No ratings yet
CS8792 UNIT 5 Notes
59 pages
Putri 2025
No ratings yet
Putri 2025
20 pages
Wicblt (Links To World's Biggest Entertainment)
No ratings yet
Wicblt (Links To World's Biggest Entertainment)
65 pages
Winder Technician Job Opening
No ratings yet
Winder Technician Job Opening
2 pages
Attraction Marketing Secrets For Network Marketers: by Erik Christian Johnson
No ratings yet
Attraction Marketing Secrets For Network Marketers: by Erik Christian Johnson
18 pages
Uiux Imp MCQ 1
No ratings yet
Uiux Imp MCQ 1
7 pages
Resume Alaa Eldin Abdel Hamid
No ratings yet
Resume Alaa Eldin Abdel Hamid
2 pages
Ai-Lcd104ha 530134
No ratings yet
Ai-Lcd104ha 530134
3 pages
Touchpad HMI EC Fan Control
No ratings yet
Touchpad HMI EC Fan Control
6 pages
Diag Result Log
No ratings yet
Diag Result Log
3 pages
Mobile Secret Codes
100% (3)
Mobile Secret Codes
26 pages
Automating SAP Cost Rollup Process
No ratings yet
Automating SAP Cost Rollup Process
7 pages
Creating A Starburst Effect
No ratings yet
Creating A Starburst Effect
3 pages
17 - Design A Smart Waste Bin For Smart Waste
No ratings yet
17 - Design A Smart Waste Bin For Smart Waste
5 pages