ML-3-Decision Tree
ML-3-Decision Tree
Unit-5:
REINFORCEMENT LEARNING–Introduction to Reinforcement Learning, Learning Task, Example of
Reinforcement Learning in Practice, Learning Models for Reinforcement – (Markov Decision process , Q Learning -
Q Learning function, Q Learning Algorithm ), Application of Reinforcement Learning, Introduction to Deep Q
Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle of reproduction, Crossover, Mutation, Genetic
Programming, Models of Evolution and Learning, Applications.
Books:
1. Tom M. Mitchell, ―Machine Learning, McGraw-Hill Education (India) Private Limited, 2013.
2. Ethem Alpaydin, ―Introduction to Machine Learning (Adaptive Computation and Machine Learning), MIT Press
3. Stephen Marsland, ―Machine learning: An Algorithmic Perspective, CRC Press, 2009.
4. Bishop, C., Pattern Recognition and Machine Learning. Berlin: Springer-Verlag.
5. M. Gopal, “Applied Machine Learning”, McGraw Hill Education
2|Page
Unit-3:
DECISION TREE LEARNING - Decision tree learning algorithm, Inductive bias, Inductive
inference with decision trees, Entropy and information theory, Information gain, ID-3 Algorithm,
Issues in Decision tree learning.
Decision tree
A decision tree is a non-parametric supervised learning algorithm, which is utilized for both
classification and regression tasks.
Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
Learned trees can also be re-represented as sets of if-then rules
Decision trees classify instances by sorting them down the tree from the root to some leaf node,
which provides the classification of the instance. Each node in the tree specifies a test of some
attribute of the instance, and each branch descending from that node corresponds to one of the
possible values for this attribute.
An instance is classified by starting at the root node of the tree, testing the attribute specified by
this node, then moving down the tree branch corresponding to the value of the attribute. This
process is then repeated for the subtree rooted at the new node.
3|Page
For example, the instance (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
would be sorted down the left most branch of this decision tree and would therefore be classified
as a negative instance.
Decision tree learning is generally best suited to problems with the following characteristics:
Instances are represented by attribute-value pairs
The target function has discrete output values
Disjunctive descriptions may be required
The training data may contain missing attribute values
Given a collection of training examples, there are typically many decision trees consistent with
these examples.
Approximate inductive bias of ID3: Shorter trees are preferred over larger trees.
A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer trees.
Trees that place high information gain attributes close to the root are preferred over those that do
not.
Inductive inference refers to the process of generalizing knowledge from specific examples to make
probabilistic predictions about new, unseen instances.
Inductive Learning Algorithm (ILA) is an iterative and inductive machine learning algorithm that
is used for generating a set of classification rules, which produces rules of the form “IF-THEN”,
for a set of examples, producing rules at each iteration and appending to the set of rules.
Entropy ([9+, 5-]) - (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940
Note:
The entropy is 0 if all members of S belong to the same class
Entropy ([14+, 0-]) - (14/14) log2 (14/14) – (0/14) log2 (0/14) = 0
The entropy is 1 when the collection contains an equal number of positive and negative
examples.
Entropy ([7+, 7-]) - (7/14) log2 (7/14) – (7/14) log2 (7/14) = 1
If the collection contains unequal numbers of positive and negative examples, the entropy is
between 0 and 1.
If the target attribute can take on c different values, then the entropy of S relative to this c-
wise classification is defined as
Where pi is the proportion of S belonging to class i. If the target attribute can take on c possible
values, the entropy can be as large as log2c.
Information gain
Information gain is the decrease in entropy. Information gain computes the difference between
entropy before the split and average entropy after the split of the dataset based on given attribute
values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.
where Values(A) is the set of all possible values for attribute A, and Sv, is the subset of S for which
attribute A has value v.
The information gain due to sorting the original 14 examples by the attribute Wind may then be
calculated as
6|Page
Information gain is precisely the measure used by ID3 to select the best attribute at each step in
growing the tree.
Answer: Humidity
ID-3 Algorithm
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes (divides) features into two or more groups at each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree.
Our basic algorithm, ID3, learns decision trees by constructing them top down, beginning with the
question "which attribute should be tested at the root of the tree?" To answer this question, each
7|Page
instance attribute is evaluated using a statistical test to determine how well it alone classifies the
training examples.
The best attribute is selected and used as the test at the root node of the tree. A descendant of the
root node is then created for each possible value of this attribute, and the training examples are
sorted to the appropriate descendant node (i.e., down the branch corresponding to the example's
value for this attribute).
The entire process is then repeated using the training examples associated with each descendant
node to select the best attribute to test at that point in the tree
An Illustrative Example
Overfitting is a significant practical difficulty for decision tree learning and many other learning
methods. Random noise in the training examples can lead to overfitting.
There are several approaches to avoiding overfitting in decision tree learning. These can be grouped
into two classes:
Pre-pruning – we can stop growing the tree earlier, which means we can prune/remove/cut a
node if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on their
significance. Approaches that allow the tree to overfit the data, and then post-prune the tree.
Another approach is often referred to as a training and validation set approach. Even though the
learner may be misled by random errors and coincidental regularities within the training set, the
validation set provide a safety check against overfitting.
Our initial definition of ID3 is restricted to attributes that take on a discrete set of values. The
continuous-valued decision attributes can be incorporated into the learned tree.
This can be accomplished by dynamically defining new discrete valued attributes that partition the
continuous attribute value into a discrete set of intervals.
In particular, for an attribute A that is continuous-valued, the algorithm can dynamically create a
new Boolean attribute A, that is true if A < c and false otherwise.
where
Where S1 through Sc are the c subsets of examples resulting from partitioning S by the c-valued
attribute A.
A variety of other selection measures have been proposed as well. However, in his experimental
domains the choice of attribute selection measure appears to have a smaller impact on final
accuracy than does the extent and method of post-pruning
9|Page
In certain cases, the available data may be missing values for some attributes. One strategy for
dealing with the missing attribute value is to assign it the value that is most common among training
examples at node n.
In some learning tasks the instance attributes may have associated costs. These attributes vary
significantly in their costs. We would prefer decision trees that use low-cost attributes.
ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute
selection measure. For example, we might divide the Gain by the cost of the attribute, so that lower-
cost attributes would be preferred.
While such cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision
tree, they do bias the search in favor of low-cost attributes.
Reference
https://www.datacamp.com/tutorial/decision-tree-classification-python
https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
https://scikit-learn.org/stable/modules/tree.html
10 | P a g e
EXAMPLE:
11 | P a g e
12 | P a g e
13 | P a g e
14 | P a g e
15 | P a g e
16 | P a g e
17 | P a g e