Lecture3 Classification (PartII)
Lecture3 Classification (PartII)
3510-Machine Learning
Lectures 3: Classification (Part II)
1 Decision trees
3 References
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Advantages
+ Decision trees are easy to interpret.
Disadvantages:
- Usually Decision trees are not competitive with other supervised
learning approaches in terms of prediction accuracy.
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Raw data:
Salary is color-coded from
low (blue), medium (green)
to high (yellow,red).
Overall, the tree stratifies the predictor’ space into three regions:
Overall, the tree stratifies the predictor’ space into three regions:
Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.
Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.
The segments of the tree outgoing an internal node are called
branches.
Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.
The segments of the tree outgoing an internal node are called
branches.
Predictions: The number in each leaf is the mean of the response variable
for the observations that fall in the corresponding region.
Illustrative exercise
We will now focus on step 1. In theory, the regions could have any shape.
However, to simplify we divide the predictor space into high-dimensional
rectangles, or boxes.
where ŷRj is the mean of the target Y for the training observations
belonging to the jth region.
where ŷRj is the mean of the target Y for the training observations
belonging to the jth region.
Step 1: Select the predictor j and the cutpoint s such that splitting the
predictor space in two regions: {X |Xj < s} and {X |Xj ≥ s} leads
to the greatest decrease of RSS.
Step 1: Select the predictor j and the cutpoint s such that splitting the
predictor space in two regions: {X |Xj < s} and {X |Xj ≥ s} leads
to the greatest decrease of RSS. In other words:
For any pair (j, s) we define the pair of half-planes:
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
and seek the values of j and s that minimize:
X X
(yi − ŷR1 )2 + (yi − ŷR2 )2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
Tree Pruning
Tree Pruning
Tree Pruning
SOLUTION:
A good strategy is to grow a very large tree T0 (with many leaves), and
then prune it back in order to obtain a subtree.
This approach is called Cost complexity pruning also known as weakest
link pruning.
Intuition: the goal is to select a subtree that leads to the lowest test error.
Intuition: the goal is to select a subtree that leads to the lowest test error.
Approach: For each value of α there is a subtree T ⊂ T0 such that:
|T |
X X
(yi − ŷRm )2 + α|T | is minimal.
m=1 i:xi ∈Rm
where |T | is the number of terminal nodes of the tree T , Rm is the region corresponding
to the mth terminal node, and ŷRm is the predicted response associated to Rm .
Intuition: the goal is to select a subtree that leads to the lowest test error.
Approach: For each value of α there is a subtree T ⊂ T0 such that:
|T |
X X
(yi − ŷRm )2 + α|T | is minimal.
m=1 i:xi ∈Rm
where |T | is the number of terminal nodes of the tree T , Rm is the region corresponding
to the mth terminal node, and ŷRm is the predicted response associated to Rm .
A set of observations is randomly split into five non-overlapping groups. Each of these
fifths acts as a validation set. The test error is estimated by averaging the five estimates.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 23 / 88
Decision trees Regression Trees
First, randomly divide the data set in half, yielding 132 observations
in the training set and 131 observations in the test set.
We build a large regression tree T0 on the training data and vary α in
order to create subtrees with different numbers of terminal nodes.
Finally, perform six-fold cross-validation in order to estimate the
cross-validated MSE of the trees as a function of α.
Orange: test error; black: training error curve; Green: CV error. Also shown are standard
error bars around the estimated errors.
The selected tree has three leaves
P. Conde-Céspedes and 3:was
Lectures shown (Part
Classification previously.
II) September 30th, 2024 27 / 88
Decision trees Classification trees
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Classification trees
Let us suppose the response variable Y has 3 categories:
Gini index G
Gini index G
Intuition: Gini index takes on a small value if all of the p̂mk ’s are either
close to 0 or 1. For this reason the Gini index is referred to as a measure
of node purity - a small value indicates that a node contains
predominantly observations from a single class.
Cross-entropy or Deviance D
It turns out that the Gini index and the cross-entropy are very similar
numerically.
Cross-entropy or Deviance D
It turns out that the Gini index and the cross-entropy are very similar
numerically.
Since 0 ≤ p̂mk ≤ 1, it follows that −p̂mk log p̂mk ≥ 0. One can deduce
that the cross-entropy will take on a value near zero if the p̂mk ’s are all
near 0 or near 1. Therefore, the cross-entropy will take on a small value if
the mth node is pure.
Cross-entropy and the Gini index are differentiable, and hence more practical to
numerical optimization. However, the classification error rate is preferable if prediction
accuracy is the goal.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 34 / 88
Decision trees Classification trees
Some remarks:
It is possible
to include
qualitative
predictors.
Some splits
yield to two
terminal
nodes that
have the
same
predicted
value
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Bagging Illustration
Example: relationship between ozone and temperature:
B = 100 models were fitted on bootstrap samples. (Gray) Predictions
from 10 fitted models, (red) Average of the 100 fitted models.
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
The term cov (Zi , Zj ) = 0 only if the Zi ’s are not correlated (independent).
Goal: predict cancer type based on 500 genes with high variance.
If m = p, this amounts simply to bagging.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 47 / 88
Decision trees Boosting
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Introduction to Boosting
Introduction to Boosting
Introduction to Boosting
Source: https://vitalflux.com/adaboost-algorithm-explained-with-python-example/,
https://towardsdatascience.com/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f
The final prediction is the weighted majority vote of all weak learners
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.
Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.
Bagging, random forests and boosting are good methods for
improving the prediction accuracy of trees at the expense of
interpretability. They work by growing many trees on the training
data and then combining the predictions of the resulting ensemble of
trees.
Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.
Bagging, random forests and boosting are good methods for
improving the prediction accuracy of trees at the expense of
interpretability. They work by growing many trees on the training
data and then combining the predictions of the resulting ensemble of
trees.
The latter two methods - random forests and boosting - are among
the state-of-the-art methods for supervised learning. However their
results can be difficult to interpret.
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Introduction
A little of history:
Support Vector Machines, usually called simple SVM, was developed in
the 1990s by Vladimir Vapnik.
Since the, SVMs have been shown to perform well in a variety of settings,
and are often considered one of the best out of the box classifiers.
Introduction
A little of history:
Support Vector Machines, usually called simple SVM, was developed in
the 1990s by Vladimir Vapnik.
Since the, SVMs have been shown to perform well in a variety of settings,
and are often considered one of the best out of the box classifiers.
SVM principle
Finding a hyperplane that separates the classes in feature space.
What is a Hyperplane?
What is a Hyperplane?
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
β0 + β1 X1 + β2 X2 = 0
Hyperplane in 2 Dimensions
A separating hyperplane
A separating hyperplane
A separating hyperplane
A separating hyperplane
The hyperplane
1 + 2X1 + 3X2 = 0 is
shown.
Blue region: set of
points for which
1 + 2X1 + 3X2 > 0,
Red region: set of
points for which
1 + 2X1 + 3X2 < 0.
Suppose there exists a hyperplane that perfectly separates the two classes
in the training observations:
By coding:
yi = +1 for blue and yi = −1 for red class.
Then, a separating hyperplane has the property:
yi (β0 +β1 xi1 +β2 xi2 +. . .+βp xip ) > 0 ∀i = 1, . . . , n
Suppose there exists a hyperplane that perfectly separates the two classes
in the training observations:
By coding:
yi = +1 for blue and yi = −1 for red class.
Then, a separating hyperplane has the property:
yi (β0 +β1 xi1 +β2 xi2 +. . .+βp xip ) > 0 ∀i = 1, . . . , n
Given a test observation x ∗ , classify it based on
the sign of:
f (x ∗ ) = β0 + β1 x1∗ + β2 x2∗ + . . . + βp xp∗
if f (x ∗ ) > 0 then blue , if f (x ∗ ) < 0 then red.
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
The maximal
margin hyperplane
does not exist.
In this case, the
optimization
problem has no
solution with
M > 0.
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier.
Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier. This classifier allows some observations:
to be not only on the wrong side of the margin but also
to be on the wrong side of the hyperplane.
Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier. This classifier allows some observations:
to be not only on the wrong side of the margin but also
to be on the wrong side of the hyperplane.
Left: Red class: 1 is on the wrong side of the margin. Blue class: observation 8 is on
the wrong side of the margin.
Right: Same as left panel with two additional points, 11 and 12. Both are on the wrong
side of the hyperplane and the wrong side of the margin.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 72 / 88
Support Vector Machines (SVM) Support Vector Classifier
maximize M
β0 ,β1 ,...,βp ,1 ,...,n
X p
subject to : βj2 = 1,
j=1
Parameters’ interpretation
Parameters’ interpretation
Parameters’ interpretation
Parameters’ interpretation
C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.
C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.
If C large, then wide
margins and more
violations are allowed
⇒ classifier more
biased but may have
lower variance.
C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.
If C large, then wide
margins and more
violations are allowed
⇒ classifier more
biased but may have
lower variance.
Only observations that either lie directly on the margin or that violate the
margin will affect the hyperplane. These are the support vectors.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88
Support Vector Machines (SVM) Support Vector Machines
Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References
Feature expansion
In a higher dimensional space the data becomes linearly separable.
Feature expansion
In a higher dimensional space the data becomes linearly separable.
Source: https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
Feature Expansion
Feature Expansion
Feature Expansion
Now, consider replacing the inner product in the support vector classifier
optimization function with a generalization of the form: K (xi , xi 0 ).
where K is refered to as a kernel. A kernel is a function that quantifies
the similarity of two observations. For instance:
Linear kernel: K (xi , xi 0 ) = pj=1 xij xi 0 j , that is the kernel for support
P
vector classifier.
Now, consider replacing the inner product in the support vector classifier
optimization function with a generalization of the form: K (xi , xi 0 ).
where K is refered to as a kernel. A kernel is a function that quantifies
the similarity of two observations. For instance:
Linear kernel: K (xi , xi 0 ) = pj=1 xij xi 0 j , that is the kernel for support
P
vector classifier.
d
polynomial kernel: K (xi , xi 0 ) = 1 + pj=1 xij xi 0 j
P
with d > 0
integer.
Radial kernel: K (xi , xi 0 ) = exp(−γ pj=1 (xij − xi 0 j )2 ) where γ is a
P
positive constant.
The support vector machine (SVM) is an extension of the support
vector classifier that results from enlarging the feature space using kernels.
Left: SVM with a polynomial kernel; Right: SVM with a radial kernel.
OVO (One versus One) : Fit all n2 pairwise classifiers fˆkl (x).
2
Summary
References