0% found this document useful (0 votes)
3 views

Lecture3 Classification (PartII)

The document outlines a lecture on classification techniques in machine learning, focusing on decision trees and support vector machines (SVM). It discusses the structure and interpretation of decision trees, including regression trees, and highlights their advantages and disadvantages. The lecture also covers the process of building regression trees and the recursive binary splitting method for stratifying predictor space.

Uploaded by

guoxiaofan0225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture3 Classification (PartII)

The document outlines a lecture on classification techniques in machine learning, focusing on decision trees and support vector machines (SVM). It discusses the structure and interpretation of decision trees, including regression trees, and highlights their advantages and disadvantages. The lecture also covers the process of building regression trees and the recursive binary splitting method for stratifying predictor space.

Uploaded by

guoxiaofan0225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 164

IG.

3510-Machine Learning
Lectures 3: Classification (Part II)

Dr. Patricia CONDE-CESPEDES

[email protected]

September 30th, 2024

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 1 / 88


Plan

1 Decision trees

2 Support Vector Machines (SVM)

3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 2 / 88


Decision trees Introduction to Decision Trees

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 3 / 88


Decision trees Introduction to Decision Trees

Introduction to Decision Trees

Decision trees can be used for classification or for regression.


Decision trees approaches involve stratifying or segmenting the
predictor space into a number of simple regions.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 4 / 88


Decision trees Introduction to Decision Trees

Introduction to Decision Trees

Decision trees can be used for classification or for regression.


Decision trees approaches involve stratifying or segmenting the
predictor space into a number of simple regions.
The splitting rules used can be represented using a tree diagram,
that’s where the name decision tree comes from.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 4 / 88


Decision trees Introduction to Decision Trees

Advantages and disadvantages

Advantages
+ Decision trees are easy to interpret.

Disadvantages:
- Usually Decision trees are not competitive with other supervised
learning approaches in terms of prediction accuracy.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 5 / 88


Decision trees Regression Trees

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 6 / 88


Decision trees Regression Trees

Introductory example with the Hitters dataset (1/3)

Example : Predict a baseball player’s salary based on :


Years (the number of years that he has played in the major leagues)
Hits (the number of hits the player made in the previous year)

Raw data:
Salary is color-coded from
low (blue), medium (green)
to high (yellow,red).

How to stratify the


predictors space?

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 7 / 88


Decision trees Regression Trees

Regression tree for the Hitters data (2/3)

Overall, the tree stratifies the predictor’ space into three regions:

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 8 / 88


Decision trees Regression Trees

Regression tree for the Hitters data (2/3)

Overall, the tree stratifies the predictor’ space into three regions:

R1 = {X |Years < 4.5}, R2 = {X |Years ≥ 4.5, Hits < 117.5}, and


R3 = {X |Years ≥ 4.5, Hits ≥ 117.5}.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 8 / 88
Decision trees Regression Trees

How to interpret the regression tree for the Hitters


example (3/3)

At a given internal node,


the condition Xj < t
indicates the rule to split
the predictor’s space:
If condition True,
consider the left-hand
branch
else, consider the
right-hand branch
(which corresponds to
Xj ≥ t).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 9 / 88


Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88


Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88


Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88


Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.
The segments of the tree outgoing an internal node are called
branches.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88


Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.
The segments of the tree outgoing an internal node are called
branches.
Predictions: The number in each leaf is the mean of the response variable
for the observations that fall in the corresponding region.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88


Decision trees Regression Trees

Illustrative exercise

Given the following training and the following rules:


observations: if (X2 > 3) then: R1
X1 X2 Y else:
1 2 3 If (X1 < 4) then R2
else R3
2 1 2
2 2 4
Questions:
2 4 8
3 1 3 1) Build the regression tree.
3 5 9 2) Make predictions for the
4 4 11 following test observations:
5 1 5
6 2 7 X1 = 1, X2 = 4
X1 = 7, X2 = 2
6 5 12

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 11 / 88


Decision trees Regression Trees

Interpretation of Results for the Hitters data


Years of
experience is the
most important
factor to determine
Salary
For a more
experienced player
(more than 5
years), the number
of hits made in the
previous year is
important to
determine the
salary.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 12 / 88


Decision trees Regression Trees

The process of building a regression tree

There are roughly two steps:


Step 1: Stratification: Divide the predictor Space -the set of possible
values for X1 , X2 , . . . , Xp - into J disjoint regions: R1 , R2 , . . . , RJ .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 13 / 88


Decision trees Regression Trees

The process of building a regression tree

There are roughly two steps:


Step 1: Stratification: Divide the predictor Space -the set of possible
values for X1 , X2 , . . . , Xp - into J disjoint regions: R1 , R2 , . . . , RJ .
Step 2: Prediction: Given an observation (x1 , x2 , . . . , xp ) that falls into
the Rj region, its predicted value ŷ is the mean of the response
variable among all the training observations falling in Rj .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 13 / 88


Decision trees Regression Trees

The process of building a regression tree

There are roughly two steps:


Step 1: Stratification: Divide the predictor Space -the set of possible
values for X1 , X2 , . . . , Xp - into J disjoint regions: R1 , R2 , . . . , RJ .
Step 2: Prediction: Given an observation (x1 , x2 , . . . , xp ) that falls into
the Rj region, its predicted value ŷ is the mean of the response
variable among all the training observations falling in Rj .

We will now focus on step 1. In theory, the regions could have any shape.
However, to simplify we divide the predictor space into high-dimensional
rectangles, or boxes.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 13 / 88


Decision trees Regression Trees

How to stratify the feature space?

The goal is to find regions R1 , R2 , . . . , RJ that minimize the RSS (Residual


Sum of Squares):
J X
X
(yi − ŷRj )2
j=1 i∈Rj

where ŷRj is the mean of the target Y for the training observations
belonging to the jth region.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 14 / 88


Decision trees Regression Trees

How to stratify the feature space?

The goal is to find regions R1 , R2 , . . . , RJ that minimize the RSS (Residual


Sum of Squares):
J X
X
(yi − ŷRj )2
j=1 i∈Rj

where ŷRj is the mean of the target Y for the training observations
belonging to the jth region.

Unfortunately, it is computationally infeasible to consider every possible


partition of the feature space into J boxes!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 14 / 88


Decision trees Regression Trees

Recursive Binary Splitting for stratification

Recursive Binary Splitting is a top-down, greedy approach.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 15 / 88


Decision trees Regression Trees

Recursive Binary Splitting for stratification

Recursive Binary Splitting is a top-down, greedy approach.


top down: it begins at the top of the tree (when all observations
belong to a single region) and then successively splits the predictor
space; each split is indicated via two new branches further down on
the tree.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 15 / 88


Decision trees Regression Trees

Recursive Binary Splitting for stratification

Recursive Binary Splitting is a top-down, greedy approach.


top down: it begins at the top of the tree (when all observations
belong to a single region) and then successively splits the predictor
space; each split is indicated via two new branches further down on
the tree.
greedy: It is greedy because at each step, the best split is made
at that particular step, rather than looking ahead and picking a split
that will lead to a better tree in some future step.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 15 / 88


Decision trees Regression Trees

Recursive Binary splitting process (1/2)

Step 1: Select the predictor j and the cutpoint s such that splitting the
predictor space in two regions: {X |Xj < s} and {X |Xj ≥ s} leads
to the greatest decrease of RSS.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 16 / 88


Decision trees Regression Trees

Recursive Binary splitting process (1/2)

Step 1: Select the predictor j and the cutpoint s such that splitting the
predictor space in two regions: {X |Xj < s} and {X |Xj ≥ s} leads
to the greatest decrease of RSS. In other words:
For any pair (j, s) we define the pair of half-planes:
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
and seek the values of j and s that minimize:
X X
(yi − ŷR1 )2 + (yi − ŷR2 )2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

where ŷR1 is the mean response for training observations in


R1 (j, s), and ŷR2 is the mean response for the training
observations in R2 (j, s).

This step can be done quite quickly, especially if p is small!


P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 16 / 88
Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously


created regions:
Step 2: Repeat step 1, look for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS.
However, instead of splitting the entire predictor space, split one
of the two previously identified regions. So, at the end of this
step, there are three regions.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88


Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously


created regions:
Step 2: Repeat step 1, look for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS.
However, instead of splitting the entire predictor space, split one
of the two previously identified regions. So, at the end of this
step, there are three regions.
Step 3: Repeat in order to split one of these three regions further, so as
to minimize the RSS.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88


Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously


created regions:
Step 2: Repeat step 1, look for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS.
However, instead of splitting the entire predictor space, split one
of the two previously identified regions. So, at the end of this
step, there are three regions.
Step 3: Repeat in order to split one of these three regions further, so as
to minimize the RSS.
Step 4: Repeat the process until a stopping criterion is reached. For
instance, until no region contains more than five observations.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88


Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously


created regions:
Step 2: Repeat step 1, look for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS.
However, instead of splitting the entire predictor space, split one
of the two previously identified regions. So, at the end of this
step, there are three regions.
Step 3: Repeat in order to split one of these three regions further, so as
to minimize the RSS.
Step 4: Repeat the process until a stopping criterion is reached. For
instance, until no region contains more than five observations.

Predictions: once the regions created R1 , . . . , RJ make predictions by


taking the mean value of the observations in each region.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88


Decision trees Regression Trees

Example of the Recursive Binary splitting process result

Left: The output of Recursive Binary Splitting on a two-dimensional example.


Center : A tree corresponding to the partition in the left panel.
Right: A perspective plot of the prediction surface corresponding to that tree.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 18 / 88


Decision trees Regression Trees

counter example of the Recursive binary splitting


A partition of two-dimensional feature space that could not result from
recursive binary splitting.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 19 / 88


Decision trees Regression Trees

Tree Pruning

How many leaves must the tree have?

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 20 / 88


Decision trees Regression Trees

Tree Pruning

How many leaves must the tree have?


A tree with so many terminal nodes might cause overfitting leading to
good performance in the training set but poor performance in the test set.
For instance, consider a tree having as many terminal nodes as
observations in such a way that each observation has its own region.
So, the training error is zero.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 20 / 88


Decision trees Regression Trees

Tree Pruning

How many leaves must the tree have?


A tree with so many terminal nodes might cause overfitting leading to
good performance in the training set but poor performance in the test set.
For instance, consider a tree having as many terminal nodes as
observations in such a way that each observation has its own region.
So, the training error is zero.

SOLUTION:
A good strategy is to grow a very large tree T0 (with many leaves), and
then prune it back in order to obtain a subtree.
This approach is called Cost complexity pruning also known as weakest
link pruning.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 20 / 88


Decision trees Regression Trees

Cost complexity pruning / weakest link pruning

Intuition: the goal is to select a subtree that leads to the lowest test error.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 21 / 88


Decision trees Regression Trees

Cost complexity pruning / weakest link pruning

Intuition: the goal is to select a subtree that leads to the lowest test error.
Approach: For each value of α there is a subtree T ⊂ T0 such that:
|T |
X X
(yi − ŷRm )2 + α|T | is minimal.
m=1 i:xi ∈Rm

where |T | is the number of terminal nodes of the tree T , Rm is the region corresponding
to the mth terminal node, and ŷRm is the predicted response associated to Rm .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 21 / 88


Decision trees Regression Trees

Cost complexity pruning / weakest link pruning

Intuition: the goal is to select a subtree that leads to the lowest test error.
Approach: For each value of α there is a subtree T ⊂ T0 such that:
|T |
X X
(yi − ŷRm )2 + α|T | is minimal.
m=1 i:xi ∈Rm

where |T | is the number of terminal nodes of the tree T , Rm is the region corresponding
to the mth terminal node, and ŷRm is the predicted response associated to Rm .

Which value of α to select?


Select the optimal value α∗ using cross-validation to estimate the error
test.
Then, use the full dataset to obtain the subtree that corresponds to α∗ .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 21 / 88


Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to


the training data) and the subtree’s complexity.
When α = 0, then we get T0 .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88


Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to


the training data) and the subtree’s complexity.
When α = 0, then we get T0 .
As α increases, there is a price to pay for having a tree with many
terminal nodes, and so a smaller subtree will be preferable.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88


Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to


the training data) and the subtree’s complexity.
When α = 0, then we get T0 .
As α increases, there is a price to pay for having a tree with many
terminal nodes, and so a smaller subtree will be preferable.
As α increases, branches get pruned from the tree in a nested and
predictable fashion, so obtaining the whole sequence of subtrees as a
function of α is easy!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88


Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to


the training data) and the subtree’s complexity.
When α = 0, then we get T0 .
As α increases, there is a price to pay for having a tree with many
terminal nodes, and so a smaller subtree will be preferable.
As α increases, branches get pruned from the tree in a nested and
predictable fashion, so obtaining the whole sequence of subtrees as a
function of α is easy!

When performing cross-validation α plays the role of a penalty for a very


big tree that barely contributes to decrease the test error.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88


Decision trees Regression Trees

Reminder : K-fold Cross-validation


Idea : randomly split the data into K equal-sized groups or folds. Then,
leave out part k, fit the model to the other K − 1 parts (combined), and
then obtain predictions for the left-out kth part.
Repeat for each fold k = 1, 2, ...K fold and estimate the test error.
Finally the estimated overall test error is the average of the K estimates.
A schematic display of 5-fold CV

A set of observations is randomly split into five non-overlapping groups. Each of these
fifths acts as a validation set. The test error is estimated by averaging the five estimates.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 23 / 88
Decision trees Regression Trees

Summary of the Building of a regression tree

Step 1: Use Recursive binary splitting to build a large tree T0 on the


training data.
Step 2: Apply cost complexity pruning to T0 in order to obtain a
sequence of best subtrees, as a function of α.
Step 3: Use K -fold cross-validation to choose α. For k = 1, . . . , K do:
1. Repeat Steps 1 and 2 on the KK−1 th fraction of the training data,
excluding the kth fold.
2. Estimate the test error on the data in the left-out kth fold, as a
function of α.
Average the results, and choose α that minimizes the average
estimate error test.
Step 4: Return the subtree from Step 2 that corresponds to the chosen
value of α.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 24 / 88


Decision trees Regression Trees

Example with the Hitters data set (1/3)

Consider the Hitters dataset:

First, randomly divide the data set in half, yielding 132 observations
in the training set and 131 observations in the test set.
We build a large regression tree T0 on the training data and vary α in
order to create subtrees with different numbers of terminal nodes.
Finally, perform six-fold cross-validation in order to estimate the
cross-validated MSE of the trees as a function of α.

Notice there is a ONE to ONE correspondance between α and the number


of leaves |T |.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 25 / 88


Decision trees Regression Trees

Large regression tree T0 , Hitters example (2/3)


Unpruned tree resulting from the recursive binary splitting on the Hitters
data with 9 predictors:

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 26 / 88


Decision trees Regression Trees

Cross-validation, Hitters example (3/3)


CV error as a function of the number of leaves.

Orange: test error; black: training error curve; Green: CV error. Also shown are standard
error bars around the estimated errors.
The selected tree has three leaves
P. Conde-Céspedes and 3:was
Lectures shown (Part
Classification previously.
II) September 30th, 2024 27 / 88
Decision trees Classification trees

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 28 / 88


Decision trees Classification trees

Classification trees
Let us suppose the response variable Y has 3 categories:

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 29 / 88


Decision trees Classification trees

How to build a Classification tree?

The procedure is very similar to that of regression trees, except that


the predicted variable Y is qualitative.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 30 / 88


Decision trees Classification trees

How to build a Classification tree?

The procedure is very similar to that of regression trees, except that


the predicted variable Y is qualitative.

In classification, given a new observation we predict that it belongs to


the most commonly occurring class in the region to which it
belongs to.

Likewise in regression, start by building a large classification tree


recursive binary splitting.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 30 / 88


Decision trees Classification trees

How to build a Classification tree?

The procedure is very similar to that of regression trees, except that


the predicted variable Y is qualitative.

In classification, given a new observation we predict that it belongs to


the most commonly occurring class in the region to which it
belongs to.

Likewise in regression, start by building a large classification tree


recursive binary splitting. However, in classification we can not use
the RSS as a criterion for making binary splits.
Instead we minimize two other criteria which measure the purity of
the node. These are the Gini index and the cross-entropy.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 30 / 88


Decision trees Classification trees

The classification error rate

The classification error rate of a region m is simply the fraction of the


training observations in that region that do not belong to the most
common class:
Errorm,Train = 1 − max(p̂mk ).
k

Here p̂mk represents the proportion of training observations in the mth


region that belong to class k.

The classification error rate is preferable if prediction accuracy of the final


pruned tree is the goal.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 31 / 88


Decision trees Classification trees

Gini index G

For a given region m the Gini index is defined by:


K
X
Gm = p̂mk (1 − p̂mk )
k=1

where p̂mk is the proportion of training observations in region m that


belong to class k.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 32 / 88


Decision trees Classification trees

Gini index G

For a given region m the Gini index is defined by:


K
X
Gm = p̂mk (1 − p̂mk )
k=1

where p̂mk is the proportion of training observations in region m that


belong to class k.

Intuition: Gini index takes on a small value if all of the p̂mk ’s are either
close to 0 or 1. For this reason the Gini index is referred to as a measure
of node purity - a small value indicates that a node contains
predominantly observations from a single class.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 32 / 88


Decision trees Classification trees

Cross-entropy or Deviance D

An alternative to the Gini index is cross-entropy. For a given region m


this index is given by:
K
X
Dm = − p̂mk log p̂mk
k=1

It turns out that the Gini index and the cross-entropy are very similar
numerically.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 33 / 88


Decision trees Classification trees

Cross-entropy or Deviance D

An alternative to the Gini index is cross-entropy. For a given region m


this index is given by:
K
X
Dm = − p̂mk log p̂mk
k=1

It turns out that the Gini index and the cross-entropy are very similar
numerically.
Since 0 ≤ p̂mk ≤ 1, it follows that −p̂mk log p̂mk ≥ 0. One can deduce
that the cross-entropy will take on a value near zero if the p̂mk ’s are all
near 0 or near 1. Therefore, the cross-entropy will take on a small value if
the mth node is pure.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 33 / 88


Decision trees Classification trees

Comparison classification error rate, Gini index and entropy

Cross-entropy and the Gini index are differentiable, and hence more practical to
numerical optimization. However, the classification error rate is preferable if prediction
accuracy is the goal.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 34 / 88
Decision trees Classification trees

Example of Classification tree with the heart data

Consider the Heart data set:


These data contain a binary variable HD for 303 patients who
presented with chest pain.
(
Yes: presence of heart disease.
HD =
No: No heart disease.
13 predictors including Age, Sex, Chol (a cholesterol measurement),
and other heart and lung function measurements.
Cross-validation yields a tree with 6 terminal nodes (see next slide).
Decision trees can be constructed with qualitative predictors as well,
that is the case for this example. Consider the top node Thal (3
categories: normal, fixed and reversible defects).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 35 / 88


Decision trees Classification trees

Classification tree, Heart dataset

Some remarks:
It is possible
to include
qualitative
predictors.
Some splits
yield to two
terminal
nodes that
have the
same
predicted
value

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 36 / 88


Decision trees Classification trees

Trees vs. Linear Models

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 37 / 88


Decision trees Bagging or bootstrap aggregation

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 38 / 88


Decision trees Bagging or bootstrap aggregation

Bagging or bootstrap aggregation


We obtain distinct data sets by repeatedly sampling observations from the original data
set with replacement.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 39 / 88


Decision trees Bagging or bootstrap aggregation

What is Bagging? - Introduction

Decision trees can suffer from high variance (before prunning).

Bootstrap aggregation, or bagging, is a general procedure for


reducing the variance frequently used in the context of decision trees.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 40 / 88


Decision trees Bagging or bootstrap aggregation

What is Bagging? - Introduction

Decision trees can suffer from high variance (before prunning).

Bootstrap aggregation, or bagging, is a general procedure for


reducing the variance frequently used in the context of decision trees.
Reminder: Given a set of n independent observations Z1 , . . . , Zn , each
with variance σ 2 , the variance of the empirical mean Z̄ of the observations
2
is given by σn .
So, Averaging a set of observations reduces variance!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 40 / 88


Decision trees Bagging or bootstrap aggregation

What is Bagging? - Introduction

Decision trees can suffer from high variance (before prunning).

Bootstrap aggregation, or bagging, is a general procedure for


reducing the variance frequently used in the context of decision trees.
Reminder: Given a set of n independent observations Z1 , . . . , Zn , each
with variance σ 2 , the variance of the empirical mean Z̄ of the observations
2
is given by σn .
So, Averaging a set of observations reduces variance!
However, this is not practical because we generally do not have access to
multiple training sets.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 40 / 88


Decision trees Bagging or bootstrap aggregation

Bagging Illustration
Example: relationship between ozone and temperature:
B = 100 models were fitted on bootstrap samples. (Gray) Predictions
from 10 fitted models, (red) Average of the 100 fitted models.

Clearly average is more stable and there is less overfit!


Source: https://en.wikipedia.org/wiki/Bootstrap_aggregating
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 41 / 88
Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking


samples from the original dataset.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88


Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking


samples from the original dataset.
Step 2: Fit the method on the bth bootstrapped training set in order to
get the prediction fˆ∗b (x) for a given observation x.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88


Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking


samples from the original dataset.
Step 2: Fit the method on the bth bootstrapped training set in order to
get the prediction fˆ∗b (x) for a given observation x.
remark: at this point each individual tree has high variance!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88


Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking


samples from the original dataset.
Step 2: Fit the method on the bth bootstrapped training set in order to
get the prediction fˆ∗b (x) for a given observation x.
remark: at this point each individual tree has high variance!
Step 3: Then,
for regression, average the B predictions.
for classification, take a majority vote, the most commonly
occurring class among the B predictions.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88


Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking


samples from the original dataset.
Step 2: Fit the method on the bth bootstrapped training set in order to
get the prediction fˆ∗b (x) for a given observation x.
remark: at this point each individual tree has high variance!
Step 3: Then,
for regression, average the B predictions.
for classification, take a majority vote, the most commonly
occurring class among the B predictions.
Averaging these B trees reduces the variance.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88


Decision trees Bagging or bootstrap aggregation

Out-of-Bag (OOB) Error Estimation

An straightforward way to estimate the test error in bagging.


Trees are fit to bootstrapped subsets of the observations. So, on
average, each bagged tree makes use of around 32 of the observations.
(we will see the proof in the tutorial course).
The remaining 13 of the observations are referred to as the
out-of-bag (OOB) observations.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 43 / 88


Decision trees Bagging or bootstrap aggregation

Out-of-Bag (OOB) Error Estimation

An straightforward way to estimate the test error in bagging.


Trees are fit to bootstrapped subsets of the observations. So, on
average, each bagged tree makes use of around 32 of the observations.
(we will see the proof in the tutorial course).
The remaining 13 of the observations are referred to as the
out-of-bag (OOB) observations.
It is possible to predict the response for the ith observation using each
of the trees in which that observation was OOB. This will yield
around B3 predictions for each observation.
Then, we can estimate the overall OOB MSE for each of the n
observations. This will be the estimated test error!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 43 / 88


Decision trees Random Forests

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 44 / 88


Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that


decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88


Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that


decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.
As in bagging, a number of decision trees are built on bootstrapped
training samples.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88


Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that


decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.
As in bagging, a number of decision trees are built on bootstrapped
training samples.
Modification: when building these decision trees, each time a split in
a tree is considered, a random selection of m predictors is chosen
as split candidates from the full set of p predictors. The split is
allowed to use only one of those m predictors.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88


Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that


decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.
As in bagging, a number of decision trees are built on bootstrapped
training samples.
Modification: when building these decision trees, each time a split in
a tree is considered, a random selection of m predictors is chosen
as split candidates from the full set of p predictors. The split is
allowed to use only one of those m predictors.

Typical values for m are p2 or p (for instance, if p = 100, choose
among only 10 predictors).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88


Decision trees Random Forests

Why Random Forests reduces variance?

At each split, the algorithm must consider only a minority of the


predictors.
Idea: suppose that there is one very strong predictor in the data. Then, most of
the bagged trees will use this predictor in the top split. Consequently, all of the
bagged trees will look quite similar to each other. Hence the predictions from the
bagged trees will be highly correlated.

Unfortunately, averaging highly correlated quantities does not lead to as


substantial reduction in variance as averaging uncorrelated quantities.
σ2
P 
i Zi
P
Reminder: V (Z̄ ) = V n = nZ + 2 i6=j cov (Zi , Zj )

The term cov (Zi , Zj ) = 0 only if the Zi ’s are not correlated (independent).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 46 / 88


Decision trees Random Forests

The choice of m in Random Forest

Gene expression data: Performace of Random forests for different values of m.

Goal: predict cancer type based on 500 genes with high variance.
If m = p, this amounts simply to bagging.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 47 / 88
Decision trees Boosting

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 48 / 88


Decision trees Boosting

Introduction to Boosting

Like bagging, boosting is a ensemble learning method.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 49 / 88


Decision trees Boosting

Introduction to Boosting

Like bagging, boosting is a ensemble learning method.


Boosting combines a set of weak learners into strong learners. A
weak learner refers to a learning algorithm that only predicts slightly
better than randomly.
There are different types of boosting algorithms :

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 49 / 88


Decision trees Boosting

Introduction to Boosting

Like bagging, boosting is a ensemble learning method.


Boosting combines a set of weak learners into strong learners. A
weak learner refers to a learning algorithm that only predicts slightly
better than randomly.
There are different types of boosting algorithms :
AdaBoost (Adaptive Boosting)
Gradient Boosting
XGBoost (Extreme Gradient Boosting)

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 49 / 88


Decision trees Boosting

What is the idea behind boosting?

Intuition: Adaboost (Adaptative boosting)

Source: https://vitalflux.com/adaboost-algorithm-explained-with-python-example/,
https://towardsdatascience.com/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f

The final prediction is the weighted majority vote of all weak learners

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 50 / 88


Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss


function (i.e., training on the residuals of the model).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88


Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss


function (i.e., training on the residuals of the model).
Unlike fitting a single large decision tree to the data, in gradient
boosting the learners are small decision trees grown slowly.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88


Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss


function (i.e., training on the residuals of the model).
Unlike fitting a single large decision tree to the data, in gradient
boosting the learners are small decision trees grown slowly.
At each step, a decision tree is fit to the residuals or errors of the
current tree. Then, the residuals are updated. This implies to slowly
improve in areas where the classifier does not perform well.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88


Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss


function (i.e., training on the residuals of the model).
Unlike fitting a single large decision tree to the data, in gradient
boosting the learners are small decision trees grown slowly.
At each step, a decision tree is fit to the residuals or errors of the
current tree. Then, the residuals are updated. This implies to slowly
improve in areas where the classifier does not perform well.
Tuning parameters:
The number of trees B.
The depth d or number of terminal nodes in each tree.
A shrinkage parameter λ > 0 which controls the rate at which
boosting learns and scales the contribution of each weak learner.
Typical values are 0.01 or 0.001.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88
Decision trees Comparison and summary of decision trees

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 52 / 88


Decision trees Comparison and summary of decision trees

Comparison example of classification trees


Spam data contains 2-class target variable and 50 predictors.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 53 / 88


Decision trees Comparison and summary of decision trees

Summary on Decision trees

Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 54 / 88


Decision trees Comparison and summary of decision trees

Summary on Decision trees

Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.
Bagging, random forests and boosting are good methods for
improving the prediction accuracy of trees at the expense of
interpretability. They work by growing many trees on the training
data and then combining the predictions of the resulting ensemble of
trees.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 54 / 88


Decision trees Comparison and summary of decision trees

Summary on Decision trees

Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.
Bagging, random forests and boosting are good methods for
improving the prediction accuracy of trees at the expense of
interpretability. They work by growing many trees on the training
data and then combining the predictions of the resulting ensemble of
trees.
The latter two methods - random forests and boosting - are among
the state-of-the-art methods for supervised learning. However their
results can be difficult to interpret.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 54 / 88


Support Vector Machines (SVM) Introduction

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 55 / 88


Support Vector Machines (SVM) Introduction

Introduction

A little of history:
Support Vector Machines, usually called simple SVM, was developed in
the 1990s by Vladimir Vapnik.
Since the, SVMs have been shown to perform well in a variety of settings,
and are often considered one of the best out of the box classifiers.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 56 / 88


Support Vector Machines (SVM) Introduction

Introduction

A little of history:
Support Vector Machines, usually called simple SVM, was developed in
the 1990s by Vladimir Vapnik.
Since the, SVMs have been shown to perform well in a variety of settings,
and are often considered one of the best out of the box classifiers.

SVM principle for the two-class classification problem:

SVM principle
Finding a hyperplane that separates the classes in feature space.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 56 / 88


Support Vector Machines (SVM) Introduction

What is a Hyperplane?

A hyperplane in p dimensions is a flat affine subspace of dimension


p − 1.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 57 / 88


Support Vector Machines (SVM) Introduction

What is a Hyperplane?

A hyperplane in p dimensions is a flat affine subspace of dimension


p − 1.
In general the equation for a hyperplane has the form:

β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0

If p = 2 dimensions a hyperplane is a line of equation:

β0 + β1 X1 + β2 X2 = 0

If β0 = 0, the hyperplane goes through the origin.


The vector β = (β1 , β2 , . . . , βp ) is called the normal vector and it is
orthogonal to the surface of a hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 57 / 88


Support Vector Machines (SVM) Introduction

Hyperplane in 2 Dimensions

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 58 / 88


Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there


are 3 possibilities:
1 X lies on the hyperplane then it satisfies:
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88


Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there


are 3 possibilities:
1 X lies on the hyperplane then it satisfies:
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
2 X does not satisfy this equation and rather,
β0 + β1 X1 + β2 X2 + . . . + βp Xp > 0
Then X lies on one side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88


Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there


are 3 possibilities:
1 X lies on the hyperplane then it satisfies:
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
2 X does not satisfy this equation and rather,
β0 + β1 X1 + β2 X2 + . . . + βp Xp > 0
Then X lies on one side of the hyperplane.
3 On the other hand, if
β0 + β1 X1 + β2 X2 + . . . + βp Xp < 0
then X lies on the other side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88


Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there


are 3 possibilities:
1 X lies on the hyperplane then it satisfies:
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
2 X does not satisfy this equation and rather,
β0 + β1 X1 + β2 X2 + . . . + βp Xp > 0
Then X lies on one side of the hyperplane.
3 On the other hand, if
β0 + β1 X1 + β2 X2 + . . . + βp Xp < 0
then X lies on the other side of the hyperplane.
So, a hyperplane divides a p-dimensional space into two halves.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88


Support Vector Machines (SVM) Introduction

An example of a separating hyperplane in R2

The hyperplane
1 + 2X1 + 3X2 = 0 is
shown.
Blue region: set of
points for which
1 + 2X1 + 3X2 > 0,
Red region: set of
points for which
1 + 2X1 + 3X2 < 0.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 60 / 88


Support Vector Machines (SVM) Introduction

Classification Using a Separating Hyperplane

Now suppose a n × p data matrix that consists of n training observations


in p−dimensional space
1st observation: x1 = (x11 , . . . x1p )
2nd observation: x2 = (x21 , . . . x2p )
..
.
nth observation: xn = (xn1 , . . . xnp )
These observations fall into two classes, y1 , . . . , yn ∈ {−1, 1}

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 61 / 88


Support Vector Machines (SVM) Introduction

Classification Using a Separating Hyperplane

Now suppose a n × p data matrix that consists of n training observations


in p−dimensional space
1st observation: x1 = (x11 , . . . x1p )
2nd observation: x2 = (x21 , . . . x2p )
..
.
nth observation: xn = (xn1 , . . . xnp )
These observations fall into two classes, y1 , . . . , yn ∈ {−1, 1}
Consider a test observation, x ∗ = x1∗ , . . . xp∗ .


P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 61 / 88


Support Vector Machines (SVM) Introduction

Classification Using a Separating Hyperplane

Now suppose a n × p data matrix that consists of n training observations


in p−dimensional space
1st observation: x1 = (x11 , . . . x1p )
2nd observation: x2 = (x21 , . . . x2p )
..
.
nth observation: xn = (xn1 , . . . xnp )
These observations fall into two classes, y1 , . . . , yn ∈ {−1, 1}
Consider a test observation, x ∗ = x1∗ , . . . xp∗ .


Goal: develop a classifier based on the training data that correctly


classifies the test observation.
idea: build a separating hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 61 / 88


Support Vector Machines (SVM) Introduction

How to classify using a separating hyperplan?

Suppose there exists a hyperplane that perfectly separates the two classes
in the training observations:
By coding:
yi = +1 for blue and yi = −1 for red class.
Then, a separating hyperplane has the property:
yi (β0 +β1 xi1 +β2 xi2 +. . .+βp xip ) > 0 ∀i = 1, . . . , n

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 62 / 88


Support Vector Machines (SVM) Introduction

How to classify using a separating hyperplan?

Suppose there exists a hyperplane that perfectly separates the two classes
in the training observations:
By coding:
yi = +1 for blue and yi = −1 for red class.
Then, a separating hyperplane has the property:
yi (β0 +β1 xi1 +β2 xi2 +. . .+βp xip ) > 0 ∀i = 1, . . . , n
Given a test observation x ∗ , classify it based on
the sign of:
f (x ∗ ) = β0 + β1 x1∗ + β2 x2∗ + . . . + βp xp∗
if f (x ∗ ) > 0 then blue , if f (x ∗ ) < 0 then red.

f (x ∗ ) can be interpreted as magnitude of confidence.


If f (x ∗ ) is far from zero ⇒confident about its class assignment.
if f (x ∗ ) is close to zero ⇒ less confident about its class assignment.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 62 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 63 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

What separating hyperplan to choose?


If a perfect separating hyperplan exists, then there exist an infinite number
of such hyperplanes.

Which one to choose?


P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 64 / 88
Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the


optimal separating hyperplane, which is the separating hyperplane that
is farthest from the training observations.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the


optimal separating hyperplane, which is the separating hyperplane that
is farthest from the training observations.
Compute the distance from each training observation to a given separating
hyperplane. The smallest such distance is the minimal distance among all
the observations to the hyperplane, and is known as the margin.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the


optimal separating hyperplane, which is the separating hyperplane that
is farthest from the training observations.
Compute the distance from each training observation to a given separating
hyperplane. The smallest such distance is the minimal distance among all
the observations to the hyperplane, and is known as the margin.
The maximal margin hyperplane is the separating hyperplane for which
the margin is the largest

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the


optimal separating hyperplane, which is the separating hyperplane that
is farthest from the training observations.
Compute the distance from each training observation to a given separating
hyperplane. The smallest such distance is the minimal distance among all
the observations to the hyperplane, and is known as the margin.
The maximal margin hyperplane is the separating hyperplane for which
the margin is the largest
Then, classify a test observation based on which side of the maximal
margin hyperplane it lies. This is known as the maximal margin
classifier.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier


Example of Maximal margin hyperplane

the maximal margin


hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier


Example of Maximal margin hyperplane

the maximal margin


hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.
The dashed lines indicate
the width of the margin.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier


Example of Maximal margin hyperplane

the maximal margin


hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.
The dashed lines indicate
the width of the margin.
the three training
observations equidistant
from the maximal margin
hyperplane that lie along the
dashed lines are known as
support vectors.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier


Example of Maximal margin hyperplane

the maximal margin


hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.
The dashed lines indicate
the width of the margin.
the three training
observations equidistant
from the maximal margin
hyperplane that lie along the
dashed lines are known as
support vectors.

The maximal margin hyperplane depends only on the support vectors!


P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88
Support Vector Machines (SVM) Maximal Margin Classifier

Construction of the Maximal Margin Classifier


Given n training observations x1 , . . . , xn ∈ Rp associated class labels
y1 , . . . , yn ∈ {−1, 1}. The maximal margin hyperplane is the solution to
the optimization problem:
maximize M
β0 ,β1 ,...,βp
p
X
subject to : βj2 = 1 and
j=1

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M ∀i = 1, . . . , n.

The second constrant guarantees that each observation will be on the


correct side of the hyperplane (provided that M is positive.).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 67 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Construction of the Maximal Margin Classifier


Given n training observations x1 , . . . , xn ∈ Rp associated class labels
y1 , . . . , yn ∈ {−1, 1}. The maximal margin hyperplane is the solution to
the optimization problem:
maximize M
β0 ,β1 ,...,βp
p
X
subject to : βj2 = 1 and
j=1

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M ∀i = 1, . . . , n.

The second constrant guarantees that each observation will be on the


correct side of the hyperplane (provided that M is positive.).
M represents the margin of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 67 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Construction of the Maximal Margin Classifier


Given n training observations x1 , . . . , xn ∈ Rp associated class labels
y1 , . . . , yn ∈ {−1, 1}. The maximal margin hyperplane is the solution to
the optimization problem:
maximize M
β0 ,β1 ,...,βp
p
X
subject to : βj2 = 1 and
j=1

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M ∀i = 1, . . . , n.

The second constrant guarantees that each observation will be on the


correct side of the hyperplane (provided that M is positive.).
M represents the margin of the hyperplane.
If the first constrant holds, the distance from the ith observation to
the hyperplane is: yi (β0 + β1 xi1 + . . . + βp xip ).
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 67 / 88
Support Vector Machines (SVM) Maximal Margin Classifier

Situations when the Maximal Margin classifier fails (1/2)


The Non-separable Case
In many real life situations the two classes are unseparable

The maximal
margin hyperplane
does not exist.
In this case, the
optimization
problem has no
solution with
M > 0.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 68 / 88


Support Vector Machines (SVM) Maximal Margin Classifier

Situations when the Maximal Margin classifier fails: (2/2)


Noisy data and sensitivity to individual observations

The addition of a single observation leads to a dramatic change in the


maximal margin hyperplane.
This extremely sensitive suggests overfitting. The margin is smaller!
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 69 / 88
Support Vector Machines (SVM) Support Vector Classifier

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 70 / 88


Support Vector Machines (SVM) Support Vector Classifier

Support Vector Classifier - Introduction

Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 71 / 88


Support Vector Machines (SVM) Support Vector Classifier

Support Vector Classifier - Introduction

Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier. This classifier allows some observations:
to be not only on the wrong side of the margin but also
to be on the wrong side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 71 / 88


Support Vector Machines (SVM) Support Vector Classifier

Support Vector Classifier - Introduction

Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier. This classifier allows some observations:
to be not only on the wrong side of the margin but also
to be on the wrong side of the hyperplane.

Observations on the wrong side of the hyperplane correspond to


misclassified training observations.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 71 / 88


Support Vector Machines (SVM) Support Vector Classifier

Support vector classifier relaxation

Left: Red class: 1 is on the wrong side of the margin. Blue class: observation 8 is on
the wrong side of the margin.
Right: Same as left panel with two additional points, 11 and 12. Both are on the wrong
side of the hyperplane and the wrong side of the margin.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 72 / 88
Support Vector Machines (SVM) Support Vector Classifier

Construction of the Support Vector Classifier

The Support Vector Classifier is the solution to the problem:

maximize M
β0 ,β1 ,...,βp ,1 ,...,n
X p
subject to : βj2 = 1,
j=1

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M(1 − i ),


X n
i ≥ 0, i ≤ C ∀i = 1, . . . , n.
i=1

C is a nonnegative tuning parameter, M is the width of the margin,


1 , . . . , n are slack variables that allow individual observations to be
on the wrong side of the margin or of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 73 / 88


Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:


If i = 0 then observation i is on the correct side of the margin.
If 0 < i < 1 then observation i is on the wrong side of the margin,
If i > 1, then observation i is on the wrong side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88


Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:


If i = 0 then observation i is on the correct side of the margin.
If 0 < i < 1 then observation i is on the wrong side of the margin,
If i > 1, then observation i is on the wrong side of the hyperplane.

The tuning parameter C bounds ni=1 i , it represents a budget for


P
the amount that the margin can be violated by the n observations.
If C = 0 then 1 = . . . = n = 0, then we obtain the maximal margin
classifier problem.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88


Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:


If i = 0 then observation i is on the correct side of the margin.
If 0 < i < 1 then observation i is on the wrong side of the margin,
If i > 1, then observation i is on the wrong side of the hyperplane.

The tuning parameter C bounds ni=1 i , it represents a budget for


P
the amount that the margin can be violated by the n observations.
If C = 0 then 1 = . . . = n = 0, then we obtain the maximal margin
classifier problem.
If C > 0 no more than C observations can be on the wrong side of the
hyperplane,

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88


Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:


If i = 0 then observation i is on the correct side of the margin.
If 0 < i < 1 then observation i is on the wrong side of the margin,
If i > 1, then observation i is on the wrong side of the hyperplane.

The tuning parameter C bounds ni=1 i , it represents a budget for


P
the amount that the margin can be violated by the n observations.
If C = 0 then 1 = . . . = n = 0, then we obtain the maximal margin
classifier problem.
If C > 0 no more than C observations can be on the wrong side of the
hyperplane,
As C increases, tolerance increases and so the margin will widen.
Conversely, as C decreases and the margin will narrow.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88


Support Vector Machines (SVM) Support Vector Classifier

The regularization parameter C

C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88


Support Vector Machines (SVM) Support Vector Classifier

The regularization parameter C

C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.
If C large, then wide
margins and more
violations are allowed
⇒ classifier more
biased but may have
lower variance.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88


Support Vector Machines (SVM) Support Vector Classifier

The regularization parameter C

C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.
If C large, then wide
margins and more
violations are allowed
⇒ classifier more
biased but may have
lower variance.

Only observations that either lie directly on the margin or that violate the
margin will affect the hyperplane. These are the support vectors.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88
Support Vector Machines (SVM) Support Vector Machines

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 76 / 88


Support Vector Machines (SVM) Support Vector Machines

Classification with Non-linear Decision Boundaries

If the decision boundary is not linear, the Support Vector Classifier


performs poorly!
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 77 / 88
Support Vector Machines (SVM) Support Vector Machines

Feature expansion
In a higher dimensional space the data becomes linearly separable.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 78 / 88


Support Vector Machines (SVM) Support Vector Machines

Feature expansion
In a higher dimensional space the data becomes linearly separable.

Source: https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 78 / 88


Support Vector Machines (SVM) Support Vector Machines

Feature Expansion

In order to address non-linearity, consider enlarging the feature space


by including transformations of the predictors e.g. X12 , X13 , X1 X2 ,
X1 X22 (quadratic, cubic or even higher-order polynomial terms).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 79 / 88


Support Vector Machines (SVM) Support Vector Machines

Feature Expansion

In order to address non-linearity, consider enlarging the feature space


by including transformations of the predictors e.g. X12 , X13 , X1 X2 ,
X1 X22 (quadratic, cubic or even higher-order polynomial terms).
This implies to go from a p-dimensional space to a P > p
dimensional space.
Then, we can fit a support-vector classifier in the enlarged space and
get non-linear decision boundaries in the original space.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 79 / 88


Support Vector Machines (SVM) Support Vector Machines

Feature Expansion

In order to address non-linearity, consider enlarging the feature space


by including transformations of the predictors e.g. X12 , X13 , X1 X2 ,
X1 X22 (quadratic, cubic or even higher-order polynomial terms).
This implies to go from a p-dimensional space to a P > p
dimensional space.
Then, we can fit a support-vector classifier in the enlarged space and
get non-linear decision boundaries in the original space.
Example: Suppose we consider (X1 , X2 , X12 , X22 , X1 X2 ) instead of just
(X1 , X2 ). Then the decision boundary would be of the form:
β0 + β1 X1 + β2 X2 + β3 X12 + β4 X22 + β5 X1 X2 = 0
As we enlarge the feature space computations become unmanageable!
Support vector machine allows to enlarge the feature space while
keeping efficient computations.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 79 / 88
Support Vector Machines (SVM) Support Vector Machines

Inner products and Support Vectors


It can be shown that the linear support vector classifier can be written as:
n
X
f (x) = β0 + αi hx, xi i
i=1
p
X
where ha, bi = aj bj is the inner product between vectors a and b ∈ Rp .
j=1
There are n parameter αi , one per training obsevation.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 80 / 88


Support Vector Machines (SVM) Support Vector Machines

Inner products and Support Vectors


It can be shown that the linear support vector classifier can be written as:
n
X
f (x) = β0 + αi hx, xi i
i=1
p
X
where ha, bi = aj bj is the inner product between vectors a and b ∈ Rp .
j=1
There are n parameter αi , one per training obsevation.

To estimate the parameters α1 , . . . , αn and β0 , we need n2 inner products




between all pairs of training observations.


However, it turns out that αi is nonzero only for the support vectors:
n
X
f (X ) = β0 + αi hx, xi i
i∈S

where S is the the collection of indices of these support points.


P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 80 / 88
Support Vector Machines (SVM) Support Vector Machines

Kernels and Support Vector Machines

Now, consider replacing the inner product in the support vector classifier
optimization function with a generalization of the form: K (xi , xi 0 ).
where K is refered to as a kernel. A kernel is a function that quantifies
the similarity of two observations. For instance:
Linear kernel: K (xi , xi 0 ) = pj=1 xij xi 0 j , that is the kernel for support
P
vector classifier.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 81 / 88


Support Vector Machines (SVM) Support Vector Machines

Kernels and Support Vector Machines

Now, consider replacing the inner product in the support vector classifier
optimization function with a generalization of the form: K (xi , xi 0 ).
where K is refered to as a kernel. A kernel is a function that quantifies
the similarity of two observations. For instance:
Linear kernel: K (xi , xi 0 ) = pj=1 xij xi 0 j , that is the kernel for support
P
vector classifier.
 d
polynomial kernel: K (xi , xi 0 ) = 1 + pj=1 xij xi 0 j
P
with d > 0
integer.
Radial kernel: K (xi , xi 0 ) = exp(−γ pj=1 (xij − xi 0 j )2 ) where γ is a
P
positive constant.
The support vector machine (SVM) is an extension of the support
vector classifier that results from enlarging the feature space using kernels.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 81 / 88


Support Vector Machines (SVM) Support Vector Machines

Example of SVM with polynomial and radial kernels

Left: SVM with a polynomial kernel; Right: SVM with a radial kernel.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 82 / 88


Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88


Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.


Indeed, using kernels, one needs only compute K (xi , xi 0 ) for all n2 distinct


pairs i, i 0 . This allows to operate in the original feature space without


computing the coordinates of the data in a higher dimensional space. This
is known as the kernel trick.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88


Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.


Indeed, using kernels, one needs only compute K (xi , xi 0 ) for all n2 distinct


pairs i, i 0 . This allows to operate in the original feature space without


computing the coordinates of the data in a higher dimensional space. This
is known as the kernel trick.
In many applications, the enlarged feature space is so large that
computations are intractable.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88


Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.


Indeed, using kernels, one needs only compute K (xi , xi 0 ) for all n2 distinct


pairs i, i 0 . This allows to operate in the original feature space without


computing the coordinates of the data in a higher dimensional space. This
is known as the kernel trick.
In many applications, the enlarged feature space is so large that
computations are intractable.
For some kernels, such as the radial kernel, the feature space is implicit
and infinite-dimensional (Taylor series of the exponential function) , so we
could never do the computations there anyway!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88


Support Vector Machines (SVM) Support Vector Machines

Application to the Heart Disease Data, test data


13 predictors such as Age, Sex in order to predict whether an individual
has heart disease, 297 subjects, randomly split into 207 training and 90
test observations. ROC curves*:

* Probability scores are calculated using Platt scaling.


P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 84 / 88
Support Vector Machines (SVM) Support Vector Machines

SVM with More than Two Classes

The SVM as defined works for K = 2 classes. What do we do if we have


K > 2 classes?.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 85 / 88


Support Vector Machines (SVM) Support Vector Machines

SVM with More than Two Classes

The SVM as defined works for K = 2 classes. What do we do if we have


K > 2 classes?. Two approaches:
1 OVA (One versus All) : Fit K different 2-class SVM classifiers
fˆk (x), k = 1, . . . , K ; each class versus the rest. Classify x ∗ to the
class for which fˆk (x ∗ ) is largest.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 85 / 88


Support Vector Machines (SVM) Support Vector Machines

SVM with More than Two Classes

The SVM as defined works for K = 2 classes. What do we do if we have


K > 2 classes?. Two approaches:
1 OVA (One versus All) : Fit K different 2-class SVM classifiers
fˆk (x), k = 1, . . . , K ; each class versus the rest. Classify x ∗ to the
class for which fˆk (x ∗ ) is largest.

OVO (One versus One) : Fit all n2 pairwise classifiers fˆkl (x).

2

Classify x ∗ to the class that wins the most pairwise competitions.

Which to choose? If K is not too large, use OVO.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 85 / 88


Support Vector Machines (SVM) Support Vector Machines

Which to use: SVM or Logistic Regression (LR) or LDA?

SVM became very popular since the introduction of kernels.


When classes are (nearly) separable, SVM does better than LR. So
does LDA.
If the goal is to estimate probabilities, LR is the choice.
For nonlinear boundaries, kernel SVMs are popular. It is possible to
use kernels with LR and LDA as well, but computations are more
expensive.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 86 / 88


Support Vector Machines (SVM) Support Vector Machines

Summary

The support vector machine is a generalization of a simple and


intuitive classifier called the maximal margin classifier.
The support vector classifier, an extension of the maximal margin
classifier that can be applied in a broader range of cases.
The support vector machine, which is a further extension of the
support vector classifier in order to accommodate non-linear class
boundaries.
People often loosely refer to the maximal margin classifier, the support
vector classifier, and the support vector machine as ”support vector
machines”.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 87 / 88


References

References

James, Gareth; Witten, Daniela; Hastie, Trevor and Tibshirani,


Robert. ”An Introduction to Statistical Learning with Applications in
R”, 2nd edition, New York : ”Springer texts in statistics”, 2021. Site
web: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
Hastie, Trevor; Tibshirani, Robert and Friedman, Jerome (2009).
”The Elements of Statistical Learning (Data Mining, Inference, and
Prediction), 2nd edition”. New York: ”Springer texts in statistics”.
Site web :
http://statweb.stanford.edu/~tibs/ElemStatLearn/

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 88 / 88

You might also like