0% found this document useful (0 votes)

3 views

Lecture3 Classification (PartII)

The document outlines a lecture on classification techniques in machine learning, focusing on decision trees and support vector machines (SVM). It discusses the structure and interpretation of decision trees, including regression trees, and highlights their advantages and disadvantages. The lecture also covers the process of building regression trees and the recursive binary splitting method for stratifying predictor space.

Uploaded by

guoxiaofan0225

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture3 Classification (PartII)

Uploaded by

guoxiaofan0225

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 164

IG.

3510-Machine Learning
Lectures 3: Classification (Part II)

Dr. Patricia CONDE-CESPEDES

[email protected]

September 30th, 2024

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 1 / 88

Plan

1 Decision trees

2 Support Vector Machines (SVM)

3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 2 / 88

Decision trees Introduction to Decision Trees

Outline
1 Decision trees
Introduction to Decision Trees
Regression Trees
Classification trees
Bagging or bootstrap aggregation
Random Forests
Boosting
Comparison and summary of decision trees
2 Support Vector Machines (SVM)
Introduction
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
3 References

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 3 / 88

Decision trees Introduction to Decision Trees

Introduction to Decision Trees

Decision trees can be used for classification or for regression.

Decision trees approaches involve stratifying or segmenting the
predictor space into a number of simple regions.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 4 / 88

Decision trees Introduction to Decision Trees

Introduction to Decision Trees

Decision trees can be used for classification or for regression.

Decision trees approaches involve stratifying or segmenting the
predictor space into a number of simple regions.
The splitting rules used can be represented using a tree diagram,
that’s where the name decision tree comes from.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 4 / 88

Decision trees Introduction to Decision Trees

Advantages and disadvantages

Advantages
+ Decision trees are easy to interpret.

Disadvantages:
- Usually Decision trees are not competitive with other supervised
learning approaches in terms of prediction accuracy.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 5 / 88

Decision trees Regression Trees

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 6 / 88

Decision trees Regression Trees

Introductory example with the Hitters dataset (1/3)

Example : Predict a baseball player’s salary based on :

Years (the number of years that he has played in the major leagues)
Hits (the number of hits the player made in the previous year)

Raw data:
Salary is color-coded from
low (blue), medium (green)
to high (yellow,red).

How to stratify the

predictors space?

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 7 / 88

Decision trees Regression Trees

Regression tree for the Hitters data (2/3)

Overall, the tree stratifies the predictor’ space into three regions:

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 8 / 88

Decision trees Regression Trees

Regression tree for the Hitters data (2/3)

Overall, the tree stratifies the predictor’ space into three regions:

R1 = {X |Years < 4.5}, R2 = {X |Years ≥ 4.5, Hits < 117.5}, and

R3 = {X |Years ≥ 4.5, Hits ≥ 117.5}.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 8 / 88
Decision trees Regression Trees

How to interpret the regression tree for the Hitters

example (3/3)

At a given internal node,

the condition Xj < t
indicates the rule to split
the predictor’s space:
If condition True,
consider the left-hand
branch
else, consider the
right-hand branch
(which corresponds to
Xj ≥ t).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 9 / 88

Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88

Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88

Decision trees Regression Trees

Terminology for Trees

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88

Decision trees Regression Trees

Terminology for Trees

Characteristics of trees:
Nodes at the bottom with no branches are called terminal nodes or
leaves
Each terminal node represents a region Rj .
The nodes in the tree where the predictor space is split are referred to
as internal nodes.
For our example, the tree has two internal nodes and three terminal
nodes, or leaves.
The segments of the tree outgoing an internal node are called
branches.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88

Decision trees Regression Trees

Terminology for Trees

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 10 / 88

Decision trees Regression Trees

Illustrative exercise

Given the following training and the following rules:

observations: if (X2 > 3) then: R1
X1 X2 Y else:
1 2 3 If (X1 < 4) then R2
else R3
2 1 2
2 2 4
Questions:
2 4 8
3 1 3 1) Build the regression tree.
3 5 9 2) Make predictions for the
4 4 11 following test observations:
5 1 5
6 2 7 X1 = 1, X2 = 4
X1 = 7, X2 = 2
6 5 12

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 11 / 88

Decision trees Regression Trees

Interpretation of Results for the Hitters data

Years of
experience is the
most important
factor to determine
Salary
For a more
experienced player
(more than 5
years), the number
of hits made in the
previous year is
important to
determine the
salary.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 12 / 88

Decision trees Regression Trees

The process of building a regression tree

There are roughly two steps:

Step 1: Stratification: Divide the predictor Space -the set of possible
values for X1 , X2 , . . . , Xp - into J disjoint regions: R1 , R2 , . . . , RJ .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 13 / 88

Decision trees Regression Trees

The process of building a regression tree

There are roughly two steps:

Step 1: Stratification: Divide the predictor Space -the set of possible
values for X1 , X2 , . . . , Xp - into J disjoint regions: R1 , R2 , . . . , RJ .
Step 2: Prediction: Given an observation (x1 , x2 , . . . , xp ) that falls into
the Rj region, its predicted value ŷ is the mean of the response
variable among all the training observations falling in Rj .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 13 / 88

Decision trees Regression Trees

The process of building a regression tree

There are roughly two steps:

We will now focus on step 1. In theory, the regions could have any shape.
However, to simplify we divide the predictor space into high-dimensional
rectangles, or boxes.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 13 / 88

Decision trees Regression Trees

How to stratify the feature space?

The goal is to find regions R1 , R2 , . . . , RJ that minimize the RSS (Residual

Sum of Squares):
J X
X
(yi − ŷRj )2
j=1 i∈Rj

where ŷRj is the mean of the target Y for the training observations
belonging to the jth region.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 14 / 88

Decision trees Regression Trees

How to stratify the feature space?

The goal is to find regions R1 , R2 , . . . , RJ that minimize the RSS (Residual

Sum of Squares):
J X
X
(yi − ŷRj )2
j=1 i∈Rj

where ŷRj is the mean of the target Y for the training observations
belonging to the jth region.

Unfortunately, it is computationally infeasible to consider every possible

partition of the feature space into J boxes!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 14 / 88

Decision trees Regression Trees

Recursive Binary Splitting for stratification

Recursive Binary Splitting is a top-down, greedy approach.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 15 / 88

Decision trees Regression Trees

Recursive Binary Splitting for stratification

Recursive Binary Splitting is a top-down, greedy approach.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 15 / 88

Decision trees Regression Trees

Recursive Binary Splitting for stratification

Recursive Binary Splitting is a top-down, greedy approach.

top down: it begins at the top of the tree (when all observations
belong to a single region) and then successively splits the predictor
space; each split is indicated via two new branches further down on
the tree.
greedy: It is greedy because at each step, the best split is made
at that particular step, rather than looking ahead and picking a split
that will lead to a better tree in some future step.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 15 / 88

Decision trees Regression Trees

Recursive Binary splitting process (1/2)

Step 1: Select the predictor j and the cutpoint s such that splitting the
predictor space in two regions: {X |Xj < s} and {X |Xj ≥ s} leads
to the greatest decrease of RSS.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 16 / 88

Decision trees Regression Trees

Recursive Binary splitting process (1/2)

Step 1: Select the predictor j and the cutpoint s such that splitting the
predictor space in two regions: {X |Xj < s} and {X |Xj ≥ s} leads
to the greatest decrease of RSS. In other words:
For any pair (j, s) we define the pair of half-planes:
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
and seek the values of j and s that minimize:
X X
(yi − ŷR1 )2 + (yi − ŷR2 )2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

where ŷR1 is the mean response for training observations in

R1 (j, s), and ŷR2 is the mean response for the training
observations in R2 (j, s).

This step can be done quite quickly, especially if p is small!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 16 / 88
Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88

Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously

created regions:
Step 2: Repeat step 1, look for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS.
However, instead of splitting the entire predictor space, split one
of the two previously identified regions. So, at the end of this
step, there are three regions.
Step 3: Repeat in order to split one of these three regions further, so as
to minimize the RSS.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88

Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88

Decision trees Regression Trees

Recursive Binary splitting process (2/2)

Next steps consist in repeating step 1 to recursively split the previously

Predictions: once the regions created R1 , . . . , RJ make predictions by

taking the mean value of the observations in each region.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 17 / 88

Decision trees Regression Trees

Example of the Recursive Binary splitting process result

Left: The output of Recursive Binary Splitting on a two-dimensional example.

Center : A tree corresponding to the partition in the left panel.
Right: A perspective plot of the prediction surface corresponding to that tree.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 18 / 88

Decision trees Regression Trees

counter example of the Recursive binary splitting

A partition of two-dimensional feature space that could not result from
recursive binary splitting.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 19 / 88

Decision trees Regression Trees

Tree Pruning

How many leaves must the tree have?

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 20 / 88

Decision trees Regression Trees

Tree Pruning

How many leaves must the tree have?

A tree with so many terminal nodes might cause overfitting leading to
good performance in the training set but poor performance in the test set.
For instance, consider a tree having as many terminal nodes as
observations in such a way that each observation has its own region.
So, the training error is zero.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 20 / 88

Decision trees Regression Trees

Tree Pruning

How many leaves must the tree have?

SOLUTION:
A good strategy is to grow a very large tree T0 (with many leaves), and
then prune it back in order to obtain a subtree.
This approach is called Cost complexity pruning also known as weakest
link pruning.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 20 / 88

Decision trees Regression Trees

Cost complexity pruning / weakest link pruning

Intuition: the goal is to select a subtree that leads to the lowest test error.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 21 / 88

Decision trees Regression Trees

Cost complexity pruning / weakest link pruning

Intuition: the goal is to select a subtree that leads to the lowest test error.
Approach: For each value of α there is a subtree T ⊂ T0 such that:
|T |
X X
(yi − ŷRm )2 + α|T | is minimal.
m=1 i:xi ∈Rm

where |T | is the number of terminal nodes of the tree T , Rm is the region corresponding
to the mth terminal node, and ŷRm is the predicted response associated to Rm .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 21 / 88

Decision trees Regression Trees

Cost complexity pruning / weakest link pruning

where |T | is the number of terminal nodes of the tree T , Rm is the region corresponding
to the mth terminal node, and ŷRm is the predicted response associated to Rm .

Which value of α to select?

Select the optimal value α∗ using cross-validation to estimate the error
test.
Then, use the full dataset to obtain the subtree that corresponds to α∗ .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 21 / 88

Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to

the training data) and the subtree’s complexity.
When α = 0, then we get T0 .

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88

Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88

Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to

the training data) and the subtree’s complexity.
When α = 0, then we get T0 .
As α increases, there is a price to pay for having a tree with many
terminal nodes, and so a smaller subtree will be preferable.
As α increases, branches get pruned from the tree in a nested and
predictable fashion, so obtaining the whole sequence of subtrees as a
function of α is easy!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88

Decision trees Regression Trees

About the parameter α

The choice of the parameter α is crucial!

The tuning parameter α controls a trade-off between the RSS (fit to

When performing cross-validation α plays the role of a penalty for a very

big tree that barely contributes to decrease the test error.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 22 / 88

Decision trees Regression Trees

Reminder : K-fold Cross-validation

Idea : randomly split the data into K equal-sized groups or folds. Then,
leave out part k, fit the model to the other K − 1 parts (combined), and
then obtain predictions for the left-out kth part.
Repeat for each fold k = 1, 2, ...K fold and estimate the test error.
Finally the estimated overall test error is the average of the K estimates.
A schematic display of 5-fold CV

A set of observations is randomly split into five non-overlapping groups. Each of these
fifths acts as a validation set. The test error is estimated by averaging the five estimates.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 23 / 88
Decision trees Regression Trees

Summary of the Building of a regression tree

Step 1: Use Recursive binary splitting to build a large tree T0 on the

training data.
Step 2: Apply cost complexity pruning to T0 in order to obtain a
sequence of best subtrees, as a function of α.
Step 3: Use K -fold cross-validation to choose α. For k = 1, . . . , K do:
1. Repeat Steps 1 and 2 on the KK−1 th fraction of the training data,
excluding the kth fold.
2. Estimate the test error on the data in the left-out kth fold, as a
function of α.
Average the results, and choose α that minimizes the average
estimate error test.
Step 4: Return the subtree from Step 2 that corresponds to the chosen
value of α.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 24 / 88

Decision trees Regression Trees

Example with the Hitters data set (1/3)

Consider the Hitters dataset:

First, randomly divide the data set in half, yielding 132 observations
in the training set and 131 observations in the test set.
We build a large regression tree T0 on the training data and vary α in
order to create subtrees with different numbers of terminal nodes.
Finally, perform six-fold cross-validation in order to estimate the
cross-validated MSE of the trees as a function of α.

Notice there is a ONE to ONE correspondance between α and the number

of leaves |T |.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 25 / 88

Decision trees Regression Trees

Large regression tree T0 , Hitters example (2/3)

Unpruned tree resulting from the recursive binary splitting on the Hitters
data with 9 predictors:

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 26 / 88

Decision trees Regression Trees

Cross-validation, Hitters example (3/3)

CV error as a function of the number of leaves.

Orange: test error; black: training error curve; Green: CV error. Also shown are standard
error bars around the estimated errors.
The selected tree has three leaves
P. Conde-Céspedes and 3:was
Lectures shown (Part
Classification previously.
II) September 30th, 2024 27 / 88
Decision trees Classification trees

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 28 / 88

Decision trees Classification trees

Classification trees
Let us suppose the response variable Y has 3 categories:

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 29 / 88

Decision trees Classification trees

How to build a Classification tree?

The procedure is very similar to that of regression trees, except that

the predicted variable Y is qualitative.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 30 / 88

Decision trees Classification trees

How to build a Classification tree?

The procedure is very similar to that of regression trees, except that

the predicted variable Y is qualitative.

In classification, given a new observation we predict that it belongs to

the most commonly occurring class in the region to which it
belongs to.

Likewise in regression, start by building a large classification tree

recursive binary splitting.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 30 / 88

Decision trees Classification trees

How to build a Classification tree?

The procedure is very similar to that of regression trees, except that

the predicted variable Y is qualitative.

In classification, given a new observation we predict that it belongs to

the most commonly occurring class in the region to which it
belongs to.

Likewise in regression, start by building a large classification tree

recursive binary splitting. However, in classification we can not use
the RSS as a criterion for making binary splits.
Instead we minimize two other criteria which measure the purity of
the node. These are the Gini index and the cross-entropy.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 30 / 88

Decision trees Classification trees

The classification error rate

The classification error rate of a region m is simply the fraction of the

training observations in that region that do not belong to the most
common class:
Errorm,Train = 1 − max(p̂mk ).
k

Here p̂mk represents the proportion of training observations in the mth

region that belong to class k.

The classification error rate is preferable if prediction accuracy of the final

pruned tree is the goal.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 31 / 88

Decision trees Classification trees

Gini index G

For a given region m the Gini index is defined by:

K
X
Gm = p̂mk (1 − p̂mk )
k=1

where p̂mk is the proportion of training observations in region m that

belong to class k.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 32 / 88

Decision trees Classification trees

Gini index G

For a given region m the Gini index is defined by:

K
X
Gm = p̂mk (1 − p̂mk )
k=1

where p̂mk is the proportion of training observations in region m that

belong to class k.

Intuition: Gini index takes on a small value if all of the p̂mk ’s are either
close to 0 or 1. For this reason the Gini index is referred to as a measure
of node purity - a small value indicates that a node contains
predominantly observations from a single class.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 32 / 88

Decision trees Classification trees

Cross-entropy or Deviance D

An alternative to the Gini index is cross-entropy. For a given region m

this index is given by:
K
X
Dm = − p̂mk log p̂mk
k=1

It turns out that the Gini index and the cross-entropy are very similar
numerically.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 33 / 88

Decision trees Classification trees

Cross-entropy or Deviance D

An alternative to the Gini index is cross-entropy. For a given region m

this index is given by:
K
X
Dm = − p̂mk log p̂mk
k=1

It turns out that the Gini index and the cross-entropy are very similar
numerically.
Since 0 ≤ p̂mk ≤ 1, it follows that −p̂mk log p̂mk ≥ 0. One can deduce
that the cross-entropy will take on a value near zero if the p̂mk ’s are all
near 0 or near 1. Therefore, the cross-entropy will take on a small value if
the mth node is pure.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 33 / 88

Decision trees Classification trees

Comparison classification error rate, Gini index and entropy

Cross-entropy and the Gini index are differentiable, and hence more practical to
numerical optimization. However, the classification error rate is preferable if prediction
accuracy is the goal.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 34 / 88
Decision trees Classification trees

Example of Classification tree with the heart data

Consider the Heart data set:

These data contain a binary variable HD for 303 patients who
presented with chest pain.
(
Yes: presence of heart disease.
HD =
No: No heart disease.
13 predictors including Age, Sex, Chol (a cholesterol measurement),
and other heart and lung function measurements.
Cross-validation yields a tree with 6 terminal nodes (see next slide).
Decision trees can be constructed with qualitative predictors as well,
that is the case for this example. Consider the top node Thal (3
categories: normal, fixed and reversible defects).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 35 / 88

Decision trees Classification trees

Classification tree, Heart dataset

Some remarks:
It is possible
to include
qualitative
predictors.
Some splits
yield to two
terminal
nodes that
have the
same
predicted
value

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 36 / 88

Decision trees Classification trees

Trees vs. Linear Models

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 37 / 88

Decision trees Bagging or bootstrap aggregation

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 38 / 88

Decision trees Bagging or bootstrap aggregation

Bagging or bootstrap aggregation

We obtain distinct data sets by repeatedly sampling observations from the original data
set with replacement.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 39 / 88

Decision trees Bagging or bootstrap aggregation

What is Bagging? - Introduction

Decision trees can suffer from high variance (before prunning).

Bootstrap aggregation, or bagging, is a general procedure for

reducing the variance frequently used in the context of decision trees.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 40 / 88

Decision trees Bagging or bootstrap aggregation

What is Bagging? - Introduction

Decision trees can suffer from high variance (before prunning).

Bootstrap aggregation, or bagging, is a general procedure for

reducing the variance frequently used in the context of decision trees.
Reminder: Given a set of n independent observations Z1 , . . . , Zn , each
with variance σ 2 , the variance of the empirical mean Z̄ of the observations
2
is given by σn .
So, Averaging a set of observations reduces variance!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 40 / 88

Decision trees Bagging or bootstrap aggregation

What is Bagging? - Introduction

Decision trees can suffer from high variance (before prunning).

Bootstrap aggregation, or bagging, is a general procedure for

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 40 / 88

Decision trees Bagging or bootstrap aggregation

Bagging Illustration
Example: relationship between ozone and temperature:
B = 100 models were fitted on bootstrap samples. (Gray) Predictions
from 10 fitted models, (red) Average of the 100 fitted models.

Clearly average is more stable and there is less overfit!

Source: https://en.wikipedia.org/wiki/Bootstrap_aggregating
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 41 / 88
Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking

samples from the original dataset.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88

Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking

samples from the original dataset.
Step 2: Fit the method on the bth bootstrapped training set in order to
get the prediction fˆ∗b (x) for a given observation x.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88

Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88

Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking

samples from the original dataset.
Step 2: Fit the method on the bth bootstrapped training set in order to
get the prediction fˆ∗b (x) for a given observation x.
remark: at this point each individual tree has high variance!
Step 3: Then,
for regression, average the B predictions.
for classification, take a majority vote, the most commonly
occurring class among the B predictions.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88

Decision trees Bagging or bootstrap aggregation

How Bagging proceeds?

Bagging proceeds as follows:

Step 1: Generate B different bootstrapped training data sets by taking

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 42 / 88

Decision trees Bagging or bootstrap aggregation

Out-of-Bag (OOB) Error Estimation

An straightforward way to estimate the test error in bagging.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 43 / 88

Decision trees Bagging or bootstrap aggregation

Out-of-Bag (OOB) Error Estimation

An straightforward way to estimate the test error in bagging.

Trees are fit to bootstrapped subsets of the observations. So, on
average, each bagged tree makes use of around 32 of the observations.
(we will see the proof in the tutorial course).
The remaining 13 of the observations are referred to as the
out-of-bag (OOB) observations.
It is possible to predict the response for the ith observation using each
of the trees in which that observation was OOB. This will yield
around B3 predictions for each observation.
Then, we can estimate the overall OOB MSE for each of the n
observations. This will be the estimated test error!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 43 / 88

Decision trees Random Forests

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 44 / 88

Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that

decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88

Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that

decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.
As in bagging, a number of decision trees are built on bootstrapped
training samples.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88

Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that

decorrelates the trees by making a slightly modification. This reduces
the variance when averaging the estimates.
As in bagging, a number of decision trees are built on bootstrapped
training samples.
Modification: when building these decision trees, each time a split in
a tree is considered, a random selection of m predictors is chosen
as split candidates from the full set of p predictors. The split is
allowed to use only one of those m predictors.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88

Decision trees Random Forests

Introduction to Random Forests (RF)

Random forests provides an improvement over bagged trees that

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 45 / 88

Decision trees Random Forests

Why Random Forests reduces variance?

At each split, the algorithm must consider only a minority of the

predictors.
Idea: suppose that there is one very strong predictor in the data. Then, most of
the bagged trees will use this predictor in the top split. Consequently, all of the
bagged trees will look quite similar to each other. Hence the predictions from the
bagged trees will be highly correlated.

Unfortunately, averaging highly correlated quantities does not lead to as

substantial reduction in variance as averaging uncorrelated quantities.
σ2
P
i Zi
P
Reminder: V (Z̄ ) = V n = nZ + 2 i6=j cov (Zi , Zj )

The term cov (Zi , Zj ) = 0 only if the Zi ’s are not correlated (independent).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 46 / 88

Decision trees Random Forests

The choice of m in Random Forest

Gene expression data: Performace of Random forests for different values of m.

Goal: predict cancer type based on 500 genes with high variance.
If m = p, this amounts simply to bagging.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 47 / 88
Decision trees Boosting

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 48 / 88

Decision trees Boosting

Introduction to Boosting

Like bagging, boosting is a ensemble learning method.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 49 / 88

Decision trees Boosting

Introduction to Boosting

Like bagging, boosting is a ensemble learning method.

Boosting combines a set of weak learners into strong learners. A
weak learner refers to a learning algorithm that only predicts slightly
better than randomly.
There are different types of boosting algorithms :

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 49 / 88

Decision trees Boosting

Introduction to Boosting

Like bagging, boosting is a ensemble learning method.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 49 / 88

Decision trees Boosting

What is the idea behind boosting?

Intuition: Adaboost (Adaptative boosting)

Source: https://vitalflux.com/adaboost-algorithm-explained-with-python-example/,
https://towardsdatascience.com/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f

The final prediction is the weighted majority vote of all weak learners

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 50 / 88

Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss

function (i.e., training on the residuals of the model).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88

Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss

function (i.e., training on the residuals of the model).
Unlike fitting a single large decision tree to the data, in gradient
boosting the learners are small decision trees grown slowly.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88

Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88

Decision trees Boosting

Gradient boosting and XGboost

Gradient boosting trains learners based upon minimizing a loss

function (i.e., training on the residuals of the model).
Unlike fitting a single large decision tree to the data, in gradient
boosting the learners are small decision trees grown slowly.
At each step, a decision tree is fit to the residuals or errors of the
current tree. Then, the residuals are updated. This implies to slowly
improve in areas where the classifier does not perform well.
Tuning parameters:
The number of trees B.
The depth d or number of terminal nodes in each tree.
A shrinkage parameter λ > 0 which controls the rate at which
boosting learns and scales the contribution of each weak learner.
Typical values are 0.01 or 0.001.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 51 / 88
Decision trees Comparison and summary of decision trees

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 52 / 88

Decision trees Comparison and summary of decision trees

Comparison example of classification trees

Spam data contains 2-class target variable and 50 predictors.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 53 / 88

Decision trees Comparison and summary of decision trees

Summary on Decision trees

Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 54 / 88

Decision trees Comparison and summary of decision trees

Summary on Decision trees

Decision trees are simple and interpretable models for regression and
classification.
However they are often not competitive with other methods in terms
of prediction accuracy.
Bagging, random forests and boosting are good methods for
improving the prediction accuracy of trees at the expense of
interpretability. They work by growing many trees on the training
data and then combining the predictions of the resulting ensemble of
trees.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 54 / 88

Decision trees Comparison and summary of decision trees

Summary on Decision trees

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 54 / 88

Support Vector Machines (SVM) Introduction

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 55 / 88

Support Vector Machines (SVM) Introduction

Introduction

A little of history:
Support Vector Machines, usually called simple SVM, was developed in
the 1990s by Vladimir Vapnik.
Since the, SVMs have been shown to perform well in a variety of settings,
and are often considered one of the best out of the box classifiers.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 56 / 88

Support Vector Machines (SVM) Introduction

Introduction

SVM principle for the two-class classification problem:

SVM principle
Finding a hyperplane that separates the classes in feature space.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 56 / 88

Support Vector Machines (SVM) Introduction

What is a Hyperplane?

A hyperplane in p dimensions is a flat affine subspace of dimension

p − 1.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 57 / 88

Support Vector Machines (SVM) Introduction

What is a Hyperplane?

A hyperplane in p dimensions is a flat affine subspace of dimension

p − 1.
In general the equation for a hyperplane has the form:

β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0

If p = 2 dimensions a hyperplane is a line of equation:

β0 + β1 X1 + β2 X2 = 0

If β0 = 0, the hyperplane goes through the origin.

The vector β = (β1 , β2 , . . . , βp ) is called the normal vector and it is
orthogonal to the surface of a hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 57 / 88

Support Vector Machines (SVM) Introduction

Hyperplane in 2 Dimensions

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 58 / 88

Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there

are 3 possibilities:
1 X lies on the hyperplane then it satisfies:
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88

Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88

Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there

are 3 possibilities:
1 X lies on the hyperplane then it satisfies:
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
2 X does not satisfy this equation and rather,
β0 + β1 X1 + β2 X2 + . . . + βp Xp > 0
Then X lies on one side of the hyperplane.
3 On the other hand, if
β0 + β1 X1 + β2 X2 + . . . + βp Xp < 0
then X lies on the other side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88

Support Vector Machines (SVM) Introduction

A separating hyperplane

For any point X = (X1 , X2 , . . . , Xp ) ∈ Rp in p−dimensional space, there

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 59 / 88

Support Vector Machines (SVM) Introduction

An example of a separating hyperplane in R2

The hyperplane
1 + 2X1 + 3X2 = 0 is
shown.
Blue region: set of
points for which
1 + 2X1 + 3X2 > 0,
Red region: set of
points for which
1 + 2X1 + 3X2 < 0.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 60 / 88

Support Vector Machines (SVM) Introduction

Classification Using a Separating Hyperplane

Now suppose a n × p data matrix that consists of n training observations

in p−dimensional space
1st observation: x1 = (x11 , . . . x1p )
2nd observation: x2 = (x21 , . . . x2p )
..
.
nth observation: xn = (xn1 , . . . xnp )
These observations fall into two classes, y1 , . . . , yn ∈ {−1, 1}

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 61 / 88

Support Vector Machines (SVM) Introduction

Classification Using a Separating Hyperplane

Now suppose a n × p data matrix that consists of n training observations

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 61 / 88

Support Vector Machines (SVM) Introduction

Classification Using a Separating Hyperplane

Now suppose a n × p data matrix that consists of n training observations

Goal: develop a classifier based on the training data that correctly

classifies the test observation.
idea: build a separating hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 61 / 88

Support Vector Machines (SVM) Introduction

How to classify using a separating hyperplan?

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 62 / 88

Support Vector Machines (SVM) Introduction

How to classify using a separating hyperplan?

Suppose there exists a hyperplane that perfectly separates the two classes
in the training observations:
By coding:
yi = +1 for blue and yi = −1 for red class.
Then, a separating hyperplane has the property:
yi (β0 +β1 xi1 +β2 xi2 +. . .+βp xip ) > 0 ∀i = 1, . . . , n
Given a test observation x ∗ , classify it based on
the sign of:
f (x ∗ ) = β0 + β1 x1∗ + β2 x2∗ + . . . + βp xp∗
if f (x ∗ ) > 0 then blue , if f (x ∗ ) < 0 then red.

f (x ∗ ) can be interpreted as magnitude of confidence.

If f (x ∗ ) is far from zero ⇒confident about its class assignment.
if f (x ∗ ) is close to zero ⇒ less confident about its class assignment.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 62 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 63 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

What separating hyperplan to choose?

If a perfect separating hyperplan exists, then there exist an infinite number
of such hyperplanes.

Which one to choose?

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 64 / 88
Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the

optimal separating hyperplane, which is the separating hyperplane that
is farthest from the training observations.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the

optimal separating hyperplane, which is the separating hyperplane that
is farthest from the training observations.
Compute the distance from each training observation to a given separating
hyperplane. The smallest such distance is the minimal distance among all
the observations to the hyperplane, and is known as the margin.
The maximal margin hyperplane is the separating hyperplane for which
the margin is the largest

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

The maximal margin hyperplane

A natural choice is the maximal margin hyperplane, also known as the

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 65 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier

Example of Maximal margin hyperplane

the maximal margin

hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier

Example of Maximal margin hyperplane

the maximal margin

hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.
The dashed lines indicate
the width of the margin.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier

Example of Maximal margin hyperplane

the maximal margin

hyperplane represents the
midline of the widest slab
that can be inserted between
the two classes.
The dashed lines indicate
the width of the margin.
the three training
observations equidistant
from the maximal margin
hyperplane that lie along the
dashed lines are known as
support vectors.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Maximal Margin Classifier

Example of Maximal margin hyperplane

the maximal margin

The maximal margin hyperplane depends only on the support vectors!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 66 / 88
Support Vector Machines (SVM) Maximal Margin Classifier

Construction of the Maximal Margin Classifier

Given n training observations x1 , . . . , xn ∈ Rp associated class labels
y1 , . . . , yn ∈ {−1, 1}. The maximal margin hyperplane is the solution to
the optimization problem:
maximize M
β0 ,β1 ,...,βp
p
X
subject to : βj2 = 1 and
j=1

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M ∀i = 1, . . . , n.

The second constrant guarantees that each observation will be on the

correct side of the hyperplane (provided that M is positive.).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 67 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Construction of the Maximal Margin Classifier

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M ∀i = 1, . . . , n.

The second constrant guarantees that each observation will be on the

correct side of the hyperplane (provided that M is positive.).
M represents the margin of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 67 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Construction of the Maximal Margin Classifier

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M ∀i = 1, . . . , n.

The second constrant guarantees that each observation will be on the

correct side of the hyperplane (provided that M is positive.).
M represents the margin of the hyperplane.
If the first constrant holds, the distance from the ith observation to
the hyperplane is: yi (β0 + β1 xi1 + . . . + βp xip ).
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 67 / 88
Support Vector Machines (SVM) Maximal Margin Classifier

Situations when the Maximal Margin classifier fails (1/2)

The Non-separable Case
In many real life situations the two classes are unseparable

The maximal
margin hyperplane
does not exist.
In this case, the
optimization
problem has no
solution with
M > 0.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 68 / 88

Support Vector Machines (SVM) Maximal Margin Classifier

Situations when the Maximal Margin classifier fails: (2/2)

Noisy data and sensitivity to individual observations

The addition of a single observation leads to a dramatic change in the

maximal margin hyperplane.
This extremely sensitive suggests overfitting. The margin is smaller!
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 69 / 88
Support Vector Machines (SVM) Support Vector Classifier

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 70 / 88

Support Vector Machines (SVM) Support Vector Classifier

Support Vector Classifier - Introduction

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 71 / 88

Support Vector Machines (SVM) Support Vector Classifier

Support Vector Classifier - Introduction

Solution: Consider a hyperplane that does not perfectly separate the two
classes, in the interest of
Greater robustness to individual observations, and
Better classification of most of the training observations.
Such a classifier is called support vector classifier or soft margin
classifier. This classifier allows some observations:
to be not only on the wrong side of the margin but also
to be on the wrong side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 71 / 88

Support Vector Machines (SVM) Support Vector Classifier

Support Vector Classifier - Introduction

Observations on the wrong side of the hyperplane correspond to

misclassified training observations.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 71 / 88

Support Vector Machines (SVM) Support Vector Classifier

Support vector classifier relaxation

Left: Red class: 1 is on the wrong side of the margin. Blue class: observation 8 is on
the wrong side of the margin.
Right: Same as left panel with two additional points, 11 and 12. Both are on the wrong
side of the hyperplane and the wrong side of the margin.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 72 / 88
Support Vector Machines (SVM) Support Vector Classifier

Construction of the Support Vector Classifier

The Support Vector Classifier is the solution to the problem:

maximize M
β0 ,β1 ,...,βp ,1 ,...,n
X p
subject to : βj2 = 1,
j=1

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M(1 − i ),

X n
i ≥ 0, i ≤ C ∀i = 1, . . . , n.
i=1

C is a nonnegative tuning parameter, M is the width of the margin,

1 , . . . , n are slack variables that allow individual observations to be
on the wrong side of the margin or of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 73 / 88

Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:

If i = 0 then observation i is on the correct side of the margin.
If 0 < i < 1 then observation i is on the wrong side of the margin,
If i > 1, then observation i is on the wrong side of the hyperplane.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88

Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:

The tuning parameter C bounds ni=1 i , it represents a budget for

P
the amount that the margin can be violated by the n observations.
If C = 0 then 1 = . . . = n = 0, then we obtain the maximal margin
classifier problem.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88

Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:

The tuning parameter C bounds ni=1 i , it represents a budget for

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88

Support Vector Machines (SVM) Support Vector Classifier

Parameters’ interpretation

The slack variable i tells where the ith observation is located:

The tuning parameter C bounds ni=1 i , it represents a budget for

P
the amount that the margin can be violated by the n observations.
If C = 0 then 1 = . . . = n = 0, then we obtain the maximal margin
classifier problem.
If C > 0 no more than C observations can be on the wrong side of the
hyperplane,
As C increases, tolerance increases and so the margin will widen.
Conversely, as C decreases and the margin will narrow.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 74 / 88

Support Vector Machines (SVM) Support Vector Classifier

The regularization parameter C

C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88

Support Vector Machines (SVM) Support Vector Classifier

The regularization parameter C

C is generally chosen
via cross-validation.
C controls the
bias-variance trade-off.
If C small, then narrow
margins rarely violated
⇒ low bias but high
variance.
If C large, then wide
margins and more
violations are allowed
⇒ classifier more
biased but may have
lower variance.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88

Support Vector Machines (SVM) Support Vector Classifier

The regularization parameter C

Only observations that either lie directly on the margin or that violate the
margin will affect the hyperplane. These are the support vectors.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 75 / 88
Support Vector Machines (SVM) Support Vector Machines

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 76 / 88

Support Vector Machines (SVM) Support Vector Machines

Classification with Non-linear Decision Boundaries

If the decision boundary is not linear, the Support Vector Classifier

performs poorly!
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 77 / 88
Support Vector Machines (SVM) Support Vector Machines

Feature expansion
In a higher dimensional space the data becomes linearly separable.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 78 / 88

Support Vector Machines (SVM) Support Vector Machines

Feature expansion
In a higher dimensional space the data becomes linearly separable.

Source: https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 78 / 88

Support Vector Machines (SVM) Support Vector Machines

Feature Expansion

In order to address non-linearity, consider enlarging the feature space

by including transformations of the predictors e.g. X12 , X13 , X1 X2 ,
X1 X22 (quadratic, cubic or even higher-order polynomial terms).

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 79 / 88

Support Vector Machines (SVM) Support Vector Machines

Feature Expansion

In order to address non-linearity, consider enlarging the feature space

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 79 / 88

Support Vector Machines (SVM) Support Vector Machines

Feature Expansion

In order to address non-linearity, consider enlarging the feature space

by including transformations of the predictors e.g. X12 , X13 , X1 X2 ,
X1 X22 (quadratic, cubic or even higher-order polynomial terms).
This implies to go from a p-dimensional space to a P > p
dimensional space.
Then, we can fit a support-vector classifier in the enlarged space and
get non-linear decision boundaries in the original space.
Example: Suppose we consider (X1 , X2 , X12 , X22 , X1 X2 ) instead of just
(X1 , X2 ). Then the decision boundary would be of the form:
β0 + β1 X1 + β2 X2 + β3 X12 + β4 X22 + β5 X1 X2 = 0
As we enlarge the feature space computations become unmanageable!
Support vector machine allows to enlarge the feature space while
keeping efficient computations.
P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 79 / 88
Support Vector Machines (SVM) Support Vector Machines

Inner products and Support Vectors

It can be shown that the linear support vector classifier can be written as:
n
X
f (x) = β0 + αi hx, xi i
i=1
p
X
where ha, bi = aj bj is the inner product between vectors a and b ∈ Rp .
j=1
There are n parameter αi , one per training obsevation.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 80 / 88

Support Vector Machines (SVM) Support Vector Machines

Inner products and Support Vectors

To estimate the parameters α1 , . . . , αn and β0 , we need n2 inner products

between all pairs of training observations.

However, it turns out that αi is nonzero only for the support vectors:
n
X
f (X ) = β0 + αi hx, xi i
i∈S

where S is the the collection of indices of these support points.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 80 / 88
Support Vector Machines (SVM) Support Vector Machines

Kernels and Support Vector Machines

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 81 / 88

Support Vector Machines (SVM) Support Vector Machines

Kernels and Support Vector Machines

Now, consider replacing the inner product in the support vector classifier
optimization function with a generalization of the form: K (xi , xi 0 ).
where K is refered to as a kernel. A kernel is a function that quantifies
the similarity of two observations. For instance:
Linear kernel: K (xi , xi 0 ) = pj=1 xij xi 0 j , that is the kernel for support
P
vector classifier.
d
polynomial kernel: K (xi , xi 0 ) = 1 + pj=1 xij xi 0 j
P
with d > 0
integer.
Radial kernel: K (xi , xi 0 ) = exp(−γ pj=1 (xij − xi 0 j )2 ) where γ is a
P
positive constant.
The support vector machine (SVM) is an extension of the support
vector classifier that results from enlarging the feature space using kernels.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 81 / 88

Support Vector Machines (SVM) Support Vector Machines

Example of SVM with polynomial and radial kernels

Left: SVM with a polynomial kernel; Right: SVM with a radial kernel.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 82 / 88

Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88

Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.

Indeed, using kernels, one needs only compute K (xi , xi 0 ) for all n2 distinct

pairs i, i 0 . This allows to operate in the original feature space without

computing the coordinates of the data in a higher dimensional space. This
is known as the kernel trick.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88

Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.

Indeed, using kernels, one needs only compute K (xi , xi 0 ) for all n2 distinct

pairs i, i 0 . This allows to operate in the original feature space without

computing the coordinates of the data in a higher dimensional space. This
is known as the kernel trick.
In many applications, the enlarged feature space is so large that
computations are intractable.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88

Support Vector Machines (SVM) Support Vector Machines

What is the advantage of using a kernel?

The most important advantage is computational.

Indeed, using kernels, one needs only compute K (xi , xi 0 ) for all n2 distinct

pairs i, i 0 . This allows to operate in the original feature space without

computing the coordinates of the data in a higher dimensional space. This
is known as the kernel trick.
In many applications, the enlarged feature space is so large that
computations are intractable.
For some kernels, such as the radial kernel, the feature space is implicit
and infinite-dimensional (Taylor series of the exponential function) , so we
could never do the computations there anyway!

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 83 / 88

Support Vector Machines (SVM) Support Vector Machines

Application to the Heart Disease Data, test data

13 predictors such as Age, Sex in order to predict whether an individual
has heart disease, 297 subjects, randomly split into 207 training and 90
test observations. ROC curves*:

* Probability scores are calculated using Platt scaling.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 84 / 88
Support Vector Machines (SVM) Support Vector Machines

SVM with More than Two Classes

The SVM as defined works for K = 2 classes. What do we do if we have

K > 2 classes?.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 85 / 88

Support Vector Machines (SVM) Support Vector Machines

SVM with More than Two Classes

The SVM as defined works for K = 2 classes. What do we do if we have

K > 2 classes?. Two approaches:
1 OVA (One versus All) : Fit K different 2-class SVM classifiers
fˆk (x), k = 1, . . . , K ; each class versus the rest. Classify x ∗ to the
class for which fˆk (x ∗ ) is largest.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 85 / 88

Support Vector Machines (SVM) Support Vector Machines

SVM with More than Two Classes

The SVM as defined works for K = 2 classes. What do we do if we have

OVO (One versus One) : Fit all n2 pairwise classifiers fˆkl (x).

2

Classify x ∗ to the class that wins the most pairwise competitions.

Which to choose? If K is not too large, use OVO.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 85 / 88

Support Vector Machines (SVM) Support Vector Machines

Which to use: SVM or Logistic Regression (LR) or LDA?

SVM became very popular since the introduction of kernels.

When classes are (nearly) separable, SVM does better than LR. So
does LDA.
If the goal is to estimate probabilities, LR is the choice.
For nonlinear boundaries, kernel SVMs are popular. It is possible to
use kernels with LR and LDA as well, but computations are more
expensive.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 86 / 88

Support Vector Machines (SVM) Support Vector Machines

Summary

The support vector machine is a generalization of a simple and

intuitive classifier called the maximal margin classifier.
The support vector classifier, an extension of the maximal margin
classifier that can be applied in a broader range of cases.
The support vector machine, which is a further extension of the
support vector classifier in order to accommodate non-linear class
boundaries.
People often loosely refer to the maximal margin classifier, the support
vector classifier, and the support vector machine as ”support vector
machines”.

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 87 / 88

References

James, Gareth; Witten, Daniela; Hastie, Trevor and Tibshirani,

Robert. ”An Introduction to Statistical Learning with Applications in
R”, 2nd edition, New York : ”Springer texts in statistics”, 2021. Site
web: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
Hastie, Trevor; Tibshirani, Robert and Friedman, Jerome (2009).
”The Elements of Statistical Learning (Data Mining, Inference, and
Prediction), 2nd edition”. New York: ”Springer texts in statistics”.
Site web :
http://statweb.stanford.edu/~tibs/ElemStatLearn/

P. Conde-Céspedes Lectures 3: Classification (Part II) September 30th, 2024 88 / 88

Trees Handout
No ratings yet
Trees Handout
51 pages
AI ML Unit 3 QB Cse
No ratings yet
AI ML Unit 3 QB Cse
76 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Unit 3: Classification & Regression: Question Bank and Its Solution
No ratings yet
Unit 3: Classification & Regression: Question Bank and Its Solution
180 pages
Unit IV
No ratings yet
Unit IV
36 pages
Chapter 7 - Trees
No ratings yet
Chapter 7 - Trees
80 pages
Unit No. 03 - Classification & Regression
No ratings yet
Unit No. 03 - Classification & Regression
75 pages
08 Tree Regression 1
No ratings yet
08 Tree Regression 1
49 pages
Decision Tree R
No ratings yet
Decision Tree R
5 pages
2_updated_ai-ml-unit-3-qb-1-2
No ratings yet
2_updated_ai-ml-unit-3-qb-1-2
75 pages
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
Assignment Decision Tree
No ratings yet
Assignment Decision Tree
15 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
Unit 3 Classification - Dr. Vidyut D
No ratings yet
Unit 3 Classification - Dr. Vidyut D
72 pages
Lec.7.intro.D.S. Fall 2023
No ratings yet
Lec.7.intro.D.S. Fall 2023
26 pages
Decision Tree in ML
No ratings yet
Decision Tree in ML
21 pages
Decision Trees and Regression Techniques
No ratings yet
Decision Trees and Regression Techniques
27 pages
Decision Trees For Classification and Regression: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Decision Trees For Classification and Regression: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Decision Tree
No ratings yet
Decision Tree
57 pages
Decision Tree
100% (1)
Decision Tree
57 pages
Tree Based Learning Methods
No ratings yet
Tree Based Learning Methods
28 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
Lecture 8
No ratings yet
Lecture 8
28 pages
08 09 10 Cross Validation and Decision Trees
No ratings yet
08 09 10 Cross Validation and Decision Trees
15 pages
Assignment of Decision Tree in Machine Learning
No ratings yet
Assignment of Decision Tree in Machine Learning
15 pages
BSC ML Ch3.pptx
No ratings yet
BSC ML Ch3.pptx
106 pages
MIS410-Chapter6
No ratings yet
MIS410-Chapter6
47 pages
Lecture Note 5
No ratings yet
Lecture Note 5
7 pages
Stats 3
No ratings yet
Stats 3
3 pages
Lecture-7---Decision-Tree-Regression-imran-19032025-103416am
No ratings yet
Lecture-7---Decision-Tree-Regression-imran-19032025-103416am
40 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
34 pages
ML for ME S17 Decision Trees
No ratings yet
ML for ME S17 Decision Trees
12 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Module09 TreeBasedMethods
No ratings yet
Module09 TreeBasedMethods
36 pages
Lecture 07 On Decision Trees
No ratings yet
Lecture 07 On Decision Trees
36 pages
DS Unit - 4
No ratings yet
DS Unit - 4
76 pages
ML Unit-2 Material WORD
No ratings yet
ML Unit-2 Material WORD
25 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Decision Trees
No ratings yet
Decision Trees
18 pages
Classification Vs Regression
No ratings yet
Classification Vs Regression
3 pages
Decision Trees: Principal Data Miner, ATO Adjunct Associate Professor, ANU
No ratings yet
Decision Trees: Principal Data Miner, ATO Adjunct Associate Professor, ANU
3 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
Machine_Learning_Lecture_08_Decision Tree Learning (1)
No ratings yet
Machine_Learning_Lecture_08_Decision Tree Learning (1)
67 pages
Decision Tree Ppt
0% (1)
Decision Tree Ppt
24 pages
DAR LECT 12
No ratings yet
DAR LECT 12
29 pages
decision tree
No ratings yet
decision tree
13 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
ML Chapter 4 Part2
No ratings yet
ML Chapter 4 Part2
75 pages
HSMC
No ratings yet
HSMC
5 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
UNIT III MACHINE LEARNING
No ratings yet
UNIT III MACHINE LEARNING
19 pages
Module10 TreeBasedMethods
No ratings yet
Module10 TreeBasedMethods
33 pages
Lecture #15: Regression Trees & Random Forests
No ratings yet
Lecture #15: Regression Trees & Random Forests
34 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
UNIT-3 ML notes
No ratings yet
UNIT-3 ML notes
4 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
DAA WOrkbook Final
No ratings yet
DAA WOrkbook Final
97 pages
Data Structure Lab Manual-BCA152
No ratings yet
Data Structure Lab Manual-BCA152
5 pages
Split and Non Split Two Domination Numbers of Semi Total - Point Graph
No ratings yet
Split and Non Split Two Domination Numbers of Semi Total - Point Graph
9 pages
14 Arrays
No ratings yet
14 Arrays
32 pages
System Verilog
No ratings yet
System Verilog
137 pages
Ex - No:9 Construction of Dag: Program
No ratings yet
Ex - No:9 Construction of Dag: Program
8 pages
Linear Programming II - Class 9 - 10, Handout Version
No ratings yet
Linear Programming II - Class 9 - 10, Handout Version
9 pages
String Compression Solution
No ratings yet
String Compression Solution
2 pages
Part V
No ratings yet
Part V
84 pages
Numerical Solutions of Algebraic and Transcendental Equations
No ratings yet
Numerical Solutions of Algebraic and Transcendental Equations
22 pages
201 DSA Chapter 4 Sorting
No ratings yet
201 DSA Chapter 4 Sorting
60 pages
Math Quiz - Algebra 1
No ratings yet
Math Quiz - Algebra 1
5 pages
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
No ratings yet
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
3 pages
Discrete Cosine Transform
No ratings yet
Discrete Cosine Transform
12 pages
Classes and Objects
No ratings yet
Classes and Objects
34 pages
Chapter 5 - Critical Thinking Chapter 5 From 5 - 8
No ratings yet
Chapter 5 - Critical Thinking Chapter 5 From 5 - 8
5 pages
ada lab MANUAL updated
No ratings yet
ada lab MANUAL updated
30 pages
Quiz Sle Gaussianelimination
No ratings yet
Quiz Sle Gaussianelimination
3 pages
??? ????????? ??? (??????)
No ratings yet
??? ????????? ??? (??????)
15 pages
GJJB
No ratings yet
GJJB
70 pages
AI&ML Lab Manual__18CSL76_ Master Copy
No ratings yet
AI&ML Lab Manual__18CSL76_ Master Copy
47 pages
Lectures Note PDF
No ratings yet
Lectures Note PDF
56 pages
MACHINE LEARNING Question Bank
No ratings yet
MACHINE LEARNING Question Bank
11 pages
Trees
No ratings yet
Trees
27 pages
Math6011 - Mathematics for Data Science
No ratings yet
Math6011 - Mathematics for Data Science
3 pages
Travelling Salesman Problem
No ratings yet
Travelling Salesman Problem
3 pages
Advanced NLP With Spacy Chapter4
No ratings yet
Advanced NLP With Spacy Chapter4
26 pages
Top Spring Framework Interview Questions - Baeldung
No ratings yet
Top Spring Framework Interview Questions - Baeldung
16 pages
Quicksort On Singly Linked List 14. Iterative Quick Sort 15. Merge Sort For Linked List
No ratings yet
Quicksort On Singly Linked List 14. Iterative Quick Sort 15. Merge Sort For Linked List
21 pages