0% found this document useful (0 votes)

61 views

Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis

Support Vector Machines (SVMs) are classifiers that perform structural risk minimization to achieve good generalization performance. SVMs find the optimal separating hyperplane that maximizes the margin between classes. This hyperplane depends only on the support vectors, which are the training samples closest to the hyperplane. Both separable and non-separable cases can be solved using Lagrange optimization to find the support vectors and coefficients defining the optimal hyperplane.

Uploaded by

Fif Player

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis

Uploaded by

Fif Player

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Support Vector Machines (SVMs)

Chapter 5 (Duda et al.)

CS479/679 Pattern Recognition

Dr. George Bebis
Learning through
“empirical risk” minimization
• Typically, a discriminant function g(x) is estimated
from a finite set of examples by minimizing an error
function, e.g., the training error:
1 n
Remp   [ zk  zˆk ]2 empirical risk
n k 1 minimization

correct class predicted class

true class label: predicted class label:

 1 if xk  1 1 if g (xk )  0
zk   zˆk  
1 if xk  2 1 if g (xk )  0
Learning through
“empirical risk” minimization (cont’d)

• Conventional empirical risk minimization does not

imply good generalization performance.
– There could be several different functions g(x) which all
approximate the training data set well.
– Difficult to determine which function would have the best
generalization performance.
B1
Solution 1 Solution 2

B2
Which solution
generalizes
best?
Statistical Learning:
Capacity and VC dimension
• To guarantee good generalization performance, the
complexity or capacity of the learned functions must be
controlled.
• Functions with high capacity are more complex (i.e.,
have many degrees of freedom or parameters).
low capacity high capacity
Statistical Learning:
Capacity and VC dimension (cont’d)

• How can we measure the capacity of a discriminant

function?
– In statistical learning, the Vapnik-Chervonenkis (VC)
dimension is a popular measure of the capacity of a
classifier.
– The VC dimension can predict a probabilistic upper bound
on the generalization error of a classifier.
Statistical Learning:
Capacity and VC dimension (cont’d)
• Vapnik showed that a classifier that:
structural risk
(1) minimizes the empirical risk and
minimization
(2) has low VC dimension
will generalize well regardless of the dimensionality of
the input space
h(log(2n /nh)  1)  log( / 4)
errtrue  errtrain 
n
(h: VC dimension)
with probability (1-δ); (n: # of training examples)

(Vapnik, 1995, “Structural Risk Minimization Principle”)

VC dimension and margin of separation

• Vapnik has shown that

maximizing the margin of
separation (i.e., empty space
between classes) is
equivalent to minimizing the
VC dimension.

• The optimal hyperplane is

the one giving the largest
margin of separation
between the classes.
Margin of separation and support vectors

• How is the margin defined?

– The margin is defined by the
distance of the nearest
training samples from the
hyperplane.
– Intuitively speaking, these are
the most difficult samples to
classify.
– We refer to these samples as
support vectors.
Margin of separation and
support vectors (cont’d)

different solutions corresponding margins

B1 B1

B2 B2

b21
b22

margin
b11

b12
SVM Overview
• Primarily two-class classifiers but can be extended to
multiple classes.

• It performs structural risk minimization to achieve

good generalization performance.
– i.e., minimize training error & maximize margin

• Training is equivalent to solving a quadratic

programming problem with linear constraints
– Not iterative as gradient descent or Newton’s method
Linear SVM: separable case
(i.e., data is linearly separable)

• Linear discriminant
g (x)  w t x  w0
Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

 1 if xk  1
zk   class labels
1 if xk  2

• Consider the equivalent problem:

zk g (x k )  0 or zk (w t x k  w0 )  0, for k  1, 2,..., n
Linear SVM: separable case (cont’d)
• The distance r of a point xk from the separating
hyperplane should satisfy the constraint:

zk g ( x k )
r  b, b  0
|| w ||

• To enforce uniqueness on the solution, we impose

the following constraint on w :
b w 1

1
zk g (xk )  1 or zk (w xk  w0 )  1
t
where b 
|| w ||
Linear SVM: separable case (cont’d)
quadratic problem
Maximize
equivalent 1 2
margin: w
2 2
zk (w t x k  w0 )  1, for k  1, 2,..., n
|| w ||

Quadratic
constraint
optimization
problem

Use
Lagrange
Optimization
Lagrange Optimization
• Maximize f(x) subject to the constraint g(x)=0
• Form Lagrangian function:
λ≥0

• Take derivative and set it equal to zero:

n+1 equations / n+1 unknowns

solve for x and λ

g(x)=0
Lagrange Optimization (cont’d)

Example
Maximize f(x1,x2)=x1x2
subject to the constraint g(x1,x2)=x1+x2-1=0

L( x1 , x2 ,  )
 x2    0
x1
L( x1 , x2 ,  )  f ( x1 , x2 )   g ( x1 , x2 ) L( x1 , x2 ,  )
 x1    0
x2
x1  x2  1  0

3 equations / 3 unknowns
Linear SVM: separable case (cont’d)
• Using Langrange optimization, minimize:
n
1
L(w, w0 ,  )  || w ||2  k [ zk ( w t x k  w0 )  1], k  0
2 k 1

• Easier to solve the “dual” problem (Kuhn-Tucker

construction):
n
1 n

k 1
k  
2 k, j
  z z x t
k j k j j xk

C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Linear SVM: separable case (cont’d)
• The solution is given by:
n
w   zk k x k , w0  zk  w t x k (pick any xk)
k 1

• The discriminant is given by:

g (x)  w t x  w0
dot product

n n
g (x)   zk k (x x)  w0   zk k (x . x k )  w0
t
k
k 1 k 1
Linear SVM: separable case (cont’d)
n
g (x)   zk k (x . x k )  w0
k 1

• It can be shown that if xk is not a support vector,

then the corresponding λk=0.

The solution depends on

the support vectors only!
Linear SVM: non-separable case
(i.e., data is not linearly separable)

• Allow miss-classifications (i.e., soft margin classifier)

by introducing positive error (slack) variables ψk :

zk (w t x k  w0 )  1  k , k  1, 2,..., n
n
1
w  c  k
2
c: constant
2 k 1

zk (w t x k  w0 )  1  k , k  1, 2,..., n

• The solution minimizes the sum of errors ψk while maximizing

the margin of the correctly classified data.
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Linear SVM:
non-separable case (cont’d)
n
1
w  c  k
2

2 k 1

• The choice of the constant c

is very important!
• It controls the trade-off
between margin and
misclassification errors.
• Aims to prevent outliers
from affecting the optimal
hyperplane.
Linear SVM:
non-separable case (cont’d)
• Easier to solve the “dual” problem (Kuhn-Tucker
construction):
n
1 n

k 1
k  
2 k, j
  z z x t
k j k j j xk

n
g (x)   zk k (x . x k )  w0
k 1

C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Nonlinear SVM
• Extensions to the non-linear case involves mapping
the data to an h-dimensional space:

1 (xk ) 
 (x ) 
xk  Φ(xk )   2 k 
 ... 
 

 h k 
( x )

• Mapping the data to a sufficiently high dimensional

space is likely to cast the data linearly separable in
that space.
Nonlinear SVM (cont’d)

Example:
Nonlinear SVM (cont’d)

linear SVM: g (x)   zk k (x . x k )  w0

k 1

n
non-linear SVM: g (x)   zk k ((x). (x k ))  w0
k 1
Nonlinear SVM (cont’d)

n
non-linear SVM: g (x)   zk k ((x). (x k ))  w0
k 1

• The disadvantage of this approach is that the mapping

x k  ( x k )
is typically very computationally intensive!

• Is there an efficient way to compute (x).(xk ) ?

The kernel trick
• Compute dot products using a kernel function

K (x, xk )  (x). (xk )

n
g (x)   zk k ((x). (x k ))  w0
k 1

n
g (x)   zk k K (x , x k )  w0
k 1
The kernel trick (cont’d)
• Do such kernel functions exist?
– Kernel functions which can be expressed as a dot
product in some space satisfy the Mercer’s condition
(see Burges’ paper).
– The Mercer’s condition does not tell us how to construct
Φ() or even what the high dimensional space is.
• Advantages of kernel trick
– No need to know Φ()
– Computations remain feasible even if the feature space
has very high dimensionality.
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Polynomial Kernel
parameter

K(x,y)=(x . y) d
Polynomial Kernel (cont’d)
Common Kernel functions
Example
Example (cont’d)

h=6
Example (cont’d)
Example (cont’d)
(Problem 4)
Example (cont’d)

w0=0
Example (cont’d)
The discriminant
Comments
• SVM is based on exact optimization, not on
approximate methods (i.e., global optimization
method, no local optima)
• Appears to avoid overfitting in high dimensional
spaces and generalize well using a small training set.
• Performance depends on the choice of the kernel
and its parameters.
• Its complexity depends on the number of support
vectors, not on the dimensionality of the
transformed space.

(Monographs and Surveys in Pure and Applied Mathematics) A K Gupta, D K Nagar - Matrix Variate Distributions-Chapman and Hall - CRC (1999)
No ratings yet
(Monographs and Surveys in Pure and Applied Mathematics) A K Gupta, D K Nagar - Matrix Variate Distributions-Chapman and Hall - CRC (1999)
384 pages
Free Access to Elementary Statistics 11 th Edition Robert Johnson Chapter Answers
100% (4)
Free Access to Elementary Statistics 11 th Edition Robert Johnson Chapter Answers
82 pages
Operating Systems Question Bank and Solution
67% (6)
Operating Systems Question Bank and Solution
32 pages
Modern Big Data Algorithms
No ratings yet
Modern Big Data Algorithms
52 pages
Cramer Raoh and Out 08
No ratings yet
Cramer Raoh and Out 08
13 pages
Final 12
No ratings yet
Final 12
11 pages
HW2 Sol
No ratings yet
HW2 Sol
5 pages
May 2021 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics ID5059
6 pages
STAT 480b Answer Key To Problem Set No. 4
No ratings yet
STAT 480b Answer Key To Problem Set No. 4
3 pages
May2015 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May2015 Examination Diet School of Mathematics & Statistics ID5059
9 pages
DTAM: Dense Tracking and Mapping in Real-Time Seminar
No ratings yet
DTAM: Dense Tracking and Mapping in Real-Time Seminar
34 pages
Supervised Machine Learning Algorithms For Credit Card Fraudulent Transaction Detection: A Comparative Study
No ratings yet
Supervised Machine Learning Algorithms For Credit Card Fraudulent Transaction Detection: A Comparative Study
4 pages
Exercises 695 Clas
No ratings yet
Exercises 695 Clas
3 pages
Registration Form: Philippine Educational Placement Test
100% (2)
Registration Form: Philippine Educational Placement Test
2 pages
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
No ratings yet
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
22 pages
Kernel Matrix Factorization Models
No ratings yet
Kernel Matrix Factorization Models
8 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
Solution Manual for Introduction to Applied Linear Algebra 1st by Boyd - All Chapters Are Available In PDF Format For Download
100% (7)
Solution Manual for Introduction to Applied Linear Algebra 1st by Boyd - All Chapters Are Available In PDF Format For Download
53 pages
ML Practice 1
No ratings yet
ML Practice 1
106 pages
Support Vector Machine
No ratings yet
Support Vector Machine
16 pages
21 SVR
No ratings yet
21 SVR
22 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
Support Vector Machines
No ratings yet
Support Vector Machines
14 pages
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
No ratings yet
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
4 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
A Practical Guide To Support Vector Classi Cation - Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin
No ratings yet
A Practical Guide To Support Vector Classi Cation - Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin
12 pages
How Google Uses SVD
No ratings yet
How Google Uses SVD
6 pages
Support Vector Machines PDF
100% (1)
Support Vector Machines PDF
37 pages
Prof. U.J.Dixit
No ratings yet
Prof. U.J.Dixit
11 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Answer 1722791857 NLP and Classification Practical MCQ 4991
No ratings yet
Answer 1722791857 NLP and Classification Practical MCQ 4991
26 pages
Statistical Inference Cheat Sheet
No ratings yet
Statistical Inference Cheat Sheet
4 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Revised Simplex Method
No ratings yet
Revised Simplex Method
18 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Appendix B Hand Out Gauss Newton Derivation
No ratings yet
Appendix B Hand Out Gauss Newton Derivation
8 pages
App.A - Detection and Estimation in Additive Gaussian Noise PDF
No ratings yet
App.A - Detection and Estimation in Additive Gaussian Noise PDF
55 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
Econometrics - MCQ Flashcards - Quizlet
No ratings yet
Econometrics - MCQ Flashcards - Quizlet
19 pages
ENTROPY MINIMAX MULTIVARIATE STATISTICAL Modeeling PDF
No ratings yet
ENTROPY MINIMAX MULTIVARIATE STATISTICAL Modeeling PDF
80 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
MAS Chap3
No ratings yet
MAS Chap3
24 pages
Radial Basis Functions With Adaptive Input and Composite Trend Representation For Portfolio Selection
100% (1)
Radial Basis Functions With Adaptive Input and Composite Trend Representation For Portfolio Selection
13 pages
Study Guide: Eigenvalues, Eigenvectors, and Diagonalization
No ratings yet
Study Guide: Eigenvalues, Eigenvectors, and Diagonalization
2 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Download full Solutions Manual to accompany Design And Analysis Of Experiments 6th edition 9780471487357 (PDF) with all chapters
100% (4)
Download full Solutions Manual to accompany Design And Analysis Of Experiments 6th edition 9780471487357 (PDF) with all chapters
40 pages
Quantitative Mueller Matrix Polarimetry With Diverse Applications
No ratings yet
Quantitative Mueller Matrix Polarimetry With Diverse Applications
41 pages
Handout - BITS-F464 - Machine - Learning - August 2019
No ratings yet
Handout - BITS-F464 - Machine - Learning - August 2019
4 pages
Basics of Multivariate Normal
No ratings yet
Basics of Multivariate Normal
46 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
15 Mvue
100% (1)
15 Mvue
28 pages
Statistical Inference
No ratings yet
Statistical Inference
55 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
3 - Principles of Data Reduction
No ratings yet
3 - Principles of Data Reduction
14 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM
No ratings yet
SVM
36 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Module ECM3401 (2020) Individual Literature Review and Project
No ratings yet
Module ECM3401 (2020) Individual Literature Review and Project
2 pages
Ikengwu Chisom CV
No ratings yet
Ikengwu Chisom CV
3 pages
Data Processing SS1 Second Term-1
No ratings yet
Data Processing SS1 Second Term-1
30 pages
SlotServer Configuration Manual
No ratings yet
SlotServer Configuration Manual
37 pages
Chapter 2 Thesis About Working Students
100% (3)
Chapter 2 Thesis About Working Students
5 pages
Training Report On Embedded System
No ratings yet
Training Report On Embedded System
16 pages
SICK Nanoscan3 Operating Instructions
No ratings yet
SICK Nanoscan3 Operating Instructions
144 pages
HW 499 Heather Stmyer Resume 8
No ratings yet
HW 499 Heather Stmyer Resume 8
2 pages
How To Block Ports
No ratings yet
How To Block Ports
8 pages
Comparative Study of Different Cryptographic Algorithms For Data Security in Cloud Computing
No ratings yet
Comparative Study of Different Cryptographic Algorithms For Data Security in Cloud Computing
7 pages
An Introduction To Pascal Programming MOD 2010
No ratings yet
An Introduction To Pascal Programming MOD 2010
5 pages
Photoshop မွတ္စုမ်ား
100% (1)
Photoshop မွတ္စုမ်ား
439 pages
Freelancers Websites
No ratings yet
Freelancers Websites
2 pages
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
No ratings yet
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
8 pages
TP 1 Introduction to AutoCAD 2022
No ratings yet
TP 1 Introduction to AutoCAD 2022
19 pages
Integrating RPG and CL Programs by Using The Microsoft OLE DB Provider For DB2
100% (1)
Integrating RPG and CL Programs by Using The Microsoft OLE DB Provider For DB2
7 pages
Design of CMOS OpAmp For A D/A Converter Buffer by Manraj Singh Gujral
No ratings yet
Design of CMOS OpAmp For A D/A Converter Buffer by Manraj Singh Gujral
22 pages
India Workshop
No ratings yet
India Workshop
69 pages
Multiplying Multidigit Numbers
No ratings yet
Multiplying Multidigit Numbers
4 pages
Principles of Measurements
No ratings yet
Principles of Measurements
9 pages
It of Mcdonald'S
No ratings yet
It of Mcdonald'S
11 pages
Infinity Users Guide
No ratings yet
Infinity Users Guide
172 pages
Amazon WEB Scrapin G: Using Python
No ratings yet
Amazon WEB Scrapin G: Using Python
9 pages
IP Office Manager
No ratings yet
IP Office Manager
964 pages
Vnxe Software Ds
No ratings yet
Vnxe Software Ds
2 pages
SQL Data Integrity
No ratings yet
SQL Data Integrity
7 pages
25565
No ratings yet
25565
24 pages
Schafer SMMR 1999 MI Primer
No ratings yet
Schafer SMMR 1999 MI Primer
14 pages

Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis

Uploaded by

Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis

Uploaded by

Support Vector Machines (SVMs)

Chapter 5 (Duda et al.)

CS479/679 Pattern Recognition

correct class predicted class

true class label: predicted class label:

• Conventional empirical risk minimization does not

• How can we measure the capacity of a discriminant

(Vapnik, 1995, “Structural Risk Minimization Principle”)

• Vapnik has shown that

• The optimal hyperplane is

• How is the margin defined?

different solutions corresponding margins

• It performs structural risk minimization to achieve

• Training is equivalent to solving a quadratic

• Consider the equivalent problem:

• To enforce uniqueness on the solution, we impose

• Take derivative and set it equal to zero:

solve for x and λ

• Easier to solve the “dual” problem (Kuhn-Tucker

• The discriminant is given by:

• It can be shown that if xk is not a support vector,

The solution depends on

• Allow miss-classifications (i.e., soft margin classifier)

• The solution minimizes the sum of errors ψk while maximizing

• The choice of the constant c

• Mapping the data to a sufficiently high dimensional

linear SVM: g (x)   zk k (x . x k )  w0

• The disadvantage of this approach is that the mapping

• Is there an efficient way to compute (x).(xk ) ?

K (x, xk )  (x). (xk )

You might also like