0% found this document useful (0 votes)
61 views

Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis

Support Vector Machines (SVMs) are classifiers that perform structural risk minimization to achieve good generalization performance. SVMs find the optimal separating hyperplane that maximizes the margin between classes. This hyperplane depends only on the support vectors, which are the training samples closest to the hyperplane. Both separable and non-separable cases can be solved using Lagrange optimization to find the support vectors and coefficients defining the optimal hyperplane.

Uploaded by

Fif Player
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis

Support Vector Machines (SVMs) are classifiers that perform structural risk minimization to achieve good generalization performance. SVMs find the optimal separating hyperplane that maximizes the margin between classes. This hyperplane depends only on the support vectors, which are the training samples closest to the hyperplane. Both separable and non-separable cases can be solved using Lagrange optimization to find the support vectors and coefficients defining the optimal hyperplane.

Uploaded by

Fif Player
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Support Vector Machines (SVMs)

Chapter 5 (Duda et al.)

CS479/679 Pattern Recognition


Dr. George Bebis
Learning through
“empirical risk” minimization
• Typically, a discriminant function g(x) is estimated
from a finite set of examples by minimizing an error
function, e.g., the training error:
1 n
Remp   [ zk  zˆk ]2 empirical risk
n k 1 minimization

correct class predicted class

true class label: predicted class label:


 1 if xk  1 1 if g (xk )  0
zk   zˆk  
1 if xk  2 1 if g (xk )  0
Learning through
“empirical risk” minimization (cont’d)

• Conventional empirical risk minimization does not


imply good generalization performance.
– There could be several different functions g(x) which all
approximate the training data set well.
– Difficult to determine which function would have the best
generalization performance.
B1
Solution 1 Solution 2

B2
Which solution
generalizes
best?
Statistical Learning:
Capacity and VC dimension
• To guarantee good generalization performance, the
complexity or capacity of the learned functions must be
controlled.
• Functions with high capacity are more complex (i.e.,
have many degrees of freedom or parameters).
low capacity high capacity
Statistical Learning:
Capacity and VC dimension (cont’d)

• How can we measure the capacity of a discriminant


function?
– In statistical learning, the Vapnik-Chervonenkis (VC)
dimension is a popular measure of the capacity of a
classifier.
– The VC dimension can predict a probabilistic upper bound
on the generalization error of a classifier.
Statistical Learning:
Capacity and VC dimension (cont’d)
• Vapnik showed that a classifier that:
structural risk
(1) minimizes the empirical risk and
minimization
(2) has low VC dimension
will generalize well regardless of the dimensionality of
the input space
h(log(2n /nh)  1)  log( / 4)
errtrue  errtrain 
n
(h: VC dimension)
with probability (1-δ); (n: # of training examples)

(Vapnik, 1995, “Structural Risk Minimization Principle”)


VC dimension and margin of separation

• Vapnik has shown that


maximizing the margin of
separation (i.e., empty space
between classes) is
equivalent to minimizing the
VC dimension.

• The optimal hyperplane is


the one giving the largest
margin of separation
between the classes.
Margin of separation and support vectors

• How is the margin defined?


– The margin is defined by the
distance of the nearest
training samples from the
hyperplane.
– Intuitively speaking, these are
the most difficult samples to
classify.
– We refer to these samples as
support vectors.
Margin of separation and
support vectors (cont’d)

different solutions corresponding margins


B1 B1

B2 B2

b21
b22

margin
b11

b12
SVM Overview
• Primarily two-class classifiers but can be extended to
multiple classes.

• It performs structural risk minimization to achieve


good generalization performance.
– i.e., minimize training error & maximize margin

• Training is equivalent to solving a quadratic


programming problem with linear constraints
– Not iterative as gradient descent or Newton’s method
Linear SVM: separable case
(i.e., data is linearly separable)

• Linear discriminant
g (x)  w t x  w0
Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

 1 if xk  1
zk   class labels
1 if xk  2

• Consider the equivalent problem:


zk g (x k )  0 or zk (w t x k  w0 )  0, for k  1, 2,..., n
Linear SVM: separable case (cont’d)
• The distance r of a point xk from the separating
hyperplane should satisfy the constraint:

zk g ( x k )
r  b, b  0
|| w ||

• To enforce uniqueness on the solution, we impose


the following constraint on w :
b w 1

1
zk g (xk )  1 or zk (w xk  w0 )  1
t
where b 
|| w ||
Linear SVM: separable case (cont’d)
quadratic problem
Maximize
equivalent 1 2
margin: w
2 2
zk (w t x k  w0 )  1, for k  1, 2,..., n
|| w ||

Quadratic
constraint
optimization
problem

Use
Lagrange
Optimization
Lagrange Optimization
• Maximize f(x) subject to the constraint g(x)=0
• Form Lagrangian function:
λ≥0

• Take derivative and set it equal to zero:


n+1 equations / n+1 unknowns

solve for x and λ


g(x)=0
Lagrange Optimization (cont’d)

Example
Maximize f(x1,x2)=x1x2
subject to the constraint g(x1,x2)=x1+x2-1=0

L( x1 , x2 ,  )
 x2    0
x1
L( x1 , x2 ,  )  f ( x1 , x2 )   g ( x1 , x2 ) L( x1 , x2 ,  )
 x1    0
x2
x1  x2  1  0

3 equations / 3 unknowns
Linear SVM: separable case (cont’d)
• Using Langrange optimization, minimize:
n
1
L(w, w0 ,  )  || w ||2  k [ zk ( w t x k  w0 )  1], k  0
2 k 1

• Easier to solve the “dual” problem (Kuhn-Tucker


construction):
n
1 n

k 1
k  
2 k, j
  z z x t
k j k j j xk

C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Linear SVM: separable case (cont’d)
• The solution is given by:
n
w   zk k x k , w0  zk  w t x k (pick any xk)
k 1

• The discriminant is given by:


g (x)  w t x  w0
dot product

n n
g (x)   zk k (x x)  w0   zk k (x . x k )  w0
t
k
k 1 k 1
Linear SVM: separable case (cont’d)
n
g (x)   zk k (x . x k )  w0
k 1

• It can be shown that if xk is not a support vector,


then the corresponding λk=0.

The solution depends on


the support vectors only!
Linear SVM: non-separable case
(i.e., data is not linearly separable)

• Allow miss-classifications (i.e., soft margin classifier)


by introducing positive error (slack) variables ψk :

zk (w t x k  w0 )  1  k , k  1, 2,..., n
n
1
w  c  k
2
c: constant
2 k 1

zk (w t x k  w0 )  1  k , k  1, 2,..., n

• The solution minimizes the sum of errors ψk while maximizing


the margin of the correctly classified data.
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Linear SVM:
non-separable case (cont’d)
n
1
w  c  k
2

2 k 1

• The choice of the constant c


is very important!
• It controls the trade-off
between margin and
misclassification errors.
• Aims to prevent outliers
from affecting the optimal
hyperplane.
Linear SVM:
non-separable case (cont’d)
• Easier to solve the “dual” problem (Kuhn-Tucker
construction):
n
1 n

k 1
k  
2 k, j
  z z x t
k j k j j xk

n
g (x)   zk k (x . x k )  w0
k 1

C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Nonlinear SVM
• Extensions to the non-linear case involves mapping
the data to an h-dimensional space:

1 (xk ) 
 (x ) 
xk  Φ(xk )   2 k 
 ... 
 

 h k 
( x )

• Mapping the data to a sufficiently high dimensional


space is likely to cast the data linearly separable in
that space.
Nonlinear SVM (cont’d)

Example:
Nonlinear SVM (cont’d)

linear SVM: g (x)   zk k (x . x k )  w0


k 1

n
non-linear SVM: g (x)   zk k ((x). (x k ))  w0
k 1
Nonlinear SVM (cont’d)

n
non-linear SVM: g (x)   zk k ((x). (x k ))  w0
k 1

• The disadvantage of this approach is that the mapping


x k  ( x k )
is typically very computationally intensive!

• Is there an efficient way to compute (x).(xk ) ?


The kernel trick
• Compute dot products using a kernel function

K (x, xk )  (x). (xk )

n
g (x)   zk k ((x). (x k ))  w0
k 1

n
g (x)   zk k K (x , x k )  w0
k 1
The kernel trick (cont’d)
• Do such kernel functions exist?
– Kernel functions which can be expressed as a dot
product in some space satisfy the Mercer’s condition
(see Burges’ paper).
– The Mercer’s condition does not tell us how to construct
Φ() or even what the high dimensional space is.
• Advantages of kernel trick
– No need to know Φ()
– Computations remain feasible even if the feature space
has very high dimensionality.
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Polynomial Kernel
parameter

K(x,y)=(x . y) d
Polynomial Kernel (cont’d)
Common Kernel functions
Example
Example (cont’d)

h=6
Example (cont’d)
Example (cont’d)
(Problem 4)
Example (cont’d)

w0=0
Example (cont’d)
The discriminant
Comments
• SVM is based on exact optimization, not on
approximate methods (i.e., global optimization
method, no local optima)
• Appears to avoid overfitting in high dimensional
spaces and generalize well using a small training set.
• Performance depends on the choice of the kernel
and its parameters.
• Its complexity depends on the number of support
vectors, not on the dimensionality of the
transformed space.

You might also like