Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis
Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis
B2
Which solution
generalizes
best?
Statistical Learning:
Capacity and VC dimension
• To guarantee good generalization performance, the
complexity or capacity of the learned functions must be
controlled.
• Functions with high capacity are more complex (i.e.,
have many degrees of freedom or parameters).
low capacity high capacity
Statistical Learning:
Capacity and VC dimension (cont’d)
B2 B2
b21
b22
margin
b11
b12
SVM Overview
• Primarily two-class classifiers but can be extended to
multiple classes.
• Linear discriminant
g (x) w t x w0
Decide ω1 if g(x) > 0 and ω2 if g(x) < 0
1 if xk 1
zk class labels
1 if xk 2
zk g ( x k )
r b, b 0
|| w ||
1
zk g (xk ) 1 or zk (w xk w0 ) 1
t
where b
|| w ||
Linear SVM: separable case (cont’d)
quadratic problem
Maximize
equivalent 1 2
margin: w
2 2
zk (w t x k w0 ) 1, for k 1, 2,..., n
|| w ||
Quadratic
constraint
optimization
problem
Use
Lagrange
Optimization
Lagrange Optimization
• Maximize f(x) subject to the constraint g(x)=0
• Form Lagrangian function:
λ≥0
Example
Maximize f(x1,x2)=x1x2
subject to the constraint g(x1,x2)=x1+x2-1=0
L( x1 , x2 , )
x2 0
x1
L( x1 , x2 , ) f ( x1 , x2 ) g ( x1 , x2 ) L( x1 , x2 , )
x1 0
x2
x1 x2 1 0
3 equations / 3 unknowns
Linear SVM: separable case (cont’d)
• Using Langrange optimization, minimize:
n
1
L(w, w0 , ) || w ||2 k [ zk ( w t x k w0 ) 1], k 0
2 k 1
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Linear SVM: separable case (cont’d)
• The solution is given by:
n
w zk k x k , w0 zk w t x k (pick any xk)
k 1
n n
g (x) zk k (x x) w0 zk k (x . x k ) w0
t
k
k 1 k 1
Linear SVM: separable case (cont’d)
n
g (x) zk k (x . x k ) w0
k 1
zk (w t x k w0 ) 1 k , k 1, 2,..., n
n
1
w c k
2
c: constant
2 k 1
zk (w t x k w0 ) 1 k , k 1, 2,..., n
2 k 1
n
g (x) zk k (x . x k ) w0
k 1
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Nonlinear SVM
• Extensions to the non-linear case involves mapping
the data to an h-dimensional space:
1 (xk )
(x )
xk Φ(xk ) 2 k
...
h k
( x )
Example:
Nonlinear SVM (cont’d)
n
non-linear SVM: g (x) zk k ((x). (x k )) w0
k 1
Nonlinear SVM (cont’d)
n
non-linear SVM: g (x) zk k ((x). (x k )) w0
k 1
n
g (x) zk k ((x). (x k )) w0
k 1
n
g (x) zk k K (x , x k ) w0
k 1
The kernel trick (cont’d)
• Do such kernel functions exist?
– Kernel functions which can be expressed as a dot
product in some space satisfy the Mercer’s condition
(see Burges’ paper).
– The Mercer’s condition does not tell us how to construct
Φ() or even what the high dimensional space is.
• Advantages of kernel trick
– No need to know Φ()
– Computations remain feasible even if the feature space
has very high dimensionality.
C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and
Knowledge Discovery, Kluwer Academic Publishers, 1998.
Polynomial Kernel
parameter
K(x,y)=(x . y) d
Polynomial Kernel (cont’d)
Common Kernel functions
Example
Example (cont’d)
h=6
Example (cont’d)
Example (cont’d)
(Problem 4)
Example (cont’d)
w0=0
Example (cont’d)
The discriminant
Comments
• SVM is based on exact optimization, not on
approximate methods (i.e., global optimization
method, no local optima)
• Appears to avoid overfitting in high dimensional
spaces and generalize well using a small training set.
• Performance depends on the choice of the kernel
and its parameters.
• Its complexity depends on the number of support
vectors, not on the dimensionality of the
transformed space.