0% found this document useful (0 votes)
2 views

Machine learning Lecture 03

The document discusses Support Vector Machines (SVM), a classifier developed from statistical learning theory, highlighting its applications in various fields such as object detection and text recognition. It explains the concept of linear discriminant functions, large margin classifiers, and the optimization problem involved in training SVMs, including the use of slack variables for non-linear separability. Additionally, it introduces the kernel trick for mapping input data to higher-dimensional feature spaces to enhance classification capabilities.

Uploaded by

233046
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine learning Lecture 03

The document discusses Support Vector Machines (SVM), a classifier developed from statistical learning theory, highlighting its applications in various fields such as object detection and text recognition. It explains the concept of linear discriminant functions, large margin classifiers, and the optimization problem involved in training SVMs, including the use of slack variables for non-linear separability. Additionally, it introduces the kernel trick for mapping input data to higher-dimensional feature spaces to enhance classification capabilities.

Uploaded by

233046
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Decision Surfaces

Decision Linear Nonlinear


Tree Functions Functions
(Neural nets)
g (x)  wT x  b
Today: Support Vector
Machine (SVM)
• A classifier derived from statistical learning theory by
Vapnik, et al. in 1992
• SVM became famous when, using images as input, it
gave accuracy comparable to neural-network with
hand-designed features in a handwriting recognition task
• Currently, SVM is widely used in object detection &
recognition, content-based image retrieval, text
recognition, biometrics, speech recognition, etc.
• Also used for regression (will not cover today)

• Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in Bishop


• SVM tutorial (start reading from Section 3) V. Vapnik
Linear Discriminant Function
or a Linear Classifier denotes +1
• Given data and two x2
denotes -1
wT x + b > 0
classes, learn a function
of the form:
g (x)  wT x  b
 A hyper-plane in the
feature space
 Decide class=1 if g(x)>0
and class=-1 otherwise

wT x + b < 0 x1
Linear Discriminant
denotes +1
Function denotes -1
• How would you classify x2
these points using a linear
discriminant function in
order to minimize the error
rate?

 Infinite number of answers!

x1
Linear Discriminant
denotes +1
Function denotes -1
• How would you classify x2
these points using a linear
discriminant function in
order to minimize the error
rate?

 Infinite number of answers!

x1
Linear Discriminant
denotes +1
Function denotes -1
• How would you classify x2
these points using a linear
discriminant function in
order to minimize the error
rate?

 Infinite number of answers!

x1
Linear Discriminant
denotes +1
Function denotes -1
• How would you classify x2
these points using a linear
discriminant function in
order to minimize the error
rate?

 Infinite number of answers!

 Which one is the best? x1


Large Margin Linear
denotes +1
Classifier denotes -1
• The linear discriminant x2
Margin
function (classifier) with the “safe zone”
maximum margin is the best

 Margin is defined as the


width that the boundary
could be increased by before
hitting a data point

 Why it is the best?


 The larger the margin the
x1
better generalization
 Robust to outliers
Large Margin Linear
denotes +1
Classifier denotes -1
• Aim: Learn a large x2
Margin
margin classifier. “safe zone”

• Given a set of data


points, define:
For yi  1, wT xi  b  1
For yi  1, wT xi  b  1
• Give an algebraic
expression for the width
of the margin.
x1
Algebraic Expression for
Width of a Margin
Margin
“safe zone”

x1
Large Margin Linear
denotes +1
Classifier denotes -1
• Aim: Learn a large margin x2
Margin
classifier
• Mathematical Formulation: x+
2
maximize
w
x+

such that
For yi  1, wT xi  b  1 n
x-
For yi  1, wT xi  b  1

x1
Large Margin Linear
denotes +1
Classifier denotes -1
• Formulation: x2
Margin

1 2 x+
minimize w
2

such that x+

For yi  1, wT xi  b  1 n
x-
For yi  1, wT xi  b  1

x1
Large Margin Linear
denotes +1
Classifier denotes -1
• Formulation: x2
Margin

1 2 x+
minimize w
2

such that x+

yi (wT xi  b)  1 n
x-

x1
Solving the Optimization
Problem
Quadratic
programming 1 2
with linear
minimize w
2
constraints
s.t. yi (wT xi  b)  1

Lagrangian
Function

minimize Lp (w, b, i )  w  i  yi (wT xi  b)  1


n
1 2

2 i 1

s.t. i  0
Solving the Optimization
Problem
minimize Lp (w, b, i )  w  i  yi (wT xi  b)  1
n
1 2

2 i 1

s.t. i  0

Lp
n

0 w    i yi xi
w i 1
n
Lp
0  y i i 0
b i 1
Solving the Optimization

Problem
From the equations, we can prove
that: (KKT conditions): x2

i  yi (w xi  b)  1  0
T x+

i  0
x+

 Thus, only support vectors have


x-

 The solution has the form: Support Vectors


x1
n
w    i yi xi   y x i i i
i 1 iSV

get b from yi (wT xi  b)  1  0,


where xi is support vector
Large Margin Linear
denotes +1
Classifier denotes -1
• What if data is not linear x2
separable? (noisy data,
outliers, etc.)

 Slack variables ξi can be 2


added to allow mis- 1
classification of difficult
or noisy data points

x1
Large Margin Linear
 Formulation:
Classifier
n
1
w  C  i
2
minimize
2 i 1

such that

yi (wT xi  b)  1  i

i  0

 Parameter C can be viewed as a way to control over-fitting.


Non-linear SVMs: Feature Space
 General idea: the original input space can be mapped to
some higher-dimensional feature space where the
training set is separable:

Φ: x → φ(x)

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt


Nonlinear SVMs: The Kernel Trick
 Examples of commonly-used kernel functions:

 Linear kernel: K (xi , x j )  xTi x j


 Polynomial kernel: K (xi , x j )  (1  xTi x j ) p
 Gaussian (Radial-Basis Function (RBF) ) kernel:
2
xi  x j
K (xi , x j )  exp( )
2 2

 Sigmoid:
K (xi , x j )  tanh(0 xTi x j  1 )
 In general, functions that satisfy Mercer’s condition can be
kernel functions.

You might also like