Introduction to Support Vector Machines
Hsuan-Tien Lin
Learning Systems Group, California Institute of Technology
Talk in NTU EE/CS Speech Lab, November 16, 2005
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 1 / 20
Learning Problem
Setup
fixed-length example: D-dimensional vector x, each component is
a feature
raw digital sampling of a 0.5 sec. wave file
DFT of the raw sampling
a fixed-length feature vector extracted from the wave file
label: a number y ∈ Y
binary classification: is there a man speaking in the wave file?
(y = +1 if man, y = −1 if not)
multi-class classification: which speaker is speaking?
(y ∈ {1, 2, · · · , K })
regression: how excited is the speaker? (y ∈ R)
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 2 / 20
Learning Problem
Binary Classification Problem
learning problem: given training examples and labels {(xi , yi )}N
i=1 ,
find a function g(x) : X → Y that predicts the label of unseen x
well
vowel identification: given training wave files and their vowel labels,
find a function g(x) that translates wave files to vowel well
we will focus on binary classification problem: Y = {+1, −1}
most basic learning problem, but very useful and can be extended
to other problems
illustrative demo: for examples with two different color in a
D-dimensional space, how can we “separate” the examples?
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 3 / 20
Support Vector Machine
Hyperplane Classifier
use a hyperplane to separate the two colors:
T
g(x) = sign w x + b
if w T + b ≥ 0, the classifier returns +1, otherwise the classifier
returns −1
possibly lots of hyperplanes satisfying our needs, which one
should we choose?
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 4 / 20
Support Vector Machine
SVM: Large-Margin Hyperplane Classifier
margin ρi = yi (w T x + b)/kwk2 :
does yi agree with w T x + b in sign?
how large is the distance between the example and the separating
hyperplane?
large positive margin → clear separation → low risk classification
idea of SVM: maximize the minimum margin
max min ρi
w,b i
s.t. ρi = yi (w T xi + b)/kwk2 ≥ 0
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 5 / 20
Support Vector Machine
Hard-Margin Linear SVM
maximize the minimum margin
max min ρi
w,b i
s.t. ρi = yi (w T xi + b)/kwk2 ≥ 0, i = 1, . . . , N.
equivalent to
1 T
min w w
w,b 2
s.t. yi (w T xi + b) ≥ 1, i = 1, . . . , N.
– hard-margin linear SVM
quadratic programming with D + 1 variables: well-studied in
optimization
is the hard-margin linear SVM good enough?
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 6 / 20
Support Vector Machine
Soft-Margin Linear SVM
hard-margin – hard constraints on separation:
1 T
min w w
w,b 2
s.t. yi (w T xi + b) ≥ 1, i = 1, . . . , N.
no feasible solution if some noisy outliers exist
soft-margin – soft constraints as cost:
1 T X
min w w +C ξi
w,b 2
i
T
s.t. yi (w xi + b) ≥ 1 − ξi ,
ξi ≥ 0, i = 1, . . . , N.
allow the noisy examples to have ξi > 0 with a cost
is linear SVM good enough?
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 7 / 20
Support Vector Machine
Soft-Margin Nonlinear SVM
what if we want a boundary g(x) = sign x T x − 1 ?
can never be constructed with a hyperplane classifier
T
sign w x + b
however, we can have more complex feature transforms:
φ(x) = [(x)1 , (x)2 , · · · , (x)D , (x)1 (x)1 , (x)1 (x)2 , · · · , (x)D (x)D ]
there is a classifier sign w T φ(x) + b that describes the boundary
soft-margin nonlinear SVM:
1 T X
min w w +C ξi
w,b 2
i
T
s.t. yi (w φ(xi ) + b) ≥ 1 − ξi ,
ξi ≥ 0, i = 1, . . . , N.
– with nonlinear φ(·)
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 8 / 20
Support Vector Machine
Feature Transformation
what feature transforms φ(·) should we use?
we can only extract finite small number of features, but we can use
unlimited number of feature transforms
traditionally:
use domain knowledge to do feature transformation
use only “useful” feature transformation
use a small number of feature transformation
control the goodness of fitting by suitable choice of feature
transformation
what if we use “infinite number” of feature transformation, and let
the algorithm decide a good w automatically?
would infinite number of transformations introduce overfitting?
are we able to solve the optimization problem?
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 9 / 20
Support Vector Machine
Dual Problem
infinite quadratic programming if infinite φ(·):
1 T X
min w w +C ξi
w,b 2
i
s.t. yi (w T φ(xi ) + b) ≥ 1 − ξi ,
ξi ≥ 0, i = 1, . . . , N.
luckily, we can solve its associated dual problem:
1 T
min α Qα − eT α
α 2
s.t. y T α = 0,
0 ≤ αi ≤ C,
Qij ≡ yi yj φT (xi )φ(xj )
α: N-dimensional vector
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 10 / 20
Support Vector Machine
Solution of the Dual Problem
associated dual problem:
1 T
min α Qα − eT α
α 2
s.t. y T α = 0,
0 ≤ αi ≤ C,
Qij ≡ yi yj φT (xi )φ(xj )
equivalent solution:
X
g(x) = sign w T x + b = sign yi αi φT (xi )φ(x) + b
no need for w and φ(x) explicitly if we can compute
K (x, x 0 ) = φT (x)φ(x 0 ) efficiently
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 11 / 20
Support Vector Machine
Kernel Trick
let kernel K (x, x 0 ) = φT (x)φ(x 0 )
revisit: can we compute the kernel of
φ(x) = [(x)1 , (x)2 , · · · , (x)D , (x)1 (x)1 , (x)1 (x)2 , · · · , (x)D (x)D ]
efficiently?
well, not really
how about this?
h√ √ √ i
φ(x) = 2(x)1 , 2(x)2 , · · · , 2(x)D , (x)1 (x)1 , · · · , (x)D (x)D
K (x, x 0 ) = (1 + x T x 0 )2 − 1
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 12 / 20
Support Vector Machine
Different Kernels
types of kernels
linear K (x, x 0 ) = x T x 0 ,
polynomial: K (x, x 0 ) = (ax T x 0 + r )d
Gaussian RBF: K (x, x 0 ) = exp(−γkx − x 0 k22 )
Laplacian RBF: K (x, x 0 ) = exp(−γkx − x 0 k1 )
the last two equivalently have feature transformation in infinite
dimensional space!
new paradigm for machine learning: use many many feature
transformations, control the goodness of fitting by large-margin
(clear separation) and violation cost (amount of outlier allowed)
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 13 / 20
Properties of SVM
Support Vectors: Meaningful Representation
1 T
min α Qα − eT α
α 2
s.t. y T α = 0,
0 ≤ αi ≤ C,
equivalent solution:
X
g(x) = sign yi αi K (xi , x) + b
only those with αi > 0 are needed for classification – support
vectors
from optimality conditions, αi :
“= 0”: no need in constructing the decision function,
away from the boundary or on the boundary
“> 0 and < C”: free support vector, on the boundary
“= C”: bounded support vector,
violate the boundary (ξi > 0) or on the boundary
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 14 / 20
Properties of SVM
Why is SVM Successful?
infinite number of feature transformation: suitable for conquering
nonlinear classification tasks
large-margin concept: theoretically promising
soft-margin trade-off: controls regularization well
convex optimization problems: possible for good optimization
algorithms (compared to Neural Networks and some other
learning algorithms)
support vectors: useful in data analysis and interpretation
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 15 / 20
Properties of SVM
Why is SVM Not Successful?
SVM can be sensitive to scaling and parameters
standard SVM is only a “discriminative” classification algorithm
SVM training can be time-consuming when N is large and the
solver is not carefully implemented
infinite number of feature transformation ⇔ mysterious classifier
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 16 / 20
Using SVM
Useful Extensions of SVM
multiclass SVM: use 1vs1 approach to combine binary SVM to
multiclass
– the label that gets more votes from the classifiers is the
prediction
probability output: transform the raw output w T φ(x) + b to a value
between [0, 1] to mean P(+1|x)
– use a sigmoid function to transform from R → [0, 1]
infinite ensemble learning (Lin and Li 2005):
if the kernel K (x, x 0 ) = −kx − x 0 k1 is used for standard SVM, the
classifier is equivalently
Z
g(x) = sign wθ sθ (x)dθ + b
where sθ (x) is a thresholding rule on one feature of x.
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 17 / 20
Using SVM
Basic Use of SVM
scale each feature of your data to a suitable range (say, [−1, 1])
use a Gaussian RBF kernel K (x, x 0 ) = exp(−γkx − x 0 k22 )
use cross validation and grid search to determine a good (γ, C)
pair
use the best (γ, C) on your training set
do testing with the SVM classifier
all included in LIBSVM (from Lab of Prof. Chih-Jen Lin)
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 18 / 20
Using SVM
Advanced Use of SVM
include domain knowledge by specific kernel design (e.g. train a
generative model for feature extraction, and use the extracted
feature in SVM to get discriminative power)
combining SVM with your favorite tools (e.g. HMM + SVM for
speech recognition)
fine-tune SVM parameters with specific knowledge of your
problem (e.g. different costs for different examples?)
interpreting the SVM results you get (e.g. are the SVs
meaningful?)
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 19 / 20
Using SVM
Resources
LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm
LIBSVM Tools:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools
Kernel Machines Forum: http://www.kernel-machines.org
Hsu, Chang, and Lin: A Practical Guide to Support Vector
Classification
my email: [email protected]
acknowledgment: some figures obtained from Prof. Chih-Jen Lin
H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 20 / 20