0% found this document useful (0 votes)
35 views22 pages

04SVM

The document discusses Support Vector Machines (SVM) and their optimization through Regularized Risk Minimization (RRM). It covers concepts such as the margin, hard margin SVM, soft margin SVM, and the transition from primal to dual formulations. The importance of support vectors and the use of slack variables in non-separable data are also highlighted.

Uploaded by

zhanghaojing62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views22 pages

04SVM

The document discusses Support Vector Machines (SVM) and their optimization through Regularized Risk Minimization (RRM). It covers concepts such as the margin, hard margin SVM, soft margin SVM, and the transition from primal to dual formulations. The importance of support vectors and the use of slack variables in non-separable data are also highlighted.

Uploaded by

zhanghaojing62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topic 4: SUPPORT VECTOR MACHINES

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning


Risi Kondor, The University of Chicago
Regularized Risk Minimization (RRM)
Find the hypothesis fb by solving a problem of the form
 
1 X
m
fb = arg min ℓ(f (xi ), yi ) + λ Ω[f ]
f ∈F m | {z }
i=1
| {z } regularizer
training error

• F can be quite a rich hypothesis space.


• The purpose of the regularizer is to avoid overfitting.
• λ is a tunable parameter.
• ℓ(yb, y) : loss function
• ℓ might or might not be the same loss as in Etrue .

[Tykhonov regularization] [Vapnik 1970’s–]

2
2/22
/22
Optimization: equality constraints

Problem:
minimize f (x) subject to g(x) = c.
x∈Rn

1. Form the Lagrangian L(x, λ) = f (x) − λ (g(x) − c) .


2. The solution must be at a critical point of L . → Setting

∂L(x, λ)
=0 i = 1, 2, . . . , n.
∂xi
yields a curve of solutions x = γ(λ) .
3. Reintroducing the constraint g(γ(λ))=c gives λ , hence the optimal x .

3
3/22
/22
Optimization: inequality constraints
Problem:
minimize f (x) subject to g(x) ≥ c.
x∈Rn

1. Form the Lagrangian L(x, λ) = f (x) − λ (g(x) − c) .


2. Introduce the dual function

h(λ) = inf L(x, λ).


x

3. Solve the dual problem

λ∗ = argmax h(λ) subject to λ ≥ 0.


λ

4. The optimal x is inf x L(x, λ∗ ) (assuming strong duality).


When f is a convex function and g(x) ≥ c defines a convex region of
space, this gives the global optimum.
4
4/22
/22
Karush–Kuhn–Tucker conditions

At the optimal solution x∗ of

minimize f (x) subject to g(x) ≥ c.


x∈Rn

either
1. we are the boundary → g(x∗ ) = c or
2. we are at an interior point → λ∗ = 0.

→ Complementary slackness: λ∗ (g(x∗ ) − c) = 0.

5
5/22
/22
Support Vector Machines
Linear classifiers

To apply RRM, go back to binary classification in Rn with a linear (affine)


hyperplane:

Input space: X = Rn
Output space: Y = {−1, +1}
Hypothesis:
f (x) = w · x + b.
h(x) = sgn(f (x))
(Note the sneaky difference between f and h )

Question: Of all possible hyperplanes that separate the data which one do
we choose?

7
7/22
/22
The margin
Recall, the margin of a point (x, y) to the hyperplane f (x) = w · x + b = 0
(with ∥w∥ = 1 ) is
y (w · x + b).

The margin of a dataset S = {(x1 , y1 ), . . . , (xm , ym )} to f is

mini yi (w · xi + b) .

In the case of the perceptron we saw that having a large margin is desirable.

IDEA: Choose w and b explicitly to maximize the margin! → Support


Vector Machines (SVM)

8
8/22
/22
Maximizing the margin

Choose the hyperplane that has the largest margin!

9
9/22
/22
Hard Margin Support Vector
Machine

Given a dataset S = {(x1 , y1 ), . . . , (xm , ym )} ,

maximize δ s.t. yi (w · xi + b) ≥ δ ∀i.


∥w∥=1, b

Equivalent formulation: drop the ∥w∥ = 1 constraint and solve

1
minimize ∥w∥2 s.t. yi (w · xi + b) ≥ 1 ∀i.
w, b 2

10
10/22
/22
The primal problem

The primal SVM optimization problem


1
minimize ∥w∥2 s.t. yi (w · xi + b) ≥ 1 ∀i
w,b 2

This is a nice convex optimization problem (a QP) with a unique minimum.


→ Introduce a Lagrangian.

11
11/22
/22
From primal to dual
1
minimize ∥w∥2 s.t. yi (w · xi + b) ≥ 1 ∀i
w,b 2
Lagrangian:
P
L(w, b, α) = 1
2 ∥w∥2 − i αi (yi (w · xi + b) − 1)
P

∂wi L(w, b, α) =0 ⇒ w− i α i y i xi =0
P

∂b L(w, b, α) =0 ⇒ i α i yi =0

Dual function:
X 1X
L(α) = αi − αi αj yi yj (xi · xj )
2
i i,j

12
12/22
/22
The dual problem
The dual SVM optimization problem
X 1X
maximize L(α) = αi − αi αj yi yj (xi · xj )
α1 ,...,αm 2
i i,j
X
subject to yi αi = 0 and αi ≥ 0 ∀i
i

Still a QP, but in fewer variables, so easier to solve. In particular,


hX i hX i
h(x) = sgn αi yi (x · xi ) + b = sgn γi (x · xi ) + b ,
i i

where γi = yi αi . → The solution lies in the span of the data,


P
w= i γi xi .

13
13/22
/22
Support vector machine

14
14/22
/22
Sparsity of support vectors

The KKT conditions prescribe that

αi (yi (xi · w + b) − 1) = 0 ∀i

So αi ̸= 0 only for those examples that lie exactly on the margin, and
therefore only these “support vectors” influence the solution
hX i
h(x) = sgn α i y i ( x · xi ) + b
i

→ Sparsity is a precious thing.

Question: But what about non-separable data? → Soft margin SVMs

15
15/22
/22
The Soft Margin SVM

The primal SVM optimization problem

1 CX
minimize ∥w∥2 + ξi s.t. yi (w · xi + b) ≥ 1 − ξi ξi ≥ 0 ∀i
w,b,ξ1 ,...,ξm 2 m
i

The ξi ’s are called slack variables and C is a “softness parameter”

[Cortes & Vapnik, 1995]

16
/22
16/22
From primal to dual

1 CX
minimize ∥w∥2 + ξi s.t. yi (w · xi + b) ≥ 1 − ξi ξi ≥ 0 ∀i
w,b,ξ1 ,...,ξm 2 m
i

Lagrangian:
P P P
L(w, b, α, β) = 12 ∥w∥2 + m
C
i ξi − i αi (yi (w·xi +b)−1+ξi )− i β i ξi
P

∂wi L(w, b, α, β) =0 ⇒ w − i αi yi xi = 0
P

∂b L(w, b, α, β) =0 ⇒ i α i yi = 0


∂ξi L(w, b, α, β) =0 ⇒ αi + βi = C
m

17
17/22
/22
Soft margin SVM dual

The dual SVM optimization problem


X 1X
maximize L(α) = αi − αi αj yi yj (xi · xj )
α1 ,...,αm 2
i i,j
X C
subject to yi αi = 0 and 0 ≤ αi ≤ ∀i
m
i

18
18/22
/22
SVM is just a form of RRM
At the optimum of the primal problem the slacks are as small as possible:

ξi = max {0, 1 − yi (w · xi + b)} = (1 − yi (w · xi + b))≥0 ,


| {z }
ℓhinge (w·xi ,yi )
where (z)≥0 = max(0, z) .

The soft-margin SVM finds


 X m 
b 1 1 2
f = argmin ℓhinge (f (xi ), yi ) + ∥w∥ .
f ∈F m |2C {z }
i=1
| {z } regularizer
empirical loss

where F is the hypothesis space of f (x) = w · x + b linear functions.

19
19/22
/22
Loss functions for classification

20
20/22
/22
Loss functions for regression

21
21/22
/22
22
22/22
/22

You might also like