0% found this document useful (0 votes)
14 views

Svm Student

Uploaded by

arkadebmisra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Svm Student

Uploaded by

arkadebmisra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Support Vector Machines

Margins : Intuition
Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space:

wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)
Linear Separators
• Which of the linear separators is optimal?
Functional Margin
• Given a training example (x(i) , y(i) ), functional margin
M(i) = y(i) (wTx + b)

• For a training set S = {(x(i) , y(i) ); i = {1, . . . , m},


Functional margin= min(M(i) ) for i={1. . . m}

• It can be arbitrarily large by scaling w and b parameters.


Geometric Margin
• Point A represents the input x(i)
of some training example with
label y(i) = 1.

• Distance of point A i.e x(i) to the


decision boundary is given by
the line segment AB.
=> AB = γ(i)

• w/||w|| is a unit-length vector


pointing in the same direction as
w.
Geometric Margin
• Point B is given by
w
x(i) - γ(i)·
||w||

• Point B lies on the decision


boundary, and all points x on the
decision boundary satisfy the
equation wTx + b=0.
T (i) (i) w
=> w (x - γ . ) +b =0
||w||
Geometric Margin
• Solving for γ(i)

wT x(i) + b
γ(i) =
||w||

• Geometric margin of (w, b) with respect to a training example x(i) y(i)

γ(i = y(i) .( (w/||w|| )T . x(i) ) + b/||w||


)
• If ||w|| = 1, functional margin = geometric margin

• Given a training set S = {(x(i) , y(i) ); i = 1, . . . , m},


γ(i) = min(γ(i) ) for i={1…m}
Sec. 15.1
How will we find the one that achieves the maximum
geometric margin?
max γ,w,b γ
s.t y(i) (wT x(i) + b) ≥ γ where γ =distance of = x(i) from
the hyperplane and ||w|| =1
• By transforming the problem
max M,w,b M/||w|| s.t y(i) (wT x(i) + b) >= M for i=1….m

• Maximize M/||w||, subject to the functional margins all being at least M.

• Geometric margin, γ= M/||w||


• Introducing the scaling constraint that the functional margin of w, b
with respect to the training set must be 1 i.e M =1 we get,
wTx + b ≥ 1 if x Є C1
wTx + b ≤ −1 if x Є C2

• So, y(wTx + b) ≥ 1
For support vectors, the inequality becomes an equality 9
Sec. 15.1

Linear SVMs Mathematically


• Then we can formulate the quadratic optimization problem:

Find w and b such that


1
w is maximized; and for all {(x
(i) , y(i))}
wTx(i) + b ≥ 1 if yi=1; wTx(i) + b ≤ -1 if yi = -1

• A better formulation (max 1/ ||w|| =min ||w|| ):

Find w and b such that

1 .||w||2 is minimized; and for all {( x(i) , y(i) )}: yi (wTx(i) + b) ≥ 1


2

Its solution gives us the optimal margin classifier. 10


Sec. 15.1

Support Vector Machine (SVM)

• Training points are equally apart Support vectors


from the classifier.
• SVMs maximize the margin around
the separating hyperplane.
• A.k.a. Large Margin Classifiers
• The decision function is fully
specified by a subset of training
samples, the support vectors. Maximizes
Narrower
• Safer when distance from the margin
margin
boundary is large.

11
Sec. 15.1
Maximum Margin: Formalization
ρ =Margin Width

• w: decision hyperplane normal vector


• xi: data point i
• yi: Class belongingness for each xi yi =(+1 or -1)
• For all class yi (wTxi + b)>0

12
SVM - An Optimization Problem
C2
 Depends on the position of
nearest feature vector.

 Even if we remove any other


feature from C1 and C2 except the
support vector, the position of the
hyperplane remains the same.
C1

 We want to maximize γ:

Maximize this
≥γ

Minimize this
Solving the Optimization Problem

Any optimization problem can be formulated into two way, primal and dual problem. First we use
primal formulation for optimization algorithm, but if it does not yield any solution, we go for dual
optimization formulation, which is guaranteed to yield solution.
Primal Optimization Problem
• Find w and b such that:

Φ(w) =½ wT w = ½ w .w is minimized;
such that for all {(xi , yi)}: yi (wT xi + b) ≥ 1

• This is now optimizing a quadratic function subject to linear


constraints.

• This can be transformed into an unconstrained problem by


applying Lagrangian Multiplier (αi)
Primal & Dual Concept

https://www.youtube.com/watch?v=YOsrYl1JRrc&t=362s
Primal & Dual Concept(cont)

https://www.youtube.com/watch?v=YOsrYl1JRrc&t=362s
KKT Condition( Karush -Kuhn-Tucker Condition)

https://www.youtube.com/watch?v=YOsrYl1JRrc&t=362s
Primal Optimization Problem (cont)
Lp = 1/2(𝑤.𝑤)−∑α𝑖[ yi [ w . xi + b ] – 1 ]

Lp = 1/2(𝑤.𝑤)−∑ α𝑖yi (w . xi ) - ∑ α𝑖yi b + ∑ α𝑖

 KKT Condition( Karush -Kuhn-Tucker Condition)


𝜕𝐿/𝜕𝑏= 0
where m is the no of feature vector

𝜕𝐿/𝜕𝑤= 0

𝑤=∑ α𝑖yi xi
Primal to Dual Optimization Problem
(cont.)
By substitution we get,

Lp = 1/2 (𝑤.𝑤)−∑α𝑖 yi (w . xi ) + ∑α𝑖

= 1/2 ∑α𝑖 αj yiyj(x i.xj)−∑α𝑖α𝑗 yiyj(xi.xj) + ∑α𝑖

Lagrangian multipliers are always positive ,so we have


one constraint : α𝑖 >= 0 and

another constraint :
Lagrangian Multiplier

 If 𝜶i =0 , then the corresponding training feature vector xi is


not a support vector.

 If 𝜶i is high , then the corresponding training feature vector xi


has a high influence over the position of the decision surface.
Calculation of b value
For an unknown feature vector
• Now for an unknown feature vector z:

D(z) = sgn (w.z + b )

If sgn = + ve , z 𝜺 C1
sgn = - ve , z 𝜺 C2
Types of SVM
1. Linear SVM

I) Hard Margin SVM- for linearly separable data


Ii) Soft Margin SVM – for noisy training data

2. Non-linear SVM (kernel function)

 Hard Margin: So far we require all data


points be classified correctly
- No training error
Sec. 15.2.1

Soft Margin Classification


• If the training data is not linearly
separable, slack variables ξi can be
added to allow misclassification of
difficult or noisy examples.
• Allow some errors
• Let some points be moved to
where they belong, at a cost
ξi
What should our quadratic ξj
optimization criterion be?
Minimize
R
1
w.w  C  ε k
2 k 1

25
Hard Margin vs. Soft Margin
 The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 The new formulation incorporating slack variables:

Find w and b such that


Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

 Parameter C can be viewed as how much penalty to give to


the misclassified points.
Soft Margin Hyperplane
• The new conditions become

• xi are “slack variables” in optimization


• Note that xi=0 if there is no error for xi
• xi is an upper bound of the number of errors
• We want to minimize 1 2 n
w  C  xi
2 i 1

• C : tradeoff parameter between error and margin

28
Sec. 15.2.1

Soft Margin Classification – Solution


• The dual problem for soft margin classification:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

• Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem!
• Again, xi with non-zero αi will be support vectors.
• Solution to the dual problem is:
w is not needed explicitly for
w = Σ αi yi x i
classification!
b = yk(1- ξk) - wTxk where k =
argmax αk’ k’ f(x) = ΣαiyixiTx + b
29
Multiclass Classification
Non-Linearly Separable Data
Non-linear SVMs
 When the dataset can not be separable in linear fashion:

0 x

 How about… mapping data to a higher-dimensional space:

x2

0 x
Non-linear SVMs: Feature spaces
 General idea: the original input space can always be mapped to some
higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)
Kernel Trick
• With feature mapping, the discriminant function becomes :
g(x) = wTɸ(x) + b = ∑iϵSV αiɸ(xi)Tɸ(x) + b
• A kernel fuction is defined as a function that corresponds to a
dot product of two feature vectors.
K(xa , xb) = ɸ(xa).ɸ(xb)
• Often K(xa , xb) may be very inexpensive to compute even if ɸ(xa)
may be extremely high dimensional.
Commonly used Kernel Functions
 Linear: K(xi,xj)= xi Txj

 Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

 Gaussian (radial-basis function network):


2
xi  x j
K (x i , x j )  exp( )
2 2

 Sigmoid: K(xi,xj) = tanh(β0xi Txj + β1)


Example of Gaussian Kernel

© Eric Xing @ CMU, 2006 - 2010


Performance

• Support vector machines work very well in practice.


- The user must choose the Kernel function and its parameters.
• They can be expensive in time and space for big data sets.
• SVM is a better classifier than many other classifiers, as this classifier
can be can be used for both linearly separable and linearly non
separable data.
SVM Applications

SVM is used successfully in many real-world problems


- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Reference
• www.astro.caltech.edu/~george/aybi199/AMooreTutorials/svm.ppt
• www.iro.umontreal.ca/~pift6080/H09/documents/papers/svm_tutorial.ppt
• https://medium.com/@ankitnitjsr13/math-behind-support-vector-machine-svm-
5e7376d0ee4d
• https://www.youtube.com/watch?v=WLhvjpoCPiY&list=RDCMUC2nvtxeY_rJLnlJKEg-
J82g&start_radio=1
• https://shuzhanfan.github.io/2018/05/understanding-mathematics-behind-
support-vector-machines/
• https://www.youtube.com/watch?v=SRVswRH5Q7E
• https://www.youtube.com/watch?v=gidJbK1gXmA&t=1586s
• https://www.youtube.com/watch?v=YOsrYl1JRrc&t=362s
Thank You

You might also like