0% found this document useful (0 votes)
4 views44 pages

ML 18-20 SVM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views44 pages

ML 18-20 SVM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Support Vector Machines (SVMs)

 Support Vector Machines (SVMs) are a powerful and versatile supervised machine learning
algorithm used primarily for classification. SVM can also be adapted for regression tasks.
 Effective in high-dimensional spaces.
 Memory efficient as they only use a subset of training points (support vectors) in the decision
function.
 Works well with both linear and non-linear data
 Robust against overfitting, especially in small datasets
 If the data points are not linearly separable, we can use a technique called kernel trick to map the data
points into a higher-dimensional space where they become separable.The kernel function computes
the inner product between the mapped data points without computing the mapping itself.
Hyperplane and Margin:
 In a dataset with two classes, an SVM aims to find a decision boundary, known as a hyperplane,
that separates the data points of one class from those of another.
 The core idea behind SVMs is to find the optimal hyperplane that best separates different classes
within a dataset.
 The optimal hyperplane is the one that maximizes the "margin," which is the distance between
the hyperplane. and the closest data points of each class.
 Datapoints that are closest to the hyperplane is called support vectors.
Maximizing the Margin:
 The goal of maximizing the margin is to ensure
the best possible separation between classes,
leading to better generalization to unseen data.
 A larger margin indicates a more robust and
accurate classification.
 Solving SVMs is a quadratic programming
problem
Classification and Prediction:
Once the optimal hyperplane is determined,
new, unseen data points can be classified by SVM algorithm steps for classification
determining which side of the hyperplane I. Prepare the Data – Load and preprocess the dataset.
they fall on. II. Choose a Kernel Function – Select the right kernel
Common kernels: (linear, polynomial, RBF).
Linear Kernel – Works well for linearly III. Train the SVM Model – Fit the model using labelled
separable data. training data.
Polynomial Kernel – Used when data has IV. Find the Optimal Hyperplane – Determine the best
complex relationships. decision boundary.
Radial Basis Function (RBF) Kernel – V. Classify New Data – Use the trained model to classify
Common for non-linear problems. new inputs.
Sec. 15.1

Maximum Margin
• w: decision hyperplane normal vector
• xi: data point i
• yi: class of data point i (+1 or -1) NB: Not 1/0
• Classifier is: f(xi) = sign(wTxi + b)
• Functional margin of xi is: yi (wTxi + b)
• But note that we can increase this margin simply by scaling w, b….
• Functional margin of dataset is twice the minimum functional margin for any point

5
6
7
• in 3D the discriminant is a plane, and in nD it is a hyperplane
8
9
10
11
12
13
14
15
16
Sec. 15.1

Linear SVMs Mathematically

• We can formulate the quadratic optimization problem:


Find w and b such that
2
r= is maximized; and for all {(xi , yi)}
w
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
• A better formulation (min ||w|| = max 1/ ||w|| ):

Find w and b such that


Φ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1


17
Sec. 15.1

Solving the Optimization Problem


Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

• This is now optimizing a quadratic function subject to linear constraints


• Quadratic optimization problems are a well-known class of mathematical
programming problem.
• The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every constraint in the primary problem.

18
The Optimization Problem

19
The Dual Optimization Problem, where number of parameters is N

20
SVM: Example

21
22
SVM: Example

23
24
Sec. 15.2.1

Soft Margin Classification


• If the training data is not linearly
separable, slack variables ξi can be
added to allow misclassification of
difficult or noisy examples.
• Allow some errors
• Let some points be moved to where they
belong, at a cost ξi
• Still, try to minimize training set errors, ξj
and to place hyperplane “far” from each
class (large margin)

25
Sec. 15.2.1

Soft Margin Classification Mathematically


• The old formulation:

Find w and b such that


Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

• The new formulation incorporating slack variables:

Find w and b such that


Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

• Parameter C can be viewed as a way to control overfitting


• A regularization term
26
Sec. 15.2.1

Soft Margin Classification – Solution

• The dual problem for soft margin classification:


Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

• Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem!
• Again, xi with non-zero αi will be support vectors.

k’

27
Sec. 15.2.1

Linear SVMs: Summary


• The classifier is a separating hyperplane.

• The most “important” training points are the support vectors; they define
the hyperplane.

• Quadratic optimization algorithms can identify which training points xi are


support vectors with non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solution, training
points appear only inside inner products:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0 f(x) = ΣαiyixiTx + b
(2) 0 ≤ αi ≤ C for all αi 28
SVC as Optimization

29
Sec. 15.2.3

Non-linear SVMs
• Datasets that are linearly separable (with some noise) work out great:

0 x

• But what are we going to do if the dataset is just too hard?

0 x

• How about … mapping data to a higher-dimensional space:


x2

0 x
30
Sec. 15.2.3

Non-linear SVMs: Feature spaces


• General idea: the original feature space can always be mapped to some
higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

31
Polynomial Regression
• Consider training set
• We can fit a 2nd degree polynomial model:
• to find the best quadratic curve that fits the data. But
(i) we first expand the feature dimension of the training set

(ii) and then train a linear model on the expanded data

 To train a polynomial model is just training a linear


Model on data with transformed predictors.
 Transforming the data to fit a 2nd degree
polynomial model requires a map:

where ℝ called the input


space, ℝ^3 is called the
feature space. SVC with Non-Linear Decision Boundaries 32
SVC with Non-Linear Decision Boundaries
Kernel Functions

34
Inner Products

35
SVM Example: XOR Problem

 satisfies the constraints ∀i, 36


 all samples are support vectors
37
How to choose kernel function K(xi,xj)?
K(xi,xj) should correspond to product ϕ(xi)tϕ(xj) in a higher
dimensional space
 Mercer’s condition tells us that kernel function can be
expressed as dot product of two vectors 38
The Kernel Trick (cont.)

39
Sec. 15.2.3

The “Kernel Trick”


• The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product in
some expanded feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
40
Sec. 15.2.3

Kernels
• Why use kernels?
• Make non-separable problem separable.
• Map data into better representational space
• Common kernels
• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional space)

41
Sec. 15.2.4

Per class evaluation measures

• Recall: Fraction of docs in class i classified correctly: c ii


å c ij
j

• Precision: Fraction of docs assigned class i that are


actually about class i:
c ii
å c ji
• Accuracy: (1 - error rate) Fraction of docs classified j
correctly:

åc ii
i

ååc ij
j i
42
Summary

• Support vector machines (SVM)


• Choose hyperplane based on support vectors
• Support vector = “critical” point close to decision boundary
• (Degree-1) SVMs are linear classifiers.
• Kernels: powerful and elegant way to define similarity metric
• Perhaps best performing text classifier
• But there are other methods that perform about as well as SVM, such as
regularized logistic regression (Zhang & Oles 2001)
• Partly popular due to availability of good software
• SVMlight is accurate and fast – and free (for research)
• Now lots of good software: libsvm, TinySVM, ….
• Comparative evaluation of methods
• Real world: exploit domain specific structure!

43
Ch. 15

Resources for today’s lecture

• Christopher J. C. Burges. 1998. A Tutorial on Support Vector Machines for Pattern Recognition
• S. T. Dumais. 1998. Using SVMs for text categorization, IEEE Intelligent Systems, 13(4)
• S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and representations for text
categorization. CIKM ’98, pp. 148-155.
• Yiming Yang, Xin Liu. 1999. A re-examination of text categorization methods. 22nd Annual International SIGIR
• Tong Zhang, Frank J. Oles. 2001. Text Categorization Based on Regularized Linear Classification Methods. Information
Retrieval 4(1): 5-31
• Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference and
Prediction. Springer-Verlag, New York.
• T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
• Fan Li, Yiming Yang. 2003. A Loss Function Analysis for Classification Methods in Text Categorization. ICML 2003: 472-479.
• Tie-Yan Liu, Yiming Yang, Hao Wan, et al. 2005. Support Vector Machines Classification with Very Large Scale Taxonomy,
SIGKDD Explorations, 7(1): 36-43.
• ‘Classic’ Reuters-21578 data set: http://www.daviddlewis.com /resources /testcollections/reuters21578/

You might also like