Support Vector Machines
Support Vector Machines – In a nutshell
• New method for the classification of both linear and nonlinear data.
• It uses a nonlinear mapping to transform the original training data into a
higher dimension.
• Within this new dimension, it searches for the linear optimal separating
hyperplane (that is, a “decision boundary” separating the tuples of one
class from another).
• With an appropriate nonlinear mapping to a sufficiently high dimension,
data from two classes can always be separated by a hyperplane.
• The SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors).
Support Vector Machines
• The first paper =>presented in 1992 by Vladimir Vapnik and colleagues
Bernhard Boser and Isabelle Guyon.
• The training time of even the fastest SVMs can be extremely slow.
• But they are highly accurate, owing to their ability to model complex
nonlinear decision boundaries.
• They are much less prone to overfitting than other methods.
• SVMs can be used for prediction as well as classification.
• Applications= >handwritten digit recognition, object recognition, and
speaker identification, as well as benchmark time-series prediction tests.
The Case When the Data Are Linearly
Separable
• Let the data set D be given as (X1, y1), (X2, y2), : : : , (X|D| , y|D| ),where
Xi is the set of training tuples with associated class labels, yi.
• Each yi can take one of two values, either+1 or -1 (i.e., yi ∈ { +1,-1} ),
corresponding to the classes buys computer = yes and buys computer
= no, respectively.
• To aid in visualization, consider two input attributes, A1 and A2,
• There are an infinite number of separating lines that could be drawn.
• We need to find the “best” one, that is, one that will have the
minimum classification error on previously unseen tuples=>identify
the best hyperplane.
• Start by searching for the maximum marginal hyperplane.
• The associated margin gives the largest separation between classes.
• When dealing with the MMH, this distance is the shortest distance
from the MMH to the closest training tuple of either class.
• A separating hyperplane can be written as
W.X + b = 0
Where W is a weight vector, namely,W = {w1, w2, . . . , wn};
n is the number of attributes;
b is a scalar, often referred to as a bias.
• Consider two input attributes, A1 and A2.
• Training tuples are 2-D, e.g., X = (x1, x2), where x1 and x2 are the
values of attributes A1 and A2, respectively, for X.
• If we think of b as an additional weight, w0, we can rewrite the above
separating hyperplane as
w0 + w1x1 + w2x2 = 0
• Thus, any point that lies above the separating hyperplane satisfies
w0 + w1x1 + w2x2 > 0
• Similarly, any point that lies below the separating hyperplane satisfies
w0 + w1x1 + w2x2 < 0
• The weights can be adjusted so that the hyperplanes defining the “sides” of the margin
• can be written as
• H1 : w0 + w1x1 + w2x2 ≥ 1 for yi = +1,
• H2 : w0 + w1x1 + w2x2 ≤ -1 for yi = -1
• i.e; any tuple that falls on or above H1 belongs to class +1, and any tuple that falls on
or below H2 belongs to class -1.
• Combining the two inequalities of Equations above , we get
yi (w0 + w1x1 + w2x2 ) ≥ 1,
• Any training tuples that fall on hyperplanes H1 or H2 are called support vectors.
• They are equally close to the (separating) MMH.
• The most difficult tuples to classify and give the most information regarding
classification.
• A formulae for the size of the maximal margin.
• The distance from the separating hyperplane to any point on H1 is
where ||W|| is the Euclidean norm of W, that is
• By definition, this is equal to the distance from any point on H2 to the
separating hyperplane.
• Therefore, the maximal margin is
• Any optimization software package for solving constrained convex
quadratic problems can then be used to find the support vectors and
MMH.
• For larger data, special and more efficient algorithms for training
SVMs can be used instead.
Trained support vector machine-> how do I
use it to classify test (i.e., new) tuples?”
• Based on the Lagrangian formulation, the MMH can be rewritten as
the decision boundary.
• where yi is the class label of support vector Xi;
• XT
is a test tuple;
αi and b0 are numeric parameters that were determined
automatically by the optimization or SVM algorithm and
l is the number of support vectors.
• Given a test tuple, XT
, and then check to see the sign of the result.
• This tells us on which side of the hyperplane the test tuple falls.
• If the sign is positive, then XT
falls on or above the MMH, and so the
SVM predicts that XT
belongs to class +1 (representing buys computer
= yes).
• If the sign is negative, then XT
falls on or below the MMH and the
class prediction is -1 (representing buys computer = no).
The Case When the Data Are Linearly Inseparable
• i.e; no straight line can be found that would separate the classes.
• SVMs can be extended to create nonlinear SVMs for the classification of linearly
inseparable data.
• Capable of finding nonlinear decision boundaries (i.e., nonlinear hypersurfaces) in
input space.
The Case When the Data Are Linearly Inseparable
• There are two main steps.
• Step 1: Transform the original input data into a higher dimensional space using a
nonlinear mapping.
• Step 2: Search for a linear separating hyperplane in the new space.
• Leads to quadratic optimization problem that can be solved using the linear SVM
formulation.
• The maximal marginal hyperplane found in the new space corresponds to a
nonlinear separating hypersurface in the original space.
Eg: Nonlinear transformation of original input
data into a higher dimensional space.
• A 3D input vector X = (x1, x2, x3) is mapped into a 6D space, Z, using the mappings
Φ (X) = x1, Φ(X) = x2, Φ(X) = x3, Φ(X) = (x1)2
, Φ(X) =x1x2, and Φ(X) = x1x3.
• A decision hyperplane in the new space is d(Z) = WZ + b, where W and Z are
vectors.
• This is linear.
• Solve for W and b and then substitute back so that the linear decision
hyperplane in the new (Z) space corresponds to a nonlinear second-order
polynomial in the original 3-D input space,
d(Z) = w1x1 + w2x2 + w3x3 + w4(x1)2
+ w5x1x2 + w6x1x3 + b
= w1z1+ w2z2 + w3z3 + w4z4 + w5z5 + w6z6 + b
Issues…
1. How do we choose the nonlinear mapping to a higher dimensional
space?
2. The computation involved is costly.
• Given the test tuple, we have to compute its dot product with every one of
the support vectors.
• In training, we have to compute a similar dot product several times in order to
find the MMH. This is expensive.
• Hence, the dot product computation required is very heavy and costly.
Solution
• Computing the dot product (Φ(Xi). Φ(Xj)) on the transformed data
tuples is mathematically equivalent to applying a kernel function,
K(Xi,Xj), to the original input data.
K(Xi,Xj) = Φ(Xi). Φ(Xj))
• Adv: all calculations are made in the original input space, which is of
potentially much lower dimensionality.
What are some of the kernel functions that could
be used?
• Each of these results in a different nonlinear classifier in (the original)
input space.
• An SVM with a Gaussian radial basis function (RBF) gives the same
decision hyperplane as a type of neural network known as a radial basis
function (RBF) network.
• An SVM with a sigmoid kernel is equivalent to a simple two-layer neural
network known as a multilayer perceptron.
How to choose kernel function?
• No golden rules for determining which kernel will result in the most accurate
SVM.
• In practice, the kernel chosen does not generally make a large difference in
resulting accuracy.
• SVM training always finds a global solution.
• SVMs can be used for binary and multiclass cases.
• A simple and effective approach=> given m classes, trains m classifiers, one
for each class.
• SVMs can also be designed for linear and nonlinear regression.
Major Research in SVMs
• To improve the speed in training and testing so that SVMs may
become a more feasible option for very large data sets (e.g., of
millionsof support vectors).
• Determining the best kernel for a given data set.
• Finding more efficient methods for the multiclass case.

Module 3 -Support Vector Machines data mining

  • 1.
  • 2.
    Support Vector Machines– In a nutshell • New method for the classification of both linear and nonlinear data. • It uses a nonlinear mapping to transform the original training data into a higher dimension. • Within this new dimension, it searches for the linear optimal separating hyperplane (that is, a “decision boundary” separating the tuples of one class from another). • With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane. • The SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors).
  • 3.
    Support Vector Machines •The first paper =>presented in 1992 by Vladimir Vapnik and colleagues Bernhard Boser and Isabelle Guyon. • The training time of even the fastest SVMs can be extremely slow. • But they are highly accurate, owing to their ability to model complex nonlinear decision boundaries. • They are much less prone to overfitting than other methods. • SVMs can be used for prediction as well as classification. • Applications= >handwritten digit recognition, object recognition, and speaker identification, as well as benchmark time-series prediction tests.
  • 4.
    The Case Whenthe Data Are Linearly Separable • Let the data set D be given as (X1, y1), (X2, y2), : : : , (X|D| , y|D| ),where Xi is the set of training tuples with associated class labels, yi. • Each yi can take one of two values, either+1 or -1 (i.e., yi ∈ { +1,-1} ), corresponding to the classes buys computer = yes and buys computer = no, respectively. • To aid in visualization, consider two input attributes, A1 and A2,
  • 6.
    • There arean infinite number of separating lines that could be drawn. • We need to find the “best” one, that is, one that will have the minimum classification error on previously unseen tuples=>identify the best hyperplane. • Start by searching for the maximum marginal hyperplane.
  • 7.
    • The associatedmargin gives the largest separation between classes. • When dealing with the MMH, this distance is the shortest distance from the MMH to the closest training tuple of either class. • A separating hyperplane can be written as W.X + b = 0 Where W is a weight vector, namely,W = {w1, w2, . . . , wn}; n is the number of attributes; b is a scalar, often referred to as a bias.
  • 8.
    • Consider twoinput attributes, A1 and A2. • Training tuples are 2-D, e.g., X = (x1, x2), where x1 and x2 are the values of attributes A1 and A2, respectively, for X. • If we think of b as an additional weight, w0, we can rewrite the above separating hyperplane as w0 + w1x1 + w2x2 = 0 • Thus, any point that lies above the separating hyperplane satisfies w0 + w1x1 + w2x2 > 0 • Similarly, any point that lies below the separating hyperplane satisfies w0 + w1x1 + w2x2 < 0
  • 9.
    • The weightscan be adjusted so that the hyperplanes defining the “sides” of the margin • can be written as • H1 : w0 + w1x1 + w2x2 ≥ 1 for yi = +1, • H2 : w0 + w1x1 + w2x2 ≤ -1 for yi = -1 • i.e; any tuple that falls on or above H1 belongs to class +1, and any tuple that falls on or below H2 belongs to class -1. • Combining the two inequalities of Equations above , we get yi (w0 + w1x1 + w2x2 ) ≥ 1, • Any training tuples that fall on hyperplanes H1 or H2 are called support vectors. • They are equally close to the (separating) MMH. • The most difficult tuples to classify and give the most information regarding classification.
  • 10.
    • A formulaefor the size of the maximal margin. • The distance from the separating hyperplane to any point on H1 is where ||W|| is the Euclidean norm of W, that is • By definition, this is equal to the distance from any point on H2 to the separating hyperplane. • Therefore, the maximal margin is • Any optimization software package for solving constrained convex quadratic problems can then be used to find the support vectors and MMH. • For larger data, special and more efficient algorithms for training SVMs can be used instead.
  • 11.
    Trained support vectormachine-> how do I use it to classify test (i.e., new) tuples?” • Based on the Lagrangian formulation, the MMH can be rewritten as the decision boundary. • where yi is the class label of support vector Xi; • XT is a test tuple; αi and b0 are numeric parameters that were determined automatically by the optimization or SVM algorithm and l is the number of support vectors.
  • 12.
    • Given atest tuple, XT , and then check to see the sign of the result. • This tells us on which side of the hyperplane the test tuple falls. • If the sign is positive, then XT falls on or above the MMH, and so the SVM predicts that XT belongs to class +1 (representing buys computer = yes). • If the sign is negative, then XT falls on or below the MMH and the class prediction is -1 (representing buys computer = no).
  • 13.
    The Case Whenthe Data Are Linearly Inseparable • i.e; no straight line can be found that would separate the classes. • SVMs can be extended to create nonlinear SVMs for the classification of linearly inseparable data. • Capable of finding nonlinear decision boundaries (i.e., nonlinear hypersurfaces) in input space.
  • 14.
    The Case Whenthe Data Are Linearly Inseparable • There are two main steps. • Step 1: Transform the original input data into a higher dimensional space using a nonlinear mapping. • Step 2: Search for a linear separating hyperplane in the new space. • Leads to quadratic optimization problem that can be solved using the linear SVM formulation. • The maximal marginal hyperplane found in the new space corresponds to a nonlinear separating hypersurface in the original space.
  • 15.
    Eg: Nonlinear transformationof original input data into a higher dimensional space. • A 3D input vector X = (x1, x2, x3) is mapped into a 6D space, Z, using the mappings Φ (X) = x1, Φ(X) = x2, Φ(X) = x3, Φ(X) = (x1)2 , Φ(X) =x1x2, and Φ(X) = x1x3. • A decision hyperplane in the new space is d(Z) = WZ + b, where W and Z are vectors. • This is linear. • Solve for W and b and then substitute back so that the linear decision hyperplane in the new (Z) space corresponds to a nonlinear second-order polynomial in the original 3-D input space, d(Z) = w1x1 + w2x2 + w3x3 + w4(x1)2 + w5x1x2 + w6x1x3 + b = w1z1+ w2z2 + w3z3 + w4z4 + w5z5 + w6z6 + b
  • 16.
    Issues… 1. How dowe choose the nonlinear mapping to a higher dimensional space? 2. The computation involved is costly. • Given the test tuple, we have to compute its dot product with every one of the support vectors. • In training, we have to compute a similar dot product several times in order to find the MMH. This is expensive. • Hence, the dot product computation required is very heavy and costly.
  • 17.
    Solution • Computing thedot product (Φ(Xi). Φ(Xj)) on the transformed data tuples is mathematically equivalent to applying a kernel function, K(Xi,Xj), to the original input data. K(Xi,Xj) = Φ(Xi). Φ(Xj)) • Adv: all calculations are made in the original input space, which is of potentially much lower dimensionality.
  • 18.
    What are someof the kernel functions that could be used? • Each of these results in a different nonlinear classifier in (the original) input space. • An SVM with a Gaussian radial basis function (RBF) gives the same decision hyperplane as a type of neural network known as a radial basis function (RBF) network. • An SVM with a sigmoid kernel is equivalent to a simple two-layer neural network known as a multilayer perceptron.
  • 19.
    How to choosekernel function? • No golden rules for determining which kernel will result in the most accurate SVM. • In practice, the kernel chosen does not generally make a large difference in resulting accuracy. • SVM training always finds a global solution. • SVMs can be used for binary and multiclass cases. • A simple and effective approach=> given m classes, trains m classifiers, one for each class. • SVMs can also be designed for linear and nonlinear regression.
  • 20.
    Major Research inSVMs • To improve the speed in training and testing so that SVMs may become a more feasible option for very large data sets (e.g., of millionsof support vectors). • Determining the best kernel for a given data set. • Finding more efficient methods for the multiclass case.