Module 3 -Support Vector Machines data mining

Support Vector Machines – In a nutshell
• New method for the classification of both linear and nonlinear data.
• It uses a nonlinear mapping to transform the original training data into a
higher dimension.
• Within this new dimension, it searches for the linear optimal separating
hyperplane (that is, a “decision boundary” separating the tuples of one
class from another).
• With an appropriate nonlinear mapping to a sufficiently high dimension,
data from two classes can always be separated by a hyperplane.
• The SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors).

Support Vector Machines
• The first paper =>presented in 1992 by Vladimir Vapnik and colleagues
Bernhard Boser and Isabelle Guyon.
• The training time of even the fastest SVMs can be extremely slow.
• But they are highly accurate, owing to their ability to model complex
nonlinear decision boundaries.
• They are much less prone to overfitting than other methods.
• SVMs can be used for prediction as well as classification.
• Applications= >handwritten digit recognition, object recognition, and
speaker identification, as well as benchmark time-series prediction tests.

The Case When the Data Are Linearly
Separable
• Let the data set D be given as (X1, y1), (X2, y2), : : : , (X|D| , y|D| ),where
Xi is the set of training tuples with associated class labels, yi.
• Each yi can take one of two values, either+1 or -1 (i.e., yi ∈ { +1,-1} ),
corresponding to the classes buys computer = yes and buys computer
= no, respectively.
• To aid in visualization, consider two input attributes, A1 and A2,

• There are an infinite number of separating lines that could be drawn.
• We need to find the “best” one, that is, one that will have the
minimum classification error on previously unseen tuples=>identify
the best hyperplane.
• Start by searching for the maximum marginal hyperplane.

• The associated margin gives the largest separation between classes.
• When dealing with the MMH, this distance is the shortest distance
from the MMH to the closest training tuple of either class.
• A separating hyperplane can be written as
W.X + b = 0
Where W is a weight vector, namely,W = {w1, w2, . . . , wn};
n is the number of attributes;
b is a scalar, often referred to as a bias.

• Consider two input attributes, A1 and A2.
• Training tuples are 2-D, e.g., X = (x1, x2), where x1 and x2 are the
values of attributes A1 and A2, respectively, for X.
• If we think of b as an additional weight, w0, we can rewrite the above
separating hyperplane as
w0 + w1x1 + w2x2 = 0
• Thus, any point that lies above the separating hyperplane satisfies
w0 + w1x1 + w2x2 > 0
• Similarly, any point that lies below the separating hyperplane satisfies
w0 + w1x1 + w2x2 < 0

• The weights can be adjusted so that the hyperplanes defining the “sides” of the margin
• can be written as
• H1 : w0 + w1x1 + w2x2 ≥ 1 for yi = +1,
• H2 : w0 + w1x1 + w2x2 ≤ -1 for yi = -1
• i.e; any tuple that falls on or above H1 belongs to class +1, and any tuple that falls on
or below H2 belongs to class -1.
• Combining the two inequalities of Equations above , we get
yi (w0 + w1x1 + w2x2 ) ≥ 1,
• Any training tuples that fall on hyperplanes H1 or H2 are called support vectors.
• They are equally close to the (separating) MMH.
• The most difficult tuples to classify and give the most information regarding
classification.

• A formulae for the size of the maximal margin.
• The distance from the separating hyperplane to any point on H1 is
where ||W|| is the Euclidean norm of W, that is
• By definition, this is equal to the distance from any point on H2 to the
separating hyperplane.
• Therefore, the maximal margin is
• Any optimization software package for solving constrained convex
quadratic problems can then be used to find the support vectors and
MMH.
• For larger data, special and more efficient algorithms for training
SVMs can be used instead.

Trained support vector machine-> how do I
use it to classify test (i.e., new) tuples?”
• Based on the Lagrangian formulation, the MMH can be rewritten as
the decision boundary.
• where yi is the class label of support vector Xi;
• XT
is a test tuple;
αi and b0 are numeric parameters that were determined
automatically by the optimization or SVM algorithm and
l is the number of support vectors.

• Given a test tuple, XT
, and then check to see the sign of the result.
• This tells us on which side of the hyperplane the test tuple falls.
• If the sign is positive, then XT
falls on or above the MMH, and so the
SVM predicts that XT
belongs to class +1 (representing buys computer
= yes).
• If the sign is negative, then XT
falls on or below the MMH and the
class prediction is -1 (representing buys computer = no).

The Case When the Data Are Linearly Inseparable
• i.e; no straight line can be found that would separate the classes.
• SVMs can be extended to create nonlinear SVMs for the classification of linearly
inseparable data.
• Capable of finding nonlinear decision boundaries (i.e., nonlinear hypersurfaces) in
input space.

The Case When the Data Are Linearly Inseparable
• There are two main steps.
• Step 1: Transform the original input data into a higher dimensional space using a
nonlinear mapping.
• Step 2: Search for a linear separating hyperplane in the new space.
• Leads to quadratic optimization problem that can be solved using the linear SVM
formulation.
• The maximal marginal hyperplane found in the new space corresponds to a
nonlinear separating hypersurface in the original space.

Eg: Nonlinear transformation of original input
data into a higher dimensional space.
• A 3D input vector X = (x1, x2, x3) is mapped into a 6D space, Z, using the mappings
Φ (X) = x1, Φ(X) = x2, Φ(X) = x3, Φ(X) = (x1)2
, Φ(X) =x1x2, and Φ(X) = x1x3.
• A decision hyperplane in the new space is d(Z) = WZ + b, where W and Z are
vectors.
• This is linear.
• Solve for W and b and then substitute back so that the linear decision
hyperplane in the new (Z) space corresponds to a nonlinear second-order
polynomial in the original 3-D input space,
d(Z) = w1x1 + w2x2 + w3x3 + w4(x1)2
+ w5x1x2 + w6x1x3 + b
= w1z1+ w2z2 + w3z3 + w4z4 + w5z5 + w6z6 + b

Issues…
1. How do we choose the nonlinear mapping to a higher dimensional
space?
2. The computation involved is costly.
• Given the test tuple, we have to compute its dot product with every one of
the support vectors.
• In training, we have to compute a similar dot product several times in order to
find the MMH. This is expensive.
• Hence, the dot product computation required is very heavy and costly.

Solution
• Computing the dot product (Φ(Xi). Φ(Xj)) on the transformed data
tuples is mathematically equivalent to applying a kernel function,
K(Xi,Xj), to the original input data.
K(Xi,Xj) = Φ(Xi). Φ(Xj))
• Adv: all calculations are made in the original input space, which is of
potentially much lower dimensionality.

What are some of the kernel functions that could
be used?
• Each of these results in a different nonlinear classifier in (the original)
input space.
• An SVM with a Gaussian radial basis function (RBF) gives the same
decision hyperplane as a type of neural network known as a radial basis
function (RBF) network.
• An SVM with a sigmoid kernel is equivalent to a simple two-layer neural
network known as a multilayer perceptron.

How to choose kernel function?
• No golden rules for determining which kernel will result in the most accurate
SVM.
• In practice, the kernel chosen does not generally make a large difference in
resulting accuracy.
• SVM training always finds a global solution.
• SVMs can be used for binary and multiclass cases.
• A simple and effective approach=> given m classes, trains m classifiers, one
for each class.
• SVMs can also be designed for linear and nonlinear regression.

Major Research in SVMs
• To improve the speed in training and testing so that SVMs may
become a more feasible option for very large data sets (e.g., of
millionsof support vectors).
• Determining the best kernel for a given data set.
• Finding more efficient methods for the multiclass case.

Module 3 -Support Vector Machines data mining

More Related Content

Similar to Module 3 -Support Vector Machines data mining

Recently uploaded

Module 3 -Support Vector Machines data mining