VO MCA S4 Data Mining Unit 6
VO MCA S4 Data Mining Unit 6
Names of Sub-Units
Overview
In this unit you will study, History of SVM, Support Vector classifier (SVC), Limitations of
perceptron, Hyperplanes,Hard & Soft margins, SVM Kernel, Mercer’s theorem, Optimal
Hyper planes, Case study –Support vector machines/ Support vector classifiers.
1
Learning Objectives
Learning Outcomes
2
Pre-Unit Preparatory Material
https://towardsdatascience.com/support-vector-machine-introduction-to-machine-
learning-algorithms-934a444fca47
https://machinelearningmastery.com/types-of-classification-in-machine-learning/
3
6.1 Support Vector Machines – History
Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used
for classification or regression tasks. The history of SVMs can be traced back to the early 1960s
with the work of Vladimir Vapnik and Alexey Chervonenkis on the theory of statistical learning.
However, it wasn't until the 1990s that SVMs began to gain widespread attention in the machine
learning community.
The basic idea behind SVMs is to find a decision boundary, or a hyperplane, that separates the
different classes in a dataset. The hyperplane is chosen in such a way that it maximizes the margin,
or the distance between the hyperplane and the closest data points from each class. These closest
data points are known as support vectors and play a crucial role in determining the decision
boundary.
SVMs were first introduced in the early 1960s by Vapnik and Chervonenkis as a method for
solving pattern recognition problems. However, it wasn't until the 1990s that the algorithm began
to gain widespread attention in the machine learning community. This was due in part to the
development of more efficient algorithms for solving the optimization problem that is at the
heart of SVMs, as well as the increasing availability of large, labeled datasets.
In 1992, Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik proposed an SVM algorithm for
pattern recognition which used a linear decision boundary. Later, in 1993, Vapnik and Corinna
Cortes developed an algorithm for non-linear classification using what is now known as the
kernel trick. This allowed SVMs to be applied to a wide range of non-linear classification
problems.
In the mid-1990s, SVMs began to be applied to a wide range of real-world problems, including
text and image classification, bioinformatics, and hand-written character recognition. In 2000, the
first SVM software library, LIBSVM, was released, which made it easy for researchers and
practitioners to apply SVMs to their own problems.
In the years since, SVMs have continued to be an active area of research and development, with
new variations and extensions being proposed. Some of these include:
Support Vector Regression (SVR) for regression tasks
Multi-class SVMs for classification problems with more than two classes
Cost-sensitive SVMs for dealing with imbalanced datasets
Online SVMs for learning from streaming data
Kernel-based feature selection for SVMs
SVMs have also been used in various fields such as natural language processing, computer
4
vision, and bioinformatics. In natural language processing, SVMs have been used for sentiment
analysis, text classification, and named entity recognition. In computer vision, they have been
used for object recognition, face detection, and image segmentation. In bioinformatics, they
have been used for protein classification, gene expression analysis, and drug discovery. While
SVMs are still widely used, new algorithms such as deep learning have become more popular in
recent years, particularly in image and speech recognition tasks. However, SVMs continue to be
a valuable tool for a wide range of problems, particularly when working with small datasets and
when interpretability is important.
In summary, Support Vector Machines (SVMs) is a type of supervised learning algorithm that
can be used for classification or regression tasks. The idea behind SVMs is to find a decision
boundary, or a hyperplane, that separates the different classes in a dataset. The history of SVMs
can be traced back to the early 1960s with the work of Vladimir Vapnik and Alexey
Chervonenkis on the theory of statistical learning.
5
The basic idea behind SVC is to find a hyperplane that maximizes the margin, or the distance
between the hyperplane and the closest data points from each class. These closest data points
are known as support vectors and play a crucial role in determining the decision boundary. The
goal is to find a hyperplane that separates the different classes in the dataset while also
maximizing the margin.
The SVC algorithm can be used for both linear and non-linear classification problems. For linear
classification problems, the decision boundary is a linear hyperplane. For non-linear classification
problems, the decision boundary is a non-linear hyperplane. In the case of non-linear
classification, a technique called the kernel trick is used to transform the input data into a higher
dimensional space where a linear decision boundary can be found.
One of the advantages of SVC is that it is a convex optimization problem, which means that there
is only one global minimum. This makes the algorithm highly robust and less sensitive to the
choice of initial parameters. Additionally, SVC has the ability to handle high dimensional datasets
and datasets with a large number of features. The SVC algorithm also has the ability to handle
imbalanced datasets, which is a common problem in real-world classification tasks. This can be
done through the use of regularization techniques, such as cost-sensitive learning, and by
adjusting the penalty parameter C.
SVC has been used in various fields such as natural language processing, computer vision, and
bioinformatics. In natural language processing, SVC has been used for sentiment analysis, text
classification, and named entity recognition. In computer vision, it has been used for object
recognition, face detection, and image segmentation. In bioinformatics, it has been used for
protein classification, gene expression analysis, and drug discovery.
In summary, Support Vector Classifier (SVC) is a type of supervised learning algorithm that is
6
used for classification tasks. It works by finding a decision boundary, or a hyperplane, that
separates the different classes in a dataset while also maximizing the margin between the classes.
The SVC algorithm can be used for both linear and non-linear classification problems, and it is
particularly useful for datasets with a large number of features or for datasets that are not linearly
separable. SVC has many advantages such as high generalization performance, ability to handle
high dimensional datasets, ability to handle imbalanced datasets and robustness to noise and
outliers. However, it also has some disadvantages such as sensitivity to the choice of kernel
function and parameters, computational expensive, may not be suitable for datasets with a large
number of samples and may not be suitable for datasets with a large number of features.
Perceptron is a simple algorithm for supervised learning of binary classifiers. It is based on the
idea of adjusting the weights and biases of a linear classifier to minimize the classification error.
The perceptron algorithm was first proposed in the late 1950s and was one of the earliest
algorithms used for supervised learning. However, the perceptron algorithm has some limitations
that make it less powerful than other algorithms such as Support Vector Machines (SVMs). One
of the main limitations of perceptron is that it is only able to find linear decision boundaries. This
means that it is not able to classify non-linearly separable datasets, which are common in real-
world problems.
Another limitation of the perceptron algorithm is that it is only able to solve the binary
classification problem. In other words, it can only classify data into two classes. This makes it less
flexible than other algorithms that are able to classify data into multiple classes.
The perceptron algorithm also has the tendency to get stuck in local optima, meaning that it may
not find the global optimal solution. This is because the algorithm is based on gradient descent,
which can become trapped in a suboptimal solution if the initial weights are not set correctly.
Additionally, the perceptron algorithm does not provide any probabilistic output or decision
boundary, which makes it less interpretable than other algorithms such as SVMs.
SVMs, on the other hand, are a more powerful algorithm that can find both linear and non-linear
decision boundaries. They use a technique called the kernel trick to transform the input data into
a higher dimensional space where a linear decision boundary can be found. This allows SVMs to
classify non-linearly separable datasets, which are common in real-world problems.
SVMs also have the ability to handle multi-class classification problems, which is more flexible
than perceptron algorithm. They also provide a probabilistic output and decision boundary,
which makes them more interpretable than perceptron.
In summary, Perceptron is a simple algorithm for supervised learning of binary classifiers, but it
has some limitations that make it less powerful than other algorithms such as Support Vector
7
Machines (SVMs). One of the main limitations of perceptron is that it is only able to find linear
decision boundaries, which means that it is not able to classify non-linearly separable datasets.
Another limitation of the perceptron algorithm is that it is only able to solve the binary
classification problem, which makes it less flexible than other algorithms that are able to classify
data into multiple classes. Additionally, the perceptron algorithm has the tendency to get stuck
in local optima, which means it may not find the global optimal solution. Furthermore, the
perceptron algorithm does not provide any probabilistic output or decision boundary, which
makes it less interpretable than other algorithms such as SVMs. SVMs, on the other hand, are a
more powerful algorithm that can find both linear and non-linear decision boundaries, handle
multi-class classification problems and provide a probabilistic output and decision boundary,
which makes them more interpretable.
6.3 Hyperplanes
In Support Vector Machines (SVMs), a hyperplane is a decision boundary that separates the
different classes in a dataset. The goal of the SVM algorithm is to find the hyperplane that
maximizes the margin, or the distance between the hyperplane and the closest data points from
each class. These closest data points are known as support vectors and play a crucial role in
determining the decision boundary.
The concept of a hyperplane is based on linear algebra, where a hyperplane is defined as a
subspace of one dimension less than the ambient space. In the context of SVMs, the ambient
space is the feature space, and the hyperplane is defined by the equation:
w^T x + b = 0
Where w is the weight vector and b is the bias term. The weight vector represents the normal
vector of the hyperplane, and the bias term represents the distance of the hyperplane from the
origin. The sign of the equation determines which side of the hyperplane a point belongs to.
Points on one side of the hyperplane are classified as one class, while points on the other side
are classified as the other class.
For linear classification problems, the decision boundary is a linear hyperplane. In this case, the
SVM algorithm finds the optimal weight vector and bias term that maximizes the margin.
However, for non-linear classification problems, the decision boundary is a non-linear
hyperplane. In this case, a technique called the kernel trick is used to transform the input data
into a higher dimensional space where a linear decision boundary can be found.
The kernel trick works by mapping the input data into a higher dimensional space, where a linear
decision boundary can be found. The kernel function used in the mapping is chosen based on
8
the characteristics of the dataset. Commonly used kernel functions include the linear kernel,
polynomial kernel, and radial basis function (RBF) kernel.
The linear kernel is used for datasets that are linearly separable. The polynomial kernel is used
for datasets that are not linearly separable but can be separated by a polynomial decision
boundary. The RBF kernel is used for datasets that are not linearly separable and cannot be
separated by a polynomial decision boundary.
One of the most important properties of a hyperplane in SVM is that the margin, the distance
between the hyperplane and the closest data points from each class, should be maximized. The
margin is a key property of the hyperplane because it provides a measure of the generalization
ability of the classifier. A larger margin results in a classifier that generalizes better to new data,
while a smaller margin results in a classifier that is more likely to overfit the training data.
In summary, In Support Vector Machines (SVMs), a hyperplane is a decision boundary that
separates the different classes in a dataset. The goal of the SVM algorithm is to find the
hyperplane that maximizes the margin, or the distance between the hyperplane and the closest
data points from each class. These closest data points are known as support vectors and play a
crucial role in determining the decision boundary. For linear classification problems, the decision
boundary is a linear hyperplane, while for non-linear classification problems, the decision
boundary is a non-linear hyperplane, which can be found by using a technique called the kernel
trick. The kernel trick works by mapping the input data into a higher dimensional space, where a
linear decision boundary can be found. The kernel function used in the mapping is chosen based
on the characteristics of the dataset. The most important property of a hyperplane in SVM is that
the margin, the distance between the hyperplane and the closest data points from each class.
In Support Vector Machines (SVMs), the goal is to find the optimal hyperplane that separates the
different classes in a dataset. The optimal hyperplane is the one that maximizes the margin, or
the distance between the hyperplane and the closest data points from each class. These closest
data points are known as support vectors and play a crucial role in determining the decision
boundary.
Finding the optimal hyperplane can be formulated as an optimization problem. The optimization
problem is to minimize the following objective function:
1/2 * (w^T * w)
Subject to the constraints:
y_i * (w^T * x_i + b) >= 1
Where w is the weight vector, b is the bias term, x_i is the feature vector of the ith sample, y_i is
9
the class label of the ith sample and w^T * w is the squared Euclidean norm of the weight vector.
The optimization problem can be solved using the method of Lagrangian multipliers. The method
of Lagrangian multipliers is used to solve constrained optimization problems by converting the
constraints into a set of Lagrange multipliers. These Lagrange multipliers are then used to adjust
the objective function so that it is minimized subject to the constraints.
In the case of SVMs, the Lagrange multipliers are used to adjust the objective function so that it
is minimized subject to the constraints that the margin is maximized. The Lagrange multipliers
are also used to adjust the objective function so that it is minimized subject to the constraints
that the samples are correctly classified.
The solution to the optimization problem is found by solving the following dual optimization
problem:
maximize L(alpha) = sum(alpha_i) - 1/2 * sum(alpha_i * alpha_j * y_i * y_j * (x_i^T * x_j))
subject to 0 <= alpha_i <= C and sum(alpha_i * y_i) = 0
Where alpha_i is the Lagrange multiplier for the ith sample and C is a regularization parameter.
The regularization parameter is used to control the trade-off between maximizing the margin
and minimizing the classification error.
The optimal hyperplane is found by using the support vectors, which are the samples that
correspond to the non-zero Lagrange multipliers. The weight vector and bias term of the optimal
hyperplane are given by:
w = sum(alpha_i * y_i * x_i)
b = y_i - w^T * x_i (for any i such that alpha_i > 0)
The bias term is chosen as the value of y_i - w^T * x_i for any i such that alpha_i > 0. This is
because the bias term is only defined up to a constant, and any value of y_i - w^T * x_i for a
support vector will give the same decision boundary.
It is important to note that the optimization problem is a convex optimization problem, which
means that there is only one global minimum. This makes the algorithm highly robust and less
sensitive to the choice of initial parameters.
In summary, In Support Vector Machines (SVMs), the goal is to find the optimal hyperplane that
separates the different classes in a dataset. The optimal hyperplane is the one that maximizes the
margin, or the distance between the hyperplane and the closest data points from each class. The
problem of finding the optimal hyperplane can be formulated as an optimization problem, which
can be solved using the method of Lagrangian multipliers.
10
6.3.2 Hard and Soft Margins in Support Vector Machines
In Support Vector Machines (SVMs), the goal is to find the optimal hyperplane that separates the
different classes in a dataset while maximizing the margin, or the distance between the
hyperplane and the closest data points from each class. However, in real-world datasets, it is
often not possible to find a hyperplane that separates all the data points perfectly. This is because
the data points may be noisy or overlapping, making it difficult to find a clear decision boundary.
To address this problem, SVMs introduce the concept of hard and soft margins. Hard margins
refer to the case where the algorithm only allows data points that are correctly classified, also
known as support vectors, to be on the margin. This means that all other data points must be
correctly classified and lie on the correct side of the decision boundary. This is a strict criterion,
and it is not always possible to find a solution that satisfies this criterion.
On the other hand, soft margins introduce a relaxation of the hard margin criterion. The algorithm
allows some misclassifications of data points, also known as slack variables, to be on the margin.
This means that some data points may be on the incorrect side of the decision boundary, but
they are still considered as support vectors. This allows the algorithm to find a solution in cases
where a perfect separation is not possible.
The trade-off between hard and soft margins is controlled by a regularization parameter C. C is
a positive scalar that determines the trade-off between maximizing the margin and minimizing
the classification error. A small value of C corresponds to a soft margin, which allows for more
misclassifications but also a larger margin. A large value of C corresponds to a hard margin, which
allows for fewer misclassifications but also a smaller margin.
The optimization problem for the soft margin is a convex optimization problem, which means
that there is only one global minimum. This makes the algorithm highly robust and less sensitive
to the choice of initial parameters.
In summary, In Support Vector Machines (SVMs), the goal is to find the optimal hyperplane that
separates the different classes in a dataset while maximizing the margin, or the distance between
the hyperplane and the closest data points from each class. However, in real-world datasets, it is
often not possible to find a hyperplane that separates all the data points perfectly. To address
this problem, SVMs introduce the concept of hard and soft margins. Hard margins refer to the
case where the algorithm only allows data points that are correctly classified to be on the margin,
while soft margins introduce a relaxation of the hard margin criterion. The trade-off between
hard and soft margins is controlled by a regularization parameter C, which is a positive scalar
that determines the trade-off between maximizing the margin and minimizing the classification
error. The optimization problem for the soft margin is a convex optimization problem, which
means that there is only one global minimum.
11
6.4 SVM Kernel
Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used
for classification and regression tasks. They are particularly useful for problems with a large
number of features or when the data is not linearly separable. One of the key features of SVMs
is the use of a kernel function, which allows the algorithm to transform the input data into a
higher-dimensional space, where it becomes easier to separate or classify.
A kernel function, also known as a kernel, is a function that takes two inputs and produces a
scalar output. The kernel function is used to measure the similarity between inputs, and it can be
thought of as a dot product of the input vectors in a higher-dimensional space. The
dimensionality of this space may be infinite, and it is known as the feature space. There are several
types of kernel functions that can be used with SVMs. The most common ones are linear,
polynomial, and radial basis function (RBF) kernels.
The linear kernel is the simplest kernel function, and it corresponds to a dot product of the input
vectors in the original input space. It is used when the data is linearly separable.
The polynomial kernel is a more complex kernel function that is used when the data is not
linearly separable. It corresponds to a dot product of the input vectors in a higher-dimensional
space, where the input vectors are raised to a power. The degree of the polynomial is a parameter
that can be adjusted.
The RBF kernel is the most widely used kernel function in SVMs. It corresponds to a dot product
of the input vectors in a higher-dimensional space, where the input vectors are transformed by
a radial basis function. The RBF kernel is particularly useful for non-linear classification and
regression tasks.
The choice of kernel function depends on the problem and the data. The linear kernel is the
simplest and the most efficient, but it can only be used when the data is linearly separable. The
polynomial and RBF kernels are more powerful, but they are also more computationally
expensive. The kernel function plays a crucial role in the SVM algorithm as it allows the algorithm
to transform the input data into a higher-dimensional space, where it becomes easier to separate
or classify. It also provides a way to measure the similarity between inputs, which is crucial for
classification and regression tasks.
The kernel trick is a computational trick that allows the SVM algorithm to operate in the high-
dimensional feature space without explicitly computing the coordinates of the input data in that
space. Instead, the algorithm only needs to compute the inner products between the input
12
vectors in the feature space, which can be done using the kernel function. This is significant
because it allows the SVM algorithm to handle non-linear problems and a large number of
features without incurring a significant computational cost.
In conclusion, Support Vector Machines (SVMs) are a powerful supervised learning algorithm that
can be used for classification and regression tasks. The use of a kernel function is one of the key
features of SVMs, it allows the algorithm to transform the input data into a higher-dimensional
space, where it becomes easier to separate or classify. There are several types of kernel functions
that can be used with SVMs, such as linear, polynomial, and radial basis function (RBF) kernels.
The choice of kernel function depends on the problem and the data. The kernel trick allows the
SVM algorithm to operate in the high-dimensional feature space without explicitly computing
the coordinates of the input data in that space. This makes the algorithm computationally
efficient and powerful.
Mercer's theorem, named after the British mathematician John Mercer, is a fundamental result in
the field of machine learning and functional analysis. It states that any symmetric positive definite
kernel on a set can be represented as an inner product in some feature space. This means that
any kernel function can be written as a dot product of the input vectors in a higher dimensional
space, where the dimensionality of this space may be infinite.
A kernel function, also known as a kernel, is a function that takes two inputs and produces a
scalar output. It is often used in machine learning algorithms such as support vector machines
(SVMs) and Gaussian processes (GPs) to measure the similarity between inputs. The kernel
function can be used to map the input data from its original space to a higher-dimensional space,
where it becomes easier to separate or classify.
Mercer's theorem states that any symmetric positive definite kernel can be represented as an
inner product in some feature space. A symmetric positive definite kernel is a kernel function that
satisfies two properties: symmetry and positive definiteness. Symmetry means that the kernel
function is equal to its transpose, and positive definiteness means that the kernel function is
greater than or equal to zero for all inputs and is equal to zero only when the inputs are the
same. In other words, Mercer's theorem states that any kernel function that satisfies these
properties can be represented as a dot product of the input vectors in a higher dimensional
space. The dimensionality of this space may be infinite, and it is known as the feature space.
The significance of Mercer's theorem is that it allows us to understand kernel-based methods,
such as SVMs and GPs, as methods that operate in a high-dimensional feature space, rather than
directly on the input space. This is important because it allows us to use kernel functions that are
not directly computable in the original input space, such as the radial basis function (RBF) kernel,
13
which is widely used in SVMs.
Furthermore, Mercer's theorem also provides a theoretical foundation for the use of kernel-based
methods. It shows that these methods are mathematically well-defined and that they can be
understood as linear methods in a high-dimensional feature space. This means that kernel-based
methods have the same expressiveness as linear methods, but they are able to handle non-linear
problems by mapping the input data to a higher-dimensional space.
In conclusion, Mercer's theorem is a fundamental result in machine learning and functional
analysis that states that any symmetric positive definite kernel on a set can be represented as an
inner product in some feature space. This allows us to understand kernel-based methods, such
as SVMs and GPs, as methods that operate in a high-dimensional feature space, rather than
directly on the input space. It also provides a theoretical foundation for the use of kernel-based
methods and shows that they are mathematically well-defined and can handle non-linear
problems.
Support Vector Machine (SVM) classifiers are a type of supervised learning algorithm that can be
used for binary and multi-class classification problems. They are particularly useful for problems
with a large number of features or when the data is not linearly separable. The key idea behind
SVM classifiers is to find a hyperplane (also known as a decision boundary) that separates the
different classes of data in the feature space.
The basic idea behind an SVM classifier is to find a hyperplane in the feature space that separates
the different classes of data with the maximum margin. The margin is the distance between the
hyperplane and the closest data points from each class, also known as support vectors. The goal
is to find a hyperplane that maximizes the margin, which means that it maximizes the distance
between the closest data points from each class. This ensures that the classifier is less sensitive
to outliers and less likely to overfit the training data.
SVMs are based on the concept of structural risk minimization, where the goal is to find a
hyperplane that maximizes the margin while also minimizing the number of misclassifications.
This is achieved by introducing slack variables, which allow for misclassifications to occur but at
a cost. The cost is controlled by a parameter called the regularization parameter, which
determines the trade-off between maximizing the margin and minimizing the number of
misclassifications.
One of the key features of SVM classifiers is the use of a kernel function, which allows the
algorithm to transform the input data into a higher-dimensional space, where it becomes easier
14
to separate or classify. There are several types of kernel functions that can be used with SVM
classifiers, such as linear, polynomial, and radial basis function (RBF) kernels.
The linear kernel is the simplest kernel function and corresponds to a dot product of the input
vectors in the original input space. It is used when the data is linearly separable. The polynomial
kernel is a more complex kernel function that is used when the data is not linearly separable. It
corresponds to a dot product of the input vectors in a higher-dimensional space, where the input
vectors are raised to a power. The degree of the polynomial is a parameter that can be adjusted.
The RBF kernel is the most widely used kernel function in SVM classifiers, it corresponds to a dot
product of the input vectors in a higher-dimensional space, where the input vectors are
transformed by a radial basis function. It is particularly useful for non-linear classification tasks.
SVMs also have the ability to handle multi-class classification problems. One-vs-all method is
one of the most common approach to handle multi-class classification where multiple binary
classifiers are trained, one for each class. Another approach is to use a multi-class SVM algorithm,
such as the one-vs-one method, where multiple binary classifiers are trained and the class with
the most winning classifiers is chosen as the final prediction.
SVMs are efficient and powerful classifiers, especially for large datasets and high-dimensional
feature spaces. However, they can be sensitive to the choice of kernel function and the
regularization parameter. Additionally, the performance of an SVM classifier can be affected by
the presence of outliers and noisy data. Therefore, it is important to preprocess the data and
carefully select the kernel function and regularization parameter before training the model.
In conclusion, Support Vector Machine (SVM) classifiers are a type of supervised learning
algorithm that can be used for binary and multi-class classification problems. They are particularly
useful for problems with a large number of features.
y = [0, 0, 1, 1]
15
clf.fit(X, y)
# Make predictions
print(predictions) # [1, 1]
Non- Linear SVM for multiclass Classification – Code
iris = load_iris()
X = iris.data
y = iris.target
clf.fit(X, y)
# Make predictions
print(predictions) # [0, 2]
16
# Create and fit the model
reg.fit(X, y)
# Make predictions
In the above examples, the first one is a simple binary classification problem using a linear kernel,
the second one is a multiclass classification problem using a RBF (Radial Basis Function) kernel
and the third one is a simple regression problem using a linear kernel.
Conclusion
Support Vector Machines (SVMs) are a powerful and widely used technique for classification in
data mining. They are particularly well-suited for high-dimensional data and are able to handle
both binary and multi-class classification tasks. The basic idea behind SVMs is to find a
hyperplane that separates the data into different classes, in such a way that the margin, or the
distance between the hyperplane and the closest data points from each class, is maximized. This
maximization of the margin allows SVMs to achieve a good balance between the accuracy of the
classification and the robustness of the model. One of the key advantages of SVMs is that they
are able to handle non-linearly separable data by using kernel functions. These functions map
the data into a higher-dimensional space, where it becomes possible to find a linear boundary.
Some of the most commonly used kernel functions include the linear, polynomial, and radial
basis function (RBF) kernels. SVMs also have the ability to handle imbalanced data sets, where
one class is under-represented in comparison to the other. One approach to handle this is to
adjust the cost parameter which allows the model to give more weight to the minority class.
Another important aspect of SVMs is the choice of the appropriate kernel function and its
parameters. These can be determined through techniques such as cross-validation, grid search,
and bootstrap aggregating. Despite its strengths, SVMs have some limitations. One of the main
limitation is that they can be sensitive to the choice of kernel function and its parameters.
Additionally, SVMs can be computationally intensive, especially for large data sets. Overall, SVMs
are a valuable tool for classification in data mining. They provide a good balance between
17
accuracy and robustness, and are able to handle non-linearly separable data and imbalanced
data sets. With the proper choice of kernel function and its parameters, SVMs can achieve high
performance for a wide range of data mining tasks.
Summary
Support Vector Machines (SVMs) are a type of supervised learning algorithm used for
classification and regression tasks.
SVMs find the best boundary (or "hyperplane") that separates different classes in the
data.
SVMs are particularly effective in high-dimensional spaces and when the number of
features exceeds the number of samples.
SVMs are sensitive to the choice of kernel and the values of the parameters C and
gamma.
SVMs have been successfully applied to a variety of fields, including natural language
processing, image and speech recognition, and bioinformatics.
SVMs can handle non-linearly separable data by using kernel trick which projects the
data into a higher-dimensional space where it becomes linearly separable.
SVMs are robust to outliers and can work well with unbalanced datasets, by using a
technique called "Cost-sensitive learning" or "Cost-sensitive SVM" which assigns different
misclassification costs to different classes.
18
Self- Assessment questions
1. Explain the concept of structural risk minimization and how it relates to Support Vector
Machines (SVMs).
2. Describe the role of the kernel function in SVMs and explain how it allows the algorithm to
handle non-linear problems.
3. Discuss the different types of kernel functions that can be used with SVMs, including linear,
polynomial, and radial basis function (RBF) kernels.
4. Explain the concept of the margin and the support vectors in SVMs and how they are used to
find the decision boundary.
5. Analyze the strengths and limitations of SVMs and discuss potential challenges that may arise
when using them for real-world classification tasks.
Book chapters
1. "Support Vector Machines" Chapter 9, in "Introduction to Statistical Learning" by Gareth
James, Daniela Witten, Trevor Hastie and Robert Tibshirani. This chapter provides a
comprehensive introduction to support vector machines, including the theory behind them
and how to construct and interpret them.
2. "Support Vector Machines" Chapter 14, in "Data Mining: Concepts and Techniques" by Jiawei
Han, Micheline Kamber, and Jian Pei. This chapter provides a detailed overview of support
vector machines, including the theory, algorithms, and applications of support vector
machines.
3. "Support Vector Machines" Chapter 9, in "Applied Predictive Modeling" by Max Kuhn and
Kjell Johnson. This chapter provides an introduction to support vector machines, including
how to construct and interpret them, and provides practical examples on how to implement
them in R.
4. "Support Vector Machines" Chapter 8, in "Python Machine Learning" by Sebastian Raschka
and Vahid Mirjalili. This chapter provides an introduction to support vector machines and
19
their implementation in Python using the scikit-learn library.
5. "Support Vector Machines" Chapter 7, in "The Hundred-Page Machine Learning Book" by
Andriy Burkov. This chapter provides an introduction to support vector machines, including
the theory and algorithms behind them, and discusses various applications and performance
evaluation techniques.
20
21