UNIT V
Machine Learning
What is Machine Learning?
In 1959, Arthur Samuel, a computer scientist who pioneered the
study of artificial intelligence, described machine learning as “the
study that gives computers the ability to learn without being
explicitly programmed.”
Alan Turing’s seminal paper (Turing, 1950) introduced a
benchmark standard for demonstrating machine intelligence, such
that a machine has to be intelligent and responsive in a manner that
cannot be differentiated from that of a human being.
Machine Learning is an application of artificial
intelligence where a computer/machine learns
from the past experiences (input data) and
makes future predictions. The performance of
such a system should be at least human level.
A more technical definition given by Tom M. Mitchell’s (1997) : “A
computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.”
Example:
A handwriting recognition learning problem:Task T: recognizing
and classifying handwritten words within images
Performance measure P: percent of words correctly classified,
accuracy
Training experience E: a data-set of handwritten words with
given classifications
In order to perform the task T, the system learns from the data-set
provided. A data-set is a collection of many examples. An example is
a collection of features.
Machine Learning Categories
Machine Learning is generally categorized into three types:
Supervised Learning, Unsupervised Learning, Reinforcement
learning
Supervised Learning:
In supervised learning the machine experiences the examples along
with the labels or targets for each example. The labels in the data
help the algorithm to correlate the features.
Two of the most common supervised machine learning tasks
are classification and regression.
In classification problems the machine must learn to predict
discrete values. That is, the machine must predict the most
probable category, class, or label for new examples.
Applications of classification include predicting whether a
stock's price will rise or fall, or deciding if a news article
belongs to the politics or leisure section. In regression
problems the machine must predict the value of a continuous
response variable. Examples of regression problems include
predicting the sales for a new product, or the salary for a job
based on its description.
Unsupervised Learning:
When we have unclassified and unlabeled data, the system attempts
to uncover patterns from the data . There is no label or target given
for the examples. One common task is to group similar examples
together called clustering.
Reinforcement Learning:
Reinforcement learning refers to goal-oriented algorithms, which
learn how to attain a complex objective (goal) or maximize along a
particular dimension over many steps. This method allows machines
and software agents to automatically determine the ideal behavior
within a specific context in order to maximize its performance.
Simple reward feedback is required for the agent to learn which
action is best; this is known as the reinforcement signal. For
example, maximize the points won in a game over many moves.
Techniques of Supervised Machine Learning
Regression is a technique used to predict the value of a response
(dependent) variables, from one or more predictor (independent)
variables.
Most commonly used regressions techniques are: Linear
Regression and Logistic Regression. We will discuss the theory
behind these two prominent techniques alongside explaining many
other key concepts like Gradient-descent algorithm, Over-fit/Under-
fit, Error analysis, Regularization, Hyper-parameters, Cross-
validation techniques involved in machine learning.
Linear Regression
In linear regression problems, the goal is to predict a real-value
variable y from a given pattern X. In the case of linear regression the
output is a linear function of the input. Letŷ b e th e o u tp u t o u r m o d e l
predicts: ŷ = WX+b
Here X is a vector (features of an example), W are the weights (vector
of parameters) that determine how each feature affects the
prediction andb is bias term. So our task T is to predict y from X, now
we need to measure performance P to know how well the model
performs.
Now to calculate the performance of the model, we first calculate the
error of each example i as:
we take the absolute value of the error to take into account both
positive and negative values of error.
Finally we calculate the mean for all recorded absolute errors
(Average sum of all absolute errors).
Mean Absolute Error (MAE) = Average of All absolute errors
More popular way of measuring model performance is using
Mean Squared Error (MSE): Average of squared differences
between prediction and actual observation.
The mean is halved (1/2) as a convenience for the computation of the
gradient descent [discussed later], as the derivative term of the
square function will cancel out the 1/2 term. For more discussion on
the MAE vs MSE please refer [1] & [2].
The main aim of training the ML algorithm is to
adjust the weights W to reduce the MAE or MSE.
To minimize the error, the model while experiencing the examples
of the training set, updates the model parameters W. These error
calculations when plotted against the W is also called cost
function J(w), since it determines the cost/penalty of the model. So
minimizing the error is also called as minimization the cost function
J.
Gradient descent Algorithm:
When we plot the cost function J(w) vs w. It is represented as below:
As we see from the curve, there exists a value of parameters W which
has the minimum cost Jmin. Now we need to find a way to reach this
minimum cost.
In the gradient descent algorithm, we start with random model
parameters and calculate the error for each learning iteration, keep
updating the model parameters to move closer to the values that
results in minimum cost.
repeat until minimum cost: {
In the above equation we are updating the model parameters after
each iteration. The second term of the equation calculates the slope
or gradient of the curve at each iteration.
The gradient of the cost function is calculated as partial derivative of
cost function J with respect to each model parameter wj, j takes
value of number of features [1 to n]. α, alpha, is the learning rate, or
how quickly we want to move towards the minimum. If α is too large,
we can overshoot. If α is too small, means small steps of learning
hence the overall time taken by the model to observe all examples
will be more.
There are three ways of doing gradient descent:
Batch gradient descent: Uses all of the training instances to
update the model parameters in each iteration.
Mini-batch Gradient Descent: Instead of using all examples,
Mini-batch Gradient Descent divides the training set into smaller
size called batch denoted by ‘b’. Thus a mini-batch ‘b’ is used to
update the model parameters in each iteration.
Stochastic Gradient Descent (SGD): updates the parameters
using only a single training instance in each iteration. The training
instance is usually selected randomly. Stochastic gradient descent is
often preferred to optimize cost functions when there are hundreds
of thousands of training instances or more, as it will converge more
quickly than batch gradient descent [3].
Logistic Regression
In some problems the response variable is not normally distributed.
For instance, a coin toss can result in two outcomes: heads or tails.
The Bernoulli distribution describes the probability distribution of a
random variable that can take the positive case with probability P or
the negative case with probability 1-P. If the response variable
represents a probability, it must be constrained to the range {0,1}.
In logistic regression, the response variable describes the probability
that the outcome is the positive case. If the response variable is
equal to or exceeds a discrimination threshold, the positive class is
predicted; otherwise, the negative class is predicted.
The response variable is modeled as a function of a linear
combination of the input variables using the logistic function.
Since our hypotheses ŷ h as to s atis fy 0 ≤ ŷ ≤ 1, this can be
accomplished by plugging logistic function or “Sigmoid Function”
The function g(z) maps any real number to the (0, 1) interval,
making it useful for transforming an arbitrary-valued function into a
function better suited for classification. The following is a plot of the
value of the sigmoid function for the range {-6,6}:
Now coming back to our logistic regression problem, Let us assume
that z is a linear function of a single explanatory variable x. We can
then express z as follows:
And the logistic function can now be written as:
Note that g(x) is interpreted as the probability of the dependent
variable.
g(x) = 0.7, gives us a probability of 70% that our output is 1. Our
probability that our prediction is 0 is just the complement of our
probability that it is 1 (e.g. if probability that it is 1 is 70%, then the
probability that it is 0 is 30%).
The input to the sigmoid function ‘g’ doesn’t need to be linear
function. It can very well be a circle or any shape.
Cost Function
We cannot use the same cost function that we used for linear
regression because the Sigmoid Function will cause the output to be
wavy, causing many local optima. In other words, it will not be a
convex function.
Non-convex cost function
In order to ensure the cost function is convex (and therefore ensure
convergence to the global minimum), the cost function is
transformed using the logarithm of the sigmoid function. The cost
function for logistic regression looks like:
Which can be written as:
So the cost function for logistic regression is:
Since the cost function is a convex function, we can run the gradient
descent algorithm to find the minimum cost.
Under-fitting & Over-fitting
We try to make the machine learning algorithm fit the input data by
increasing or decreasing the models capacity. In linear regression
problems, we increase or decrease the degree of the polynomials.
Consider the problem of predicting y from x ∈ R. The leftmost figure
below shows the result of fitting a line to a data-set. Since the data
doesn’t lie in a straight line, so fit is not very good (left side figure).
To increase model capacity, we add another feature by adding
term x² to it. This produces a better fit ( middle figure). But if we
keep on doing so ( x⁵ , 5th order polynomial, figure on the right side),
we may be able to better fit the data but will not generalize well for
new data. The first figure represents under-fitting and the last figure
represents over-fitting.
Under-fitting:
When the model has fewer features and hence not able to learn from
the data very well. This model has high bias.
Over-fitting:
When the model has complex functions and hence able to fit the
data very well but is not able to generalize to predict new data. This
model has high variance.
There are three main options to address the issue of over-fitting:
1. Reduce the number of features: Manually select which
features to keep. Doing so, we may miss some important
information, if we throw away some features.
2. Regularization: Keep all the features, but reduce the
magnitude of weights W. Regularization works well when we
have a lot of slightly useful feature.
3. Early stopping: When we are training a learning algorithm
iteratively such as using gradient descent, we can measure how
well each iteration of the model performs. Up to a certain
number of iterations, each iteration improves the model. After
that point, however, the model’s ability to generalize can weaken
as it begins to over-fit the training data.
Regularization
Regularization can be applied to both linear and logistic regression
by adding a penalty term to the error function in order to discourage
the coefficients or weights from reaching large values.
Hyper-parameters
Hyper-parameters are “higher-level” parameters that describe
structural information about a model that must be decided before
fitting model parameters, examples of hyper-parameters we
discussed so far:
Learning rate alpha , Regularization lambda.
Cross-Validation
The process to select the optimal values of hyper-parameters is
called model selection. if we reuse the same test data-set over and
over again during model selection, it will become part of our training
data and thus the model will be more likely to over fit.
The overall data set is divided into:
1. the training data set
2. validation data set
3. test data set.
The training set is used to fit the different models, and the
performance on the validation set is then used for the model
selection. The advantage of keeping a test set that the model hasn’t
seen before during the training and model selection steps is that we
avoid over-fitting the model and the model is able to better
generalize to unseen data.
In many applications, however, the supply of data for training and
testing will be limited, and in order to build good models, we wish to
use as much of the available data as possible for training. However,
if the validation set is small, it will give a relatively noisy estimate of
predictive performance. One solution to this dilemma is to use
cross-validation, which is illustrated in Figure below.
Below Cross-validation steps are taken from here, adding here for
completeness.
Cross-Validation Step-by-Step:
These are the steps for selecting hyper-parameters using K-fold
cross-validation:
1. Split your training data into K = 4 equal parts, or “folds.”
2. Choose a set of hyper-parameters, you wish to optimize.
3. Train your model with that set of hyper-parameters on the first 3
folds.
4. Evaluate it on the 4th fold, or the”hold-out” fold.
5. Repeat steps (3) and (4) K (4) times with the same set of hyper-
parameters, each time holding out a different fold.
6. Aggregate the performance across all 4 folds. This is your
performance metric for the set of hyper-parameters.
7. Repeat steps (2) to (6) for all sets of hyper-parameters you wish
to consider.
Cross-validation allows us to tune hyper-parameters with only our
training set. This allows us to keep the test set as a truly unseen
data-set for selecting final model.
Conclusion
We’ve covered some of the key concepts in the field of Machine
Learning, starting with the definition of machine learning and then
covering different types of machine learning techniques. We
discussed the theory behind the most common regression
techniques (Linear and Logistic) alongside discussed other key
concepts of machine learning.
Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of
those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is
the type of dataset that we are using. In classification, we work with the labeled data set,
whereas in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-series
to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In this
technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be
selected by cutting the tree at the correct level. The most common example of this
method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in
the smooth density of data points. It is an example of a centroid-based model,
that works on updating the candidates for centroid to be the center of the
points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to
the mean-shift, but with some remarkable advantages. In this algorithm, the
areas of high density are separated by the areas of low density. Because of
this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be
used as an alternative for the k-means algorithm or for those cases where K-
means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does
not require to specify the number of clusters. In this, each data point sends a
message between the pair of data points until convergence. It has O(N2T)
time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used
for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query. It
does it by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that for
what purpose the particular land should be used, that means for which
purpose it is more suitable.
Inductive vs. Deductive Research
Approach (with Examples)
Published on April 18, 2019 by Raimo Streefkerk. Revised on May 6, 2022.
The main difference between inductive and deductive reasoning is that inductive reasoning
aims at developing a theory while deductive reasoning aims at testing an existing theory.
Inductive reasoning moves from specific observations to broad generalizations, and deductive
reasoning the other way around.
Both approaches are used in various types of research, and it’s not uncommon to combine
them in one large study.
Table of contents
1. Inductive research approach
2. Deductive research approach
3. Combining inductive and deductive research
4. Frequently asked questions about inductive vs deductive reasoning
Inductive research approach
When there is little to no existing literature on a topic, it is common to perform inductive
research because there is no theory to test. The inductive approach consists of three stages:
1. Observation
o A low-cost airline flight is delayed
o Dogs A and B have fleas
o Elephants depend on water to exist
2. Observe a pattern
o Another 20 flights from low-cost airlines are delayed
oAll observed dogs have fleas
oAll observed animals depend on water to exist
3. Develop a theory or general (preliminary) conclusion
o Low cost airlines always have delays
o All dogs have fleas
o All biological life depends on water to exist
Limitations of an inductive approach
A conclusion drawn on the basis of an inductive method can never be proven, but it can be
invalidated.
Example
You observe 1000 flights from low-cost airlines. All of them experience a delay, which is in
line with your theory. However, you can never prove that flight 1001 will also be delayed.
Still, the larger your dataset, the more reliable the conclusion.
Deductive research approach
When conducting deductive research, you always start with a theory (the result of inductive
research). Reasoning deductively means testing these theories. If there is no theory yet, you
cannot conduct deductive research.
The deductive research approach consists of four stages:
1. Start with an existing theory (and create a problem statement)
o Low cost airlines always have delays
o All dogs have fleas
o All biological life depends on water to exist
2. Formulate a falsifiable hypothesis based on existing theory
o If passengers fly with a low cost airline, then they will always experience
delays
o All pet dogs in my apartment building have fleas
o All land mammals depend on water to exist
3. Collect data to test the hypothesis
o Collect flight data of low-cost airlines
o Test all dogs in the building for fleas
o Study all land mammal species to see if they depend on water
4. Analyze and test the data
o 5 out of 100 flights of low-cost airlines are not delayed
o 10 out of 20 dogs didn’t have fleas
o All land mammal species depend on water
5. Decide whether you can reject the null hypothesis
o 5 out of 100 flights of low-cost airlines are not delayed = reject hypothesis
o 10 out of 20 dogs didn’t have fleas = reject hypothesis
o All land mammal species depend on water = support hypothesis
Limitations of a deductive approach
The conclusions of deductive reasoning can only be true if all the premises set in the
inductive study are true and the terms are clear.
Example
All dogs have fleas (premise)
Benno is a dog (premise)
Benno has fleas (conclusion)
Based on the premises we have, the conclusion must be true. However, if the first premise
turns out to be false, the conclusion that Benno has fleas cannot be relied upon
Combining inductive and deductive research
Many scientists conducting a larger research project begin with an inductive study
(developing a theory). The inductive study is followed up with deductive research to
confirm or invalidate the conclusion.
In the examples above, the conclusion (theory) of the inductive study is also used as
a starting point for the deductive study
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm. We will first train our model with
lots of images of cats and dogs so that it can learn about different features of cats
and dogs, and then we test it with this strange creature. So as support vector creates
a decision boundary between these two data (cat and dog) and choose extreme
cases (support vectors), it will see the extreme case of cat and dog. On the basis of
the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the hyperplane
of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Python Implementation of Support Vector Machine
Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN
classification.
o Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
1. #Data Pre-processing Step
2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the
dataset as:
The scaled output for the test set will be:
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier,
we will import SVC class from Sklearn.svm library. Below is the code for it:
1. from sklearn.svm import SVC # "Support vector classifier"
2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for
linearly separable data. However, we can change it for non-linear data. And then we
fitted the classifier to the training dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization
factor), gamma, and kernel.
o Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new
vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to
check the difference between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many
incorrect predictions are there as compared to the Logistic regression
classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value
return by the classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions
and 8+2= 10 correct predictions. Therefore we can say that our SVM model improved
as compared to the Logistic regression model.
o Visualizing the training set result:
Now we will visualize the training set result, below is the code for it:
1. from matplotlib.colors import ListedColormap
2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.
shape),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
Output:
By executing the above code, we will get the output as:
As we can see, the above output is appearing similar to the Logistic regression
output. In the output, we got the straight line as hyperplane because we have used a
linear kernel in the classifier. And we have also discussed above that for the 2d
space, the hyperplane in SVM is a straight line.
o Visualizing the test set result:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.
shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
By executing the above code, we will get the output as:
As we can see in the above output image, the SVM classifier has divided the users
into two regions (Purchased or Not purchased). Users who purchased the SUV are
in the red region with the red scatter points. And users who did not purchase the
SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.
ML | Case Based Reasoning (CBR)
Classifier
Difficulty Level : Medium
Last Updated : 26 Mar, 2020
As we know Nearest Neighbour classifiers stores training tuples as points in
Euclidean space. But Case-Based Reasoning classifiers (CBR) use a
database of problem solutions to solve new problems. It stores the tuples or
cases for problem-solving as complex symbolic descriptions.
How CBR works?
When a new case arrises to classify, a Case-based Reasoner(CBR) will first
check if an identical training case exists. If one is found, then the
accompanying solution to that case is returned. If no identical case is found,
then the CBR will search for training cases having components that are similar
to those of the new case. Conceptually, these training cases may be
considered as neighbours of the new case. If cases are represented as
graphs, this involves searching for subgraphs that are similar to subgraphs
within the new case. The CBR tries to combine the solutions of the
neighbouring training cases to propose a solution for the new case. If
compatibilities arise with the individual solutions, then backtracking to search
for other solutions may be necessary. The CBR may employ background
knowledge and problem-solving strategies to propose a feasible solution.
Applications of CBR includes:
1. Problem resolution for customer service help desks, where cases describe
product-related diagnostic problems.
2. It is also applied to areas such as engineering and law, where cases are
either technical designs or legal rulings, respectively.
3. Medical educations, where patient case histories and treatments are used
to help diagnose and treat new patients.
Challenges with CBR
Finding a good similarity metric (eg for matching subgraphs) and suitable
methods for combining solutions.
Selecting salient features for indexing training cases and the development
of efficient indexing techniques.
CBR becomes more intelligent as the number of the trade-off between
accuracy and efficiency evolves as the number of stored cases becomes very
large. But after a c4517ertain point, the system’s efficiency will suffer as the
time required to search for and pr
|oce//A1 `q1298ss relevant cases increases.
``7
Neural networks are parallel computing devices, which is basically an attempt to
make a computer model of the brain. The main objective is to develop a system to
perform various computational tasks faster than the traditional systems. These tasks
include pattern recognition and classification, approximation, optimization, and data
clustering.
What is Artificial Neural Network?
Artificial Neural Network ANNANN is an efficient computing system whose central
theme is borrowed from the analogy of biological neural networks. ANNs are also
named as “artificial neural systems,” or “parallel distributed processing systems,” or
“connectionist systems.” ANN acquires a large collection of units that are
interconnected in some pattern to allow communication between the units. These
units, also referred to as nodes or neurons, are simple processors which operate in
parallel.
Every neuron is connected with other neuron through a connection link. Each
connection link is associated with a weight that has information about the input signal.
This is the most useful information for neurons to solve a particular problem because
the weight usually excites or inhibits the signal that is being communicated. Each
neuron has an internal state, which is called an activation signal. Output signals,
which are produced after combining the input signals and activation rule, may be
sent to other units.
A Brief History of ANN
The history of ANN can be divided into the following three eras −
ANN during 1940s to 1960s
Some key developments of this era are as follows −
1943 − It has been assumed that the concept of neural network started with
the work of physiologist, Warren McCulloch, and mathematician, Walter Pitts,
when in 1943 they modeled a simple neural network using electrical circuits in
order to describe how neurons in the brain might work.
1949 − Donald Hebb’s book, The Organization of Behavior, put forth the fact
that repeated activation of one neuron by another increases its strength each
time they are used.
1956 − An associative memory network was introduced by Taylor.
1958 − A learning method for McCulloch and Pitts neuron model named
Perceptron was invented by Rosenblatt.
1960 − Bernard Widrow and Marcian Hoff developed models called "ADALINE"
and “MADALINE.”
ANN during 1960s to 1980s
Some key developments of this era are as follows −
1961 − Rosenblatt made an unsuccessful attempt but proposed the
“backpropagation” scheme for multilayer networks.
1964 − Taylor constructed a winner-take-all circuit with inhibitions among
output units.
1969 − Multilayer perceptron MLPMLP was invented by Minsky and Papert.
1971 − Kohonen developed Associative memories.
1976 − Stephen Grossberg and Gail Carpenter developed Adaptive resonance
theory.
ANN from 1980s till Present
Some key developments of this era are as follows −
1982 − The major development was Hopfield’s Energy approach.
1985 − Boltzmann machine was developed by Ackley, Hinton, and Sejnowski.
1986 − Rumelhart, Hinton, and Williams introduced Generalised Delta Rule.
1988 − Kosko developed Binary Associative Memory BAMBAM and also gave
the concept of Fuzzy Logic in ANN.
The historical review shows that significant progress has been made in this field.
Neural network based chips are emerging and applications to complex problems are
being developed. Surely, today is a period of transition for neural network technology.
Biological Neuron
A nerve cell neuronneuron is a special biological cell that processes information.
According to an estimation, there are huge number of neurons, approximately
1011 with numerous interconnections, approximately 1015.
Schematic Diagram
Working of a Biological Neuron
As shown in the above diagram, a typical neuron consists of the following four parts
with the help of which we can explain its working −
Dendrites − They are tree-like branches, responsible for receiving the
information from other neurons it is connected to. In other sense, we can say
that they are like the ears of neuron.
Soma − It is the cell body of the neuron and is responsible for processing of
information, they have received from dendrites.
Axon − It is just like a cable through which neurons send the information.
Synapses − It is the connection between the axon and other neuron dendrites.
ANN versus BNN
Before taking a look at the differences between Artificial Neural
Network ANNANN and Biological Neural Network BNNBNN, let us take a look at
the similarities based on the terminology between these two.
Biological Neural Network BNNBNN Artificial Neural Network ANNANN
Soma Node
Dendrites Input
Synapse Weights or Interconnections
Axon Output
The following table shows the comparison between ANN and BNN based on some
criteria mentioned.
Criteria BNN AN
Processing Massively parallel, Massively parallel, fast but inferior than BNN
slow but superior
than ANN
Size 1011 neurons and 102 to
1015 interconnections 104 nodes mainlydependsonthetypeofapplicationandnetwo
signer
Learning They can tolerate Very precise, structured and formatted data is required to tolerate
ambiguity
Fault Performance It is capable of robust performance, hence has the potential to be
tolerance degrades with even
partial damage
Storage Stores the Stores the information in continuous memory locations
capacity information in the
synapse
Model of Artificial Neural Network
The following diagram represents the general model of ANN followed by its
processing.
For the above general model of artificial neural network, the net input can be
calculated as follows −
yin=x1.w1+x2.w2+x3.w3…xm.wmyin=x1.w1+x2.w2+x3.w3…xm.wm
i.e., Net input yin=∑mixi.wiyin=∑imxi.wi
The output can be calculated by applying the activation function over the net input.
Y=F(yin)Y=F(yin)
Output = function netinputcalculated
Processing of ANN depends upon the following three building blocks −
Network Topology
Adjustments of Weights or Learning
Activation Functions
In this chapter, we will discuss in detail about these three building blocks of ANN
Network Topology
A network topology is the arrangement of a network along with its nodes and
connecting lines. According to the topology, ANN can be classified as the following
kinds −
Feedforward Network
It is a non-recurrent network having processing units/nodes in layers and all the
nodes in a layer are connected with the nodes of the previous layers. The connection
has different weights upon them. There is no feedback loop means the signal can
only flow in one direction, from input to output. It may be divided into the following
two types −
Single layer feedforward network − The concept is of feedforward ANN
having only one weighted layer. In other words, we can say the input layer is
fully connected to the output layer.
Multilayer feedforward network − The concept is of feedforward ANN having
more than one weighted layer. As this network has one or more layers
between the input and the output layer, it is called hidden layers.
Feedback Network
As the name suggests, a feedback network has feedback paths, which means the
signal can flow in both directions using loops. This makes it a non-linear dynamic
system, which changes continuously until it reaches a state of equilibrium. It may be
divided into the following types −
Recurrent networks − They are feedback networks with closed loops.
Following are the two types of recurrent networks.
Fully recurrent network − It is the simplest neural network architecture
because all nodes are connected to all other nodes and each node works as
both input and output.
Jordan network − It is a closed loop network in which the output will go to the
input again as feedback as shown in the following diagram.
Adjustments of Weights or Learning
Learning, in artificial neural network, is the method of modifying the weights of
connections between the neurons of a specified network. Learning in ANN can be
classified into three categories namely supervised learning, unsupervised learning,
and reinforcement learning.
Supervised Learning
As the name suggests, this type of learning is done under the supervision of a
teacher. This learning process is dependent.
During the training of ANN under supervised learning, the input vector is presented
to the network, which will give an output vector. This output vector is compared with
the desired output vector. An error signal is generated, if there is a difference
between the actual output and the desired output vector. On the basis of this error
signal, the weights are adjusted until the actual output is matched with the desired
output.
Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a
teacher. This learning process is independent.
During the training of ANN under unsupervised learning, the input vectors of similar
type are combined to form clusters. When a new input pattern is applied, then the
neural network gives an output response indicating the class to which the input
pattern belongs.
There is no feedback from the environment as to what should be the desired output
and if it is correct or incorrect. Hence, in this type of learning, the network itself must
discover the patterns and features from the input data, and the relation for the input
data over the output.
Reinforcement Learning
As the name suggests, this type of learning is used to reinforce or strengthen the
network over some critic information. This learning process is similar to supervised
learning, however we might have very less information.
During the training of network under reinforcement learning, the network receives
some feedback from the environment. This makes it somewhat similar to supervised
learning. However, the feedback obtained here is evaluative not instructive, which
means there is no teacher as in supervised learning. After receiving the feedback,
the network performs adjustments of the weights to get better critic information in
future.
Activation Functions
It may be defined as the extra force or effort applied over the input to obtain an exact
output. In ANN, we can also apply activation functions over the input to get the exact
output. Followings are some activation functions of interest −
Linear Activation Function
It is also called the identity function as it performs no input editing. It can be defined
as −
F(x)=xF(x)=x
Sigmoid Activation Function
It is of two type as follows −
Binary sigmoidal function − This activation function performs input editing
between 0 and 1. It is positive in nature. It is always bounded, which means its
output cannot be less than 0 and more than 1. It is also strictly increasing in
nature, which means more the input higher would be the output. It can be
defined as
F(x)=sigm(x)=11+exp(−x)F(x)=sigm(x)=11+exp(−x)
Bipolar sigmoidal function − This activation function performs input editing
between -1 and 1. It can be positive or negative in nature. It is always bounded,
which means its output cannot be less than -1 and more than 1. It is also
strictly increasing in nature like sigmoid function. It can be defined as
F(x)=sigm(x)=21+exp(−x)−1=1−exp(x)1+exp(x)