0% found this document useful (0 votes)
400 views284 pages

Machine Learning Fundamentals Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
400 views284 pages

Machine Learning Fundamentals Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-I

What is Machine Learning?

Machine learning is a sub-domain of artificial intelligence The goal of machine


learning is usually to understand the structure of the data and to match that
data to models that can be understood and used by humans. While artificial
intelligence and machine learning are often used together, they are two
different concepts.

AI is a broad concept – decision-making machines, learning new skills, and


problem-solving in the same way for people - and machine learning is an AI set
that enables intelligent systems to independently learn new things from the
data.

Alan Turing’s definition would have fallen under the category of “systems that
act like humans.”

At its simplest form, artificial intelligence is a field, which combines computer


science and robust datasets, to enable problem-solving. It also encompasses
sub-fields of machine learning and deep learning, which are frequently
mentioned in conjunction with artificial intelligence. These disciplines are
comprised of AI algorithms which seek to create expert systems which make
predictions or classifications based on input data.

Machine Learning subset


Machine learning is a subset of artificial intelligence. Machine learning is a tool
for transforming information into knowledge. In the previous 50 years, there
has been a blast of information/data. This mass of information is pointless
except if we investigate it and discover the examples covered up inside.
Machine learning techniques are utilized to consequently locate the significant
fundamental examples inside complex information that we would somehow
battle to find.

Hidden patterns and information about the problem can be used to predict
future events and to make all sorts of complex decisions.

The name machine learning was coined by Arthur Samuel in 1959, an American
pioneer in the field of computer gaming and artificial intelligence who stated
that Learning gives computers the ability to learn without being explicitly
Arthur Samuel created the first self study program for playing checkers. You
realize that the more the system plays, the better it performs.

And in 1997, Tom Mitchell gave a mathematical and relational definition that
computer program is said to learn from experience E with respect to some task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience.

Example 1: Playing Chess

Task: The task of playing chess

Experience: The experience of playing many games of chess

Performance: Measure The probability of the program which will win the next
game of chess

Example 2: Spam Mail Detection

Task: To recognize and classify the emails

Experience: A set of emails with given labels

Performance: Measure Total percentage of emails being correctly classified as


'spam' (or 'not spam’) by the program.
Currently, machine learning is used for a variety of tasks such as image
recognition, speech recognition like Amazon’s Alexa, email filtering like email is
spam or not spam, Facebook auto-tagging, recommendation systems like
Amazon and Flipkart recommends the product to the user and many more.

The Seven Steps of Machine Learning

The process of machine learning can be broken down into 7 steps as shown in
Figure To illustrate the significance and function of each step, we would be
using an example of a simple model. This model would be responsible for
differentiating between an apple and an orange. Machine learning is capable
of much for complex tasks. However, to explain the process in simplistic terms,

a basic example is taken to explain the relevant concepts:

Step 1: Data Gathering / Data Collection


This step is very important because the quality and quantity of data that we gather will
directly determine how well or badly your model will work. To develop our machine learning
model, our first step would be to gather relevant data that can be used to train the model.
The step of gathering data is the foundation of the machine learning process. Mistakes such
as choosing the incorrect features or focusing on limited types of entries for the data set
may render the model completely ineffective.
Step 2: Preparing the data
After the training data is gathered, we move on to the next step of machine learning: data
preparation, where the data is loaded into a suitable place, and then prepared for use in
machine learning training. Here, the data is first put all together, and then
the order is randomized as the order of data should not affect what is learned.
In this step, we wrangle the data collected in Step 1 and prepare it for training. We can clean
the data by removing duplicates, correct errors, deal with missing values, data type
conversions, and so on. We can also do the visualization of the data, as this will help us to
see if there are any relevant relationships between the different attributes, how we can
take their advantage, and as well as, a show if there are any data imbalances present.
Another major component of data preparation is breaking down the data sets into 2 parts.
The larger part (~80%) would be used for training the model while the smaller part (~20%) is
used for the evaluation of the trained model’s performance. This is
important because using the same data sets for both training and evaluation would not give
a fair assessment of the model’s performance in real-world scenarios.
Step 3: Choosing a Model
The selection of the model type is our next course of action once we are done with the data-
centric steps. There are various existing models developed by data scientists that can be
used for different purposes. Different classes of models are good at modeling the underlying
patterns of different types of datasets. These models are designed with different goals in
mind. For instance, some models are more suited to dealing with texts while another model
may be better equipped to handle images.
Step 4: Training
At the heart of the machine learning process is the training of the model. The bulk of the is
done at this stage. Training requires patience and experimentation. It is also useful to know
the field where the model would be implemented. Training can prove to be highly
rewarding if the model starts to succeed in its role. The training process involves initializing
some random values for say X and Y of our model, predict the output with those values,
then compare it with the model's prediction and then adjust the values so that they match
the predictions that were made previously.
This process then repeats, and each cycle of updating is called one training step. It is
comparable to when a child learns to ride a bicycle. Initially, they may have multiple falls
but, after a while, they develop a better grasp of the process and can react better to
different situations while riding the bicycle.
Step 5: Evaluation
With the model trained, it needs to be tested to see if it would operate well in real- world
situations. That is why the part of the data set created for evaluation is used to check the
model’s proficiency. This puts the model in a scenario where it encounters
situations that were not a part of its training. Evaluation becomes highly important when it
comes to commercial applications. Evaluation allows data scientists to check whether the
goals they set out to achieve were met or not. If the results are not satisfactory then the
prior steps need to be revisited so that the root cause behind the model’s
underperformance can be identified and, subsequently, rectified. If the evaluation is not
done properly then the model may not excel at fulfilling its desired
commercial purpose. This could mean that the company that designed and sold the model
may lose their goodwill with the client. It could also mean damage to the company’s
reputation as future clients may become hesitant when it comes to trusting the company’s
acumen regarding machine learning models. Therefore, evaluation of the model is essential
for avoiding the aforementioned ill-effects.
Step 6: Hyperparameter Tuning
Once the evaluation is over, any further improvement in our training can be possible by
tuning the parameters. There were a few parameters that were implicitly assumed when
the training was done. Another parameter included is the learning rate that defines how far
the line is shifted during each step, based on the information from the previous training
step. These values play a role in the accuracy of the training model and how long the
training will take. Naturally, the question arises that why we need hyper parameter tuning in
the first place when our model is achieving its targets? This can be answered by looking at
the competitive nature of machine learning-based service providers. Clients can choose
from multiple options when they seek a machine learning model to solve their respective
problems. However, they are more likely to be enticed by the one which produces the most
accurate results. That is why for ensuring the commercial success of a machine learning
model, hyperparameter tuning is a necessary step.
Step 7: Prediction
The final step of the machine learning process is prediction. This is the stage where we
consider the model to be ready for practical applications. This is the point where the value
of machine learning is realized. Here we can finally use our model to predict the outcome of
what we want.
Applications of Machine Learning
Machine learning helps to improve business decisions, boost productivity, detect diseases,
forecast the weather, and much more. A machine learns automatically from the inputs.
Some of the best machine learning applications are mentioned as follows.
Social Media Features
Social media platforms like Facebook, Instagram, LinkedIn, and so on, use machine learning
algorithms for users. Social media create some attractive and excellent features using
machine learning. For example, Facebook notices and records the activities, chats, likes,
comments, and the time spent by users on specific kinds of posts, videos, and so on.
Machine learning learns from the user’s activities & experiences and makes friends and
page suggestions for the user’s profile. Let us take another example, when a user uploads a
picture of him with a friend and Facebook instantly recognizes that friend. Facebook checks
the poses and projections in the picture, notice the unique features, and then match them
with the people in the user’s friend list and tag that friend automatically.
Product Recommendations
Product recommendation is used in almost every e-commerce website today. This is a very
advanced application of machine learning techniques. Using machine learning, e-commerce
websites track the user’s behavior based on previous purchases, searching patterns, cart
history, and so on. Based on this tracking, the ecommerce websites recommend the product
to users that somehow matches the user’s taste.
Image & Speech Recognition
Image recognition is one of the most common applications of machine learning. Image
processing is used to identify objects, persons, places, digital images, and so on in an image.
This technique is used for further analysis, such as pattern recognition, character
recognition, face detection, or face recognition. Facebook provides us with a feature of auto
friend tagging suggestions. Whenever a user uploads a photo with his Facebook friends,
then the user automatically gets a tagging suggestion with the name of his friends.
Speech recognition is the translation of spoken words into text. It is also known as to or
speech Google and Alexa are using speech recognition technology to follow voice
instructions.
Sentiment Analysis
Sentiment analysis is a real-time machine learning application that determines the emotion
or opinion of the speaker or the writer. For instance, if someone has written a review or
email (or any form of a document), the sentiment analyzer will instantly find out the actual
thought and tone of the text. This sentiment analysis application can be used to analyze a
review-based website, decision-making applications, and so on.
Self-driving cars
One of the most exciting applications of machine learning is self driving cars. Machine
learning plays a significant role in self driving cars. Tesla, the most popular car
manufacturing company is working on a self-driving car. These self-driven cars are
autonomous cars that are safer than cars driven by humans. Things that make these cars
safe are that they are not affected by factors like illness or the emotion of the driver.
Email Spam and Malware Filtering
Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important email in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is machine learning. Following
are some spam filters used by Content filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Stock Market Trading
Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of fluctuations in shares, so for this, machine learning is used to predict the market
trends of stock.
Medical Diagnosis
In medical science, machine learning is used for disease diagnoses. With this, medical
technology is growing rapidly and can build 3D models that can predict the exact position of
lesions in the brain. It helps in finding brain tumors and other brain-related diseases easily.
Online Fraud Detection
Machine learning is making online transactions safe and secure by detecting fraud
transactions. Whenever we perform some online transactions, there may be various ways
that a fraudulent transaction can take place such as fake accounts, fake ids, and steal money
in the middle of a transaction. Therefore, to detect this, machine learning helps us by
checking whether it is a genuine transaction or a fraud transaction.
Automatic language translation
Nowadays, if we visit a new place and are not aware of the language then it is not a
problem, as for this machine learning helps us by converting the text into our known
language. Google's Google Neural Machine Translation provides us with this feature,
which is a neural machine learning that translates the text into a familiar language, and it is
called automatic The technology behind the automatic translation is a sequence-to
sequence learning algorithm, which is used with image recognition and translates the text
from one language to another.
Deep Learning
A deep learning model is a neural network with many layers. For neural networks, a simple
method of increasing the model capacity is to add more hidden layers, which corresponds to
more parameters such as connection weights and thresholds. The model complexity can
also be increased by simply adding more hidden neurons since we have seen previously that
a multi-layer feedforward network with a single hidden layer already has very strong
learning ability. However, to increase model complexity, adding more hidden layers is more
effective than adding hidden neurons since more layers not only imply more hidden neurons
but also more nesting of activation functions.
TYPES OF MACHINE LEARNING

Supervised learning
Supervised learning methods are the ML methods that are most commonly used. It takes
the data sample (usually called training data) and the associated output (usually called
labels or responses) with each data sample during the training process of the model. The
main objective of supervised learning is to understand the association between input
training data and corresponding labels.
Let’s understand it with an example. Suppose we have:
Input variable:
Output variable:
In order to learn the mapping function from the input to output, we need to apply an
algorithm whose main objective is to approximate the mapping function so well that we can
also easily predict the output variable for the new input data, as shown in
the following example:
Y = f(x)
These methods are called supervised learning methods because the ML model learns from
the training data where the desired output is already known. Logistic regression, k-Nearest
neighbors (KNN), Decision tree, and Random Forest are some of the well known supervised
machine learning algorithms.
Based on the type of ML-based tasks, supervised learning methods can be divided into two
major classes as follows: The main objective of the classification-based tasks is to predict
categorical output responses based on the input data that is being
provided. The output depends on the ML model’s learning in the training phase. Categorical
means unordered and discrete values; hence, the output responses will belong to a specific
discrete category.
SUPERVISED LEARNING
For example, predicting high-risk patients and discriminating them from low-risk patients is
also a classification task. Suppose for newly admitted patients, an emergency room in a
hospital measures 12 variables (such as blood sugar, blood pressure, age, weight, and so
on). After measuring these variables, a decision is to be taken whether to put the patient in
ICU or not. There is a simple condition that a high priority should be given to the patients
who may survive more than a month.
The main objective of regression-based tasks is to predict continuous numerical output
responses based on the input data that is being provided. The output depends on the ML
model’s learning in the training phase. Similar to classification, with the help of regression,
we can predict the output responses for unseen data instances, but that is with continuous
numerical output values. Predicting the price of houses is one of the most common real-
world examples of regression.
Unsupervised learning
Unsupervised learning methods (as opposed to supervised learning methods) do not require
any pre-labeled training data. In such methods, the machine learning model or algorithm
tries to learn patterns and relationships from the given raw data without any supervision.
Although there are a lot of uncertainties in the result of these models, we can still obtain a
lot of useful information like all kinds of unknown patterns in the data and the features that
can be useful for categorization.
To make it clearer, suppose we have:

UNSUPERVISED LEARNING
Input variable: x
There would be no corresponding output variable. For learning, the algorithm needs to
discover interesting patterns in data. Kmeans Clustering, Hierarchical Clustering, and
Hebbian Learning are some of the well-known unsupervised machine learning algorithms.
Based on the type of ML-based tasks, unsupervised learning methods can be categorized
into the following broad areas: Clustering, one of the most useful unsupervised machine
learning algorithms/methods, is used to find the similarity and relationship patterns among
data samples. Once the relationship patterns are found, it clusters the data samples into
groups having similar features. The following figure illustrates the working of clustering
methods:

One other useful unsupervised machine learning algorithm/method is Association. In order


to find patterns representing the interesting relationships between a variety of items,
association analyzes a large dataset. For example, analyzing customer shopping patterns
comes under association. It is also known as Association Rule Mining or Market Basket
Analysis.
Anomaly Sometimes, we need to find out and eliminate the observations that do not occur
generally. In that case, the most useful unsupervised ML method is Anomaly detection. It
uses learned knowledge to differentiate between anomalous and normal
data points. K-means clustering, mean shift clustering, and Knearest neighbors (KNN) are
some of the unsupervised learning algorithms that can detect anomalous data based on its
features.
Reinforcement learning
Reinforcement machine learning methods are a bit different from supervised, unsupervised,
and semi-supervised machine learning methods. In these kinds of learning algorithms, a
trained agent interacts with a specific environment. The job of the agent is to interact with
the environment and once observed, it takes actions regarding the current state of that
environment. Let’s understand the working of reinforcement learning methods in the
following steps:
Prepare an agent with some set of strategies. Observe the environment’s current state.
Regarding the current state of the environment, select the optimal policy and perform
suitable action accordingly. An agent gets a reward or penalty based on the action it took
according to the current state of the environment. If needed, update the set of strategies.
Repeat the process until the agent learns and adopts the optimal policy.
Advantages of Machine Learning
There are many advantages of machine learning. These are some of the helpful advantages:
Easily identifies trends and Machine learning takes large amounts of data and discovers the
hidden structures, specific styles and patterns that are very difficult to find by humans. For
example, an e-commerce website like Flipkart, works to understand the browsing methods
and purchase records of its users to help target relevant products, deals, and related
reminders. It uses results to show relevant ads or products to them.
No human intervention needed or Automation of Machine learning has a very powerful tool
to automate various decision-making tasks. By automating things, we allow the algorithm to
perform difficult tasks. Automation is now practiced almost everywhere. The reason is that
it is very reliable. In addition, it helps us to think deeply.
Efficient Handling of any type of Data: Here are many factors that make machine learning
reliable. Data handling is one of them. Machine learning plays a big role when it comes to
data. It can handle any type and amount of data. It can process and analyze the data that
normal systems cannot. Data is the most important
part of a machine learning model.
Continuous Improvement: As machine learning algorithms gain experience from data, they
keep improving their accuracy and efficiency. This lets them make better decisions.
Suppose, we need to make a weather forecast model. As the amount of data we have keeps
growing, our algorithms learn to make accurate predictions faster.
Disadvantages of Machine Learning
Similar to the advantages of machine learning, we should also Know the disadvantages of
machine learning. If we don’t know the cons, we won’t know the risks of ML.
Disadvantages.
Possibility of High Error: In machine learning, we can choose algorithms based on the
accurate results. Therefore, we must apply the results to all algorithms. A major problem
arises with the training and testing of data. The data is large, so sometimes
deleting errors becomes impossible. These errors can cause headaches for users. As the
data is large, errors take a lot of time to resolve.
Data Machine learning requires a large number of datasets for the training and testing of
the model. The machine learning algorithm takes the good quality of data for an accurate
result. In many situations, the data constantly keeps on updating. Therefore, we have to
wait for the new data to arrive.
Time and Machine learning algorithms require enough time to learn and develop enough to
achieve their goals with a high degree of accuracy and consistency. It also requires great
resources to operate. This may mean additional computational power requirements for our
computer.

MACHINE LEARNING CHALLANGES

1. Poor Quality of Data

Data plays a significant role in the machine learning process. One of the significant issues
that machine learning professionals face is the absence of good quality data. Unclean and
noisy data can make the whole process extremely exhausting. We don’t want our
algorithm to make inaccurate or faulty predictions. Hence the quality of data is essential
to enhance the output. Therefore, we need to ensure that the process of data
preprocessing which includes removing outliers, filtering missing values, and removing
unwanted features, is done with the utmost level of perfection.

2. Underfitting of Training Data

This process occurs when data is unable to establish an accurate relationship between
input and output variables. It simply means trying to fit in undersized jeans. It signifies the
data is too simple to establish a precise relationship. To overcome this issue:
 Maximize the training time
 Enhance the complexity of the model
 Add more features to the data
 Reduce regular parameters
 Increasing the training time of model

3. Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of data that
negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately,
this is one of the significant issues faced by machine learning professionals. This means
that the algorithm is trained with noisy and biased data, which will affect its overall
performance. Let’s understand this with the help of an example. Let’s consider a model
trained to differentiate between a cat, a rabbit, a dog, and a tiger. The training data
contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable
probability that it will identify the cat as a rabbit. In this example, we had a vast amount of
data, but it was biased; hence the prediction was negatively affected.
We can tackle this issue by:
 Analyzing the data with the utmost level of perfection
 Use data augmentation technique
 Remove outliers in the training set
 Select a model with lesser features

4. Machine Learning is a Complex Process

The machine learning industry is young and is continuously changing. Rapid hit and trial
experiments are being carried on. The process is transforming, and hence there are high
chances of error which makes the learning complex. It includes analyzing the data,
removing data bias, training data, applying complex mathematical calculations, and a lot
more. Hence it is a really complicated process which is another big challenge for Machine
learning professionals.

5. Lack of Training Data

The most important task you need to do in the machine learning process is to train the
data to achieve an accurate output. Less amount training data will produce inaccurate or
too biased predictions. Let us understand this with the help of an example. Consider a
machine learning algorithm similar to training a child. One day you decided to explain to a
child how to distinguish between an apple and a watermelon. You will take an apple and a
watermelon and show him the difference between both based on their color, shape, and
taste. In this way, soon, he will attain perfection in differentiating between the two. But
on the other hand, a machine-learning algorithm needs a lot of data to distinguish. For
complex problems, it may even require millions of data to be trained. Therefore we need
to ensure that Machine learning algorithms are trained with sufficient amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The machine
learning models are highly efficient in providing accurate results, but it takes a
tremendous amount of time. Slow programs, data overload, and excessive requirements
usually take a lot of time to provide accurate results. Further, it requires constant
monitoring and maintenance to deliver the best output.

7. Imperfections in the Algorithm When Data Grows

So you have found quality data, trained it amazingly, and the predictions are really concise
and accurate. Yay, you have learned how to create a machine learning algorithm!! But
wait, there is a twist; the model may become useless in the future as data grows. The best
model of the present may become inaccurate in the coming Future and require further
rearrangement. So you need regular monitoring and maintenance to keep the algorithm
working. This is one of the most exhausting issues faced by machine learning
professionals.

STATISTICAL LEARNING
INTRODUCTION
There are two major goals for modeling data: 1) to accurately predict some future
quantity of interest, given some observed data, and 2) to discover unusual or
interesting patterns in the data.
Function approximation.
Building a mathematical model for data usually means understanding how one data
variable depends on another data variable. The most natural way to represent the
relationship between variables is via a mathematical function or map.
Optimization.
Given a class of mathematical models, we wish to find the best possible model in
that class. This requires some kind of efficient search or optimization procedure. The
optimization step can be viewed as a process of fitting or calibrating a function to
observed data.

Probability and Statistics.


In order to quantify the uncertainty inherent in making predictions about the future,
and the sources of error in the model, data scientists need a firm grasp of probability
theory and statistical inference.

Supervised and Unsupervised Learning


Given an input or feature vector x, one of the main goals of machine learning is to
predict an output or response variable y. For example, x could be a digitized
signature and y a binary variable that indicates whether the signature is genuine or
false. Another example is where x represents the weight and smoking habits of an
expecting mother and y the birth weight of the baby. The data science attempt at this
prediction is encoded in a mathematical function g, called the prediction function ,
which takes as an input x and outputs a guess g(x) for y (denoted by 𝑦^, for example).
In a sense, g encompasses all the information about the relationship between the
variables x and y, excluding the effects of chance and randomness in nature.
loss function:

It measure the accuracy of a prediction by with respect to a given response y by


using some loss function Loss(y, 𝑦^,). In a regression setting the usual choice is the
squared error loss (y−𝑦^,)2. In the case of classification, the zero–one (also written 0–
1) loss function Loss(y, 𝑦^,) = 1{y ≠ 𝑦^,} is often used, which incurs a loss of 1
whenever the predicted class by is not equal to the class y.
Risk
Even with the same input x, the output y may be different, depending on chance
circumstances or randomness. For this reason, we adopt a probabilistic approach
and assume that each pair (x, y) is the outcome of a random pair (X, Y) that has
some joint probability density f (x, y). We then assess the predictive performance
via the expected loss, usually called the risk, for g:

For example, in the classification case with zero–one loss function the risk is equal to
the probability of incorrect classification:

We denote this sample by Ƭ = {(X1, Y1), . . . , (Xn, Yn)} and call it the training
set (Ƭ is a mnemonic for training) with n examples. It will be important to distinguish
between a random training set T and its (deterministic) outcome {(x1, y1), . . . ,
(xn, yn)}.
Our goal is thus to “learn” the unknown g* using the n examples in the training set Ƭ.
Let us denote by gT the best (by some criterion) approximation for g∗ that we can
construct from Ƭ.
The above setting is called supervised learning , because one tries to learn the
functional relationship between the feature vector x and response y in the presence
of a teacher who provides n examples. It is common to speak of “explaining” or
predicting y on the basis of x, where x is a vector of explanatory variables. In
contrast, unsupervised learning makes no distinction between response and
explanatory variables, and the objective is simply to learn the structure of the
unknown distribution of the data. In other words, we need to learn f (x). In this case
the guess g(x) is an approximation of f (x) and the risk is of the form

Training and Test Loss


Given an arbitrary prediction function g, it is typically not possible to compute its risk
ℓ(g). However, using the training sample Ƭ, we can approximate ℓ(g) via the
empirical (sample average) risk

which we call the training loss. The training loss is thus an unbiased estimator of the
risk (the expected loss) for a prediction function g, based on the training data. To
approximate the optimal prediction function 𝑔∗ (the minimizer of the risk ℓ(g)) we first
select a suitable collection of approximating functions G and then take our learner to
be the function in G that minimizes the training loss; that is,

For example, the simplest and most useful G is the set of linear functions of x; that
is, the set of all functions 𝑔: 𝑥 → 𝛽T𝑥 for some real-valued vector β.
Generalization risk

The prediction accuracy of new pairs of data is measured by the generalization risk
Of the learner. For a fixed training set τ it is defined as

where {(𝑋' , 𝑌 ' ), (𝑋' , 𝑌' ), … … (𝑋' , 𝑌 ' )}=: 𝑟 ' ′ is a so-called test sample. The test sample
1 1 2 2 n n
is completely separate from Ƭ , but is drawn in the same way as Ƭ; that is, via
independent draws from f (x, y), for some sample size n′.

(Polynomial Regression) In what follows, it will appear that we have arbitrarily


replaced the symbols x, g,G with u, h,H, respectively. The data (depicted as dots)
are n = 100 points (ui, yi), i = 1, . . . , n drawn from iid random points (Ui, Yi),
i = 1, . . . , n, where the {Ui} are uniformly distributed on the interval (0, 1) and,
given Ui = ui, the random variable Yi has a normal distribution with expectation
10 − 140𝑈i + 400i𝑈2 − 250𝑈3 and variance 𝑃∗= 25. This is an example of a
polynomial regression model . Using a squared-error loss, the optimal prediction
function
h ∗(u) = E[Y |U = u] is thus
ℎ∗(𝑢) = 10 − 140𝑈 + 400𝑢2 − 250𝑢3
Training data and the optimal polynomial prediction function ℎ∗. To obtain a good
estimate of ℎ∗(𝑢) based on the training set τ = {(ui, yi), i = 1, . . . , n}, we
minimize the outcome of the training loss

Tradeoffs in Statistical Learning


The art of machine learning in the supervised case is to make the generalization risk
or expected generalization risk as small as possible, while using as few
computational resources as possible. In pursuing this goal, a suitable class G of
prediction functions has to be chosen. This choice is driven by various factors, such
as
1. the complexity of the class (e.g., is it rich enough to adequately approximate,
or even contain, the optimal prediction function g ∗?),
2. the ease of training the learner via the optimization program ,
3. how accurately the training loss and estimates the risk within class G,
4. the feature types (categorical, continuous, etc.).

We can decompose the generalization risk into the following three components

Irreducible risk

where 𝑃∗ = 𝑃(𝑔∗) is the irreducible risk and gG:= argming∈G ℓ(g) is the best learner

within class G. No learner can predict a new response with a smaller risk than 𝑃∗.
Approximation error
The second component is the approximation error It measures the difference
between the irreducible risk and the best possible risk that can be obtained by
selecting the best prediction function in the selected class of functions G.
Statistical (estimation) error
The third component is the statistical (estimation) error . It depends on the training
set τ and, in particular, on how well the learner 𝑔cG estimates the best possible
prediction function, 𝑔G, within class G. For any sensible estimator this error should
decay to zero (in probability or expectation) as the training size tends to infinity.

Estimating Risk
Test loss can be estimated by quantifying the generalization risk. How ever
generalization risk depends on the training sets, different training sets can produces
different results during the estimation.

Estimation in the sample risk


Learning function g is not a good estimate for generalization risk due to the over
fitting of data, we uses the same data for training and testing. In order to simplify the
concept by estimating the predictive accuracy by taking the average values of all
feature vector in the sample risk.

Where each response variable 𝑌i' is drawn from f(y/xi). Even in this simplified setting,
the training loss of the learner will be a poor estimate of the in-sample risk. Instead,
the proper way to assess th prediction accuracy of the learner at the feature vectors
x1, . . . , x n, is to draw new response values 𝑌i' ~𝑓(y ), i = 1, . . . , n, that
x
i
are independent from the responses y1, . . . , yn in the training data, and then
estimate the in-sample risk of 𝑔c
UNIT-II

Regression

Regression is a mathematical method used in finance, investment,


and other methods that attempt to determine the strength and
nature of the relationship between a single dependent/output
variable and one or more other independent/input variables.

Regression algorithms are used if there is a relationship between


one or more input variables and output variables. It is used when
the value of the output variable is continuous or real, such as
house price, weather forecasting, stock price prediction, and so
on.

The following Table 2.1 shows, the dataset, which serves the
purpose of predicting the house price, based on different
parameters:

parameters: parameters: parameters:

parameters:

parameters:

parameters:

parameters:

parameters:

parameters:
parameters:

parameters:

parameters:

Table 2.1: Dataset of Regression

Here the input variables are Size of No. of Bedrooms and No. of
Bathrooms & the output variable is the Price of which is a
continuous value. Therefore, this is a Regression

The goal here is to predict a value as much closer to the actual


output value and then evaluation is done by calculating the error
value. The smaller the error the greater the accuracy of the
regression model.

Regression is used in many real-life applications, such as financial


forecasting (house price prediction or stock price prediction),
weather forecasting, time series forecasting, and so on.

In regression, we plot a graph between the variables which best


fits the given data points, using this plot, the machine learning
model can make predictions about the data. In simple words,
regression displays an entire line and curve in the goal predictor
chart in order to minimize the vertical gap between the datapoints
and the regression line. The distance between the data points and
the line tells us whether a model has captured a strong
relationship or not.
Terminologies used in Regression

Following are the terminologies used in regression:

Dependent Dependent variable is also known as target output


variable, or response The dependent variable in regression analysis
is a variable, which we want to predict or understand. It is
denoted by ‘Y’.

Size of house, number of bedrooms, and number of bathrooms


are dependent variables (refer to Table

Independent Independent variable is also known as the input


variable or Independent variables affect the dependent variables, or
these are used to predict the values of the dependent variables. It
is denoted by ‘X’.

Price of a house is an independent variable (refer to Table

Outliers are observed data points that are far from the least
square line or that differs significantly from other data or
observations, or in other words, an outlier is an extreme value
that differ greatly from other values in a set of values.

In Figure there are a bunch of apples, but one apple is different.


This apple is what we call an
Figure 2.3: Outlier

Outliers need to be examined closely. Sometimes, for some reason


or another, they should not be included in data analysis. It is
possible that an outlier is a result of erroneous data. Other times,
an outlier may hold valuable information about the population
under study and should remain included in the data. The key is
to examine carefully what causes a data point to be an outlier.
Figure 2.4: Outlier in House Price Dataset

Consider the following example. Suppose as shown in Figure we


sample the number of bedrooms in a house and note the price
of each house. We can see from the dataset that ten houses
range between 200000 and 875000; but one house is priced at
3000000, that house will be considered as an outlier.

Multicollinearity in regression analysis occurs when two or more


independent variables are closely related to each other, so as not
to provide unique or independent data to the regression model. If
the degree of correlation is high enough between variables, it can
cause problems when fitting and interpreting the regression model.

For example, suppose we do a regression analysis using variable


height, shoe size, and hours spent in practice in a day to predict
high jumps for basketball players. In this case, the height and
size of the shoes may be closely related to each other because
taller people tend to have larger shoe sizes. This means that
multicollinearity may be a problem in this regression.
Take another example, suppose we have two inputs X1 or X2 :

X1 = [0, 3, 4, 9, 6, 2]

X2 = [0, −1.5, −2, −4.5, −3, −1]

X1 = -2 * X2

So X1 & X2 are collinear. Here it’s better to use only one variable,
either X1 or X2 for the input.

Underfitting and Overfitting and underfitting are two primary


issues that happen in machine learning and decrease the
performance of the machine learning model.

The main goal of each machine learning model is to provide a


suitable output by adapting the given set of unknown inputs, this
is known as It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, underfitting and
overfitting are the two terms that need to be checked for the
reliability of the machine learning model.

Let us understand the basic terminology for overfitting and


underfitting:

It is about the true base pattern of the data.

Unnecessary and insignificant data that reduces the performance.


It is the difference between expected and real values.

When model performs well on the training dataset but not on the
test dataset, variance exists.

Figure 2.5: Underfitting and Overfitting

On the left side of the above Figure anyone easily predicts that
the line does not cover all the points shown in the graph. Such a
model tends to cause a phenomenon known as underfitting of
data. In the case of underfitting, the model cannot learn enough
from the training data and from the training data and thus
reduces precision and accuracy. There is a high bias and low
variance in the underfitted model.

Contrarily, when we consider the right side of the graph in Figure


it shows that the predicted line covers all the points in the graph.
In such a situation, we might think this is a good graph that
covers all the points, but that is not true. The predicted line of
the given graph covers all the points including those, which are
noise and outlier. Such a model tends to cause a phenomenon
known as overfitting of data. This model is responsible for
predicting poor results due to its high complexity. The overfitted
model has low bias and high So, this model is also known as the
High Variance

Now, consider the middle graph in Figure it shows a well-


predicted line. It covers a major portion of the points in the
graph while also maintaining the balance between bias and
variance. Such a model tends to cause a phenomenon known as
appropriate fitting of

Types of Linear Regression

As shown in Figure linear regression is classified into two


categories based on the number of independent variables:

Figure 2.6: Types of Linear Regression


Simple Linear Regression

Simple Linear Regression is a type of linear regression where we


have only one independent variable to predict the dependent
variable. The dependent variable must be a continuous/real value.

The relationship between independent variable (X) & dependent


variable (Y) is shown by a linear or a sloped straight line as
shown in Figure hence it is called Simple Linear Regression:

Figure 2.7: Simple Linear Regression

The Simple Linear Regression model can be represented using the


following equation:

y=
Where,

Y: Dependent Variable

X: Independent variable

is the Y-intercept of the regression line where best-fitted line


intercepts with the Y-axis.

is the slope of the regression line, which tells whether the line is
increasing or decreasing.

Therefore, in this graph, the dots are our data and based on this
data we will train our model to predict the results. The black line
is the best-fitted line for the given data. The best-fitted line is a
straight line that best represents the data on a scatter plot. The
best-fitted line may pass through all of the points, some of the
points or none of the points in the graph.

The goals of simple linear regression are as follows:

To find out if there is a correlation between dependent &


independent variables.

To find the best-fit line for the dataset. The best-fit line is the one
for which total prediction error (all data points) are as small as
possible.
How the dependent variable is changing, by changing the
independent variable.

Here of is the independent variable (X) and is the dependent


variable (Y). Our aim is to find out the value of B0 & B1 such
that it produces the best-fitted regression line. This linear equation
is then used for new data.

The House dataset is used to train our linear regression model.


That is, if we give of as an input, our model should predict with
minimum error.

Here Y’ is the predicted value of

The values and must be chosen so that they minimize the error.
If the sum of squared error is taken as a metric to evaluate the
model, then the goal to obtain a line that best reduces the error
is achieved.

If we do not square the error, then positive and negative points


will cancel out each other.
Multiple Linear Regression

If there is more than one independent variable for the prediction


of a dependent variable, then this type of linear regression is
known as multiple linear regression.
Classification

Classification is the process to group the output into different


classes based on one or more input variables. Classification is
used when the value of the output variable is discrete or
categorical, such as email is or or and or and or or and 0 or 1,
and so on.

If the algorithm tries to classify input variables into two different


classes, it is called binary such as email is or When the algorithm
tries to classify input variables into more than two classes, it is
called multiclass such as handwritten character recognition where
classes go from 0 to 9. Figure 2.8 shows the classification
example with a message if an email is a spam or not spam:
Figure 2.8: Example of Classification

Naïve Bayes classifier algorithm

Naive Bayes classifiers are a group of classification algorithms that


are based on Bayes' Theorem. It is a family of algorithms that
shares a common principle, namely that every pair of the
characteristics being classified is independent of each other.

The Naïve Bayes algorithm is a supervised learning algorithm that


solves classification problems and is based on the Bayes theorem.

It is primarily used in text classification with a large training


dataset.

The Naïve Bayes Classifier is a simple and effective Classification


algorithm that aids in the development of fast machine learning
models capable of making quick predictions.

It is a probabilistic classifier, which ensures it predicts based on


an object's probability.

Spam filtration, Sentimental analysis, and article classification are


some popular applications of the Naïve Bayes Algorithm.

For example, if a fruit is red, round, and about 3 inches in


diameter, it is classified as an apple. Even if these features are
dependent on each other or on the presence of other features, all
of these properties independently contribute to the likelihood that
this fruit is an apple, which is why it is called

The Naive Bayes model is simple to construct and is especially


useful for very large data sets. In addition to its simplicity, Naive
Bayes has been shown to outperform even the most sophisticated
classification methods.

Why is it called Naïve Bayes?

The Naive Bayes algorithm is made up of the words Naive and


Bayes, which can be interpreted as:

It is called Naïve because it assumes that the occurrence of one


feature is unrelated to the occurrence of others. For example, if
the fruit is identified based on color, shape, and taste, then a red,
spherical, and sweet fruit is identified as an apple. As a result,
each feature contributes to identifying it as an apple without
relying on the others.

It is known as Bayes because it is based on the principle of


Bayes' Theorem.

Principle of Naive Bayes Classifier

A Naive Bayes classifier is a type of probabilistic machine learning


model that is used to perform classification tasks. The classifier's
crux is based on the Bayes theorem.

The Bayes theorem allows us to calculate the posterior probability


P(c|x) from P(c), P(x), and P(x|c). Consider the following equation:
Where,

Posterior probability of class c (target) given predictor x(attributes)


- P(c|x)

The prior probability of class - P(c)

The likelihood which is the probability of predictor given class -


P(x|c)

The prior probability of predictor - P(x)

Working of Naïve Bayes' Classifier

The following steps demonstrate how the Nave Bayes' Classifier


works.

Step Convert the given dataset into frequency tables.

Step Create a likelihood table by calculating the probabilities of


the given features.

Step Use Bayes theorem to calculate the posterior probability.


Naive Bayes Example

The weather training data set and corresponding target variable


are shown in Figure We must now categorize whether or not
players will play based on the weather.

Step Make a frequency table out of the data set.

Step Make a likelihood table by calculating probabilities such as


overcast probability and probability of playing.

Figure 2.9: Weather training data set and corresponding frequency and
likelihood table

Step Calculate the posterior probability for each class using the
Naive Bayesian equation. The outcome of prediction is the class
with the highest posterior probability.
Naive Bayes employs a similar method to predict the likelihood of
various classes based on various attributes. This algorithm is
commonly used in text classification and multi-class problems.

Types of Naïve Bayes

The Naive Bayes Model is classified into three types, which are
listed as follows:

Gaussian Naïve The Gaussian model is based on the assumption


that features have a normal distribution. This means that if
predictors take continuous values rather than discrete values, the
model assumes these values are drawn from the Gaussian
distribution.

Multinomial Naïve When the data is multinomially distributed, the


Multinomial Naive Bayes classifier is used. It is primarily used to
solve document classification problems, indicating which category a
particular document belongs to, such as sports, politics, education,
and so on. The predictors of the classifier are based on the
frequency of words.

Bernoulli Naïve The Bernoulli classifier operates similarly to the


multinomial classifier, except that the predictor variables are
independent Boolean’s variables. For example, whether or not a
specific word appears in a document. This model is also well-
known for performing document classification tasks.

Advantages and disadvantages of Naïve Bayes

Following are the advantages:


Easy and fast algorithm to predict a class of datasets.

Used for binary as well as multi-class classifications.

When compared to numerical variables, it performs well with


categorical input variables.

It is the most widely used test for text classification problems.

Following are the disadvantages:

Naive Bayes is based on the assumption that all predictors (or


features) are independent.

The is confronted by this algorithm.

Applications of Naïve Bayes Algorithms

Following are the applications of Naïve Bayes Algorithms:

Text classification

Spam Filtering

Real-time Prediction

Multi-class Prediction
Recommendation Systems

Credit Scoring

Sentiment Analysis

Decision Tree

A decision tree is a supervised learning technique that can be


applied to classification and regression problems.

Decision trees are designed to mimic human decision-making


abilities, making them simple to understand.

Because the decision tree has a tree-like structure, the logic


behind it is easily understood.

Decision-tree working

Following steps are involved in the working of Decision-tree:

Begin the tree with node T, which contains the entire dataset.

Using the Attribute Selection Measure, find the best attribute in


the dataset.

Divide the T into subsets that contain the best possible values for
the attributes.
Create the decision tree node with the best attribute.

Make new decision trees recursively using the subsets of the


dataset created in step-3.

Continue this process until you reach a point where you can no
longer classify the nodes and refer to the final node as a leaf
node.

General Following is the general decision tree structure is shown


in Figure

Figure 2.10: General Decision Tree structure

A decision tree can contain both categorical (Yes/No) and


numerical data.
Example of decision-tree

Assume we want to play badminton on a specific day, say


Saturday - how will you decide whether or not to play? Assume
you go outside to see if it is hot or cold, the speed of the wind
and humidity, and the weather, i.e., whether it's sunny, cloudy, or
rainy. You consider all of these factors when deciding whether or
not to play. Table 2.3 shows the weather observations of the last
ten days.

days.

days.

days.

days.

days.

days.

days.

days.

days.

days.

days.

Table 2.3: Weather Observations of the last ten days


A decision tree is a great way to represent the data because it
follows a tree-like structure and considers all possible paths that
can lead to the final decision.

Figure 2.11: Play Badminton decision tree

Figure 2.11 depicts a learned decision tree. Each node represents


an attribute or feature, and the branch from each node represents
the node's outcome. Finally, the final decision is made by the
tree's leaves.

Advantages of decision tree

Following are the advantages of decision tree:

Simple and easy to understand.

Popular technique for resolving decision-making problems.


It aids in considering all possible solutions to a problem.

Less data cleaning is required.

Disadvantages of the decision tree

Following are the disadvantages of decision tree:

Overfitting problem

Complexity

K-Nearest Neighbors (K-NN) algorithm

The K-Nearest Neighbors algorithm is a clear and simple


supervised machine learning algorithm that can be used to solve
regression and classification problems.

The K-NN algorithm assumes that the new case and existing
cases are similar and places the new case in the category that is
most similar to the existing categories. The K-NN algorithm stores
all available data and classifies a new data point based on its
similarity to the existing data. This means that when new data
appears, the KNN algorithm can quickly classify it into a suitable
category.

K-NN is a non-parametric algorithm, which means it makes no


assumptions about the data it uses. It's also known as a lazy
learner algorithm because it doesn't learn from the training set
right away; instead, it stores the dataset and performs an action
on it when it comes time to classify it.

Pattern recognition, data mining, and intrusion detection are some


of the demanding applications.

Need of the K-NN Algorithm

Assume there are two categories, Category A and Category B, and


we have a new data point x1. Which of these categories will this
data point fall into? A K-NN algorithm is required to solve this
type of problem. We can easily identify the category or class of a
dataset with the help of K-NN as shown in Figure

Figure 2.12: Category or class of a dataset with the help of K-NN

The following algorithm can be used to explain how KNNs work:

Select the no. K of the neighbors.

Determine the Euclidean distance between K neighbors.


Take the K closest neighbors based on the Euclidean distance
calculated.

Count the number of data points in each category among these K


neighbors.

Assign the new data points to the category with the greatest
number of neighbors.

Our model is complete.

Let's say we have a new data point that needs to be placed in


the appropriate category. Consider the following illustration:

Figure 2.13: K-NN example

First, we'll decide on the number of neighbors, so we'll go with

The Euclidean distance between the data points will then be


calculated as shown in Figure The Euclidean distance is the
distance between two points that we learned about in geometry
class. It can be calculated using the following formula:

Figure 2.14: The Euclidean distance between the data points

We found the closest neighbors by calculating the Euclidean


distance, which yielded three closest neighbors in category A and
two closest neighbors in category B as shown in Figure Consider
the following illustration:
Figure 2.15: Closest neighbors for the Category A and B

As can be seen, the three closest neighbors are all from category
A, so this new data point must also be from that category.

Logistic Regression

The classification algorithm logistic regression is used to assign


observations to a discrete set of classes. Unlike linear regression,
which produces a continuous number of values, logistic regression
produces a probability value that can be mapped to two or more
discrete classes using the logistic sigmoid function.

A regression model with a categorical target variable is known as


logistic regression. To model binary dependent variables, it
employs a logistic function.

The target variable in logistic regression has two possible values,


such as yes/no. Consider how the target variable y would be
represented in the value of "yes" is 1 and "no" is 0. The log-odds
of y being 1 is a linear combination of one or more predictor
variables, according to the logistic model. So, let's say we have
two predictors or independent variables, and and p is the
probability of y equaling 1. Then, using the logistic model as a
guide:

We can recover the odds by exponentiating the equation:


As a result, the probability of y is 1. If p is closer to 0, y equals
0, and if p is closer to 1, y equals 1. As a result, the logistic
regression equation is:

This equation can be generalized to n number of parameters and


independent variables as follows:

Comparison between linear and logistic regression

Different things can be predicted using linear regression and


logistic regression as shown in Figure

Linear The predictions of linear regression are continuous


(numbers in a range). It may be able to assist us in predicting
the student's test score on a scale of 0 to 100.

Logistic The predictions of logistic regression are discrete (only


specific values or categories are allowed). It might be able to tell
us whether the student passed or failed. The probability scores
that underpin the model's classifications can also be viewed.

Figure 2.16: Linear regression vs Logistic regression


While linear regression has an infinite number of possible
outcomes, logistic regression has a set of predetermined
outcomes.

When the response variable is continuous, linear regression is


used, but when the response variable is categorical, logistic
regression is used.

A continuous output, such as a stock market score, is an example


of logistic regression and predicting a defaulter in a bank using
transaction details from the past is an example of linear
regression.

The following Table 2.4 shows the difference between Linear and
Logistic Regression:
Regression: Regression:

Regression: Regression: Regression: Regression: Regression:

Regression: Regression: Regression: Regression: Regression:


Regression: Regression: Regression: Regression: Regression:
Regression: Regression: Regression: Regression: Regression:
Regression: Regression: Regression: Regression: Regression:
Regression:

Regression: Regression: Regression:

Regression: Regression: Regression: Regression: Regression:

Regression: Regression: Regression: Regression: Regression:

Regression: Regression: Regression: Regression: Regression:


Regression: Regression: Regression: Regression: Regression:

Table 2.4: Difference between Linear and Logistic Regression

Types of Logistic Regression

Logistic regression is mainly categorized into three types as shown


in Figure

Binary Logistic Regression

Multinomial Logistic Regression

Ordinal Logistic Regression


Figure 2.17: Types of Logistic Regression

Binary Logistic Regression

The target variable or dependent variable in binary logistic


regression is binary in nature, meaning it has only two possible
values.

There are only two possible outcomes for the categorical response.

For example, determining whether or not a message is a spam.


Multinomial Logistic Regression

In a multinomial logistic regression, the target variable can have


three or more values, but there is no fixed order of preference for
these values.

For instance, the most popular type of food (Indian, Italian,


Chinese, and so on.)

Predicting which food is preferred more is an example (Veg, Non-


Veg, Vegan).

Ordinal Logistic Regression

The target variable in ordinal logistic regression has three or more


possible values, each of which has a preference or order.

For instance, restaurant star ratings range from 1 to 5, and movie


ratings range from 1 to 5.

Examples

The following are some of the scenarios in which logistic


regression can be used.

Weather Prediction regression is used to make weather predictions.


We use the information from previous weather reports to forecast
the outcome for a specific day. However, logistic regression can
only predict categorical data, such as whether it will rain or not.

Determining Illness: We can use logistic regression with the help


of the patient's medical history to predict whether the illness is
positive or negative.

Support Vector Machine (SVM) Algorithm

The Support Vector Machine, or SVM, is a popular Supervised


Learning algorithm that can be used to solve both classification
and regression problems. However, it is primarily used in machine
learning for classification problems as shown in Figure

Many people prefer SVM because it produces significant accuracy


while using less computing power. The extreme points/vectors that
help create the hyperplane are chosen by SVM. Support vectors
are the extreme cases hence the algorithm is called a Support
Vector
Figure 2. 18: Support Vector Machine (SVM) concept

The support vector machine algorithm's goal is to find a


hyperplane in N-dimensional space (N - the number of features)
that categorizes the data points clearly.

Figure 2.19: SVM hyperplanes with Maximum margin

There are numerous hyperplanes to choose from to separate the


two classes of data points. Our goal is to find a plane with the
greatest margin, or the greatest distance between data points from
both classes as shown in Figure Maximizing the margin distance
provides some reinforcement, making it easier to classify future
data points.

Hyperplanes are decision boundaries that aid in data classification.


Different classes can be assigned to data points on either side of
the hyperplane. The hyperplane's dimension is also determined by
the number of features. If there are only two input features, the
hyperplane is just a line. The hyperplane becomes a two-
dimensional plane when the number of input features reaches
three. When the number of features exceeds three, it becomes
difficult to imagine.

Figure 2.20: Support Vectors with small and large margin

Support vectors are data points that are closer to the hyperplane
and have an influence on the hyperplane's position and orientation
as shown in Figure We maximize the classifier's margin by using
these support vectors. The hyperplane's position will be altered if
the support vectors are deleted. These are the points that will
assist us in constructing SVM.

The sigmoid function is used in logistic regression to squash the


output of the linear function within the range of [0,1]. If the
squashed value exceeds a threshold value (0.5), it is labeled as 1,
otherwise, it is labeled as 0. In SVM, we take the output of a
linear function and, if it is greater than 1, we assign it to one
class, and if it is less than 1, we assign it to another. We get this
reinforcement range of values ([-1,1]) which acts as a margin
because the threshold values in SVM are changed to 1 and -1.

Hyperplane, Support Vectors, and Margin

The Hyperplane, Support Vectors, and Margin are described as


follows:

In n-dimensional space, there can be multiple lines/decision


boundaries to separate the classes, but we need to find the best
decision boundary to help classify the data points. The hyperplane
of SVM refers to the best boundary. The hyperplane's dimensions
are determined by the features in the dataset; for example, if
there are two features, the hyperplane will be a straight line. If
three features are present, the hyperplane will be a two-
dimensional plane. We always make a hyperplane with a maximum
margin, which refers to the distance between data points.

Support Vectors: Support vectors are the data points or vectors


that are closest to the hyperplane and have an effect on the
hyperplane's position. These vectors are called support vectors
because they support the hyperplane.

It's the distance between two lines on the closest data point of
different classes. The perpendicular distance between the line and
the support vectors can be calculated. A large margin is regarded
as a good margin, while a small margin is regarded as a bad
margin.
Working of SVM

In multidimensional space, an SVM model is essentially a


representation of different classes in a hyperplane. SVM will
generate the hyperplane in an iterative manner in order to reduce
the error. SVM's goal is to divide datasets into classes so that a
maximum marginal hyperplane can be found.

Figure 2.21: Working of SVM

SVM's main goal is to divide datasets into classes in order to


find a maximum marginal hyperplane which can be accomplished
in two steps:

First, SVM will iteratively generate hyperplanes that best separate


the classes.

The hyperplane that correctly separates the classes will then be


chosen.

Types of SVM

Support Vector Machine (SVM) types are described below:


Linear Linear SVM is used for linearly separable data, which
means that if a dataset can be classified into two classes using
only a single straight line, it is called linearly separable and the
classifier used is called Linear SVM.

Non-linear Non-Linear SVM is used to classify non-linearly


separated which means that if a dataset cannot be classified using
a straight line, it is classified as non-linear data, and the classifier
used is the Non-Linear SVM classifier.

Applications of Support-Vector Machines

The following are few of the applications of Support-Vector


Machines:

Facial expressions classifications

Pattern classification and regression problems

In the military datasets

Speech recognition

Predicting the structure of proteins

In image processing - handwriting recognition

In earthquake potential damage detections


dvantages of SVM

The following are the advantages of SVM:

SVM classifiers are highly accurate and work well in high-


dimensional environments. Because SVM classifiers only use a
subset of training points, they require very little memory.

Solve the data points that are not linearly separable.

Effective in a higher dimension space.

It works well with a clear margin of separation.

It is effective in cases where the number of dimensions is greater


than the number of samples.

Better utilization of memory space.

Disadvantages of SVM

The following are the disadvantages of SVM:

Not suitable for the larger data sets.

Less effective when the data set has more noise.

It doesn’t directly provide probability estimates.


UNIT-III

A Gentle Introduction to Ensemble Learning Algorithms

Ensemble learning is a general meta approach to machine learning that seeks better predictive
performance by combining the predictions from multiple models.

Although there are a seemingly unlimited number of ensembles that you can develop for your predictive
modeling problem, there are three methods that dominate the field of ensemble learning. So much so,
that rather than algorithms per se, each is a field of study that has spawned many more specialized
methods.

The three main classes of ensemble learning methods are bagging, stacking, and boosting, and it is
important to both have a detailed understanding of each method and to consider them on your
predictive modeling project.

But, before that, you need a gentle introduction to these approaches and the key ideas behind each
method prior to layering on math and code.

In this tutorial, you will discover the three standard ensemble learning techniques for machine learning.

After completing this tutorial, you will know:

 Bagging involves fitting many decision trees on different samples of the same dataset and
averaging the predictions.
 Stacking involves fitting many different models types on the same data and using another model
to learn how to best combine the predictions.

 Boosting involves adding ensemble members sequentially that correct the predictions made by
prior models and outputs a weighted average of the predictions.

Standard Ensemble Learning Strategies

Ensemble learning refers to algorithms that combine the predictions from two or more models.

Although there is nearly an unlimited number of ways that this can be achieved, there are perhaps three
classes of ensemble learning techniques that are most commonly discussed and used in practice. Their
popularity is due in large part to their ease of implementation and success on a wide range of predictive
modeling problems.
A rich collection of ensemble-based classifiers have been developed over the last several years. However,
many of these are some variation of the select few well- established algorithms whose capabilities have
also been extensively tested and widely reported.

Given their wide use, we can refer to them as “standard” ensemble learning strategies; they are:

1. Bagging.

2. Stacking.

3. Boosting.
There is an algorithm that describes each approach, although more importantly, the success of each
approach has spawned a myriad of extensions and related techniques. As such, it is more useful to
describe each as a class of techniques or standard approaches to ensemble learning.

Rather than dive into the specifics of each method, it is useful to step through, summarize, and contrast
each approach. It is also important to remember that although discussion and use of these methods are
pervasive, these three methods alone do not define the extent of ensemble learning.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have knowledge of the Decision
Tree Algorithm.

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.

o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">

o It takes less training time as compared to other algorithms.

o It predicts output with high accuracy, even for the large dataset it runs efficiently.

o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.

2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.

3. Land Use: We can identify the areas of similar land use by this algorithm.

4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.

o It is capable of handling large datasets with high dimensionality.

o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest


o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

Python Implementation of Random Forest Algorithm

Now we will implement the Random Forest Algorithm tree using Python. For this, we will use the same
dataset "user_data.csv", which we have used in previous classification models. By using the same
dataset, we can compare the Random Forest classifier with other classification models such as Decision
tree Classifier, KNN, SVM, Logistic Regression, etc.

Voting Classifier :
You never know if your model is useful unless you evaluate the performance of the machine learning
model. The goal of a data scientist is to train a robust and high-performing model. There are various
techniques or hacks to improve the performance of the model, ensembling of models being one of
them.

Ensembling is a powerful technique to improve the performance of the model by combining various
base models in order to produce an optimal and robust model. Types of Ensembling techniques include:

 Bagging or Bootstrap Aggregation

 Boosting

 Stacking Classifier

 Voting Classifier

Read one of my previous articles to get a better understanding on bagging ensemble technique:

Improving the Performance of Machine Learning Model using Bagging

Understand the working of Bootstrap Aggregation (Bagging) ensemble learning and implement a
Random Forest Bagging model…

[Link]

In this article, we will discuss the implementation of a voting classifier and further discuss how can it be
used to improve the performance of the model.

Voting Classifier:

A voting classifier is a machine learning estimator that trains various base models or estimators and
predicts on the basis of aggregating the findings of each base estimator. The aggregating criteria can be
combined decision of voting for each estimator output. The voting criteria can be of two types:

 Hard Voting: Voting is calculated on the predicted output class.

 Soft Voting: Voting is calculated on the predicted probability of the output class.
How Voting Classifier can improve performance?

The voting classifier aggregates the predicted class or predicted probability on basis of hard voting or
soft voting. So if we feed a variety of base models to the voting classifier it makes sure to resolve the
error by any model.

(Image by Author), Left: Hard Voting, Right: Soft Voting

Implementation:

Scikit-learn packages offer implementation of Voting Classifier in a few lines of Python code.

For our sample classification dataset, we are training 4 base estimators of Logistic Regression, Random
Forest, Gaussian Naive Bayes, and Support Vector Classifier.

Parameter voting=‘soft’ or voting=‘hard’ enables developers to switch between hard or soft voting
aggregators. The parameter weight can be tuned to users to overshadow some of the good-performing
base estimators. The sequence of weights to weigh the occurrences of predicted class labels for hard
voting or class probabilities before averaging for soft voting.

We are using a soft voting classifier and weight distribution of [1,2,1,1], where twice the weight is
assigned to the Random Forest model. Now lets, observe the benchmark performance of each of the
base estimators vis-a-vis the voting classifier.

(Image by Author), Benchmark Performance

From the above pretty table, the voting classifier boosts the performance compared to its base
estimator performances.

Conclusion:

Voting Classifier is a machine-learning algorithm often used by Kagglers to boost the performance of
their model and climb up the rank ladder. Voting Classifier can also be used for real-world datasets to
improve performance, but it comes with some limitations. The model interpretability decreases, as one
cannot interpret the model using shap, or lime packages.
Scikit-learn does not provide implementation to compute the top-performing features for the voting
classifier unlike other models, but I have come with a hack to compute the same. You can compute the
feature importance by combining the importance score of each of the estimators based on the weights.
Follow my previous article to get a better understanding of the same:

Bagging and Pasting

In machine learning, sometimes multiple predictors grouped together have a better


predictive performance than anyone of the group alone. These techniques are very
popular in competitions and in production. They are called Ensemble Learning.

There are several ways to group models. They differ in the training algorithm and
data used in each one of them and also how they are grouped. We’ll be talking in the
article about two methods called Bagging and Pasting and how to implement them
in scikit-learn.
But before we begin talking about Bagging and Pasting, we have to know what
is Bootstrapping.

Bootstrapping

In statistics, bootstrapping refers to a resample method that consists of repeatedly


drawn, with replacement, samples from data to form other smaller datasets, called
bootstrapping samples. It’s as if the bootstrapping method is making a bunch of
simulations to our original dataset so in some cases we can generalize the mean and
the standard deviation.

For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each
bootstrap sample containing n observations, the following are valid samples:

 n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…

 n=4: [2, 32, 4, 16], [2, 4, 2, 8], [8, 32, 4, 2]…

Since we drawn data with replacement, the observations can appear more than one
time in a single sample.

Bagging & Pasting

Bagging means bootstrap+aggregating and it is a ensemble method in which we first


bootstrap our data and for each bootstrap sample we train one model. After that, we
aggregate them with equal weights. When it’s not used replacement, the method is
called pasting.

Out-of-Bag Scoring

If we are using bagging, there’s a chance that a sample would never be selected, while
anothers may be selected multiple time. The probability of not selecting a specific
sample is (1–1/n), where n is the number of samples. Therefore, the probability of not
picking n samples in n draws is (1–1/n)^n. When the value of n is big, we can
approximate this probability to 1/e, which is approximately 0.3678. This means that
when the dataset is big enough, 37% of its samples are never selected and we could
use it to test our model. This is called Out-of-Bag scoring, or OOB Scoring.

Random Forests
As the name suggest, a random forest is an ensemble of decision trees that can be
used to classification or regression. In most cases, it is used bagging. Each tree in the
forest outputs a prediction and the most voted becomes the output of the model. This
is helpful to make the model with more accuracy and stable, preventing overfitting.

Another very useful property of random forests is the ability to measure the relative
importance of each feature by calculating how much each one reduce the impurity of
the model. This is called feature importance.

A scikit-learn
To see how bagging works in scikit-learn, we will train some models alone and then
aggregate them, so we can see if it works.

In this example we’ll be using the 1994 census dataset on US income. It contains
informations such as marital status, age, type of work and more. As target column we
have a categorical data type that informs if a salary is less than or equal 50k a year(0)
or not(1). Let’s explore the DataFrame with Pandas’ info method:
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education_num 32561 non-null int64
marital_status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital_gain 32561 non-null int64
capital_loss 32561 non-null int64
hours_per_week 32561 non-null int64
native_country 32561 non-null object
high_income 32561 non-null int8
dtypes: int64(6), int8(1), object(8)

As we can see, there’s numerical(int64 and int8) and categorical(object) data types.
We have to deal with each type separately to send to the predictor.

Data Preparation

First we load the CSV file and convert the target column to categorical, so when we
are passing all columns to the pipeline we don’t have to worry about the target
column.
import numpy as np
import pandas as pd# Load CSV
df = pd.read_csv('data/[Link]')# Convert target to categorical
col = [Link](df.high_income)
df["high_income"] = [Link]

There’s numerical and categorical columns in our dataset. We need to make different
preprocessing in each one of them. The numerical features need to be normalized and
the categorical features need to be converted to integers. To do this, we define a
transformer to preprocess our data depending on it’s type.
from [Link] import BaseEstimator, TransformerMixin
from [Link] import MinMaxScalerclass
PreprocessTransformer(BaseEstimator, TransformerMixin):
def init (self, cat_features, num_features):
self.cat_features = cat_features
self.num_features = num_features def fit(self, X, y=None):
return self def transform(self, X, y=None):
df = [Link]() # Treat ? workclass as unknown
[Link][df['workclass'] == '?', 'workclass'] = 'Unknown'
# Too many categories, just convert to US and Non-US
[Link][df['native_country']!='United-States','native_country']='non_usa' #
Convert columns to categorical
for name in self.cat_features:
col = [Link](df[name])
df[name] = [Link] # Normalize numerical features
scaler = MinMaxScaler()
df[self.num_features] = scaler.fit_transform(df[num_features]) return df
The data is then splitted into train and test, so we can see later if our model
generalized to unseen data.
from sklearn.model_selection import train_test_split# Split the dataset into
training and testing
X_train, X_test, y_train, y_test = train_test_split(
[Link]('high_income', axis=1),
df['high_income'],
test_size=0.2,
random_state=42,
shuffle=True,
stratify=df['high_income']
)

Build the Model

Finally, we build our models. First we create a pipeline to preprocess with our custom
transformer, select the best features with SelectKBest and train our predictors.
from [Link] import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import cross_val_score
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
from [Link] import BaggingClassifierrandom_state = 42
leaf_nodes = 5
num_features = 10
num_estimators = 100# Decision tree for bagging
tree_clf = DecisionTreeClassifier(
splitter='random',
max_leaf_nodes=leaf_nodes,
random_state=random_state
)# Initialize the bagging classifier
bag_clf = BaggingClassifier(
tree_clf,
n_estimators=num_estimators,
max_samples=1.0,
max_features=1.0,
random_state=random_state,
n_jobs=-1
)# Create a pipeline
pipe = Pipeline([
('preproc', PreprocessTransformer(categorical_features, numerical_features)),
('fs', SelectKBest()),
('clf', DecisionTreeClassifier())
])

Since what we are trying to do is see the difference between a simple decision tree
and an ensemble of them, we can use scikit-learn’s GridSearchCV to train all
predictors using a single fit method. We are using AUC and accuracy as scoring and a
KFold with 10 splits as cross-validation.
from sklearn.model_selection import KFold, GridSearchCV
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, make_scorer# Define our search space
for grid search
search_space = [
{
'clf': [DecisionTreeClassifier()],
'clf max_leaf_nodes': [128],
'fs score_func': [chi2],
'fs k': [10],
},
{
'clf': [RandomForestClassifier()],
'clf n_estimators': [200],
'clf max_leaf_nodes': [128],
'clf bootstrap': [False, True],
'fs score_func': [chi2],
'fs k': [10],
}
]# Define scoring
scoring = {'AUC':'roc_auc', 'Accuracy':make_scorer(accuracy_score)}# Define cross
validation
kfold = KFold(n_splits=10, random_state=42)# Define grid search
grid = GridSearchCV(
pipe,
param_grid=search_space,
cv=kfold,
scoring=scoring,
refit='AUC',
verbose=1,
n_jobs=-1
)# Fit grid search
model = [Link](X_train, y_train)

The mean of AUC and accuracy for each model tested on GridSearchCV are:

 Single model: AUC = 0.791, Accuracy: 0.798

 Bagging: AUC = 0.869, Accuracy = 0.816

 Pasting: AUC = 0.870, Accuracy = 0.815

 Native random forest: AUC = 0.887, Accuracy = 0.838


As expected, we had better results with the ensemble methods, even if the constituent
parts are the same training algorithm with the same parameters as the single one.

Since the best estimator was the random forest, we can visualize the OOB score and
the features importance by:
best_estimator = grid.best_estimator_.steps[-1][1]
columns = X_test.[Link]()print('OOB Score:
{}'.format(best_estimator.oob_score_))
print('Feature Importances')

for i, imp in enumerate(best_estimator.feature_importances_):


print('{}: {:.3f}'.format(columns[i], imp))

Which prints:
OOB Score: 0.8396805896805897

Feature Importances:
age: 0.048
workclass: 0.012
fnlwgt: 0.167
education: 0.138
education_num: 0.001
marital_status: 0.329
occupation: 0.009
relationship: 0.259
race: 0.012
sex: 0.025

Boosting
heory, Implementation, and Visualization

Unlike many ML models which focus on high quality prediction done by a


single model, boosting algorithms seek to improve the prediction power
by training a sequence of weak models, each compensating the
weaknesses of its predecessors.

One is weak, together is strong, learning from past is the best

To understand Boosting, it is crucial to recognize that boosting is a


generic algorithm rather than a specific model. Boosting needs you to
specify a weak model (e.g. regression, shallow decision trees, etc) and
then improves it.

With that sorted out, it is time to explore different definitions of


weakness and their corresponding algorithms. I’ll introduce two major
algorithms: Adaptive Boosting (AdaBoost) and Gradient Boosting.

1. AdaBoost

Definition of Weakness

AdaBoost is a specific Boosting algorithm developed for classification


problems (also called discrete AdaBoost). The weakness is identified by
the weak estimator’s error rate:

In each iteration, AdaBoost identifies miss-classified data points,


increasing their weights (and decrease the weights of correct points, in a
sense) so that the next classifier will pay extra attention to get them
[Link] following figure illustrates how weights impact the
performance of a simple decision stump(tree with depth 1)
How sample weights affect the decision boundary

Now with weakness defined, the next step is to figure out how to combine
the sequence of models to make the ensemble stronger overtime.

Pseudocode

There are several different algorithms proposed by researchers. Here I’ll


introduce the most popular method called SAMME, a specific method
that deals with multi-classification problems. (Zhu, H. Zou, S. Rosset, T.
Hastie, “Multi-class AdaBoost”, 2009).

AdaBoost trains a sequence of models with augmented sample weights,


generating ‘confidence’ coefficients Alpha for individual classifiers based
on errors. Low errors leads to large Alpha, which means higher

the size of dots indicates their weights

Implementation in Python

Scikit-Learn offers a nice implementation of AdaBoost with SAMME (a


specific algorithm for Multi classification).

Parameters:base_estimator : object, optional (default=None)

The base estimator from which the boosted ensemble is built. If None,
then the base estimator is DecisionTreeClassifier(max_depth=1)

n_estimators : integer, optional (default=50)

The maximum number of estimators at which boosting is terminated. In


case of perfect fit, the learning procedure is stopped early.

learning_rate : float, optional (default=1.)

Learning rate shrinks the contribution of each classifier


by learning_rate.

algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)


If ‘SAMME.R’ then use the SAMME.R real boosting
algorithm. base_estimator must support calculation of class
probabilities. If ‘SAMME’ then use the SAMME discrete boosting
algorithm.

random_state : int, RandomState instance or None, optional


(default=None)

2. Gradient Boosting

Definition of Weakness

Gradient boosting approaches the problem a bit differently. Instead of


adjusting weights of data points, Gradient boosting focuses on the
difference between the prediction and the ground truth.

weakness is defined by gradients

Pseudocode

Gradient boosting requires a differential loss function and works for


both regression and classifications. I’ll use a simple Least Square as the
loss function (for regression).

Gradient Boosting with Least Square

Following is a visualization of how weak estimators H are built over time.


Each time we fit a new estimator (regression tree with max_depth =3 in
this case) to the gradient of loss(LS in this case).
gradient is scaled down for visualization purpose

Implementation in Python

Again, you can find Gradient Boosting function in Scikit-Learn’s library.

Regression:

loss : {‘ls’, ‘lad’, ‘huber’, ‘quantile’}, optional (default=’ls’)

Classification:

loss : {‘deviance’, ‘exponential’}, optional (default=’deviance’)

The rest are the same

learning_rate : float, optional (default=0.1)

n_estimators : int (default=100)

Gradient boosting is fairly robust to over-fitting so a large number


usually results in better performance.

subsample : float, optional (default=1.0)

The fraction of samples to be used for fitting the individual base


learners. If smaller than 1.0 this results in Stochastic Gradient
Boosting. subsample interacts with the parameter n_estimators.
Choosing subsample < 1.0 leads to a reduction of variance and an
increase in bias.

criterion : string, optional (default=”friedman_mse”)

The function to measure the quality of a split.

Strength and Weakness


1. Easy to interpret: boosting is essentially an ensemble model, hence
it is easy to interpret it’s prediction
2. strong prediction power: usually boosting > bagging (random
forrest) > decision tree
3. resilient to overfitting: See this article
4. sensitive to outliers: since each weak classifier is dedicated to fix its
predecessors’ shortcomings, the model may pay too much attention
to outliers
5. hard to scale up: since each estimator is built on its predecessors,
the process is hard to parallelize.

Stacking in Machine Learning


Stacking is one of the most popular ensemble machine learning
techniques used to predict multiple nodes to build a new model and
improve model performance. Stacking enables us to train multiple
models to solve similar problems, and based on their combined output, it
builds a new model with improved performance.

In this topic, "Stacking in Machine Learning", we will discuss a few


important concepts related to stacking, the general architecture of
stacking, important key points to implement stacking, and how stacking
differs from bagging and boosting in machine learning. Before starting
this topic, first, understand the concepts of the ensemble in machine
learning. So, let's start with the definition of ensemble learning in
machine learning.

What is Ensemble learning in Machine Learning?

Ensemble learning is one of the most powerful machine learning


techniques that use the combined output of two or more models/weak
learners and solve a particular computational intelligence problem. E.g.,
a Random Forest algorithm is an ensemble of various decision trees
[Link] learning is primarily used to improve the model
performance, such as classification, prediction, function approximation,
etc. In simple words, we can summarise the ensemble learning as
follows:
"An ensembled model is a machine learning model that combines the
predictions from two or more models.”

There are 3 most common ensemble learning methods in machine


learning. These are as follows:

o Bagging
o Boosting
o Stacking

However, we will mainly discuss Stacking on this topic.

1. Bagging

Bagging is a method of ensemble modeling, which is primarily used to


solve supervised machine learning problems. It is generally completed in
two steps as follows:

o Bootstrapping: It is a random sampling method that is used to


derive samples from the data using the replacement procedure. In
this method, first, random data samples are fed to the primary
model, and then a base learning algorithm is run on the samples to
complete the learning process.
o Aggregation: This is a step that involves the process of combining
the output of all base models and, based on their output, predicting
an aggregate result with greater accuracy and reduced variance.

Example: In the Random Forest method, predictions from multiple


decision trees are ensembled parallelly. Further, in regression problems,
we use an average of these predictions to get the final output, whereas, in
classification problems, the model is selected as the predicted class.

2. Boosting

Boosting is an ensemble method that enables each member to learn from


the preceding member's mistakes and make better predictions for the
future. Unlike the bagging method, in boosting, all base learners (weak)
are arranged in a sequential format so that they can learn from the
mistakes of their preceding learner. Hence, in this way, all weak learners
get turned into strong learners and make a better predictive model with
significantly improved performance.

We have a basic understanding of ensemble techniques in machine


learning and their two common methods, i.e., bagging and boosting.
Now, let's discuss a different paradigm of ensemble learning, i.e.,
Stacking.

3. Stacking

Stacking is one of the popular ensemble modeling techniques in


machine learning. Various weak learners are ensembled in a parallel
manner in such a way that by combining them with Meta learners, we
can predict better predictions for the future.

This ensemble technique works by applying input of combined multiple


weak learners' predictions and Meta learners so that a better output
prediction model can be achieved.

In stacking, an algorithm takes the outputs of sub-models as input and


attempts to learn how to best combine the input predictions to make a
better output prediction.

Stacking is also known as a stacked generalization and is an extended


form of the Model Averaging Ensemble technique in which all sub-
models equally participate as per their performance weights and build a
new model with better predictions. This new model is stacked up on top
of the others; this is the reason why it is named stacking.

Architecture of Stacking

The architecture of the stacking model is designed in such as way that it


consists of two or more base/learner's models and a meta-model that
combines the predictions of the base models. These base models are
called level 0 models, and the meta-model is known as the level 1 model.
So, the Stacking ensemble method includes original (training) data,
primary level models, primary level prediction, secondary level model,
and final prediction. The basic architecture of stacking can be
represented as shown below the image.

o Original data: This data is divided into n-folds and is also


considered test data or training data.
o Base models: These models are also referred to as level-0 models.
These models use training data and provide compiled predictions
(level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some training
data and provides different predictions, which are known as level-0
predictions.
o Meta Model: The architecture of the stacking model consists of one
meta-model, which helps to best combine the predictions of the
base models. The meta-model is also known as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the
predictions of the base models and is trained on different
predictions made by individual base models, i.e., data not used to
train the base models are fed to the meta-model, predictions are
made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to fit
the meta-model.

Steps to implement Stacking models:

There are some important steps to implementing stacking models in


machine learning. These are as follows:

o Split training data sets into n-folds using


the RepeatedStratifiedKFold as this is the most common approach
to preparing training datasets for meta-models.
o Now the base model is fitted with the first fold, which is n-1, and it
will make predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train
array of size n,
o Now, the model is trained on all the n parts, which will make
predictions for the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and
y3_test by using Model 2 and 3 for training, respectively, to get
Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these
predictions will be used as features for the model.
o Finally, Meta learners can now be used to make a prediction on test
data in the stacking model.

Stacking Ensemble Family

There are some other ensemble techniques that can be considered the
forerunner of the stacking method. For better understanding, we have
divided them into the different frameworks of essential stacking so that
we can easily understand the differences between methods and the
uniqueness of each technique. Let's discuss a few commonly used
ensemble techniques related to stacking.

Voting ensembles:

This is one of the simplest stacking ensemble methods, which uses


different algorithms to prepare all members individually. Unlike the
stacking method, the voting ensemble uses simple statistics instead of
learning how to best combine predictions from base models separately.

It is significant to solve regression problems where we need to predict


the mean or median of the predictions from base models. Further, it is
also helpful in various classification problems according to the total votes
received for prediction. The label with the higher numbers of votes is
referred to as hard voting, whereas the label that receives the largest
sums of probability or lesser votes is referred to as soft voting.

The voting ensemble differs from than stacking ensemble in terms of


weighing models based on each member's performance because here, all
models are considered to have the same skill levels.

Member Assessment: In the voting ensemble, all members are assumed


to have the same skill sets.

Combine with Model: Instead of using combined prediction from each


member, it uses simple statistics to get the final prediction, e.g., mean or
median.
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points.
Support Vectors:The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import [Link] as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the dataset as:

The scaled output for the test set will be:


Support Vector Regression for Machine
Learning

ntroduction to Support Vector Regression (SVR)


Support Vector Regression (SVR) uses the same principle as SVM, but for regression
problems. Let’s spend a few minutes understanding the idea behind SVR.
The Idea Behind Support Vector Regression
The problem of regression is to find a function that approximates mapping from an input
domain to real numbers on the basis of a training sample. So let’s now dive deep and
understand how SVR works actually.

Consider these two red lines as the decision boundary and the green line as the
hyperplane. Our objective, when we are moving on with SVR, is to basically consider
the points that are within the decision boundary line. Our best fit line is the hyperplane
that has a maximum number of points.

The first thing that we’ll understand is what is the decision boundary (the danger red line
above!). Consider these lines as being at any distance, say ‘a’, from the hyperplane. So,
these are the lines that we draw at distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the
text is basically referred to as epsilon.

Assuming that the equation of the hyperplane is as follows:

Y = wx+b (equation of hyperplane)


Then the equations of decision boundary become:

wx+b= +a

wx+b= -a

Thus, any hyperplane that satisfies our SVR should satisfy:

-a < Y- wx+b < +a

Our main aim here is to decide a decision boundary at ‘a’ distance from the original
hyperplane such that data points closest to the hyperplane or the support vectors are within
that boundary line.

Hence, we are going to take only those points that are within the decision boundary and
have the least error rate, or are within the Margin of Tolerance. This gives us a better fitting
model.

Implementing Support Vector Regression (SVR) in Python


Time to put on our coding hats! In this section, we’ll understand the use of Support Vector
Regression with the help of a dataset. Here, we have to predict the salary of an employee
given a few independent variables. A classic HR analytics project!

Step 1: Importing the libraries


import numpy as np

import [Link] as plt

import pandas as pd

view [Link] hosted with by GitHub

Step 2: Reading the dataset

dataset = pd.read_csv('Position_Salaries.csv')

X = [Link][:, 1:2].values

y = [Link][:, 2].values

view [Link] hosted with by GitHub

Step 3: Feature Scaling

A real-world dataset contains features that vary in magnitudes, units, and range. I would
suggest performing normalization when the scale of a feature is irrelevant or misleading.

Feature Scaling basically helps to normalize the data within a particular range. Normally
several common class types contain the feature scaling function so that they make feature
scaling automatically. However, the SVR class is not a commonly used class type so we
should perform feature scaling using Python.

from [Link] import StandardScaler

sc_X = StandardScaler()

sc_y = StandardScaler()

X = sc_X.fit_transform(X)

y = sc_y.fit_transform(y)

view [Link] hosted with by GitHub

Step 4: Fitting SVR to the dataset

from [Link] import SVR

regressor = SVR(kernel = 'rbf')


[Link](X, y)

view [Link] hosted with by GitHub

Kernel is the most important feature. There are many types of kernels – linear, Gaussian,
etc. Each is used depending on the dataset. To learn more about this, read this: Support
Vector Machine (SVM) in Python and R

Step 5. Predicting a new result

y_pred = [Link](6.5)

y_pred = sc_y.inverse_transform(y_pred)

view [Link] hosted with by GitHub

So, the prediction for y_pred(6, 5) will be 170,370.

Step 6. Visualizing the SVR results (for higher resolution and smoother curve)

X_grid = [Link](min(X), max(X), 0.01) #this step required because data is feature scaled.

X_grid = X_grid.reshape((len(X_grid), 1))

[Link](X, y, color = 'red')

[Link](X_grid, [Link](X_grid), color = 'blue')

[Link]('Truth or Bluff (SVR)')

[Link]('Position level')

[Link]('Salary')

[Link]()
This is what we get as output- the best fit line that has a maximum number of points. Quite
accurate!
UNIT-IV
Clustering in Machine Learning
lustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined as "A
way of grouping the data points into different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides
them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled
dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to
simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of dataset that we are using.
In classification, we work with the labeled data set, whereas in clustering, we work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any shopping mall,
we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one section,
and trousers are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into
several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft
Clustering (data points can belong to another group also). But there are also other various approaches of Clustering exist.
Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The
cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared
to another cluster centroid.
Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different
clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided
from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high dimensions.
Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs
to a particular distribution. The grouping is done by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models
(GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each
dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-
means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine
learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along
with the Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs

only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process
until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create
a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.


Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means
here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either the points
from the dataset or any other point. So, here we are selecting the below two points as k points, which are not the
part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below mage:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right
of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the
new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a
median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line.
So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in
the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line, which means our
model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the
below image:

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing
the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but
here we are discussing the most appropriate method to find the number of clusters or value of K. The method is given
below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept
of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a
cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean distance or
Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of
K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for
the elbow method looks like the below image:

Note: We can choose the number of clusters equal to the given data points. If we choose the number of clusters equal to the
data points, then the value of WCSS becomes zero, and that will be the endpoint of the plot.

k-Means Advantages and Disadvantages


bookmark_border
Advantages of k-means
Relatively simple to implement.

Scales to large data sets.

Guarantees convergence.

Can warm-start the positions of centroids.


Easily adapts to new examples.

Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

k-means Generalization

What happens when clusters are of different densities and sizes? Look at Figure 1. Compare the intuitive clusters on the
left side with the clusters actually found by k-means on the right side. The comparison shows how k-means can stumble
on certain datasets.

Figure 1: Ungeneralized k-means example.

To cluster naturally imbalanced clusters like the ones shown in Figure 1, you can adapt (generalize) k-means. In Figure 2,
the lines show the cluster boundaries after generalizing k-means as:

 Left plot: No generalization, resulting in a non-intuitive cluster boundary.


 Center plot: Allow different cluster widths, resulting in more intuitive clusters of different [Link] plot: Besides different
cluster widths, allow different widths per dimension, resulting in elliptical instead of

While this course doesn't dive into how to generalize k-means, remember that the ease of modifying k-means is another
reason why it's powerful. For information on generalizing k-means, see Clustering – K-means Gaussian mixture
models by Carlos Guestrin from Carnegie Mellon University.

Disadvantages of k-means
Choosing k manually.

Use the “Loss vs. Clusters” plot to find the optimal (k), as discussed in Interpret Results.

Being dependent on initial values.

For a low k, you can mitigate this dependence by running k-means several times with different initial values and picking
the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids
(called k-means seeding). For a full discussion of k- means seeding see, A Comparative Study of Efficient Initialization
Methods for the K-Means Clustering Algorithm by M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.

Clustering data of varying sizes and density.

k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to
generalize k-means as described in the Advantages section.

Clustering outliers.

Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing
or clipping outliers before clustering.

Scaling with number of dimensions.

As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any
given examples. Reduce dimensionality either by using PCA on the feature data, or by using “spectral clustering” to
modify the clustering algorithm as explained below.
Curse of Dimensionality and Spectral Clustering

These plots show how the ratio of the standard deviation to the mean of distance between examples decreases as the
number of dimensions increases. This convergence means k-means becomes less effective at distinguishing between
examples. This negative consequence of high-dimensional data is called the curse of dimensionality.

Figure 3: A demonstration of the curse of dimensionality. Each plot shows the pairwise distances between
200 random points.

Spectral clustering avoids the curse of dimensionality by adding a pre-clustering step to your algorithm:

1. Reduce the dimensionality of feature data by using PCA.


2. Project all data points into the lower-dimensional subspace.
3. Cluster the data in this subspace by using your chosen algorithm.

Therefore, spectral clustering is not a separate clustering algorithm but a pre- clustering step that you can use with any
clustering algorithm. The details of spectral clustering are complicated. See A Tutorial on Spectral Clustering by Ulrike von
Luxburg.

Segmentation By clustering
It is a method to perform Image Segmentation of pixel-wise segmentation. In this type of segmentation, we try to cluster the
pixels that are together. There are two approaches for performing the Segmentation by clustering.
 Clustering by Merging
 Clustering by Divisive
Clustering by merging or Agglomerative Clustering:
In this approach, we follow the bottom-up approach, which means we assign the pixel closest to the cluster. The algorithm for
performing the agglomerative clustering as follows:
 Take each point as a separate cluster.
 For a given number of epochs or until clustering is satisfactory.
 Merge two clusters with the smallest inter-cluster distance (WCSS).
 Repeat the above step
The agglomerative clustering is represented by Dendrogram. It can be performed in 3 methods: by selecting the closest pair for
merging, by selecting the farthest pair for merging, or by selecting the pair which is at an average distance (neither closest nor
farthest). The dendrogram representing these types of clustering is below:

Nearest clustering
Average Clustering

Farthest Clustering

Clustering by division or Divisive splitting


In this approach, we follow the top-down approach, which means we assign the pixel closest to the cluster. The algorithm for
performing the agglomerative clustering as follows:
 Construct a single cluster containing all points.
 For a given number of epochs or until clustering is satisfactory.
 Split the cluster into two clusters with the largest inter-cluster distance.
 Repeat the above steps.
In this article, we will be discussing how to perform the K-Means Clustering.
K-Means Clustering
K-means clustering is a very popular clustering algorithm which applied when we have a dataset with labels unknown. The goal
is to find certain groups based on some kind of similarity in the data with the number of groups represented by K. This
algorithm is generally used in areas like market segmentation, customer segmentation, etc. But, it can also be used to segment
different objects in the images on the basis of the pixel values.
The algorithm for image segmentation works as follows:
1. First, we need to select the value of K in K-means clustering.
2. Select a feature vector for every pixel (color values such as RGB value, texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance to measure the similarity b/w any two
points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent component that is similar to it until you can’t combine
more.
Following are the steps for applying the K-means clustering algorithm:
 Select K points and assign them one cluster center each.
 Until the cluster center won’t change, perform the following steps:
 Allocate each point to the nearest cluster center and ensure that each cluster center has one point.
 Replace the cluster center with the mean of the points assigned to it.
 End
The optimal value of K?
For a certain class of clustering algorithms, there is a parameter commonly referred to as K that specifies the number of clusters
to detect. We may have the predefined value of K, if we have domain knowledge about data that how many categories it
contains. But, before calculating the optimal value of K, we first need to define the objective function for the above algorithm.
The objective function can be given by:

Where j is the number of clusters, and i will be the points belong to the jth cluster. The above objective function is called within-
cluster sum of square (WCSS) distance.
A good way to find the optimal value of K is to brute force a smaller range of values (1-10) and plot the graph of WCSS
distance vs K. The point where the graph is sharply bent downward can be considered the optimal value of K. This method is
called Elbow method.
For image segmentation, we plot the histogram of the image and try to find peaks, valleys in it. Then, we will perform the
peakiness test on that histogram.

Implementation

 In this implementation, we will be performing Image Segmentation using K-Means clustering. We will be using OpenCV k-
Means API to perform this clustering. 
 Python3

# imports

import numpy as np

import cv2 as cv

import [Link] as plt

[Link]["[Link]"] = (12,50)
# load image

img = [Link]('[Link]')

Z = [Link]((-1,3))

# convert to np.float32

Z = np.float32(Z)

# define stopping criteria, number of clusters(K) and apply


kmeans()

# TERM_CRITERIA_EPS : stop when the epsilon value is reached

# TERM_CRITERIA_MAX_ITER: stop when Max iteration is reached

criteria = (cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER, 10,


1.0)

fig, ax = [Link](10,2, sharey=True)

for i in range(10):

K = i+3

# apply K-means algorithm

ret,label,center=[Link](Z,K,None,criteria,attempts = 10,

cv.KMEANS_RANDOM_CENTERS)

# Now convert back into uint8, and make original image

center = np.uint8(center)

res = center[[Link]()]
res2 = [Link](([Link]))

# plot the original image and K-means image

ax[i, 1].imshow(res2)

ax[i,1].set_title('K = %s Image'%K)

ax[i, 0].imshow(img)

ax[i,0].set_title('Original Image')
Image Segmentation for K=3,4,5
Image Segmentation for K=6,7,8
Remember meForgot Password

Sign

5 Stages of Data Preprocessing for K-means clustering

Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into
an understandable format for ML algorithms. Real-world data usually is noisy (contains errors,
outliers, duplicates), incomplete (some values are missed), could be stored in different places and
different formats. The task of Data Preprocessing is to handle these issues.

In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and
Training / Tunning Model.

Importance of Data Preprocessing stage

1. Different ML models have different required input data (numerical data, images in specific
format, etc). Without the right data, nothing will work.

2. Because of “bad” data, ML models will not give any useful results, or even may give wrong
answers, that may lead to wrong decisions (GIGO principle).
3. The higher the quality of the data, the less data is needed.

Note. Nowadays Preprocessing stage is the most laborious step, it


may take 60–80% of ML Engineer efforts.

Before starting data preparation, it is recommended to determine what data requirements are
presented by the ML algorithm for getting quality results. In this article we consider the K-means
clustering algorithm.

K-means input data requirements:

 Numerical variables only. K-means uses distance-based measurements to determine the


similarity between data points. If you have categorical data, use K-modes clustering, if data is
mixed, use K-prototype clustering.

 Data has no noises or outliers. K-means is very sensitive to outliers and noisy data. More
detail here and here.

 Data has symmetric distribution of variables (it isn’t skewed). Real data always has
outliers and noise, and it’s difficult to get rid of it. Transformation data to normal distribution
helps to reduce the impact of these issues. In this way, it’s much easier for the algorithm to
identify clusters.

 Variables on the same scale — have the same mean and variance, usually in a range -1.0 to
1.0 (standardized data) or 0.0 to 1.0 (normalized data). For the ML algorithm to consider all
attributes as equal, they must all have the same scale. More detail here and here.

 There is no collinearity (a high level of correlation between two variables). Correlated


variables are not useful for ML segmentation algorithms because they represent the same
characteristic of a segment. So correlated variables are nothing but noise. More detail here.

 Few numbers of dimensions. As the number of dimensions (variables) increases, a


distance-based similarity measure converges to a constant value between any given examples.
The more variables the more difficult to find strict differences between instances.
Note: What exactly does few numbers mean? It’s an open question for me. If you know the
answer, please, let me know. For now, I stick to the rule — the less the better. Plus validation of
the results.

Besides the requirements above, there are a few fundamental model assumptions:

 the variance of the distribution of each attribute (variable) is spherical (or in other words, the
boundaries between k-means clusters are linear);

 all variables have the same variance;

 each cluster has roughly equal numbers of observations.

These assumptions are beyond the data preprocessing stage. There is no way to validate them
before getting model results.

Stages of Data preprocessing for K-means Clustering

1. Data Cleaning

 Removing duplicates

 Removing irrelevant observations and errors

 Removing unnecessary columns

 Handling inconsistent data

 Handling outliers and noise

2. Handling missing data

3. Data Integration
4. Data Transformation

 Feature Construction

 Handling skewness

 Data Scaling

5. Data Reduction
Using Clustering for Semi-Supervised Learning
Semi-supervised clustering is a method that partitions unlabeled data by creating the use of domain knowledge. It is
generally expressed as pairwise constraints between instances or just as an additional set of labeled instances.
The quality of unsupervised clustering can be essentially improved using some weak structure of supervision, for
instance, in the form of pairwise constraints (i.e., pairs of objects labeled as belonging to similar or different clusters).
Such a clustering procedure that depends on user feedback or guidance constraints is known as semisupervised
clustering.
There are several methods for semi-supervised clustering that can be divided into two classes which are as follows −
Constraint-based semi-supervised clustering − It can be used based on user-provided labels or constraints to support
the algorithm toward a more appropriate data partitioning. This contains modifying the objective function depending on
constraints or initializing and constraining the clustering process depending on the labeled objects.
Distance-based semi-supervised clustering − It can be used to employ an adaptive distance measure that is trained to
satisfy the labels or constraints in the supervised data. Multiple adaptive distance measures have been utilized, including
string-edit distance trained using Expectation-Maximization (EM), and Euclidean distance changed by the shortest
distance algorithm.
An interesting clustering method, known as CLTree (CLustering based on decision TREEs). It integrates unsupervised
clustering with the concept of supervised classification. It is an instance of constraint-based semi-supervised clustering. It
changes a clustering task into a classification task by considering the set of points to be clustered as belonging to one
class, labeled as “Y,” and inserts a set of relatively uniformly distributed, “nonexistence points” with a multiple class label,
“N.”
The problem of partitioning the data area into data (dense) regions and empty (sparse) regions can then be changed into
a classification problem. These points can be considered as a set of “Y” points. It shows the addition of a collection of
uniformly distributed “N” points, defined by the “o” points.
The original clustering problem is thus changed into a classification problem, which works out a design that distinguishes
“Y” and “N” points. A decision tree induction method can be used to partition the two-dimensional space. Two clusters are
recognized, which are from the “Y” points only.
It can be used to insert a large number of “N” points to the original data can introduce unnecessary overhead in the
calculation. Moreover, it is unlikely that some points added would truly be uniformly distributed in a very high-dimensional
space as this can need an exponential number of points.

DBSCAN Clustering in ML | Density based clustering


Clustering analysis or simply Clustering is basically an Unsupervised learning method that divides the data points
into a number of specific batches or groups, such that the data points in the same groups have similar properties
and data points in different groups have different properties in some sense. It comprises many different methods
based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-shift (distance between
points), DBSCAN (distance between nearest points), Gaussian mixtures (Mahalanobis distance to centers),
Spectral clustering (graph distance) etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then we use it
to cluster the data points into groups or batches. Here we will focus on Density-based spatial clustering of
applications with noise (DBSCAN) clustering method.
Clusters are dense regions in the data space, separated by regions of the lower density of points. The DBSCAN
algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster,
the neighborhood of a given radius has to contain at least a minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped
clusters or convex clusters. In other words, they are suitable only for compact and well-separated clusters.
Moreover, they are also severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.

The figure below shows a data set containing nonconvex clusters and outliers/noises. Given such data, k-means
algorithm has difficulties in identifying these clusters with arbitrary shapes.
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighborhood around a data point i.e. if the distance between two points is lower or equal
to ‘eps’ then they are considered neighbors. If the eps value is chosen too small then large part of the data will
be considered as outliers. If it is chosen very large then the clusters will merge and the majority of the data
points will be in the same clusters. One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the larger value of
MinPts must be chosen. As a general rule, the minimum MinPts can be derived from the number of
dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.

DBSCAN algorithm can be abstracted in the following steps:


1. Find all the neighbor points within eps and identify the core points or visi ted with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as the core point.
A point a and b are said to be density connected if there exist a point c which has a sufficient number of points
in its neighbors and both the points a and b are within the eps distance. This is a chaining process. So, if b is
neighbor of c, c is neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is neighbor
of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are
noise.
Below is the DBSCAN clustering algorithm in pseudocode:
DBSCAN(dataset, eps, MinPts){
# cluster index
C = 1
for each unvisited point p in dataset {
mark p as visited
# find neighbors
Neighbors N = find the neighboring points of p

if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}
Implementation of the above algorithm in Python :
Here, we’ll use the Python library sklearn to compute DBSCAN. We’ll also use the [Link] library for
visualizing clusters. The dataset used can be found here.

Evaluation Metrics

Moreover, we will use the Silhouette score and Adjusted rand score for evaluating clustering algorithms.
Silhouette score is in the range of -1 to 1. A score near 1 denotes the best meaning that the data point i is very
compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1. Values
near 0 denote overlapping clusters.
Absolute Rand Score is in the range of 0 to 1. More than 0.9 denotes excellent cluster recovery, above 0.8 is a
good recovery. Less than 0.5 is considered to be poor recovery.
Example
 Python3

import [Link] as plt

import numpy as np

from [Link] import DBSCAN

from sklearn import metrics

from [Link].samples_generator import make_blobs

from [Link] import StandardScaler

from sklearn import datasets

# Load data in X

X, y_true = make_blobs(n_samples=300, centers=4,

cluster_std=0.50, random_state=0)

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)


core_samples_mask[db.core_sample_indices_] = True

labels = db.labels_

# Number of clusters in labels, ignoring noise if present.

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print(labels)

# Plot result

# Black removed and is used for noise instead.

unique_labels = set(labels)

colors = ['y', 'b', 'g', 'r']

print(colors)

for k, col in zip(unique_labels, colors):

if k == -1:

# Black used for noise.

col = 'k'

class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]

[Link](xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,

markeredgecolor='k',

markersize=6)

xy = X[class_member_mask & ~core_samples_mask]

[Link](xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,

markeredgecolor='k',

markersize=6)

[Link]('number of clusters: %d' % n_clusters_)

[Link]()

#evaluation metrics

sc = metrics.silhouette_score(X, labels)

print("Silhouette Coefficient:%0.2f"%sc)

ari = adjusted_rand_score(y_true, labels)

print("Adjusted Rand Index: %0.2f"%ari)

Output:
Gaussian Mixture Model
Suppose there are set of data points that need to be grouped into several parts or clusters based on their
similarity. In machine learning, this is known as Clustering.
There are several methods available for clustering:

 K Means Clustering
 Hierarchical Clustering
 Gaussian Mixture Models
In this article, Gaussian Mixture Model will be discussed.

Normal or Gaussian Distribution

In real life, many datasets can be modeled by Gaussian Distribution (Univariate or Multivariate). So it is quite
natural and intuitive to assume that the clusters come from different Gaussian Distributions. Or in other words, it
is tried to model the dataset as a mixture of several Gaussian Distributions. This is the core idea of this model.
In one dimension the probability density function of a Gaussian Distribution is given by

where and are respectively mean and variance of the distribution.


For Multivariate ( let us say d-variate) Gaussian Distribution, the probability density function is given by

Here is a d dimensional vector denoting the mean of the distribution and is the d X d covariance matrix.

Gaussian Mixture Model

Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of clusters is known
and it is K). So and is also estimated for each k. Had it been only one distribution, they would have been
estimated by the maximum-likelihood method. But since there are K such clusters and the probability density is
defined as a linear function of densities of all these K distributions, i.e.

Note: denotes the total number of sample points in the k-th cluster. Here it is assumed that
there is a total N number of samples and each sample containing d features is denoted by .
So it can be clearly seen that the parameters cannot be estimated in closed form. This is where the Expectation-
Maximization algorithm is beneficial.

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates for model
parameters when the data is incomplete or has some missing data points or has some hidden variables. EM
chooses some random values for the missing data points and estimates a new set of data. These new values are
then recursively used to estimate a better first date, by filling up missing points, until the values get fixed.
These are the two basic steps of the EM algorithm, namely E Step or Expectation Step or Estimation
Step and M Step or Maximization Step.

 Estimation step:
initialize , and by some random values, or by K means clustering results or by
hierarchical clustering results.
 Then for those given parameter values, estimate the value of the latent variables (i.e )
 Maximization Step:
 Update the value of the parameters( i.e. , and ) calculated using ML method.

Curse of Dimensionality — A “Curse” to Machine


Learning

Curse of Dimensionality describes the explosive nature of increasing data dimensions and its
resulting exponential increase in computational efforts required for its processing and/or analysis.
This term was first introduced by Richard E. Bellman, to explain the increase in volume of
Euclidean space associated with adding extra dimensions, in area of dynamic programming.
Today, this phenomenon is observed in fields like machine learning, data analysis, data mining to
name a few. An increase in the dimensions can in theory, add more information to the data
thereby improving the quality of data but practically increases the noise and redundancy during its
analysis.

Behavior of a Machine Learning Algorithms — Need for data points and Accuracy of
Model
In machine learning, a feature of an object can be an attribute or a characteristic that defines it.
Each feature represents a dimension and group of dimensions creates a data point. This represents
a feature vector that defines the data point to be used by a machine learning algorithm(s). When
we say increase in dimensionality it implies an increase in the number of features used to describe
the data. For example, in the field of breast cancer research, age, number of cancerous nodes can
be used as features to define the prognosis of the breast cancer patient. These features constitute
the dimensions of a feature vector. But other factors like past surgeries, patient history, type of
tumor and other such features help a doctor to better determine the prognosis. In this case by
adding features, we are theoretically increasing the dimensions of our data.

As the dimensionality increases, the number of data points required for good performance of any
machine learning algorithm increases exponentially. The reason is that, we would need more
number of data points for any given combination of features, for any machine learning model to be
valid. For example, let’s say that for a model to perform well, we need at least 10 data points for
each combination of feature values. If we assume that we have one binary feature, then for its 21
unique values (0 and 1) we would need 2¹x 10 = 20 data points. For 2 binary features, we would
have 2² unique values and need 2² x 10 = 40 data points. Thus, for k-number of binary features we
would need 2ᵏ x 10 data points.
Hughes (1968) in his study concluded that with a fixed number of training samples, the predictive
power of any classifier first increases as the number of dimensions increase, but after a certain
value of number of dimensions, the performance deteriorates. Thus, the phenomenon of curse of
dimensionality is also known as Hughes phenomenon.
Graphical Representation of Hughes Principle

Effect of Curse of Dimensionality on Distance Functions:


For any point A, lets assume distₘᵢₙ(A) is the minimum distance between A and its nearest
neighbor and distₘₐₓ(A) is the maximum distance between A and the farthest neighbor.

That is, for a d — dimensional space, given n-random points, the distₘᵢₙ(A) ≈ distₘₐₓ(A) meaning,
any given pair of points are equidistant to each other.

Therefore, any machine learning algorithms which are based on the distance measure including
KNN(k-Nearest Neighbor) tend to fail when the number of dimensions in the data is very high.
Thus, dimensionality can be considered as a “curse” in such algorithms.

Solutions to Curse of Dimensionality:

One of the ways to reduce the impact of high dimensions is to use a different measure of distance
in a space vector. One could explore the use of cosine similarity to replace Euclidean distance.
Cosine similarity can have a lesser impact on data with higher dimensions. However, use of such
method could also be specific to the required solution of the problem.

Other methods:

Other methods could involve the use of reduction in dimensions. Some of the techniques that can
be used are:

1. Forward-feature selection: This method involves picking the most useful subset of features from
all given features.

2. PCA/t-SNE: Though these methods help in reduction of number of features, but it does not
necessarily preserve the class labels and thus can make the interpretation of results a tough task.
Top 10 Dimensionality Reduction Techniques For
Machine Learning
Every second, the world generates an unprecedented volume of data. As data has become a crucial
component of businesses and organizations across all industries, it is essential to process, analyze, and
visualize it appropriately to extract meaningful insights from large datasets. However, there’s a catch –
more does not always mean productive and accurate. The more data we produce every second, the more
challenging it is to analyze and visualize it to draw valid inferences.
This is where Dimensionality Reduction comes into play.
Table of Contents
Best Machine Learning Courses & AI Courses Online
Master of Science in
Executive Post Graduate Programme in Machine Learning & AI
Machine Learning & AI
from IIITB
from LJMU

Advanced
Advanced
Certificate
Certificate
Programme
Programme
in Machine Executive Post Graduate Program in Data Science & Machine
in Machine
Learning & Learning from University of Maryland
Learning &
Deep
NLP from
Learning
IIITB
from IIITB

To Explore all our courses, visit our page below.

Machine Learning Courses

What is Dimensionality Reduction?


In simple words, dimensionality reduction refers to the technique of reducing the dimension of a data
feature set. Usually, machine learning datasets (feature set) contain hundreds of columns (i.e., features) or
an array of points, creating a massive sphere in a three-dimensional space. By applying dimensionality
reduction, you can decrease or bring down the number of columns to quantifiable counts, thereby
transforming the three-dimensional sphere into a two-dimensional object (circle).
Now comes the question, why must you reduce the columns in a dataset when you can directly feed it into
an ML algorithm and let it work out everything by itself?
The curse of dimensionality mandates the application of dimensionality reduction.
The Curse of Dimensionality
The curse of dimensionality is a phenomenon that arises when you work (analyze and visualize) with data
in high-dimensional spaces that do not exist in low-dimensional spaces.

Source
The higher is the number of features or factors (a.k.a. variables) in a feature set, the more difficult it
becomes to visualize the training set and work on it. Another vital point to consider is that most of the
variables are often correlated. So, if you think every variable within the feature set, you will include many
redundant factors in the training set.
Furthermore, the more variables you have at hand, the higher will be the number of samples to represent all
the possible combinations of feature values in the example. When the number of variables increases, the
model will become more complex, thereby increasing the likelihood of overfitting. When you train an ML
model on a large dataset containing many features, it is bound to be dependent on the training data. This
will result in an overfitted model that fails to perform well on real data.
The primary aim of dimensionality reduction is to avoid overfitting. A training data with considerably
lesser features will ensure that your model remains simple – it will make smaller assumptions.
Apart from this, dimensionality reduction has many other benefits, such as:

 It eliminates noise and redundant features.


 It helps improve the model’s accuracy and performance.
 It facilitates the usage of algorithms that are unfit for more substantial dimensions.
 It reduces the amount of storage space required (less data needs lesser storage space).
 It compresses the data, which reduces the computation time and facilitates faster training of the data.
Read : What is Linear discriminant analysis
Dimensionality Reduction Techniques
Dimensionality reduction techniques can be categorized into two broad categories:
1. Feature selection
The feature selection method aims to find a subset of the input variables (that are most relevant) from the
original dataset. Feature selection includes three strategies, namely:
 Filter strategy
 Wrapper strategy
 Embedded strategy
2. Feature extraction
Feature extraction, a.k.a, feature projection, converts the data from the high-dimensional space to one with
lesser dimensions. This data transformation may either be linear or it may be nonlinear as well. This
technique finds a smaller set of new variables, each of which is a combination of input variables
(containing the same information as the input variables).
Without further ado, let’s dive into a detailed discussion of a few commonly used dimensionality reduction
techniques!
FYI: Free Deep Learning Course!
1. Principal Component Analysis (PCA)
Principal Component Analysis is one of the leading linear techniques of dimensionality reduction. This
method performs a direct mapping of the data to a lesser dimensional space in a way that maximizes the
variance of the data in the low-dimensional representation.
Essentially, it is a statistical procedure that orthogonally converts the ‘n’ coordinates of a dataset into a new
set of n coordinates, known as the principal components. This conversion results in the creation of the first
principal component having the maximum variance. Each succeeding principal component bears the
highest possible variance, under the condition that it is orthogonal (not correlated) to the preceding
components.
The PCA conversion is sensitive to the relative scaling of the original variables. Thus, the data column
ranges must first be normalized before implementing the PCA method. Another thing to remember is that
using the PCA approach will make your dataset lose its interpretability. So, if interpretability is crucial to
your analysis, PCA is not the right dimensionality reduction method for your project.
2. Non-negative matrix factorization (NMF)
NMF breaks down a non-negative matrix into the product of two non-negative ones. This is what makes the
NMF method a valuable tool in areas that are primarily concerned with non-negative signals (for instance,
astronomy). The multiplicative update rule by Lee & Seung improved the NMF technique by – including
uncertainties, considering missing data and parallel computation, and sequential construction.
These inclusions contributed to making the NMF approach stable and linear. Unlike PCA, NMF does not
eliminate the mean of the matrices, thereby creating unphysical non-negative fluxes. Thus, NMF can
preserve more information than the PCA method.
Sequential NMF is characterized by a stable component base during construction and a linear modeling
process. This makes it the perfect tool in astronomy. Sequential NMF can preserve the flux in the direct
imaging of circumstellar structures in astronomy, such as detecting exoplanets and direct imaging of
circumstellar disks.
3. Linear discriminant analysis (LDA)
The linear discriminant analysis is a generalization of Fisher’s linear discriminant method that is widely
applied in statistics, pattern recognition, and machine learning. The LDA technique aims to find a linear
combination of features that can characterize or differentiate between two or more classes of objects. LDA
represents data in a way that maximizes class separability. While objects belonging to the same class are
juxtaposed via projection, objects from different classes are arranged far apart.
4. Generalized discriminant analysis (GDA)
The generalized discriminant analysis is a nonlinear discriminant analysis that leverages the kernel function
operator. Its underlying theory matches very closely to that of support vector machines (SVM), such that
the GDA technique helps to map the input vectors into high-dimensional feature space. Just like the LDA
approach, GDA also seeks to find a projection for variables in a lower-dimensional space by maximizing
the ratio of between-class scatters to within-class scatter.
5. Missing Values Ratio
When you explore a given dataset, you might find that there are some missing values in the dataset. The
first step in dealing with missing values is to identify the reason behind them. Accordingly, you can then
impute the missing values or drop them altogether by using the befitting methods. This approach is perfect
for situations when there are a few missing values.
However, what to do when there are too many missing values, say, over 50%? In such situations, you can
set a threshold value and use the missing values ratio method. The higher the threshold value, the more
aggressive will be the dimensionality reduction. If the percentage of missing values in a variable exceeds
the threshold, you can drop the variable.
Generally, data columns having numerous missing values hardly contain useful information. So, you can
remove all the data columns having missing values higher than the set threshold.
6. Low Variance Filter
Just as you use the missing values ratio method for missing variables, so for constant variables, there’s the
low variance filter technique. When a dataset has constant variables, it is not possible to improve the
model’s performance. Why? Because it has zero variance.
In this method also, you can set a threshold value to wean out all the constant variables. So, all the data
columns with variance lower than the threshold value will be eliminated. However, one thing you must
remember about the low variance filter method is that variance is range dependent. Thus, normalization is a
must before implementing this dimensionality reduction technique.
7. High Correlation Filter
If a dataset consists of data columns having a lot of similar patterns/trends, these data columns are highly
likely to contain identical information. Also, dimensions that depict a higher correlation can adversely
impact the model’s performance. In such an instance, one of those variables is enough to feed the ML
model.
For such situations, it’s best to use the Pearson correlation matrix to identify the variables showing a high
correlation. Once they are identified, you can select one of them using VIF (Variance Inflation Factor). You
can remove all the variables having a higher value ( VIF > 5 ). In this approach, you have to calculate the
correlation coefficient between numerical columns (Pearson’s Product Moment Coefficient) and between
nominal columns (Pearson’s chi-square value). Here, all the pairs of columns having a correlation
coefficient higher than the set threshold will be reduced to 1.
Since correlation is scale-sensitive, you must perform column normalization.
8. Backward Feature Elimination
In the backward feature elimination technique, you have to begin with all ‘n’ dimensions. Thus, at a given
iteration, you can train a specific classification algorithm is trained on n input features. Now, you have to
remove one input feature at a time and train the same model on n-1 input variables n times. Then you
remove the input variable whose elimination generates the smallest increase in the error rate, which leaves
behind n-1 input features. Further, you repeat the classification using n-2 features, and this continues till no
other variable can be removed.
Each iteration (k) creates a model trained on n-k features having an error rate of e(k). Following this, you
must select the maximum bearable error rate to define the smallest number of features needed to reach that
classification performance with the given ML algorithm.
Also Read: Why Data Analysis is Important in Business
9. Forward Feature Construction
The forward feature construction is the opposite of the backward feature elimination method. In the
forward feature construction method, you begin with one feature and continue to progress by adding one
feature at a time (this is the variable that results in the greatest boost in performance).
Both forward feature construction and backward feature elimination are time and computation-intensive.
These methods are best suited for datasets that already have a low number of input columns.

10. Random Forests


Random forests are not only excellent classifiers but are also extremely useful for feature selection. In this
dimensionality reduction approach, you have to carefully construct an extensive network of trees against a
target attribute. For instance, you can create a large set (say, 2000) of shallow trees (say, having two
levels), where each tree is trained on a minor fraction (3) of the total number of attributes.
The aim is to use each attribute’s usage statistics to identify the most informative subset of features. If an
attribute is found to be the best split, it usually contains an informative feature that is worthy of
consideration. When you calculate the score of an attribute’s usage statistics in the random forest in relation
to other attributes, it gives you the most predictive attributes.
ML | Principal Component Analysis(PCA)

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation that converts a set of
correlated variables to a set of uncorrelated variables. PCA is the most widely used tool in exploratory data analysis and in
machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique used to examine the
interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best
fit.
Module Needed:

import pandas as pd

import numpy as np
import [Link] as plt

import seaborn as sns

%matplotlib inline

Code #1:

# Here we are using inbuilt dataset of scikit learn

from [Link] import load_breast_cancer

# instantiating

cancer = load_breast_cancer()

# creating dataframe

df = [Link](cancer['data'], columns =
cancer['feature_names'])

# checking head of dataframe

[Link]()

Output:

Code #2:

# Importing standardscalar module

from [Link] import StandardScaler


scalar = StandardScaler()

# fitting

[Link](df)

scaled_data = [Link](df)

# Importing PCA

from [Link] import PCA

# Let's say, components = 2

pca = PCA(n_components = 2)

[Link](scaled_data)

x_pca = [Link](scaled_data)

x_pca.shape

Output:
569, 2

# giving a larger plot

[Link](figsize =(8, 6))


[Link](x_pca[:, 0], x_pca[:, 1], c = cancer['target'], cmap
='plasma')

# labeling x and y axes

[Link]('First Principal Component')

[Link]('Second Principal Component')

Output:

# components
pca.components_

Output:

df_comp = [Link](pca.components_, columns =


cancer['feature_names'])

[Link](figsize =(14, 6))

# plotting heatmap

[Link](df_comp)
# components

pca.components_

df_comp = [Link](pca.components_, columns =


cancer['feature_names'])

[Link](figsize =(14, 6))


# plotting heatmap

[Link](df_comp)

Output:

Scikit Learn Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in
Python, is built upon NumPy, SciPy and Matplotlib.

Audience
This tutorial will be useful for graduates, postgraduates, and research students who either have an interest in this Machine
Learning subject or have this subject as a part of their curriculum. The reader can be a beginner or an advanced learner.

Prerequisites
The reader must have basic knowledge about Machine Learning. He/she should also be aware about Python, NumPy,
Scipy, Matplotlib.

Randomized PCA
Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis,
information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing
large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm (Halko et al., 2011), but is
largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on
real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X
faster than the svds in Matlab with little error. The algorithm computes the first 100 principal components of a large information
retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional
methods fail due to the out-of-memory issue.

ML | Introduction to Kernel PCA


 Difficulty Level : Medium
 Last Updated : 26 Aug, 2019

 Read

 Discuss

PRINCIPAL COMPONENT ANALYSIS: is a tool which is used to reduce the dimension of the data. It allows us to reduce
the dimension of the data without much loss of information. PCA reduces the dimension by finding a few orthogonal linear
combinations (principal components) of the original variables with the largest variance.
The first principal component captures most of the variance in the data. The second principal component is orthogonal to the
first principal component and captures the remaining variance, which is left of first principal component and so on. There are as
many principal components as the number of original variables.
These principal components are uncorrelated and are ordered in such a way that the first several principal components explain
most of the variance of the original data. To learn more about PCA you can read the article Principal Component Analysis
KERNEL PCA:
PCA is a linear method. That is it can only be applied to datasets which are linearly separable. It does an excellent job for
datasets, which are linearly separable. But, if we use it to non-linear datasets, we might get a result which may not be the
optimal dimensionality reduction. Kernel PCA uses a kernel function to project dataset into a higher dimensional feature space,
where it is linearly separable. It is similar to the idea of Support Vector Machines.
There are various kernel methods like linear, polynomial, and gaussian.
Code: Create a dataset which is nonlinear and then apply PCA on the dataset.

import [Link] as plt

from [Link] import make_moons

X, y = make_moons(n_samples = 500, noise = 0.02, random_state = 417)

[Link](X[:, 0], X[:, 1], c = y)

[Link]()

Code: Let’s apply PCA on this dataset

from [Link] import PCA


pca = PCA(n_components = 2)

X_pca = pca.fit_transform(X)

[Link]("PCA")

[Link](X_pca[:, 0], X_pca[:, 1], c = y)

[Link]("Component 1")

[Link]("Component 2")

[Link]()

As you can see PCA failed to distinguish the two classes.

Code: Applying kernel PCA on this dataset with RBF kernel with a gamma value of 15.

from [Link] import KernelPCA

kpca = KernelPCA(kernel ='rbf', gamma = 15)

X_kpca = kpca.fit_transform(X)
[Link]("Kernel PCA")

[Link](X_kpca[:, 0], X_kpca[:, 1], c = y)

[Link]()

In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function to project the dataset into a higher-
dimensional space, where it is linearly separable.
Finally, we applied the kernel PCA to a non-linear dataset using scikit-learn.
UNIT-V

Introduction to Artificial Neural Networks with Keras

Birds inspired us to fly, burdock plants inspired Velcro, and nature has inspired countless
more inventions. It seems only logical, then, to look at the brain’s architecture for inspiration
on how to build an intelligent machine. This is the logic that sparked artificial neural
networks (ANNs): an ANN is a Machine Learning model inspired by the networks of
biological neurons found in our brains. However, although planes were inspired by birds,
they don’t have to flap their wings. Similarly, ANNs have gradually become quite different
from their biological cousins. Some researchers even argue that we should drop the
biological analogy altogether (e.g., by saying “units” rather than “neurons”), lest we restrict
our creativity to biologically plausible systems.1

ANNs are at the very core of Deep Learning. They are versatile, powerful, and scalable,
making them ideal to tackle large and highly complex Machine Learning tasks such as
classifying billions of images (e.g., Google Images), powering speech recognition services
(e.g., Apple’s Siri), recommending the best videos to watch to hundreds of millions of users
every day (e.g., YouTube), or learning to beat the world champion at the game of Go
(DeepMind’s AlphaGo).

Overview of Keras
Keras runs on top of open source machine libraries like TensorFlow, Theano or Cognitive Toolkit (CNTK).
Theano is a python library used for fast numerical computation tasks. TensorFlow is the most famous
symbolic math library used for creating neural networks and deep learning models. TensorFlow is very
flexible and the primary benefit is distributed computing. CNTK is deep learning framework developed by
Microsoft. It uses libraries such as Python, C#, C++ or standalone machine learning toolkits. Theano and
TensorFlow are very powerful libraries but difficult to understand for creating neural networks.
Keras is based on minimal structure that provides a clean and easy way to create deep learning models
based on TensorFlow or Theano. Keras is designed to quickly define deep learning models. Well, Keras
is an optimal choice for deep learning applications.

Features
Keras leverages various optimization techniques to make high level neural network API easier and more
performant. It supports the following features −
 Consistent, simple and extensible API.
 Minimal structure - easy to achieve the result without any frills.
 It supports multiple platforms and backends.
 It is user friendly framework which runs on both CPU and GPU.
 Highly scalability of computation.

Benefits
Keras is highly powerful and dynamic framework and comes up with the following advantages −
 Larger community support.
 Easy to test.
 Keras neural networks are written in Python which makes things simpler.
 Keras supports both convolution and recurrent networks.
 Deep learning models are discrete components, so that, you can combine into many ways.
How to Build Multi-Layer Perceptron Neural Network
Models with Keras
The Keras Python library for deep learning focuses on creating models as a sequence of layers.

In this post, you will discover the simple components you can use to create neural networks and simple deep
learning models using Keras from TensorFlow.

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and
the Python source code files for all examples.
Let’s get started.

Neural Network Models in Keras


The focus of the Keras library is a model.

The simplest model is defined in the Sequential class, which is a linear stack of Layers.

You can create a Sequential model and define all the layers in the constructor; for example:

1 from [Link] import Sequential

2 model = Sequential(...)

A more useful idiom is to create a Sequential model and add your layers in the order of the computation you
wish to perform; for example:

1 from [Link] import Sequential

2 model = Sequential()

3 [Link](...)

4 [Link](...)

5 [Link](...)
Model Inputs
The first layer in your model must specify the shape of the input.

This is the number of input attributes defined by the input_shape argument. This argument expects a tuple.
For example, you can define input in terms of 8 inputs for a Dense type layer as follows:
1 Dense(16, input_shape=(8,))

Model Layers
Layers of different types have a few properties in common, specifically their method of weight initialization and
activation functions.

Weight Initialization
The type of initialization used for a layer is specified in the kernel_initializer argument.
Some common types of layer initialization include:

 random_uniform: Weights are initialized to small uniformly random values between -0.05 and 0.05.
 random_normal: Weights are initialized to small Gaussian random values (zero mean and standard deviation
of 0.05).
 zeros: All weights are set to zero values.
You can see a full list of the initialization techniques supported on the Usage of initializations page.
Activation Function
Keras supports a range of standard neuron activation functions, such as softmax, rectified linear (relu), tanh,
and sigmoid.

You typically specify the type of activation function used by a layer in the activation argument, which takes a
string value.

You can see a full list of activation functions supported by Keras on the Usage of activations page.
Interestingly, you can also create an Activation object and add it directly to your model after your layer to apply
that activation to the output of the layer.

Layer Types

There are a large number of core layer types for standard neural networks.

Some common and useful layer types you can choose from are:

 Dense: Fully connected layer and the most common type of layer used on multi-layer perceptron models
 Dropout: Apply dropout to the model, setting a fraction of inputs to zero in an effort to reduce overfitting
 Concatenate: Combine the outputs from multiple layers as input to a single layer
You can learn about the full list of core Keras layers on the Core Layers page.
Model Compilation
Once you have defined your model, it needs to be compiled.

This creates the efficient structures used by TensorFlow in order to efficiently execute your model during
training. Specifically, TensorFlow converts your model into a graph so the training can be carried out efficiently.

You compile your model using the compile() function, and it accepts three important attributes:
1. Model optimizer
2. Loss function
3. Metrics
1 [Link](optimizer=..., loss=..., metrics=...)

1. Model Optimizers

The optimizer is the search technique used to update weights in your model.

You can create an optimizer object and pass it to the compile function via the optimizer argument. This allows
you to configure the optimization procedure with its own arguments, such as learning rate. For example:

1 from [Link] import SGD

2 sgd = SGD(...)

3 [Link](optimizer=sgd)

You can also use the default parameters of the optimizer by specifying the name of the optimizer to the
optimizer argument. For example:

1 [Link](optimizer='sgd')

Some popular gradient descent optimizers you might want to choose from include:

 SGD: stochastic gradient descent, with support for momentum


 RMSprop: adaptive learning rate optimization method proposed by Geoff Hinton
 Adam: Adaptive Moment Estimation (Adam) that also uses adaptive learning rates
You can learn about all of the optimizers supported by Keras on the Usage of optimizers page.
You can learn more about different gradient descent methods in the Gradient descent optimization algorithms
section of Sebastian Ruder’s post, An overview of gradient descent optimization algorithms.
2. Model Loss Functions
The loss function, also called the objective function, is the evaluation of the model used by the optimizer to
navigate the weight space.
You can specify the name of the loss function to use in the compile function by the loss argument. Some
common examples include:

 ‘mse‘: for mean squared error


 ‘binary_crossentropy‘: for binary logarithmic loss (logloss)
 ‘categorical_crossentropy‘: for multi-class logarithmic loss (logloss)
You can learn more about the loss functions supported by Keras on the Losses page.
3. Model Metrics

Metrics are evaluated by the model during training.

Only one metric is supported at the moment, and that is accuracy.

Model Training
The model is trained on NumPy arrays using the fit() function; for example:

1 [Link](X, y, epochs=..., batch_size=...)

Training both specifies the number of epochs to train on and the batch size.

 Epochs (epochs) refer to the number of times the model is exposed to the training dataset.
 Batch Size (batch_size) is the number of training instances shown to the model before a weight update is
performed.
The fit function also allows for some basic evaluation of the model during training. You can set the
validation_split value to hold back a fraction of the training dataset for validation to be evaluated in each epoch
or provide a validation_data tuple of (X, y) data to evaluate.

Fitting the model returns a history object with details and metrics calculated for the model in each epoch. This
can be used for graphing model performance.

Model Prediction
Once you have trained your model, you can use it to make predictions on test data or new data.

There are a number of different output types you can calculate from your trained model, each calculated using
a different function call on your model object. For example:

 [Link](): To calculate the loss values for the input data


 [Link](): To generate network output for the input data
For example, if you provided a batch of data X and the expected output y, you can use evaluate() to
calculate the loss metric (the one you defined with compile() before). But for a batch of new data X, you can
obtain the network output with predict(). It may not be the output you want, but it will be the output of your
network. For example, a classification problem will probably output a softmax vector for each sample. You will
need to use [Link]() to convert the softmax vector into class labels.
Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now

Installing TensorFlow 2 GPU [Step-by-Step


Guide]
 6 Tensorflow is one of the most-used deep-learning frameworks. It’s arguably the most popular
machine learning platform on the web, with a broad range of users from those just starting out, to
people looking for an edge in their careers and businesses.

Not all users know that you can install the TensorFlow GPU if your hardware supports it. We’ll
discuss what Tensorflow is, how it’s used in today’s world, and how to install the latest TensorFlow
version with CUDA, cudNN, and GPU support in Windows, Mac, and Linux.

Introduction to TensorFlow
TensorFlow is an open-source software library for machine learning, created by Google. It was
initially released on November 28, 2015, and it’s now used across many fields including research in
the sciences and engineering.

The idea behind TensorFlow is to make it quick and simple to train deep neural networks that use a
diversity of mathematical models. These networks are then able to learn from data without human
intervention or supervision, making them more efficient than conventional methods. The library also
offers support for processing on multiple machines simultaneously with different operating systems
and GPUs.

TensorFlow applications
TensorFlow is a library for deep learning built by Google, it’s been gaining a lot of traction ever since
its introduction early last year. The main features include automatic differentiation, convolutional
neural networks (CNN), and recurrent neural networks (RNN). It’s written in C++ and Python, for
high performance it uses a server called a “Cloud TensorFlow” that runs on Google Cloud Platform.
It doesn’t require a GPU, which is one of its main features.
The newest release of Tensorflow also supports data visualization through matplotlib. This
visualization library is very popular, and it’s often used in data science coursework, as well as by
artists and engineers to do data visualizations using MATLAB or Python / R / etc.

Installing the latest TensorFlow version with CUDA,


cudNN, and GPU support
Let’s see how to install the latest TensorFlow version on Windows, macOS, and Linux.

Windows
Prerequisite

 Python 3.6–3.8 
 Windows 7 or later (with C++ redistributable) 
 Check [Link] For latest version information

Steps

1) Download Microsoft Visual Studio from:

[Link]

2) Install the NVIDIA CUDA Toolkit ([Link] check the version of


software and hardware requirements, we’ll be using :

Version Python version Compiler Build tools cuDNN CUDA

tensorflow-2.5.0 3.6-3.9 GCC 7.3.1 Bazel 3.7.2 8.1 11.2

We will install CUDA version 11.2, but make sure you install the latest or updated version (for
example – 11.2.2 if it’s available).

Click on the newest version and a screen will pop up, where you can choose from a few options, so
follow the below image and choose these options for Windows.

Once you choose the above options, wait for the download to complete.

Install it with the Express (Recommended) option, it will take a while to install on your machine.

3) Now we’ll download NVIDIA cuDNN, [Link]

Check the version code from the TensorFlow site.


Now, check versions for CUDA and cuDNN, and click download for your operating system. If you
can’t find your desired version, click on cuDNN Archive and download from there.

Once the download is complete, extract the files.

Now, copy these 3 folders (bin, include, lib). Go to C Drive>Program Files, and search for NVIDIA
GPU Computing Toolkit.

Open the folder, select CUDA > Version Name, and replace (paste) those copied files.

Now click on the bin folder and copy the path. It should look like this: C:\Program Files\NVIDIA
GPU Computing Toolkit\CUDA\v11.2\bin.
On your PC, search for Environment variables, as shown below.

Click on Environment Variables on the bottom left. Now click on the link which states PATH.
Once you click on the PATH, you will see something like this.
Now click on New (Top Left), and paste the bin path here. Go to the CUDA folder, select
libnvvm folder, and copy its path. Follow the same process and paste that path into the system
path. Next, just restart your PC.

4) Installing Tensorflow

Open conda prompt. If not installed, get it here → [Link]

Now copy the below commands and paste them into the prompt (Check for the versions).

conda create --name tf2.5 python==3.8


conda activate tf2.5 (version)
pip install tensorflow (With GPU Support) //Install TensorFlow GPU
command, pip install --upgrade tensorflow-gpu

You’ll see an installation screen like this. If you see any errors, Make sure you’re using the correct
version and don’t miss any steps.

We’ve installed everything, so let’s test it out in Pycharm.

Test

To test the whole process we’ll use Pycharm. If not installed, get the community edition
→ [Link]

First, to check if TensorFlow GPU has been installed properly on your machine, run the below code:

# importing the tensorflow package


import tensorflow as tf
[Link].is_built_with_cuda()
[Link].is_gpu_available(cuda_only=False,
min_cuda_compute_capability=None)
It should show TRUE as output. If it’s FALSE or some error, look at the steps.

Now let’s run some code.


For a simple demo, we train it on the MNIST dataset of handwritten digits. We’ll see through how to
create the network as well as initialize a loss function, check accuracy, and more.

Configure the env, create a new Python file, and paste the below code:

# Imports
import torch
import torchvision
import [Link] as F
import [Link] as datasets
import [Link] as transforms
from torch import optim
from torch import nn
from [Link] import DataLoader
from tqdm import tqdm
Check the rest of the code here -> [Link]
Collection/blob/master/ML/Pytorch/Basics/pytorch_simple_CNN.py.

When you run the code, look for successfully opened cuda(versioncode).
Once the training started, all the steps were successful!

MacOS
MacOS doesn’t support Nvidia GPU for the latest versions, so this will be a CPU-only installation.
You can get GPU support on a Mac with some extra effort and requirements.

Prerequisite

 Python 3.6–3.8 
 macOS 10.12.6 (Sierra) or later (no GPU support)
 Check [Link] For the latest version information

You can install the latest version available on the site, but for this tutorial, we’ll be using Python 3.8.
Also, check with the TensorFlow site for version support.

2) Prepare environment:

After installing Miniconda, open the command prompt.

conda install -y jupyter


This will take some time to install jupyter. Next, install the Mac [Link] file. You can also
create a .yml file to install TensorFlow and dependencies (mentioned below).

dependencies:
- python=3.8
- pip>=19.0
- jupyter
- scikit-learn
- scipy
- pandas
- pandas-datareader
- matplotlib
- pillow
- tqdm
- requests
- h5py
- pyyaml
- flask
- boto3
- pip:
- tensorflow==2.4
- bayesian-optimization
- gym
- kaggle
Run the following command from the same directory that contains [Link].

conda env create -f [Link] -n tensorflow


This installation might take a few minutes.

Activate the environment using the following command:

python -m ipykernel install --user --name tensorflow --display-name


"Python 3.8 (tensorflow)"
Test

To test the whole process, we’ll use a Jupyter notebook. Use this command to start Jupyter:

jupyter notebook
Cope the below code and run on jupyter notebook.

import sys

import [Link]
import pandas as pd
import sklearn as sk
import tensorflow as tf

print(f"Tensor Flow Version: {tf. version }")


print(f"Keras Version: {[Link]. version }")
print()
print(f"Python {[Link]}")
print(f"Pandas {pd. version }")
print(f"Scikit-Learn {sk. version }")
gpu = len([Link].list_physical_devices('GPU'))>0
print("GPU is", "available" if gpu else "NOT AVAILABLE")
This might take some time, but you’ll see something like this with your installed versions.

Linux
We can install both CPU and GPU versions on Linux.

Prerequisite

 Python 3.6–3.8 
 Ubuntu 16.04 or later
 Check [Link] for the latest version information

Steps

1) First download and install Miniconda from [Link]


2) To install CUDA on your machine, you will need:

 CUDA capable GPU,


 A supported version of Linux,
 NVIDIA CUDA Toolkit ([Link]

You can install CUDA by running,

$ sudo apt install nvidia-cuda-toolkit


After installing CUDA, run to verify the install:

nvcc -V
You’ll see it output something like this:

nvcc: NVIDIA (R) Cuda compiler driver


Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Jul_22_[Link]_PDT_2019
Cuda compilation tools, release ‘version’
3) Now we’ll download NVIDIA cuDNN, [Link]

Check the version code from the TensorFlow site.

After downloading, extract the files:

tar -xvzf cudnn-10.1-linux-x64-'version'.tgz


Now, we’ll copy the extracted files to the CUDA installation path:

sudo cp cuda/include/cudnn.h /usr/lib/cuda/include/


sudo cp cuda/lib64/libcudnn* /usr/lib/cuda/lib64/
Setting up the file permissions of cuDNN:

$ sudo chmod a+r /usr/lib/cuda/include/cudnn.h


/usr/lib/cuda/lib64/libcudnn*
4) Get the environment ready:

Export CUDA environment variables. To set them, run:

$ echo 'export LD_LIBRARY_PATH=/usr/lib/cuda/lib64:$LD_LIBRARY_PATH' >>


~/.bashrc
$ echo 'export LD_LIBRARY_PATH=/usr/lib/cuda/include:$LD_LIBRARY_PATH' >>
~/.bashrc
You can also set the environment with conda and jupyter notebook.
After installing Miniconda, open the command prompt.

conda install -y jupyter

Now, check with TensorFlow site for version, and run the below command:

conda create --name tensorflow python=3.8


To enter the environment:

conda activate tensorflow


Let’s create Jupyter support for your new environment:

conda install nb_conda


This will take some time to get things done.

To Install CPU only, use the following command:

conda install -c anaconda tensorflow


To Install both GPU and CPU, use the following command:

conda install -c anaconda tensorflow-gpu


To add additional libraries, update or create the ymp file in your root location, use:

conda env update --file [Link]


Below are additional libraries you need to install (you can install them with pip).

dependencies:
- jupyter
- scikit-learn
- scipy
- pandas
- pandas-datareader
- matplotlib
- pillow
- tqdm
- requests
- h5py
- pyyaml
- flask
- boto3
- pip
- pip:
- bayesian-optimization
- gym
- kaggle

Test

There are two ways you can test your GPU.

First, you can run this command:

import tensorflow as tf
[Link].list_physical_devices("GPU")
You will see similar output, [PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’)]

Second, you can also use a jupyter notebook. Use this command to start Jupyter.

jupyter notebook
Now, run the below code:

import sys

import [Link]
import pandas as pd
import sklearn as sk
import tensorflow as tf

print(f"Tensor Flow Version: {tf. version }")


print(f"Keras Version: {[Link]. version }")
print()
print(f"Python {[Link]}")
print(f"Pandas {pd. version }")
print(f"Scikit-Learn {sk. version }")
gpu = len([Link].list_physical_devices('GPU'))>0
print("GPU is", "available" if gpu else "NOT AVAILABLE")
You’ll see results something like this:

TensorFlow Version: 'version'


Keras Version: 'version'-tf

Python 3.8.0
Pandas 'version'
Scikit-Learn 'version'
GPU is available
Load and preprocess images
his tutorial shows how to load and preprocess an image dataset in three ways:

 First, you will use high-level Keras preprocessing utilities (such as [Link].image_dataset_from_directory) and
layers (such as [Link]) to read a directory of images on disk.
 Next, you will write your own input pipeline from scratch using [Link].
 Finally, you will download a dataset from the large catalog available in TensorFlow Datasets.

Setup
import numpy as np
import os
import PIL
import [Link]
import tensorflow as tf
import tensorflow_datasets as tfds

print(tf. version )

2.9.1

Download the flowers dataset

This tutorial uses a dataset of several thousand photos of flowers. The flowers dataset contains five sub-
directories, one per class:

flowers_photos/
daisy/
dandelion/
roses/
sunflowers/
tulips/

Note: all images are licensed CC-BY, creators are listed in the [Link] file.
import pathlib
dataset_url =
"[Link]
data_dir = [Link].get_file(origin=dataset_url,
fname='flower_photos',
untar=True)
data_dir = [Link](data_dir)

Downloading data from [Link]


228813984/228813984 [==============================] - 2s 0us/step

After downloading (218MB), you should now have a copy of the flower photos available. There are 3,670 total
images:

image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)

3670

Each directory contains images of that type of flower. Here are some roses:
roses = list(data_dir.glob('roses/*'))
[Link](str(roses[0]))

roses = list(data_dir.glob('roses/*'))
[Link](str(roses[1]))

Load data using a Keras utility


Let's load these images off disk using the
helpful [Link].image_dataset_from_directory utility.

Create a dataset

Define some parameters for the loader:

batch_size = 32
img_height = 180
img_width = 180
It's good practice to use a validation split when developing your model. You will use
80% of the images for training and 20% for validation.

train_ds = [Link].image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

Found 3670 files belonging to 5 classes.


Using 2936 files for training.
val_ds = [Link].image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

Found 3670 files belonging to 5 classes.


Using 734 files for validation.

You can find the class names in the class_names attribute on these datasets.

class_names = train_ds.class_names
print(class_names)

['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']

Visualize the data

Here are the first nine images from the training dataset.

import [Link] as plt

[Link](figsize=(10, 10))
for images, labels in train_ds.take(1):
for i in range(9):
ax = [Link](3, 3, i + 1)
[Link](images[i].numpy().astype("uint8"))
[Link](class_names[labels[i]])
[Link]("off")

You can train a model using these datasets by passing them to [Link] (shown later in
this tutorial). If you like, you can also manually iterate over the dataset and retrieve
batches of images:

for image_batch, labels_batch in train_ds:


print(image_batch.shape)
print(labels_batch.shape)
break

(32, 180, 180, 3)


(32,)
The image_batch is a tensor of the shape (32, 180, 180, 3). This is a batch of 32 images of
shape 180x180x3 (the last dimension refers to color channels RGB). The label_batch is a
tensor of the shape (32,), these are corresponding labels to the 32 images.

You can call .numpy() on either of these tensors to convert them to a [Link].

Standardize the data

The RGB channel values are in the [0, 255] range. This is not ideal for a neural network;
in general you should seek to make your input values small.

Here, you will standardize values to be in the [0, 1] range by using [Link]:

normalization_layer = [Link](1./255)

There are two ways to use this layer. You can apply it to the dataset by
calling [Link]:

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))


image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print([Link](first_image), [Link](first_image))

0.0 0.96902645

Or, you can include the layer inside your model definition to simplify deployment. You
will use the second approach here.

Note: If you would like to scale pixel values to [-1,1] you can instead
write [Link](1./127.5, offset=-1)Note: You previously resized images
using the image_size argument of [Link].image_dataset_from_directory. If you
want to include the resizing logic in your model as well, you can use
the [Link] layer.

Configure the dataset for performance

Let's make sure to use buffered prefetching so you can yield data from disk without
having I/O become blocking. These are two important methods you should use when
loading data:

 [Link] keeps the images in memory after they're loaded off disk during the first epoch.
This will ensure the dataset does not become a bottleneck while training your model. If your
dataset is too large to fit into memory, you can also use this method to create a performant on-
disk cache.
 [Link] overlaps data preprocessing and model execution while training.
Interested readers can learn more about both methods, as well as how to cache data to
disk in the Prefetching section of the Better performance with the [Link] API guide.

AUTOTUNE = [Link]

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
UNIT-II

What is Ensemble Learning?


Ensemble Learning is a method of reaching a consensus in predictions by fusing the salient
properties of two or more models. The final ensemble learning framewor k is more robust
than the individual models that constitute the ensemble because ensemb ing reduces thevariance in
the prediction errors
Ensemble Learning tries to ca ture
p complementary information from its different contributing
models—that is, an ensemble framework is successful when the contributing models arestatistically
diverse.
In other words, models that display performance variation when evaluat e d on the same
dataset are better suited to form an [Link]
example—
Different models which make incorrect predictions on different sets of s a mples from the
UNIT-II

dataset should be ensembled. If two statistically similar models are ensembled (models that
UNIT-II

make wrong predictions on the same set of samples), the resulting model will only be as good as
the contributing models. An ensemble won’t make any difference to the predictionability in such a
case.
The diversity in the predictions of the contributing models of an ensemble is popularly verified
using the Kullback-Leibler and Jensen-Shannon Divergence metrics (this paper is great example
demonstrating the point).
Here are some of the scenarios where ensemble learning comes in handy.

1. Can't choose an “optimal” model


As explained in the example at the beginning of this article, there may arise situations wheredifferent
models perform better on some distributions within the dataset, say, for example, a model may be
well adapted to differentiate between cats and dogs, but not so much whendistinguishing between
dogs and wolves.
On the other hand, a second model can accurately differentiate between dogs and wolves while
producing wrong predictions on the “cat” class. An ensemble of these two models might draw a
more discriminative decision boundary between all the three classes of the data.

2. Excess/Shortage of data
In cases where a substantial amount of data is available, we may divide the classification tasks
between different classifiers and ensemble them during prediction time, rather than trying to train
one classifier with large volumes of data.
On the other hand, in cases where the dataset available is small (for example, in the biomedical
domain, where acquiring labeled medical data is costly), we can use a bootstrapping
ensemble strategy.
The way it works is quite simple—
We train different classifiers using various “bootstrap samples” of data, i.e., we create several
subsets of a single dataset using replacement. It means that the same data sample may be present in
more than one subset, which will be later used to train different models (for further reading, check
out this paper).
This method will be further explained in the section on the “Bagging” ensemble technique.

3. Confidence Estimation
UNIT-II

The very core of an ensemble framework is based on the confidence in predictions by the different
models. For example, when trying to draw a consensus between four models on a cat/dog
classification problem, if two models predict the sample as class “cat” and the other two predict as
“dog,” the confidence of the ensemble is low.
Further, researchers also use the confidence scores of the individual classifiers to generate a final
confidence score of the ensemble (Examples: Paper-1, Paper-2). Involving the confidence scores for
developing the ensemble gives more robust predictions than simple “majority voting” since a
prediction with 95% confidence is more reliable than a predictionwith 51% confidence.
Therefore, we can assign more importance to classifiers that predict with more confidence during
the ensemble.

4. High Problem Complexity


Sometimes, a problem can have a complex decision boundary, and it might become impossible
for a single classifier to generate the appropriate boundary.
For example, if we have a linear classifier and we try to tackle a problem with a parabolic
(polynomial) decision boundary. One linear classifier obviously cannot do the job well.
However, an ensemble of multiple linear classifiers can generate any polynomial decision
boundary.
An example of such a case is shown in the diagram below.

5. Information Fusion
The most prevalent reason for using an ensemble learning model is information fusion for
enhancing classification performance. That is, models that have been trained on different
distributions of data pertaining to the same set of classes are employed during prediction time to
get a more robust decision.
For example, we may have trained one cat/dog classifier on high-quality images taken by a
professional photographer. In contrast, another classifier has been trained on data using low-
quality photos captured on mobile phones. When predicting a new sample, integrating the decisions
from both these classifiers will be more robust and bias-free.

How does ensemble learning work?


UNIT-II

Ensemble learning combines the mapping functions learned by different classifiers to


generate an aggregated mapping function.
The diverse methods proposed over the years use different strategies for computing this
combination.
Below we describe the most p o pular methods that are commonly used in th e literature.

1. Bagging
The Bagging ensemble technique is the acronym for “bootstrap aggregating” and is one ofthe earliest
ensemble methods proposed.
For this method, subsamples from a dataset are created and they are c alled “bootstrap
sampling.” To put it simply, random subsets of a dataset are created using replacement,
meaning that the same data p int may
o be present in several subsets.
These subsets are now treated as independent datasets, on which several Machine Learning
models will be fit. During test time, the predictions from all such models trained ondifferent subsets
of the same data are accounted for.
There is an aggregation mec anism
h used to compute the final prediction (like averaging,
weighted averaging, etc. discussed later).

Bagging

The image shown above exemplifies the Bagging ensemble mechanism.


UNIT-II

Note that, in the bagging mechanism, a parallel stream of processing occurs. The main aim of the
bagging method is to reduce variance in the ensemble predictions.
Thus, the chosen ensemble classifiers usually have high variance and low bias (complex models
with many trainable parameters). Popular ensemble methods based on this approach include:
 Bagged Decision Trees
 Random Forest Classifiers
 Extra Trees

2. Boosting
The boosting ensemble mechanism works in a way markedly different from the bagging
mechanism.
Here, instead of parallel processing of data, sequential processing of the dataset occurs. The first
classifier is fed with the entire dataset, and the predictions are analyzed.
The instances where Classifier-1 fails to produce correct predictions (that are samples near the
decision boundary of the feature space) are fed to the second classifier.
This is done so that Classifier-2 can specifically focus on the problematic areas of feature space
and learn an appropriate decision boundary. Similarly, further steps of the same idea are employed,
and then the ensemble of all these previous classifiers is computed to make the final prediction on
the test data.
The pictorial representation of the same is shown below.
UNIT-II

Boosting ensemble mechanism

The main aim of the boosting method is to reduce bias in the ensemble decision. Thus, theclassifiers
are chosen for the ensemble usually need to have low variance and high bias, i.e., simpler models
with less trainable parameters.
Other algorithms based on this approach include:
 Adaptive Boosting
 Stochastic Gradient Bo o sting
 Gradient Boosting Mac h ines

3. Stacking
The stacking ensemble method also involves creating bootstrapped data subsets, like the bagging
ensemble mechanism for training multiple models.
However, here, the outputs of all such models are used as an input to another classifier, called
meta-classifier, which finally predicts the samples. The intuition behind using two layers of
classifiers is to determine whether the training data have been appropriately learned.
UNIT-II

For example, in the example o f the cat/dog/wolf classifier at the beginning of this article, if,
say, Classifier-1 can distinguish between cats and dogs, but not between dogs and wolves,the meta-
classifier present in the second layer will be able to capture this behavior from
classifier-1. The meta classifier can then correct this behavior before making the final
prediction.
A pictorial representation of the stacking mechanism is shown below.

Boosting ensemble mechanism

The diagram above shows one level of stacking. There are also multi-level stacking
ensemble methods where additional layers of classifiers are added in between.
However, such practices become computationally very expensive for a relat vely small boostin
performance.

4. Mixture of Experts
The “Mixture of Experts” genre of ensemble trains several classifiers, the outputs of which
are ensemble using a generali ed linear
z rule.
The weights assigned to these combinations are further determined by a “Gating Network,”also a
trainable model and usually a neural network.
UNIT-II

The Mixture of Experts ensemble mechanism


Such an ensemble technique is usually used when different classifiers are trained on other
parts of the feature space. Following the previous example of the cat/dog/ olf classification
w
problem, suppose one classifi e r is trained only on cats/dogs data, and another is trained on
dogs/wolves data.
Such a method is also successful on the “Information Fusion” problem described before.

5. Majority Voting
Majority voting is one of the e a rliest and easiest ensemble schemes in the literature. In this
method, an odd number of contributing classifiers are chosen, and for each sample, the
predictions from the classifiers are computed. Then, as the name suggests, the class that gets most
of the class from the classifier pool is deemed the ensemble’s predicted class.
Such a method works well f r binary
o classification problems, where there are only two
candidates for which the classifiers can vote. However, it fails for a problem with manyclasses
since many cases arise, where no class gets a clear majority of the votes.
In such cases, we usually cho se ao random class among the top candidates, which leads to
a more considerable margin of error. Thus, methods based on the confidence scores aremore
reliable and are used more widely now.
UNIT-II

6. Max Rule
The “Max Rule” ensemble method relies on the probability distributions generated by each
classifier. This method employs the concept of “confidence in prediction” of the classifiers and
thus is a superior method to Majority Voting for multi-class classification challenges.
Here, for a predicted class by a classifier, the corresponding confidence score is checked. The class
prediction of the classifier that predicts with the highest confidence score is deemed the
prediction of the ensemble framework.

7. Probability Averaging
In this ensemble technique, the probability scores for multiple models are first computed. Then,
the scores are averaged over all the models for all the classes in the dataset.
Probability scores are the confidence in predictions by a particular model. So, here we are pooling
the confidences of several models to generate a final probability score for the ensemble. The
class that has the highest probability after the averaging operation is assigned as the predicted
class.
Such a method has been used in this paper for COVID-19 detection from lung CT-scan images.

8. Weighted Probability Averaging


In the weighted probability averaging technique, similar to the previous method, the probability or
confidence scores are extracted from the different contributing models.
But here, unlike the other case, we calculate a weighted average of the probability. The weights
in this approach refer to the importance of each classifier, i.e., a classifier whose overall
performance on the dataset is better than another classifier is given more importancewhile computing
the ensemble, which leads to a better predictive ability of the ensemble framework.

Advanced Ensemble Techniques


The ensemble methods described above have been around for decades. However, with the
advancements in research, much more powerful ensemble techniques have been developed for
different use cases.
UNIT-II

For example, Fuzzy Ensembles are a class of ensemble techniques that use the concept of“dynamic
importance.”
The “weights” given to the classifiers are not fixed; they are modified based on the contributing
models’ confidence scores for every sample, rather than checking the performance on the entire
dataset. They perform much better than the popularly used weighted average probability
methods. The codes for the papers are also available here: Paper-1, Paper-2, and Paper-3.
Another genre of ensemble technique that has recently gained popularity is called “Snapshot
Ensembling.”
As we can see from the discussion throughout this article, ensemble learning comes at the expense
of training multiple models.

specially in deep learning, it is a costly operation, even with transfer learning. So, this ensemble
learning method proposed in this paper trains only one deep learning model and saves the model
snapshots at different training epochs.
The ensemble of these models generates a final ensemble prediction framework on the testdata.
They proposed some modifications to the usual deep learning model training regime to ensure
the diversity in the model snapshots. The model weights saved at these different epochs need to be
significantly different to make the ensemble successful.

Applications of Ensemble Learning


Ensemble learning is a fairly common strategy in deep learning and has been applied to tackle a
myriad of problems. It has helped tackle complex pattern recognition tasks that require computers
to learn high-level semantic information from digital images or videos, like object detection, where
bounding boxes need to be formed around the objects of interest and image classification.
Here are some of the real-life applications of Ensemble Learning.

1. Disease detection
UNIT-II

Classification and localization of diseases for simplistic and fast prognosis have been aided by
Ensemble learning, like in cardiovascular disease detection from X-Ray and CT scans.

AI chest X-ray annotation analysis

2. Remote Sensing
Monitoring of physical characteristics of a target area without coming in physical contact, called
Remote Sensing, is a difficult task since the data acquired by different sensors have varying
resolutions leading to incoherence in data distribution.
Tasks like Landslide Detection and Scene Classification have also been accomplished with the help
of Ensemble Learning.
UNIT-II

Construction land cover mapping with annotated vehicles

3. Fraud Detection
Detection of digital fraud is an important and challenging task since very minute precision is
required to automate the p rocess. Ensemble Learning has proved its efficacy in
detecting Credit Card Fraud and Impression Fraud.

4. Speech emotion recognition


Ensemble Learning is also applied in speech emotion recognition, especially in the case of multi-
lingual environments. The technique allows for the combining of all classifiers’ effect instead of
choosing one classifier and compromising certain language corpus’s accuracy.
UNIT-II

Emotion recognition using V7 bounding box

Ensemble Learning: Key Takeaways


Ensemble Learning is a sta n dard machine learning technique that inv o lves taking the
opinions of multiple experts (classifiers) to make predictions.
The need for ensemble learning arises in several problematic situations that can be both
data-centric and algorithm-centric, like a scarcity/excess of data, the c omplexity of the
problem, constraint in computational resources, etc.
The several methods evolved over the decades have proven their utility in tackling many
such issues. Still, newer ensemble approaches are being developed byaddress esearchers that
the caveats of the traditional ensembles.
UNIT

Random Forest Algorithm

Random Forest is a well-known machine learning algorithm that


uses the supervised learning method. In machine learning, it can
be used for both classification and regression problems. It is
based on ensemble learning, which is a method of combining
multiple classifiers to solve a complex problem and improve the
model's performance.

Random Forest is a classifier that combines a number of decision


trees on different subsets of a dataset and averages the results to
improve the dataset's predictive accuracy. Instead of relying on a
single decision tree, the random forest takes the predictions from
each tree and predicts the final output based on the majority
votes of predictions.

The greater the number of trees in the forest, the more accurate
it is and the problem of overfitting is avoided.

Each tree in the random forest produces a class prediction, and


the class with the most votes becomes the prediction of our
model (see figure). Figure 2.22 is showing the random forest
algorithm visualization and having the Six 1’s and three 0’s
therefore Prediction is 1:
UNIT

Figure 2.22: Random Forest Algorithm Visualization


UNIT

Working of the Random Forest Algorithm

The random forest is created in two phases: the first is to


combine N decision trees to create the random forest, and the
second is to make predictions for each tree created in the first
phase.

The following steps and Figure 2.23 can be used to explain the
working process:

Figure 2.23: Working of the Random Forest algorithm

Choose K data points at random from the training data set.

Create decision trees for the data points chosen (subsets).


UNIT
Choose N for the number of decision trees you want to create.

Repeat Steps I and II.

Find the predictions of each decision tree for new data points and
assign the new data points to the category with the most votes.

Assume you have a dataset with a variety of fruit images. As a


result, the random forest classifier is given this dataset. Each
decision tree is given a subset of the dataset to work with.
During the training phase, each decision tree generates a
prediction result, and when a new data point appears, the random
forest classifier predicts the final decision based on the majority of
results. Figure 2.24 is showing the random forest fruit’s instance
example:

Figure 2.24: Random Forest Fruit’s instance example


UNIT
Source: [Link]
algorithm
UNIT

Advantages of Random Forest Algorithm

The following are some reasons why we should use the random
forest algorithm:

Less training time required as compared to other algorithms.

Output with high accuracy.

Effective for the large dataset.

Used for the classification and regression.

Prevents the overfitting problems.

It can be used for identifying the most important features from


the training dataset i.e., feature engineering.
UNIT

Disadvantages of Random Forest Algorithm

Despite the fact that random forest can be used for both
classification and regression tasks, it is not better suited to
regression.
UNIT

Applications of Random Forest Algorithm

The random forest algorithm concept is most commonly used in


four sectors as follows:

Banking This algorithm is primarily used in the banking industry


to identify loan risk.

Medical The disease trends and risks can be identified with the
help of this algorithm.

Land Development This algorithm can identify areas with similar


land use.

Marketing This algorithm can be used to spot marketing trends.


UNBIT_II

Regression

Regression is a mathematical method used in finance,


investment, and other methods that attempt to determine the
strength and nature of the relationship between a single
dependent/output variable and one or more other
independent/input variables.

Regression algorithms are used if there is a relationship between


one or more input variables and output variables. It is used when
the value of the output variable is continuous or real, such as
house price, weather forecasting, stock price prediction, and so
on.

The following Table 2.1 shows, the dataset, which serves


the purpose of predicting the house price, based on
different parameters:

parameters: parameters: parameters:

parameters:

parameters:

parameters:

parameters:

parameters:

parameters:
parameters:

parameters:

parameters:

Table 2.1: Dataset of Regression

Here the input variables are Size of No. of Bedrooms and No. of
Bathrooms & the output variable is the Price of which is a
continuous value. Therefore, this is a Regression

The goal here is to predict a value as much closer to the actual


output value and then evaluation is done by calculating the
error value. The smaller the error the greater the accuracy of
the regression model.

Regression is used in many real-life applications, such as financial


forecasting (house price prediction or stock price prediction),
weather forecasting, time series forecasting, and so on.

In regression, we plot a graph between the variables which best


fits the given data points, using this plot, the machine learning
model can make predictions about the data. In simple words,
regression displays an entire line and curve in the goal predictor
chart in order to minimize the vertical gap between the
datapoints and the regression line. The distance between the
data points and the line tells us whether a model has captured a
strong relationship or not.
Terminologies used in Regression

Following are the terminologies used in regression:

Dependent Dependent variable is also known as target output


variable, or response The dependent variable in regression
analysis is a variable, which we want to predict or understand. It
is denoted by ‘Y’.

Size of house, number of bedrooms, and number of bathrooms


are dependent variables (refer to Table

Independent Independent variable is also known as the input


variable or Independent variables affect the dependent variables,
or these are used to predict the values of the dependent
variables. It is denoted by ‘X’.

Price of a house is an independent variable (refer to Table

Outliers are observed data points that are far from the least
square line or that differs significantly from other data or
observations, or in other words, an outlier is an extreme
value that differ greatly from other values in a set of values.

In Figure there are a bunch of apples, but one apple is different.


This apple is what we call an
Figure 2.3: Outlier

Outliers need to be examined closely. Sometimes, for some


reason or another, they should not be included in data analysis. It
is possible that an outlier is a result of erroneous data. Other
times, an outlier may hold valuable information about the
population under study and should remain included in the data.
The key is
to examine carefully what causes a data point to be an outlier.
Figure 2.4: Outlier in House Price Dataset

Consider the following example. Suppose as shown in Figure we


sample the number of bedrooms in a house and note the price
of each house. We can see from the dataset that ten houses
range between 200000 and 875000; but one house is priced at
3000000, that house will be considered as an outlier.

Multicollinearity in regression analysis occurs when two or more


independent variables are closely related to each other, so as not
to provide unique or independent data to the regression model. If
the degree of correlation is high enough between variables, it can
cause problems when fitting and interpreting the regression
model.

For example, suppose we do a regression analysis using variable


height, shoe size, and hours spent in practice in a day to predict
high jumps for basketball players. In this case, the height and
size of the shoes may be closely related to each other because
taller people tend to have larger shoe sizes. This means that
multicollinearity may be a problem in this regression.
Take another example, suppose we have two inputs X1 or X2 :

X1 = [0, 3, 4, 9, 6, 2]

X2 = [0, −1.5, −2, −4.5, −3, −1]

X1 = -2 * X2

So X1 & X2 are collinear. Here it’s better to use only one variable,
either X1 or X2 for the input.

Underfitting and Overfitting and underfitting are two


primary issues that happen in machine learning and
decrease the performance of the machine learning model.

The main goal of each machine learning model is to provide a


suitable output by adapting the given set of unknown inputs, this
is known as It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, underfitting
and overfitting are the two terms that need to be checked for
the reliability of the machine learning model.

Let us understand the basic terminology for overfitting


and underfitting:

It is about the true base pattern of the data.

Unnecessary and insignificant data that reduces the performance.


It is the difference between expected and real values.

When model performs well on the training dataset but not on the
test dataset, variance exists.

Figure 2.5: Underfitting and Overfitting

On the left side of the above Figure anyone easily predicts that
the line does not cover all the points shown in the graph. Such a
model tends to cause a phenomenon known as underfitting of
data. In the case of underfitting, the model cannot learn enough
from the training data and from the training data and thus
reduces precision and accuracy. There is a high bias and low
variance in the underfitted model.

Contrarily, when we consider the right side of the graph in Figure


it shows that the predicted line covers all the points in the graph.
In such a situation, we might think this is a good graph that
covers all the points, but that is not true. The predicted line of
the given graph covers all the points including those, which are
noise and outlier. Such a model tends to cause a phenomenon
known as overfitting of data. This model is responsible for
predicting poor results due to its high complexity. The overfitted
model has low bias and high So, this model is also known as the
High Variance

Now, consider the middle graph in Figure it shows a well-


predicted line. It covers a major portion of the points in the
graph while also maintaining the balance between bias and
variance. Such a model tends to cause a phenomenon known as
appropriate fitting of
Types of Linear Regression

As shown in Figure linear regression is classified into two


categories based on the number of independent
variables:

Figure 2.6: Types of Linear Regression


Simple Linear Regression

Simple Linear Regression is a type of linear regression where we


have only one independent variable to predict the dependent
variable. The dependent variable must be a continuous/real
value.

The relationship between independent variable (X) &


dependent variable (Y) is shown by a linear or a sloped
straight line as shown in Figure hence it is called Simple
Linear Regression:

Figure 2.7: Simple Linear Regression

The Simple Linear Regression model can be represented using the


following equation:

y=
Where,

Y: Dependent Variable

X: Independent variable

is the Y-intercept of the regression line where best-fitted


line intercepts with the Y-axis.

is the slope of the regression line, which tells whether the line is
increasing or decreasing.

Therefore, in this graph, the dots are our data and based on this
data we will train our model to predict the results. The black
line is the best-fitted line for the given data. The best-fitted line
is a straight line that best represents the data on a scatter plot.
The best-fitted line may pass through all of the points, some of
the points or none of the points in the graph.

The goals of simple linear regression are as follows:

To find out if there is a correlation between dependent


& independent variables.

To find the best-fit line for the dataset. The best-fit line is the
one for which total prediction error (all data points) are as small
as possible.
How the dependent variable is changing, by changing the
independent variable.

Suppose we have a dataset, which contains information about


the relationship between of and as shown in Table

Table 2.2: House Dataset

Here of is the independent variable (X) and is the dependent


variable (Y). Our aim is to find out the value of B0 & B1 such that
it produces the best-fitted regression line. This linear equation is
then used for new data.

The House dataset is used to train our linear regression model.


That is, if we give of as an input, our model should predict with
minimum error.

Here Y’ is the predicted value of


The values and must be chosen so that they minimize the error.
If the sum of squared error is taken as a metric to evaluate the
model, then the goal to obtain a line that best reduces the error
is achieved.

If we do not square the error, then positive and negative points


will cancel out each other.
Multiple Linear Regression

If there is more than one independent variable for the


prediction of a dependent variable, then this type of linear
regression is known as multiple linear regression.
Classification

Classification is the process to group the output into different


classes based on one or more input variables. Classification is
used when the value of the output variable is discrete or
categorical, such as email is or or and or and or or and 0 or 1,
and so on.

If the algorithm tries to classify input variables into two different


classes, it is called binary such as email is or When the algorithm
tries to classify input variables into more than two classes, it is
called multiclass such as handwritten character recognition
where classes go from 0 to 9. Figure 2.8 shows the classification
example with a message if an email is a spam or not spam:
Figure 2.8: Example of Classification
Naïve Bayes classifier algorithm

Naive Bayes classifiers are a group of classification algorithms that


are based on Bayes' Theorem. It is a family of algorithms that
shares a common principle, namely that every pair of the
characteristics being classified is independent of each other.

The Naïve Bayes algorithm is a supervised learning algorithm that


solves classification problems and is based on the Bayes theorem.

It is primarily used in text classification with a large


training dataset.

The Naïve Bayes Classifier is a simple and effective


Classification algorithm that aids in the development of fast
machine learning models capable of making quick predictions.

It is a probabilistic classifier, which ensures it predicts based on


an object's probability.

Spam filtration, Sentimental analysis, and article classification


are some popular applications of the Naïve Bayes Algorithm.

For example, if a fruit is red, round, and about 3 inches in


diameter, it is classified as an apple. Even if these features are
dependent on each other or on the presence of other features,
all
of these properties independently contribute to the likelihood
that this fruit is an apple, which is why it is called

The Naive Bayes model is simple to construct and is especially


useful for very large data sets. In addition to its simplicity, Naive
Bayes has been shown to outperform even the most sophisticated
classification methods.
Why is it called Naïve Bayes?

The Naive Bayes algorithm is made up of the words Naive and


Bayes, which can be interpreted as:

It is called Naïve because it assumes that the occurrence of one


feature is unrelated to the occurrence of others. For example,
if
the fruit is identified based on color, shape, and taste, then a
red, spherical, and sweet fruit is identified as an apple. As a
result, each feature contributes to identifying it as an apple
without
relying on the others.

It is known as Bayes because it is based on the principle of


Bayes' Theorem.
Principle of Naive Bayes Classifier

A Naive Bayes classifier is a type of probabilistic machine learning


model that is used to perform classification tasks. The classifier's
crux is based on the Bayes theorem.

The Bayes theorem allows us to calculate the posterior


probability P(c|x) from P(c), P(x), and P(x|c). Consider the
following equation:

Where,

Posterior probability of class c (target) given predictor


x(attributes)
- P(c|x)

The prior probability of class - P(c)

The likelihood which is the probability of predictor given class -


P(x|c)

The prior probability of predictor - P(x)


Working of Naïve Bayes' Classifier

The following steps demonstrate how the Nave Bayes' Classifier


works.

Step Convert the given dataset into frequency tables.

Step Create a likelihood table by calculating the probabilities


of the given features.

Step Use Bayes theorem to calculate the posterior probability.


Naive Bayes Example

The weather training data set and corresponding target


variable are shown in Figure We must now categorize whether
or not players will play based on the weather.

Step Make a frequency table out of the data set.

Step Make a likelihood table by calculating probabilities such


as overcast probability and probability of playing.

Figure 2.9: Weather training data set and corresponding


frequency and likelihood table

Step Calculate the posterior probability for each class using the
Naive Bayesian equation. The outcome of prediction is the class
with the highest posterior probability.
Naive Bayes employs a similar method to predict the likelihood of
various classes based on various attributes. This algorithm is
commonly used in text classification and multi-class problems.
Types of Naïve Bayes

The Naive Bayes Model is classified into three types, which are
listed as follows:

Gaussian Naïve The Gaussian model is based on the assumption


that features have a normal distribution. This means that if
predictors take continuous values rather than discrete values,
the model assumes these values are drawn from the Gaussian
distribution.

Multinomial Naïve When the data is multinomially distributed,


the Multinomial Naive Bayes classifier is used. It is primarily used
to solve document classification problems, indicating which
category a particular document belongs to, such as sports,
politics, education, and so on. The predictors of the classifier are
based on the frequency of words.

Bernoulli Naïve The Bernoulli classifier operates similarly to


the multinomial classifier, except that the predictor variables
are independent Boolean’s variables. For example, whether or
not a specific word appears in a document. This model is also
well- known for performing document classification tasks.
Advantages and disadvantages of Naïve Bayes

Following are the advantages:

Easy and fast algorithm to predict a class of datasets.

Used for binary as well as multi-class classifications.

When compared to numerical variables, it performs well


with categorical input variables.

It is the most widely used test for text classification problems.

Following are the disadvantages:

Naive Bayes is based on the assumption that all predictors (or


features) are independent.

The is confronted by this algorithm.


Applications of Naïve Bayes Algorithms

Following are the applications of Naïve Bayes Algorithms:

Text classification

Spam Filtering

Real-time Prediction

Multi-class Prediction

Recommendation Systems

Credit Scoring

Sentiment Analysis
Decision Tree

A decision tree is a supervised learning technique that can be


applied to classification and regression problems.

Decision trees are designed to mimic human decision-making


abilities, making them simple to understand.

Because the decision tree has a tree-like structure, the


logic behind it is easily understood.
Decision-tree working

Following steps are involved in the working of Decision-tree:

Begin the tree with node T, which contains the entire dataset.

Using the Attribute Selection Measure, find the best attribute


in the dataset.

Divide the T into subsets that contain the best possible values for
the attributes.

Create the decision tree node with the best attribute.

Make new decision trees recursively using the subsets of the


dataset created in step-3.

Continue this process until you reach a point where you can no
longer classify the nodes and refer to the final node as a leaf
node.

General Following is the general decision tree structure is


shown in Figure
Figure 2.10: General Decision Tree structure

A decision tree can contain both categorical (Yes/No)


and numerical data.
Example of decision-tree

Assume we want to play badminton on a specific day, say


Saturday - how will you decide whether or not to play? Assume
you go outside to see if it is hot or cold, the speed of the wind
and humidity, and the weather, i.e., whether it's sunny,
cloudy, or rainy. You consider all of these factors when deciding
whether or not to play. Table 2.3 shows the weather
observations of the last ten days.

days.

days.

days.

days.

days.

days.

days.

days.

days.

days.

days.

Table 2.3: Weather Observations of the last ten


days
A decision tree is a great way to represent the data because it
follows a tree-like structure and considers all possible paths
that can lead to the final decision.

Figure 2.11: Play Badminton decision tree

Figure 2.11 depicts a learned decision tree. Each node represents


an attribute or feature, and the branch from each node
represents the node's outcome. Finally, the final decision is made
by the tree's leaves.
Advantages of decision tree

Following are the advantages of decision tree:

Simple and easy to understand.

Popular technique for resolving decision-making problems.

It aids in considering all possible solutions to a problem.

Less data cleaning is required.


Disadvantages of the decision tree

Following are the disadvantages of decision tree:

Overfitting problem

Complexity
K-Nearest Neighbors (K-NN) algorithm

The K-Nearest Neighbors algorithm is a clear and simple


supervised machine learning algorithm that can be used to solve
regression and classification problems.

The K-NN algorithm assumes that the new case and existing
cases are similar and places the new case in the category that is
most similar to the existing categories. The K-NN algorithm stores
all available data and classifies a new data point based on its
similarity to the existing data. This means that when new data
appears, the KNN algorithm can quickly classify it into a suitable
category.

K-NN is a non-parametric algorithm, which means it makes no


assumptions about the data it uses. It's also known as a lazy
learner algorithm because it doesn't learn from the
training set right away; instead, it stores the dataset and
performs an action on it when it comes time to classify it.

Pattern recognition, data mining, and intrusion detection are


some of the demanding applications.
Need of the K-NN Algorithm

Assume there are two categories, Category A and Category B,


and we have a new data point x1. Which of these categories will
this data point fall into? A K-NN algorithm is required to solve
this type of problem. We can easily identify the category or class
of a dataset with the help of K-NN as shown in Figure

Figure 2.12: Category or class of a dataset with the help of K-


NN

The following algorithm can be used to explain how KNNs work:

Select the no. K of the neighbors.

Determine the Euclidean distance between K neighbors.


Take the K closest neighbors based on the Euclidean distance
calculated.

Count the number of data points in each category among these K


neighbors.

Assign the new data points to the category with the


greatest number of neighbors.

Our model is complete.

Let's say we have a new data point that needs to be


placed in the appropriate category. Consider the following
illustration:

Figure 2.13: K-NN example

First, we'll decide on the number of neighbors, so we'll go with

The Euclidean distance between the data points will then


be calculated as shown in Figure The Euclidean distance is
the
distance between two points that we learned about in
geometry class. It can be calculated using the following
formula:

Figure 2.14: The Euclidean distance between the data points

We found the closest neighbors by calculating the Euclidean


distance, which yielded three closest neighbors in category A
and two closest neighbors in category B as shown in Figure
Consider the following illustration:
Figure 2.15: Closest neighbors for the Category A and B

As can be seen, the three closest neighbors are all from category
A, so this new data point must also be from that category.
Logistic Regression

The classification algorithm logistic regression is used to assign


observations to a discrete set of classes. Unlike linear regression,
which produces a continuous number of values, logistic regression
produces a probability value that can be mapped to two or more
discrete classes using the logistic sigmoid function.

A regression model with a categorical target variable is known as


logistic regression. To model binary dependent variables, it
employs a logistic function.

The target variable in logistic regression has two possible values,


such as yes/no. Consider how the target variable y would be
represented in the value of "yes" is 1 and "no" is 0. The log-odds of
y being 1 is a linear combination of one or more predictor
variables, according to the logistic model. So, let's say we
have two predictors or independent variables, and and p is the
probability of y equaling 1. Then, using the logistic model as a
guide:

We can recover the odds by exponentiating the equation:


As a result, the probability of y is 1. If p is closer to 0, y equals
0, and if p is closer to 1, y equals 1. As a result, the logistic
regression equation is:

This equation can be generalized to n number of parameters and


independent variables as follows:
Comparison between linear and logistic regression

Different things can be predicted using linear regression and


logistic regression as shown in Figure

Linear The predictions of linear regression are continuous


(numbers in a range). It may be able to assist us in predicting
the student's test score on a scale of 0 to 100.

Logistic The predictions of logistic regression are discrete (only


specific values or categories are allowed). It might be able to
tell us whether the student passed or failed. The probability
scores that underpin the model's classifications can also be
viewed.

Figure 2.16: Linear regression vs Logistic


regression
While linear regression has an infinite number of possible
outcomes, logistic regression has a set of predetermined
outcomes.

When the response variable is continuous, linear regression is


used, but when the response variable is categorical, logistic
regression is used.

A continuous output, such as a stock market score, is an example


of logistic regression and predicting a defaulter in a bank using
transaction details from the past is an example of linear
regression.

The following Table 2.4 shows the difference between Linear


and Logistic Regression:

Regression: Regression:

Regression: Regression: Regression: Regression: Regression:

Regression: Regression: Regression: Regression:


Regression: Regression: Regression: Regression:
Regression: Regression: Regression: Regression:
Regression: Regression: Regression: Regression:
Regression: Regression: Regression: Regression:
Regression:
Regression: Regression: Regression:

Regression: Regression: Regression: Regression: Regression:

Regression: Regression: Regression: Regression: Regression:


Regression: Regression: Regression: Regression:
Regression: Regression: Regression: Regression:
Regression: Regression:

Table 2.4: Difference between Linear and Logistic Regression


Types of Logistic Regression

Logistic regression is mainly categorized into three types as shown


in Figure

Binary Logistic Regression

Multinomial Logistic Regression

Ordinal Logistic Regression

Figure 2.17: Types of Logistic Regression


Binary Logistic Regression

The target variable or dependent variable in binary logistic


regression is binary in nature, meaning it has only two possible
values.

There are only two possible outcomes for the categorical response.

For example, determining whether or not a message is a spam.


Multinomial Logistic Regression

In a multinomial logistic regression, the target variable can have


three or more values, but there is no fixed order of preference
for these values.

For instance, the most popular type of food (Indian,


Italian, Chinese, and so on.)

Predicting which food is preferred more is an example (Veg, Non-


Veg, Vegan).
Ordinal Logistic Regression

The target variable in ordinal logistic regression has three or


more possible values, each of which has a preference or order.

For instance, restaurant star ratings range from 1 to 5, and movie


ratings range from 1 to 5.
Examples

The following are some of the scenarios in which logistic


regression can be used.

Weather Prediction regression is used to make weather


predictions. We use the information from previous weather
reports to forecast the outcome for a specific day. However,
logistic regression can only predict categorical data, such as
whether it will rain or not.

Determining Illness: We can use logistic regression with the


help of the patient's medical history to predict whether
the illness is positive or negative.
Support Vector Machine (SVM) Algorithm

The Support Vector Machine, or SVM, is a popular Supervised


Learning algorithm that can be used to solve both classification
and regression problems. However, it is primarily used in machine
learning for classification problems as shown in Figure

Many people prefer SVM because it produces significant accuracy


while using less computing power. The extreme points/vectors
that help create the hyperplane are chosen by SVM. Support
vectors
are the extreme cases hence the algorithm is called a
Support Vector
Figure 2. 18: Support Vector Machine (SVM)
concept

The support vector machine algorithm's goal is to find a


hyperplane in N-dimensional space (N - the number of features)
that categorizes the data points clearly.

Figure 2.19: SVM hyperplanes with Maximum


margin

There are numerous hyperplanes to choose from to separate the


two classes of data points. Our goal is to find a plane with the
greatest margin, or the greatest distance between data points
from both classes as shown in Figure Maximizing the margin
distance provides some reinforcement, making it easier to classify
future data points.

Hyperplanes are decision boundaries that aid in data


classification. Different classes can be assigned to data points on
either side of the hyperplane. The hyperplane's dimension is also
determined by the number of features. If there are only two
input features, the
hyperplane is just a line. The hyperplane becomes a two-
dimensional plane when the number of input features reaches
three. When the number of features exceeds three, it
becomes difficult to imagine.

Figure 2.20: Support Vectors with small and large margin

Support vectors are data points that are closer to the hyperplane
and have an influence on the hyperplane's position and
orientation as shown in Figure We maximize the classifier's margin
by using these support vectors. The hyperplane's position will be
altered if the support vectors are deleted. These are the points
that will assist us in constructing SVM.

The sigmoid function is used in logistic regression to squash the


output of the linear function within the range of [0,1]. If the
squashed value exceeds a threshold value (0.5), it is labeled as 1,
otherwise, it is labeled as 0. In SVM, we take the output of a
linear function and, if it is greater than 1, we assign it to one
class, and if it is less than 1, we assign it to another. We get this
reinforcement range of values ([-1,1]) which acts as a margin
because the threshold values in SVM are changed to 1 and -1.
Hyperplane, Support Vectors, and Margin

The Hyperplane, Support Vectors, and Margin are described


as follows:

In n-dimensional space, there can be multiple lines/decision


boundaries to separate the classes, but we need to find the best
decision boundary to help classify the data points. The
hyperplane of SVM refers to the best boundary. The hyperplane's
dimensions are determined by the features in the dataset; for
example, if
there are two features, the hyperplane will be a straight line. If
three features are present, the hyperplane will be a two-
dimensional plane. We always make a hyperplane with a
maximum margin, which refers to the distance between data
points.

Support Vectors: Support vectors are the data points or


vectors that are closest to the hyperplane and have an effect
on the hyperplane's position. These vectors are called support
vectors because they support the hyperplane.

It's the distance between two lines on the closest data point of
different classes. The perpendicular distance between the line
and the support vectors can be calculated. A large margin is
regarded as a good margin, while a small margin is regarded as a
bad margin.
Working of SVM

In multidimensional space, an SVM model is essentially a


representation of different classes in a hyperplane. SVM will
generate the hyperplane in an iterative manner in order to
reduce the error. SVM's goal is to divide datasets into classes so
that a maximum marginal hyperplane can be found.

Figure 2.21: Working of SVM

SVM's main goal is to divide datasets into classes in order to find


a maximum marginal hyperplane which can be accomplished in
two steps:

First, SVM will iteratively generate hyperplanes that best


separate the classes.

The hyperplane that correctly separates the classes will then


be chosen.
Types of SVM

Support Vector Machine (SVM) types are described below:

Linear Linear SVM is used for linearly separable data, which


means that if a dataset can be classified into two classes using
only a single straight line, it is called linearly separable and
the classifier used is called Linear SVM.

Non-linear Non-Linear SVM is used to classify non-linearly


separated which means that if a dataset cannot be classified
using a straight line, it is classified as non-linear data, and the
classifier used is the Non-Linear SVM classifier.
Applications of Support-Vector Machines

The following are few of the applications of Support-


Vector Machines:

Facial expressions classifications

Pattern classification and regression problems

In the military datasets

Speech recognition

Predicting the structure of proteins

In image processing - handwriting recognition

In earthquake potential damage detections


Advantages of SVM

The following are the advantages of SVM:

SVM classifiers are highly accurate and work well in high-


dimensional environments. Because SVM classifiers only use a
subset of training points, they require very little memory.

Solve the data points that are not linearly separable.

Effective in a higher dimension space.

It works well with a clear margin of separation.

It is effective in cases where the number of dimensions is greater


than the number of samples.

Better utilization of memory space.


Disadvantages of SVM

The following are the disadvantages of SVM:

Not suitable for the larger data sets.

Less effective when the data set has more noise.

It doesn’t directly provide probability estimates.

Overfitting problem
UNIT-IV

K means Clustering – Introduction


We are given a data set of items, with certain features, and values for these features
(like a vector). The task is to categorize those items into groups. To achieve this, we
will use the kMeans algorithm; an unsupervised learning algorithm. ‘K’ in the name of
the algorithm represents the number of groups/clusters we want to classify our items
into.
Overview
(It will help if you think of items as points in an n-dimensional space). The algorithm
will categorize the items into k groups or clusters of similarity. To calculate that
similarity, we will use the euclidean distance as measurement.
K Means Clustering Algorithm:
K Means is a clustering algorithm. Clustering algorithms are unsupervised algorithms
which means that there is no labelled data available. It is used to identify different
classes or clusters in the given data based on how similar the data is. Data points in
the same group are more similar to other data points in that same group than those in
other groups.
K-means clustering is one of the most commonly used clustering algorithms.
Here, k represents the number of clusters.
Let’s see how does K-means clustering work –
1. Choose the number of clusters you want to find which is k.
2. Randomly assign the data points to any of the k clusters.
3. Then calculate the center of the clusters.
4. Calculate the distance of the data points from the centers of each of the
clusters.
5. Depending on the distance of each data point from the cluster, reassign the
data points to the nearest clusters.
6. Again calculate the new cluster center.
7. Repeat steps 4,5 and 6 till data points don’t change the clusters, or till we reach
the assigned number of iterations.
The “points” mentioned above are called means because they are the mean values of
the items categorized in them. To initialize these means, we have a lot of options. An
intuitive method is to initialize the means at random items in the data set. Another
method is to initialize the means at random values between the boundaries of the data
set (if for a feature x the items have values in [0,3], we will initialize the means with
values for x at [0,3]).
To actually find the means, we will loop through all the items, classify them to their
nearest cluster and update the cluster’s mean. We will repeat the process for a fixed
number of iterations. If between two iterations no item changes classification, we stop
the process as the algorithm has found the optimal solution.

Page 1
UNIT-IV

Image Segmentation using K Means Clustering


Image Segmentation: In computer vision, image segmentation is the process of
partitioning an image into multiple segments. The goal of segmenting an image is to
change the representation of an image into something that is more meaningful and
easier to analyze. It is usually used for locating objects and creating boundaries.
It is not a great idea to process an entire image because many parts in an image may
not contain any useful information. Therefore, by segmenting the image, we can make
use of only the important segments for processing.
An image is basically a set of given pixels. In image segmentation, pixels which have
similar attributes are grouped together. Image segmentation creates a pixel-wise mask
for objects in an image which gives us a more comprehensive and granular
understanding of the object.
1. Used in self-driving cars. Autonomous driving is not possible without object detection
which involves segmentation.

2. Used in the healthcare industry. Helpful in segmenting cancer cells and tumours using
which their severity can be gauged.

There are many more uses of image segmentation.

In this article, we will perform segmentation on an image of the monarch butterfly using a
clustering method called K Means Clustering.

1. Read in the image and convert it to an RGB image.


2. Now we have to prepare the data for K means. The image is a 3-dimensional shape but
to apply k-means clustering on it we need to reshape it to a 2-dimensional array.
3. Taking k = 3, which means that the algorithm will identify 3 clusters in the image.

Page 2
UNIT-IV

import numpy as np
import [Link] as plt
import cv2

%matplotlib inline

# Read in the image


image = [Link]('images/[Link]')

# Change color to RGB (from BGR)


image = [Link](image, cv2.COLOR_BGR2RGB)

[Link](image)
# Reshaping the image into a 2D array of pixels and 3 color values (RGB)
pixel_vals = [Link]((-1,3))

# Convert to float type


pixel_vals = np.float32(pixel_vals)
#the below line of code defines the criteria for the algorithm to stop running,
#which will happen is 100 iterations are run or the epsilon (which is the required
accuracy)
#becomes 85%
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 100, 0.85)

# then perform k-means clustering wit h number of clusters defined as 3


#also random centres are initially choosed for k-means clustering
k=3
retval, labels, centers = [Link](pixel_vals, k, None, criteria, 10,
cv2.KMEANS_RANDOM_CENTERS)

# convert data into 8-bit values


centers = np.uint8(centers)
segmented_data = centers[[Link]()]

# reshape data into the original image dimensions


segmented_image = segmented_data.reshape(([Link]))

[Link](segmented_image)

Page 3
UNIT-IV

Semi-Supervised Cluster Analysis


Semi-supervised clustering is a method that partitions unlabeled data by creating the
use of domain knowledge. It is generally expressed as pairwise constraints between
instances or just as an additional set of labeled instances.
The quality of unsupervised clustering can be essentially improved using some weak
structure of supervision, for instance, in the form of pairwise constraints (i.e., pairs of
objects labeled as belonging to similar or different clusters). Such a clustering procedure
that depends on user feedback or guidance constraints is known as semi-supervised
clustering.
Constraint-based semi-supervised clustering − It can be used based on user-provided
labels or constraints to support the algorithm toward a more appropriate data
partitioning. This contains modifying the objective function depending on constraints or
initializing and constraining the clustering process depending on the labeled objects.
Distance-based semi-supervised clustering − It can be used to employ an adaptive
distance measure that is trained to satisfy the labels or constraints in the supervised
data. Multiple adaptive distance measures have been utilized, including string-edit
distance trained using Expectation-Maximization (EM), and Euclidean distance changed
by the shortest distance algorithm.

Using clustering for pre-processing

Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into
an understandable format for ML algorithms. Real-world data usually is noisy (contains errors,
outliers, duplicates), incomplete (some values are missed), could be stored in different places
and different formats. The task of Data Preprocessing is to handle these issues.

In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and
Training / Tunning Model.

Importance of Data Preprocessing stage


1. Different ML models have different required input data (numerical data, images in
specific format, etc). Without the right data, nothing will work.

Page 4
UNIT-IV

2. Because of “bad” data, ML models will not give any useful results, or even may give
wrong answers, that may lead to wrong decisions (GIGO principle).
3. The higher the quality of the data, the less data is needed.
Stages of Data preprocessing for Clustering
1. Data Cleaning
 Removing duplicates
 Removing irrelevant observations and errors
 Removing unnecessary columns
 Handling inconsistent data
 Handling outliers and noise
2. Handling missing data
3. Data Integration
4. Data Transformation
 Feature Construction
 Handling skewness
 Data Scaling
5. Data Reduction
 Removing dependent (highly correlated) variables
 Feature selection
 PCA

DBSCAN Clustering in ML | Density based clustering


Clustering analysis or simply Clustering is basically an Unsupervised learning method that divides the
data points into a number of specific batches or groups, such that the data points in the same groups
have similar properties and data points in different groups have different properties in some sense. It
comprises many different methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-shift (distance
between points), DBSCAN (distance between nearest points), Gaussian mixtures (Mahalanobis distance
to centers), Spectral clustering (graph distance) etc.

Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then
we use it to cluster the data points into groups or batches. Here we will focus on Density-based spatial
clustering of applications with noise (DBSCAN) clustering method.

Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for
each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of
points.

Page 5
UNIT-IV

Limitations of K-means
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for
compact and well-separated clusters. Moreover, they are also severely affected by the
presence of noise and outliers in the data.

Real life data may contain irregularities, like:

1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.

The figure below shows a data set containing non convex clusters and outliers/noises. Given
such data, k-means algorithm has difficulties in identifying these clusters with arbitrary
shapes.
DBSCAN algorithm requires two parameters:

Page 6
UNIT-IV

1. eps : It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is
chosen too small then large part of the data will be considered as outliers. If it is
chosen very large then the clusters will merge and the majority of the data points will
be in the same clusters. One way to find the eps value is based on the k-distance
graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum
MinPts can be derived from the number of dimensions D in the dataset as, MinPts >=
D+1. The minimum value of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of
a core point.
Noise or outlier: A point which is not a core point or border point.

DBSCAN algorithm can be abstracted in the following steps:

1. Find all the neighbor points within eps and identify the core points or visited with more
than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as
the core point.
A point a and b are said to be density connected if there exist a point c which has a
sufficient number of points in its neighbors and both the points a and b are within the eps
distance. This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is
neighbor of e, which in turn is neighbor of a implies that b is neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.

Page 7
UNIT-IV

Introduction to Gaussian Mixture Models (GMMs)

Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian
distributions, and each of these distributions represent a cluster. Hence, a Gaussian Mixture
Model tends to group the data points belonging to a single distribution together.

Let’s say we have three Gaussian distributions (more on that in the next section) – GD1, GD2,
and GD3. These have a certain mean (μ1, μ2, μ3) and variance (σ1, σ2, σ3) value respectively.
For a given set of data points, our GMM would identify the probability of each data point
belonging to each of these distributions.

Gaussian Mixture Models are probabilistic models and use the soft clustering approach for
distributing the points in different clusters.

Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let’s
take the data point highlighted in red. The probability of this point being a part of the blue
cluster is 1, while the probability of it being a part of the green or cyan clusters is 0.

Now, consider another point – somewhere in between the blue and cyan (highlighted in the
below figure). The probability that this point is a part of cluster green is 0, right? And the
probability that this belongs to blue and cyan is 0.2 and 0.8 respectively.

Page 8
UNIT-IV

Gaussian Mixture Models use the soft clustering technique for assigning data points to Gaussian

Dimensionality Reduction
Curse of Dimensionality — A “Curse” to Machine Learning
Curse of Dimensionality describes the explosive nature of increasing data dimensions and its
resulting exponential increase in computational efforts required for its processing and/or
analysis. In machine learning, a feature of an object can be an attribute or a characteristic that
defines it. Each feature represents a dimension and group of dimensions creates a data point.
This represents a feature vector that defines the data point to be used by a machine learning
algorithm(s). When we say increase in dimensionality it implies an increase in the number of
features used to describe the data. As the dimensionality increases, the number of data points
required for good performance of any machine learning algorithm increases exponentially.

Page 9
UNIT-IV

Techniques for Dimensionality Reduction


1. Feature Selection Methods
2. Matrix Factorization
3. Manifold Learning

1. Feature Selection Methods


Perhaps the most common are so-called feature selection techniques that use scoring or
statistical methods to select which features to keep and which features to delete. perform
feature selection, to remove “irrelevant” features that do not help much with the classification
problem.

Two main classes of feature selection techniques include wrapper methods and filter methods.
Wrapper methods, as the name suggests, wrap a machine learning model, fitting and evaluating
the model with different subsets of input features and selecting the subset the results in the
best model performance. RFE is an example of a wrapper feature selection method.
Filter methods use scoring methods, like correlation between the feature and the target
variable, to select a subset of input features that are most predictive. Examples include
Pearson’s correlation and Chi-Squared test.

2. Matrix Factorization
Techniques from linear algebra can be used for dimensionality reduction. Specifically, matrix
factorization methods can be used to reduce a dataset matrix into its constituent parts.
Examples include the eigen decomposition and singular value [Link] most
common approach to dimensionality reduction is called principal components analysis or PCA.
3. Manifold Learning

Techniques from high-dimensionality statistics can also be used for dimensionality reduction.
These techniques are sometimes referred to as “manifold learning” and are used to create a
low-dimensional projection of high-dimensional data, often for the purposes of data
visualization.
The projection is designed to both create a low-dimensional representation of the dataset
whilst best preserving the salient structure or relationships in the data.
Examples of manifold learning techniques include:
 Kohonen Self-Organizing Map (SOM).
 Sammons Mapping
 Multidimensional Scaling (MDS)
 t-distributed Stochastic Neighbor Embedding (t-SNE).

Page 10
UNIT-IV

Principal Component Analysis


Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of
data.

It can be thought of as a projection method where data with m-columns (features) is projected
into a subspace with m or fewer columns, whilst retaining the essence of the original data.

 The first step is to calculate the mean values of each column.


 Next, we need to center the values in each column by subtracting the mean column
value.
 The next step is to calculate the covariance matrix of the centered matrix C.
 Correlation is a normalized measure of the amount and direction (positive or negative)
that two columns change together. Covariance is a generalized and unnormalized
version of correlation across multiple columns. A covariance matrix is a calculation of
covariance of a given matrix with covariance scores for every column with every other
column, including itself.
 Finally, we calculate the eigen decomposition of the covariance matrix V. This results in
a list of eigen values and a list of eigenvectors.
The eigenvectors represent the directions or components for the reduced subspace of B,
whereas the eigenvalues represent the magnitudes for the directions.

The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of
the components or axes of the new subspace for A.
If all eigenvalues have a similar value, then we know that the existing representation may
already be reasonably compressed or dense and that the projection may offer little. If there are
eigenvalues close to zero, they represent components or axes of B that may be discarded.
A total of m or less components must be selected to comprise the chosen subspace. Ideally, we
would select k eigenvectors, called principal components, that have the k largest eigenvalues.
Using Scikit-Learn
PCA example with Iris Data-set
import numpy as np
import [Link] as plt
from sklearn import decomposition
from sklearn import datasets

[Link](5)

iris = datasets.load_iris()
X = [Link]
y = [Link]
fig = [Link](1, figsize=(4, 3))
[Link]()

Page 11
UNIT-IV

ax = fig.add_subplot(111, projection="3d", elev=48, azim=134)


ax.set_position([0, 0, 0.95, 1])
[Link]()
pca = [Link](n_components=3)
[Link](X)
X = [Link](X)

for name, label in [("Setosa", 0), ("Versicolour", 1), ("Virginica", 2)]:


ax.text3D(
X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(),
name,
horizontalalignment="center",
bbox=dict(alpha=0.5, edgecolor="w", facecolor="w"),
)
# Reorder the labels to have colors matching the cluster results
y = [Link](y, [1, 2, 0]).astype(float)
[Link](X[:, 0], X[:, 1], X[:, 2], c=y, cmap=[Link].nipy_spectral, edgecolor="k")

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
[Link]()

RANDOMIZED PCA:

The classical PCA uses the low-rank matrix approximation to estimate the principal
components. However, this method becomes costly and makes the whole process difficult to
scale, for large [Link] randomizing how the singular value decomposition of the dataset
happens, we can approximate the first K principal components quickly than classical PCA.

1 from [Link] import PCA


2
3 rpca = PCA(n_components=2, svd_solver='randomized')
4 X_rpca = rpca.fit_transform(X)
5
6 scatter_plot(X_rpca, y)

KERNEL PCA:

PCA is a linear method. It works great for linearly separable datasets. However, if the dataset
has non-linear relationships, then it produces undesirable results.

Kernel PCA is a technique which uses the so-called kernel trick and projects the linearly
inseparable data into a higher dimension where it is linearly separable.

Page 12
UNIT-IV

There are various kernels that are popularly used; some of them are linear, polynomial, RBF,
and sigmoid.
1 import [Link] as plt
2 from [Link] import make_circles
3 from [Link] import PCA, KernelPCA
4
5 X,y = make_circles(n_samples=500, factor=.1, noise=0.02, random_state=47)
6
7 [Link](X[:,0], X[:,1], c=y)
8 [Link]()

Page 13
UNIT-V

What is Artificial Neural Network?

Artificial Neural Network ANNANN is an efficient computing system whose central theme is
borrowed from the analogy of biological neural networks. ANNs are also named as “artificial
neural systems,” or “parallel distributed processing systems,” or “connectionist systems.”
ANN acquires a large collection of units that are interconnected in some pattern to allow
communication between the units. These units, also referred to as nodes or neurons, are
simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection
link is associated with a weight that has information about the input signal. This is the most
useful information for neurons to solve a particular problem because the weight usually
excites or inhibits the signal that is being communicated. Each neuron has an internal state,
which is called an activation signal. Output signals, which are produced after combining the
input signals and activation rule, may be sent to other units.

Biological Neuron

A nerve cell neuronneuron is a special biological cell that processes information. According
to an estimation, there are huge number of neurons, approximately 10 11 with numerous
interconnections, approximately 1015.

Working of a Biological Neuron

As shown in the above diagram, a typical neuron consists of the following four parts with
the help of which we can explain its working −

Page 1
UNIT-V

 Dendrites − They are tree-like branches, responsible for receiving the information
from other neurons it is connected to. In other sense, we can say that they are like
the ears of neuron.

 Soma − It is the cell body of the neuron and is responsible for processing of
information, they have received from dendrites.

 Axon − It is just like a cable through which neurons send the information.

 Synapses − It is the connection between the axon and other neuron dendrites.

Model of Artificial Neural Network

The following diagram represents the general model of ANN followed by its
processing.

For the above general model of artificial neural network, the net input can be calculated as
follows –

Page 2
UNIT-V

Workflow of ANN

Let us first understand the different phases of deep learning and then, learn how Keras
helps in the process of deep learning.
Collect required data
Deep learning requires lot of input data to successfully learn and predict the result. So, first
collect as much data as possible.
Analyze data
Analyze the data and acquire a good understanding of the data. The better understanding of
the data is required to select the correct ANN algorithm.
Choose an algorithm (model)
Choose an algorithm, which will best fit for the type of learning process (e.g image
classification, text processing, etc.,) and the available input data. Algorithm is represented
by Model in Keras. Algorithm includes one or more layers. Each layers in ANN can be
represented by Keras Layer in Keras.
 Prepare data − Process, filter and select only the required information from the data.
 Split data − Split the data into training and test data set. Test data will be used to
evaluate the prediction of the algorithm / Model (once the machine learn) and to cross
check the efficiency of the learning process.
 Compile the model − Compile the algorithm / model, so that, it can be used further to
learn by training and finally do to prediction. This step requires us to choose loss
function and Optimizer. loss function and Optimizer are used in learning phase to find
the error (deviation from actual output) and do optimization so that the error will be
minimized.
 Fit the model − The actual learning process will be done in this phase using the
training data set.
 Predict result for unknown value − Predict the output for the unknown input data
(other than existing training and test data)
 Evaluate model − Evaluate the model by predicting the output for test data and cross-
comparing the prediction with actual result of the test data.
 Freeze, Modify or choose new algorithm − Check whether the evaluation of the model
is successful. If yes, save the algorithm for future prediction purpose. If not, then
modify or choose new algorithm / model and finally, again train, predict and evaluate
the model. Repeat the process until the best algorithm (model) is found.

Architecture of Keras

Keras API can be divided into three main categories −


 Model
 Layer
 Core Modules
In Keras, every ANN is represented by Keras Models. In turn, every Keras Model is
composition of Keras Layers and represents ANN layers like input, hidden layer, output

Page 3
UNIT-V

layers, convolution layer, pooling layer, etc., Keras model and layer access Keras
modules for activation function, loss function, regularization function, etc., Using Keras
model, Keras Layer, and Keras modules, any ANN algorithm (CNN, RNN, etc.,) can be
represented in a simple and efficient manner.

Model

Keras Models are of two types as mentioned below −


Sequential Model − Sequential model is basically a linear composition of Keras Layers.
Sequential model is easy, minimal as well as has the ability to represent nearly all available
neural networks.
A simple sequential model is as follows −
from [Link] import Sequential
from [Link] import Dense, Activation

model = Sequential()
[Link](Dense(512, activation = 'relu', input_shape = (784,)))
Where,

 Line 1 imports Sequential model from Keras models


 Line 2 imports Dense layer and Activation module
 Line 4 create a new sequential model using Sequential API
 Line 5 adds a dense layer (Dense API) with relu activation (using Activation module) function.

Layer

Each Keras layer in the Keras model represent the corresponding layer (input layer, hidden
layer and output layer) in the actual proposed neural network model. Keras provides a lot of
pre-build layers so that any complex neural network can be easily created. Some of the
important Keras layers are specified below,
 Core Layers
 Convolution Layers
 Pooling Layers
 Recurrent Layers
A simple python code to represent a neural network model using sequential model is as
follows −
from [Link] import Sequential
from [Link] import Dense, Activation, Dropout model = Sequential()

[Link](Dense(512, activation = 'relu', input_shape = (784,)))


[Link](Dropout(0.2))
[Link](Dense(512, activation = 'relu')) [Link](Dropout(0.2))
[Link](Dense(num_classes, activation = 'softmax'))
Where,

Page 4
UNIT-V

 Line 1 imports Sequential model from Keras models


 Line 2 imports Dense layer and Activation module
 Line 4 create a new sequential model using Sequential API
 Line 5 adds a dense layer (Dense API) with relu activation (using Activation module)
function.
 Line 6 adds a dropout layer (Dropout API) to handle over-fitting.
 Line 7 adds another dense layer (Dense API) with relu activation (using Activation
module) function.
 Line 8 adds another dropout layer (Dropout API) to handle over-fitting.
 Line 9 adds final dense layer (Dense API) with softmax activation (using Activation
module) function.

Core Modules

Keras also provides a lot of built-in neural network related functions to properly create the
Keras model and Keras layers. Some of the function are as follows −
 Activations module − Activation function is an important concept in ANN and
activation modules provides many activation function like softmax, relu, etc.,
 Loss module − Loss module provides loss functions like mean_squared_error,
mean_absolute_error, poisson, etc.,
 Optimizer module − Optimizer module provides optimizer function like adam, sgd,
etc.,
 Regularizers − Regularizer module provides functions like L1 regularizer, L2 regularizer,
etc.,
Keras - Regression Prediction using MPL
Step 1 − Import the modules

Let us import the necessary modules.


import keras

from [Link] import boston_housing


from [Link] import Sequential
from [Link] import Dense
from [Link] import RMSprop
from [Link] import EarlyStopping
from sklearn import preprocessing
from [Link] import scale
Step 2 − Load data
Let us import the Boston housing dataset.
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()
Here,
boston_housing is a dataset provided by Keras. It represents a collection of
housing information in Boston area, each having 13 features.
Step 3 − Process the data

Page 5
UNIT-V

Let us change the dataset according to our model, so that, we can feed into our
model. The data can be changed using below code −
x_train_scaled = [Link](x_train)
scaler = [Link]().fit(x_train)
x_test_scaled = [Link](x_test)
Here, we have normalized the training data
using [Link] function. [Link]().fit f
unction returns a scalar with the normalized mean and standard deviation of the
training data, which we can apply to the test data using [Link] function.
This will normalize the test data as well with the same setting as that of training data.
Step 4 − Create the model
Let us create the actual model.
model = Sequential()
[Link](Dense(64, kernel_initializer = 'normal', activation = 'relu',
input_shape = (13,)))
[Link](Dense(64, activation = 'relu')) [Link](Dense(1))
Step 5 − Compile the model
Let us compile the model using selected loss function, optimizer and metrics.
[Link](
loss = 'mse',
optimizer = RMSprop(),
metrics = ['mean_absolute_error']
)
Step 6 − Train the model
Let us train the model using fit() method.
history = [Link](
x_train_scaled, y_train,
batch_size=128,
epochs = 500,
verbose = 1,
validation_split = 0.2,
callbacks = [EarlyStopping(monitor = 'val_loss', patience = 20)]
)
Step 7 − Evaluate the model
Let us evaluate the model using test data.
score = [Link](x_test_scaled, y_test, verbose = 0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Executing the above code will output the below information −
Test loss: 21.928471583946077 Test accuracy: 2.9599233234629914
Step 8 − Predict

Page 6
UNIT-V

Finally, predict using test data as below −


prediction = [Link](x_test_scaled)
print([Link]())
print(y_test)

TENSORFLOW INSTALLATION
. Step 1 − Verify the python version being installed.

Step 2 − A user can pick up any mechanism to install TensorFlow in the system. We
recommend “pip” and “Anaconda”. Pip is a command used for executing and
installing modules in Python.
Before we install TensorFlow, we need to install Anaconda framework in our system.

Page 7
UNIT-V

After successful installation, check in command prompt through “conda” command.


The execution of command is displayed below −

Step 3 − Execute the following command to initialize the installation of TensorFlow −


conda create --name tensorflow python = 3.5

Page 8
UNIT-V

It downloads the necessary packages needed for TensorFlow setup.


Step 4 − After successful environmental setup, it is important to activate TensorFlow
module.
activate tensorflow

Step 5 − Use pip to install “Tensorflow” in the system. The command used for
installation is mentioned as below −
pip install tensorflow
And,
pip install tensorflow-gpu

Page 9
UNIT-V

Loading and Preprocessing Data with TensorFlow.


Deep Learning systems are often trained on very large datasets that will not fit in RAM.
Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with
other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just
create a dataset object, tell it where to get the data, then transform it in any way you want,
and TensorFlow takes care of all the implementation details, such as multithreading,
queuing, batching, prefetching, and so on.
Off the shelf, the Data API can read from text files (such as CSV files), binary files with fixed -
size records, and binary files that use TensorFlow’s TFRecord format, which supports
records of varying sizes. TFRecord is a flexible and efficient binary format based on Protocol
Buffers (an open source binary format). The Data API also has support for reading from SQL
databases. Moreover, many Open Source extensions are available to read from all sorts of
data sources, such as Google’s BigQuery

[Link] MOHAMMAD RAFI ,PROFESSOR ,SRI MITTAPALLI COLLEGE OF ENGINEERING Page


10
UNIT-V

service. However, reading huge datasets efficiently is not the only difficulty: the data also
needs to be preprocessed. Indeed, it is not always composed strictly of convenient
numerical fields: sometimes there will be text features, categorical features, and so on. To
handle this, TensorFlow provides the Features API: it lets you easily convert these features
to numerical features that can be consumed by your neural network.
Tensor‐
Flow’s ecosystem:
• TF Transform ([Link]) makes it possible to write a single preprocessing function that
can be run both in batch mode on your full training set, before training (to speed it up), and
then exported to a TF Function and incorporated into your trained model, so that once it is
deployed in production, it can take care of preprocessing new instances on the fly.
• TF Datasets (TFDS) provides a convenient function to download many common datasets of
all kinds, including large ones like ImageNet, and it provides convenient dataset objects to
manipulate them using the Data API.

The Data API


Usually you will use datasets that gradually read data from disk, but for simplicity let’s just
create a dataset entirely in RAM using [Link].from_tensor_slices():
>>> X = [Link](10) # any data tensor
>>> dataset = [Link].from_tensor_slices(X)
>>> dataset
<TensorSliceDataset shapes: (), types: tf.int32>
The from_tensor_slices() function takes a tensor and creates a [Link] whose
elements are all the slices of X (along the first dimension), so this dataset contains 10 items:
tensors 0, 1, 2, …, 9. In this case we would have obtained the same dataset if we had used
[Link](10).
Chaining Transformations
Once you have a dataset, you can apply all sorts of transformations to it by calling its
transformation methods. Each method returns a new dataset, so you can chain
transformations
>>> dataset = [Link](3).batch(7)
>>> for item in dataset:
... print(item)
...
[Link]([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
[Link]([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
[Link]([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
[Link]([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
[Link]([8 9], shape=(2,), dtype=int32)
Shuffling the Data
As you know, Gradient Descent works best when the instances in the training set are
independent and identically distributed. A simple way to ensure this is to shuffle the
instances. For this, you can just use the shuffle() method. It will create a new dataset that
will start by filling up a buffer with the first items of the source dataset, then whenever it is
asked for an item, it will pull one out randomly from the buffer, and replace it with a fresh
one from the source dataset, until it has iterated entirely through the source dataset.

[Link] MOHAMMAD RAFI ,PROFESSOR ,SRI MITTAPALLI COLLEGE OF ENGINEERING Page


11
UNIT-V

Preprocessing the Data


Let’s implement a small function that will perform this preprocessing:
X_mean, X_std = [...] # mean and scale of each feature in the training set
n_inputs = 8
def preprocess(line):
defs = [0.] * n_inputs + [[Link]([], dtype=tf.float32)]
fields = [Link].decode_csv(line, record_defaults=defs)
x = [Link](fields[:-1])
y = [Link](fields[-1:])
return (x - X_mean) / X_std, y
• First, we assume that you have precomputed the mean and standard deviation of each
feature in the training set. X_mean and X_std are just 1D tensors (or NumPy arrays)
containing 8 floats, one per input feature.
• The preprocess() function takes one CSV line, and starts by parsing it. For this, it uses the
[Link].decode_csv() function, which takes two arguments: the first is the line to parse, and the
second is an array containing the default value for each column in the CSV file. This tells
TensorFlow not only the default value for each column, but also the number of columns and
the type of each column. In this example, we tell it that all feature columns are floats and
missing values should default to 0, but we provide an empty array of type tf.float32 as the
default value for the last column (the target): this tells TensorFlow that this column contains
floats, but that there is no default value, so it will raise an exception if it encounters a
missing value.
• The decode_csv() function returns a list of scalar tensors (one per column) but we need to
return 1D tensor arrays. So we call [Link]() on all tensors except for the last one (the
target): this will stack these tensors into a 1D array. We then do the same for the target
value (this makes it a 1D tensor array with a single value, rather than a scalar tensor).
• Finally, we scale the input features by subtracting the feature means and then dividing by
the feature standard deviations, and we return a tuple containing the scaled features and
the target.
Let’s test this preprocessing function:
>>> preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')
(<[Link]: id=6227, shape=(8,), dtype=float32, numpy= array([ 0.16579159, 1.216324 , -
0.05204564, -0.39215982, -0.5277444 , -0.2633488 , 0.8543046 , -1.3072058 ],
dtype=float32)>, <[Link]: [...], numpy=array([2.782], dtype=float32)>)

[Link] MOHAMMAD RAFI ,PROFESSOR ,SRI MITTAPALLI COLLEGE OF ENGINEERING Page


12

You might also like