0% found this document useful (0 votes)
72 views

Applied ML notes

Machine learning

Uploaded by

Jeeva Jeeva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Applied ML notes

Machine learning

Uploaded by

Jeeva Jeeva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

APPLIED MACHINE LEARNING

Unit - I

Importance of Machine Learning


Machine Learning is one of the most popular sub-fields of Artificial Intelligence. Machine
learning concepts are used almost everywhere, such as Healthcare, Finance, Infrastructure,
Marketing, Self-driving cars, recommendation systems, chatbots, social sites, gaming, cyber
security, and many more.

What is Machine Learning?


Machine Learning is a branch of Artificial Intelligence that allows machines to learn and
improve from experience automatically. It is defined as the field of study that gives
computers the capability to learn without being explicitly programmed. It is quite different
than traditional programming.

How Machine Learning Works?


Machine Learning is a core form of Artificial Intelligence that enable machine to learn from
past data and make predictions

It involves data exploration and pattern matching with minimal human intervention. There
are mainly four technologies that machine learning used to work:

1. Supervised Learning:
Supervised Learning is a machine learning method that needs supervision similar to the
student-teacher relationship. In supervised Learning, a machine is trained with well-labeled
data, which means some data is already tagged with correct outputs. So, whenever new
data is introduced into the system, supervised learning algorithms analyze this sample data
and predict correct outputs with the help of that labeled data.
It is classified into two different categories of algorithms. These are as follows:

o Classification: It deals when output is in the form of a category such as Yellow, blue,
right or wrong, etc.
o Regression: It deals when output variables are real values like age, height, etc.

This technology allows us to collect or produce data output from experience. It works the
same way as humans learn using some labeled data points of the training set. It helps in
optimizing the performance of models using experience and solving various complex
computation problems.

2. Unsupervised Learning:
Unlike supervised learning, unsupervised Learning does not require classified or well-
labeled data to train a machine. It aims to make groups of unsorted information based on
some patterns and differences even without any labelled training data. In unsupervised
Learning, no supervision is provided, so no sample data is given to the machines. Hence,
machines are restricted to finding hidden structures in unlabeled data by their own.

It is classified into two different categories of algorithms. These are as follows:

o Clustering: It deals when there is a requirement of inherent grouping in training data,


e.g., grouping students by their area of interest.
o Association: It deals with the rules that help to identify a large portion of data, such as
students who are interested in ML and also interested in AI.

3. Semi-supervised learning:
Semi-supervised Learning is defined as the combination of both supervised and
unsupervised learning methods. It is used to overcome the drawbacks of both supervised
and unsupervised learning methods.

In the semi-supervised learning method, a machine is trained with labeled as well as


unlabeled data. Although, it involves a few labeled examples and a large number of
unlabeled examples.
Speech analysis, web content classification, protein sequence classification, and text
documents classifiers are some most popular real-world applications of semi-supervised
Learning.

4. Reinforcement learning:
Reinforcement learning is defined as a feedback-based machine learning method that does
not require labeled data. In this learning method, an agent learns to behave in an
environment by performing the actions and seeing the results of actions. Agents can
provide positive feedback for each good action and negative feedback for bad actions.
Since, in reinforcement learning, there is no training data, hence agents are restricted to
learn with their experience only.

Types of Machine Learning

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as a
student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model for
each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning:

o First Determine the type of training dataset


o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age,
salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on the
one or more predictor variables. It is mainly used for prediction, forecasting, time series
modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the datapoints
and the regression line is minimum." The distance between datapoints and line tells
whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent variables.
Here we are discussing some important types of regression which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image.
Here we are predicting the
salary of an employee on
the basis of the year of
experience.
o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

o It uses the concept of threshold levels, values above the threshold level are rounded up
to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear


dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints. To
cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.

Unsupervised Machine Learning


Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset in a compressed
format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task
of the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and
absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association rule
makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning as it does


not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is
not labeled, and algorithms do not know the exact output in advance.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program
learns from the given dataset or observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or
dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output


The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier.


There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of
the most related data stored in the training dataset. It takes less time in training but
more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner
takes more time in learning, and less time in prediction. Example: Decision Trees,
Na�ve Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models

o Logistic Regression
o Support Vector Machines

o Non-linear Models

o K-Nearest Neighbours
o Kernel SVM
o Na�ve Bayes
o Decision Tree Classification
o Random Forest Classification

Data Preparation in Machine Learning


Data preparation is defined as a gathering, combining, cleaning, and transforming
raw data to make accurate predictions in Machine learning projects.

Data preparation is also known as data "pre-processing," "data wrangling," "data cleaning,"
"data pre-processing," and "feature engineering." It is the later stage of the machine
learning lifecycle, which comes after data collection.
Data preparation is particular to data, the objectives of the projects, and the algorithms that
will be used in data modeling techniques.

Prerequisites for Data Preparation


Everyone must explore a few essential tasks when working with data in the data preparation
step. These are as follows:

o Data cleaning: This task includes the identification of errors and making corrections or
improvements to those errors.
o Feature Selection: We need to identify the most important or relevant input data
variables for the model.
o Data Transforms: Data transformation involves converting raw data into a well-suitable
format for the model.
o Feature Engineering: Feature engineering involves deriving new variables from the
available dataset.
o Dimensionality Reduction: The dimensionality reduction process involves converting
higher dimensions into lower dimension features without changing the information.

Data Preparation in Machine Learning


Data Preparation is the process of cleaning and transforming raw data to make predictions
accurately through using ML algorithms. Although data preparation is considered the most
complicated stage in ML, it reduces process complexity later in real-time projects. Various
issueshave been reported during the data preparation step in machine learning as follows:

o Missing data: Missing data or incomplete records is a prevalent issue found in most
datasets. Instead of appropriate data, sometimes records contain empty cells, values
(e.g., NULL or N/A), or a specific character, such as a question mark, etc.
o Outliers or Anomalies: ML algorithms are sensitive to the range and distribution of
values when data comes from unknown sources. These values can spoil the entire
machine learning training system and the performance of the model. Hence, it is
essential to detect these outliers or anomalies through techniques such as visualization
technique.
o Unstructured data format: Data comes from various sources and needs to be
extracted into a different format. Hence, before deploying an ML project, always consult
with domain experts or import data from known sources.
o Limited Features: Whenever data comes from a single source, it contains limited
features, so it is necessary to import data from various sources for feature enrichment or
build multiple features in datasets.
o Understanding feature engineering: Features engineering helps develop additional
content in the ML models, increasing model performance and accuracy in predictions.

Why is Data Preparation important?


Each machine learning project requires a specific data format. To do so, datasets need to
be prepared well before applying it to the projects. Sometimes, data in data sets have
missing or incomplete information, which leads to less accurate or incorrect predictions.
Further, sometimes data sets are clean but not adequately shaped, such as aggregated or
pivoted, and some have less business context. Hence, after collecting data from various
data sources, data preparation needs to transform raw data. Below are a few significant
advantages of data preparation in machine learning as follows:

o It helps to provide reliable prediction outcomes in various analytics operations.


o It helps identify data issues or errors and significantly reduces the chances of errors.
o It increases decision-making capability.
o It reduces overall project cost (data management and analytic cost).
o It helps to remove duplicate content to make it worthwhile for different applications.
o It increases model performance.

Steps in Data Preparation Process


Data preparation is one of the critical steps in the machine learning project building process,
and it must be done in particular series of steps which includes different tasks. There are
some essential steps of the data preparation process in machine learning suggested by
different ML experts and professionals as follows:

1. Understand the problem: This is one of the essential steps of data preparation for
a machine learning model in which we need to understand the actual problem and
try to solve it. To build a better model, we must have detailed information on all
issues, such as what to do and how to do it. It is also very much effective to retain
clients without wasting much effort.
2. Data collection: Data collection is probably the most typical step in the data
preparation process, where data scientistsneed to collect data from various potential
sources. These data sources may be either within enterprise or third parties vendors.
Data collection is beneficial to reduce and mitigate biasing in the ML model; hence
before collecting data, always analyze it and also ensure that the data set was
collected from diverse people, geographical areas, and perspectives.
There are some common problems that can be addressed using data
collection as follows:
o It is helpful to determine the relevant attributes in the string for the .csv file
format.
o It is used to parse highly nested data structures files such as XML or JSON into
tabular form.
o It is significant in easier scanning and pattern detection in data sets.
o Data collection is a practical step in machine learning to find relevant data from
external repositories.

3. Profiling and Data Exploration: After analyzing and collecting data from various
data sources, it's time to explore data such as trends, outliers, exceptions, incorrect,
inconsistent, missing, or skewed information, etc. Although source data will provide
all model findings, it does not contain unseen biases. Data exploration helps to
determine problems such as collinearity, which means a situation when the
Standardization of data sets and other data transformations are necessary.
4. Data Cleaning and Validation: Data cleaning and validation techniques help
determine and solve inconsistencies, outliers, anomalies, incomplete data, etc.
Clean data helps to find valuable patterns and information in data and
ignoresirrelevant data in the datasets. It is very much essential to build high-quality
models, and missing or incomplete data is one of the best examples of poor data.
Since missing data always reduces prediction accuracy and performance of the
model, data must be cleaned and validated through various imputation tools to fill
incomplete fields with statistically relevant substitutes.
5. Data Formatting: After cleaning and validating data, the following approach is to
ensure that the data is correctly formatted or not. If data is formatted incorrectly, it
will help build a high-quality model.
Since data comes from various sources or is sometimes updated manually, there are
high chances of discrepancies in the data format. For example, if you have collected
data from two sources, one source has updated the product's price to USD10.50,
and the other has updated the same value to $10.50. Similarly, there may be
anomalies in their spelling, abbreviation, etc. This type of data formation leads to
incorrect predictions. To reduce these errors, you must format your data inconsistent
manner by using some input formatting protocols.
6. Improve data quality: Quality is one of the essential parameters in building high-
quality models. Quality data helps to reduce errors, missing data, extreme values,
and outliers in the datasets. We can understand it with an example such, In one
dataset, columns have First Name and Last NAME, and another dataset has Column
named as a customer that combines First and Last Name. Then in such cases,
intelligent ML algorithms must have the ability to match these columns and join the
dataset for a singular view of the customer.
7. Feature engineering and selection:
Feature engineering is defined as the study of selecting, manipulating, and
transforming raw data into valuable features or most relevant variables in supervised
machine learning.Feature engineering enables you to build an enhanced predictive
model with accurate predictions.
For example, data can be spitted into various parts to capture more specific
information, such as analyzingmarketing performance by the day of the week, not
only the month or year. In this situation, segregating the day as a separate
categorical value from the data (e.g., "Mon; 07.12.2021") may provide the algorithm
with more relevant information. There are various feature engineering techniques
used in machine learning as follows:
o Imputation: Feature imputation is the technique to fill incomplete fields in the
datasets. It is essential because most machine learning models don't work when
there are missing data in the dataset. Although, the missing values problem can
be reduced by using techniques such as single value imputation, multiple value
imputation, K-Nearest neighbor, deleting the row, etc.
o Encoding: Feature encoding is defined as the method to convert string values
into numeric form. This is important as all ML models require all values in
numeric format. Feature encoding includes label encoding and One Hot
Encoding (also known as get_dummies).

Similarly, feature engineering also includes handling outliers, log transform, scaling,
normalization, Standardization, etc.

8. Splitting data:
After feature engineering and selection, the last step is to split your data into two
different sets (training and evaluation sets). Further, always select non-overlapping
subsets of your data for the training and evaluation sets to ensure proper testing.

Bias and Variance in Machine Learning


Machine learning is a branch of Artificial Intelligence, which allows machines to perform
data analysis and make predictions. However, if the machine learning model is not
accurate, it can make predictions errors, and these prediction errors are usually known as
Bias and Variance. In machine learning, these errors will always be present as there is
always a slight difference between the model predictions and actual predictions. The main
aim of ML/data science analysts is to reduce these errors in order to get more accurate
results. In this topic, we are going to discuss bias and variance, Bias-variance trade-off,
Underfitting and Overfitting. But before starting, let's first understand what errors in Machine
learning are?
Errors in Machine Learning?
In machine learning, an error is a measure of how accurately an algorithm can make
predictions for the previously unknown dataset. On the basis of these errors, the machine
learning model is selected that can perform best on the particular dataset. There are mainly
two types of errors in machine learning, which are:

o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.

o Irreducible errors: These errors will always be present in the model


regardless of which algorithm has been used. The cause of these errors is unknown
variables whose value can't be reduced.

What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them
to test data for prediction. While making predictions, a difference occurs between
prediction values made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias. It can be defined as an inability
of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target function simple to learn. A
model has either:

o Low Bias: A low bias model will make fewer assumptions about the form of the target
function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also cannot
perform well on new data.

Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with
high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Ways to reduce High Bias:


High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

What is a Variance Error?


The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is
different from its expected value. Ideally, a model should not vary too much from one
training dataset to another, which means the algorithm should be good in understanding the
hidden mapping between inputs and output variables. Variance errors are either of low
variance or high variance.

Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training dataset,
and does not generalize well with the unseen dataset. As a result, such a model gives good
results with the training dataset but shows high error rates on the test dataset.

Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:

o A high variance model leads to overfitting.


o Increase model complexities.

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms
with high variance are decision tree, Support Vector Machine, and K-nearest
neighbours.

Ways to Reduce High Variance:

o Reduce the input features or number of parameters as a model is overfitted.


o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Different Combinations of Bias-Variance


There are four possible combinations of bias and variances, which are represented by the
below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with
a large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.

How to identify High variance or High Bias?


High variance can be identified if the model has:

o Low training error and high test error.

High Bias can be identified if the model has:

o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this balance between
the bias error and variance error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:

o If we decrease the variance, it will increase the bias.


o If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model


that accurately captures the regularities in training data and simultaneously generalizes well
with the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a
high variance algorithm may perform well with training data, but it may lead to overfitting to
noisy data. Whereas, high bias algorithm generates a much simple model that may not
even capture important regularities in the data. So, we need to find a sweet spot between
bias and variance to make an optimal model.

Overfitting and Underfitting in Machine Learning


Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, the underfitting and overfitting are the two
terms that need to be checked for the performance of the model and whether the model is
generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the machine learning
model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the
machine learning algorithms. Or it is the difference between the predicted values and the
actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or
more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors
reduce the efficiency and accuracy of the model. The overfitted model has low
bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our model.
It means the more we train our model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:

As we can see from the above graph, the model tries to cover all the data points present in
the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the
regression model to find the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce
the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be
stopped at an early stage, due to which the model may not learn enough from the training
data. As a result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear regression
model:

As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the
result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and the
same happens with test data. But if we train the model for a long duration, then the
performance of the model may decrease due to the overfitting, as the model also learn the
noise present in the dataset. The errors in the test dataset start increasing, so the point, just
before the raising of errors, is the good point, and we can stop here for achieving a good
model.

There are two other methods by which we can get a good point for our model, which are
the resampling method to estimate model accuracy and validation dataset.
UNIT - II

DATA PREPROCESSING

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing is
required tasks for cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also
need to use an HTML or xlsx file.

What is a CSV File?


CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning. For
real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports
to add large, multidimensional arrays and matrices. So, in Python, we can import it as:

1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt


Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.
Note: We can set any directory as a working directory, but it must contain the required dataset.

Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.

read_csv() function:

Now to import the dataset, we will useread_csv() function of pandas library, which is used to
read acsvfile and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set. Consider the
below image:

As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:


In machine learning, it is important to distinguish the matrix of features (independent
variables) and dependent variables from dataset. In our dataset, there are three
independent variables that are Country, Age, and Salary, and one is a dependent variable
which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used
to extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]
As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent
variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but for
R language it is not required.

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object
As we can see in the above output, the missing values have been replaced with the means
of rest column values.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output,
these variables are encoded into 0, 1, and 2. By these values, the machine learning model
may assume that there is some correlation between these variables which will produce the
wrong output. So to remove this issue, we will use dummy encoding.

Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()
Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically encoded
into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:


6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test
set. This is one of the crucial steps of data preprocessing as by doing this, we can enhance
the performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test
it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a
machine learning model which performs well with the training set and also with the test
dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already
know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for
arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.
Output:

By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.

7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique
to standardize the independent variables of the dataset in a specific range. In feature
scaling, we put our variables in the same range and in the same scale so that no any
variable dominate the other variable.

Consider the below dataset:


As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then it
will cause some issue in our machine learning model.

Euclidean distance is given as:


If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library


as:

1. from sklearn.preprocessing import StandardScaler


Now, we will create the object of StandardScaler class for independent variables or
features. And then we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead
of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)
Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.

Data Cleaning
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset. Even if results and algorithms
appear to be correct, they are unreliable if the data is inaccurate. There are numerous ways
for data to be duplicated or incorrectly labeled when merging multiple data sources.

Steps for Cleaning Data


You can follow these fundamental stages to clean your data even if the techniques
employed may vary depending on the sorts of data your firm stores:

1. Remove duplicate or irrelevant observations


Remove duplicate or pointless observations as well as undesirable observations from your
dataset. The majority of duplicate observations will occur during data gathering. Duplicate
data can be produced when you merge data sets from several sources, scrape data, or get
data from clients or other departments. One of the most important factors to take into
account in this procedure is de-duplication. Those observations are deemed irrelevant when
you observe observations that do not pertain to the particular issue you are attempting to
analyze.

You might eliminate those useless observations, for instance, if you wish to analyze data on
millennial clients but your dataset also includes observations from earlier generations. This
can improve the analysis's efficiency, reduce deviance from your main objective, and
produce a dataset that is easier to maintain and use.

2. Fix structural errors


When you measure or transfer data and find odd naming practices, typos, or wrong
capitalization, such are structural faults. Mislabelled categories or classes may result from
these inconsistencies. For instance, "N/A" and "Not Applicable" might be present on any
given sheet, but they ought to be analyzed under the same heading.

3. Filter unwanted outliers


There will frequently be isolated findings that, at first glance, do not seem to fit the data you
are analyzing. Removing an outlier if you have a good reason to, such as incorrect data
entry, will improve the performance of the data you are working with.

However, occasionally the emergence of an outlier will support a theory you are
investigating. And just because there is an outlier, that doesn't necessarily indicate it is
inaccurate. To determine the reliability of the number, this step is necessary. If an outlier
turns out to be incorrect or unimportant for the analysis, you might want to remove it.

4. Handle missing data


Because many algorithms won't tolerate missing values, you can't overlook missing data.
There are a few options for handling missing data. While neither is ideal, both can be taken
into account, for example:

Although you can remove observations with missing values, doing so will result in the loss
of information, so proceed with caution.

Again, there is a chance to undermine the integrity of the data since you can be working
from assumptions rather than actual observations when you input missing numbers based
on other observations.

To browse null values efficiently, you may need to change the way the data is used.

5. Validate and QA
As part of fundamental validation, you ought to be able to respond to the following queries
once the data cleansing procedure is complete:

o Are the data coherent?


o Does the data abide by the regulations that apply to its particular field?
o Does it support or refute your working theory? Does it offer any new information?
o To support your next theory, can you identify any trends in the data?
o If not, is there a problem with the data's quality?
False conclusions can be used to inform poor company strategy and decision-making as a
result of inaccurate or noisy data. False conclusions can result in a humiliating situation in a
reporting meeting when you find out your data couldn't withstand further investigation.
Establishing a culture of quality data in your organization is crucial before you arrive. The
tools you might employ to develop this plan should be documented to achieve this.

Techniques for Cleaning Data


The data should be passed through one of the various data-cleaning procedures available.
The procedures are explained below:

1. Ignore the tuples: This approach is not very practical because it is only useful when
a tuple has multiple characteristics and missing values.
2. Fill in the missing value: This strategy is also not very practical or effective.
Additionally, it could be a time-consuming technique. One must add the missing
value to the approach. The most common method for doing this is manually, but
other options include using attribute means or the most likely value.
3. Binning method: This strategy is fairly easy to comprehend. The values nearby are
used to smooth the sorted data. The information is subsequently split into several
equal-sized parts. The various techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is smoothed out.
Regression may be multivariate or linear. Multiple regressions have more
independent variables than linear regressions, which only have one.
5. Clustering: This technique focuses mostly on the group. Data are grouped using
clustering. After that, clustering is used to find the outliers. After that, the comparable
values are grouped into a "group" or "cluster".
Process of Data Cleaning
The data cleaning method for data mining is demonstrated in the subsequent sections.

1. Monitoring the errors: Keep track of the areas where errors seem to occur most
frequently. It will be simpler to identify and maintain inaccurate or corrupt information.
Information is particularly important when integrating a potential substitute with
current management software.
2. Standardize the mining process: To help lower the likelihood of duplicity,
standardize the place of insertion.
3. Validate data accuracy: Analyse the data and spend money on data cleaning
software. Artificial intelligence-based tools were utilized to thoroughly check for
accuracy.
4. Scrub for duplicate data: To save time when analyzing data, find duplicates. By
analyzing and investing in independent data-erasing technologies that can analyze
imperfect data in quantity and automate the operation, it is possible to avoid again
attempting the same data.
5. Research on data: Our data needs to be vetted, standardized, and duplicate-
checked before this action. There are numerous third-party sources, and these
vetted and approved sources can extract data straight from our databases. They
assist us in gathering the data and cleaning it up so that it is reliable, accurate, and
comprehensive for use in business decisions.
6. Communicate with the team: Keeping the group informed will help with client
development and strengthening as well as giving more focused information to
potential clients.

Usage of Data Cleaning in Data Mining.


The following are some examples of how data cleaning is used in data mining:
o Data Integration: Since it is challenging to guarantee quality with low-quality data, data
integration is crucial in resolving this issue. The process of merging information from
various data sets into one is known as data integration. Before transferring to the
ultimate location, this step makes sure that the embedded data set is standardized and
formatted using data cleansing technologies.
o Data Migration: The process of transferring a file from one system, format, or
application to another is known as data migration. To ensure that the resulting data has
the correct format, structure, and consistency without any delicacy at the destination, it is
crucial to maintain the data's quality, security, and consistency while it is in transit.
o Data Transformation: The data must be changed before being uploaded to a location.
Data cleansing, which takes into account system requirements for formatting, organizing,
etc., is the only method that can achieve this. Before conducting additional analysis, data
transformation techniques typically involve the use of rules and filters. Most data
integration and data management methods include data transformation as a necessary
step. Utilizing the systems' internal transformations, data cleansing tools assist in
cleaning the data.
o Data Debugging in ETL Processes: To prepare data for reporting and analysis
throughout the extract, transform, and load (ETL) process, data cleansing is essential.
Only high-quality data are used for decision-making and analysis thanks to data
purification.
Cleaning data is essential. For instance, a retail business could receive inaccurate or
duplicate data from different sources, including CRM or ERP systems. A reliable data
debugging tool would find and fix data discrepancies. The deleted information will be
transformed into a common format and transferred to the intended database.
Characteristics of Data Cleaning
To ensure the correctness, integrity, and security of corporate data, data cleaning is a
requirement. These may be of varying quality depending on the properties or attributes of
the data. The key components of data cleansing in data mining are as follows:

o Accuracy: The business's database must contain only extremely accurate data.
Comparing them to other sources is one technique to confirm their veracity. The stored
data will also have issues if the source cannot be located or contains errors.
o Coherence: To ensure that the information on a person or body is the same throughout
all types of storage, the data must be consistent with one another.
o Validity: There must be rules or limitations in place for the stored data. The information
must also be confirmed to support its veracity.
o Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing
process.
o Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked. The study, design, and validation stages all play a role
in the verification process. The disadvantages are frequently obvious after applying the
data to a specific number of changes.
o Clean Data Backflow: After addressing quality issues, the previously clean data must
be replaced with data that is not present in the source so that legacy applications can
profit from it and avoid the need for a subsequent data-cleaning program.

Tools for Data Cleaning in Data Mining


Data Cleansing Tools can be very helpful if you are not confident of cleaning the data
yourself or have no time to clean up all your data sets. You might need to invest in those
tools, but it is worth the expenditure. There are many data cleaning tools in the market.
Here are some top-ranked data cleaning tools, such as:

1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure

Benefits of Data Cleaning


When you have clean data, you can make decisions using the highest-quality information
and eventually boost productivity. The following are some important advantages of data
cleaning in data mining, including:

o Removal of inaccuracies when several data sources are involved.


o Clients are happier and employees are less annoyed when there are fewer mistakes.
o The capacity to map out the many functions and the planned uses of your data.
o Monitoring mistakes and improving reporting make it easier to resolve inaccurate or
damaged data for future applications by allowing users to identify where issues are
coming from.
o Making decisions more quickly and with greater efficiency will be possible with the use of
data cleansing tools.

Redundancy and Correlation in Data Mining


In this article, we will learn Redundancy and correlation in data mining with some examples.
What is Data Redundancy?
In data mining, during data integration, many data stores are used. It may lead to data
redundancy. An attribute is known as redundant if it can be derived from any set of
attributes. Let us consider we have a set of data where there are 20 attributes. Now
suppose that out of 20, an attribute can be derived from some of the other set of attributes.
Such attributes that can be derived from other sets of attributes are called Redundant
attributes. Inconsistencies in attribute or dimension naming may lead to redundancies in the
set of data.

Let's understand this concept with the help of an example.

Suppose we have a data set that has three attributes - pizza_name, is_veg, is_nonveg

1. Is_veg is 1; if the selecting pizza is veg else, it is 0.


2. Is_nonveg is 1; if the selecting pizza is nonveg else, it is 0.

Pizza_name Is_veg Is_nonv

Farm House 1 0

Veg Loaded 1 0

Chicken Sausage 0 1

Non-Veg Supreme 0 1

Chicken Fiesta 0 1

Veg Extravaganza 1 0

Deluxe Veggie 1 0

On analyzing the above table, we have found that if a pizza is not veg (i.e., is_veg is 0
selecting the pizza_name), the pizza is surely non-veg (Since there are only two values in
the pizza_name output class- Veg and Nonveg). Hence, one of these attributes became
redundant. It means that the two attributes are very much related to each other, and one
attribute can find the other. So, you can drop either the first or second attribute without any
loss of information.
Detection of Data Redundancy
The following method is used to detect the redundancies:

1. X2 Test
2. Correlation coefficient and covariance

X2 Test
X2 Test is used for qualitative or nominal, or categorical data. It is performed over qualitative
data. Let us suppose we have two attributes X and Y, in the set of data. To represent the
data tuples, you have to make a contingency table.

The given formula is used for X2 Test.

Where,

Observed Values are the actual count.

Expected values are the count acquired from contingency table joint events.

The X2 examines the hypothesis that X and Y are not dependent. If this hypothesis can be
rejected, we can assume that X and Y are statistically related to each other, and we can
ignore any one of them (either X or Y).

The correlation coefficient for Numeric data


In the case of numeric data, this test is used. In this test, the relation between the A attribute
and B attribute is computed by Pearson's product-moment coefficient, also called the
correlation coefficient. A correlation coefficient measures the extent to which the value of
one variable changes with another. The best known are Pearson's and Spearman's rank-
order. The first is used where both variables are continuous, the second where at least one
represents a rank.

There are several different correlation coefficients, each of which is appropriate for different
types of data. The most common is the Pearson r, used for continuous variables. It is a
statistic that measures the degree to which one variable varies in tandem with another. It
ranges from -1 to +1. A +1 correlation means that as one variable rises, the other rises
proportionally; a -1 correlation means that as one rises, the other falls proportionally. A 0
correlation means that there is no relationship between the movements of the two variables.

The formula used to calculate the numeric data is given below.


Where,

n = number of tuples

ai = value of x in tuple i

bi = value of y in tuple i

From the above discussion, we can say that the greater the correlation coefficient, the more
strongly the attributes are correlated to each other, and we can ignore any one of them
(either a or b). If the value of the correlation constant is null, the attributes are independent.
If the value of the correlation constant is negative, one attribute discourages the other. It
means that the value of one attribute increases, then the value of another attribute is
decreasing.

What is Dimensionality Reduction?


The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for obtaining
a better fit predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given
below:

o By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality reduction, which are
given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.

Principal Component Analysis


Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help
of orthogonal transformation. These new transformed features are called the Principal
Components.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good
split between the classes, and hence it reduces the dimensionality. Some real-world applications of
PCA are image processing, movie recommendation system, optimizing the power allocation in
various communication channels. It is a feature extraction technique, so it contains the important
variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

Backward Skip 10sPlay VideoForward Skip 10s

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below:

o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.

Steps for PCA algorithm

1. Getting the dataset


Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the
two-dimensional matrix of independent variable X. Here each row corresponds to the
data items, and the column corresponds to the Features. The number of columns is
the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with lower
variance.
If the importance of features is independent of the variance of the feature, then we
will divide each data item in a column with the standard deviation of the column.
Here we will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of
Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of the
axes with high information. And the coefficients of these eigenvectors are defined as
the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to
the Z. In the resultant matrix Z*, each observation is the linear combination of
original features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI applications


such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.

Linear Discriminant Analysis (LDA) in Machine


Learning
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification
problems. It is also known as Normal Discriminant Analysis (NDA) or Discriminant
Function Analysis (DFA).

Whenever there is a requirement to separate two or more classes having multiple features efficiently,
the Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate
them efficiently. When we classify them using a single feature, then it may show overlapping.

To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:

However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the
1-D plane. Using this technique, we can also maximize the separability between multiple classes.

How Linear Discriminant Analysis (LDA) works?


Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning,
using which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis,
and we need to classify them efficiently. As we have already seen in the above example
that LDA enables us to draw a straight line that can completely separate the two classes of
the data points. Here, LDA uses an X-Y axis to create a new axis by separating them using
a straight line and projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane
into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the variation
within each class.

In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.

Why LDA?

o Logistic Regression is one of the most popular classification algorithms that perform well
for binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)


Although, LDA is specifically used to solve supervised classification problems for two or
more classes which are not possible using logistic regression in machine learning. But LDA
also fails in some cases where the Mean of the distributions is shared. In this case, LDA
fails to create a new axis that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant analysis in machine


learning.

Extension to Linear Discriminant Analysis (LDA)


Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as
follows:

1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups
of inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of
the variance (actually covariance) and hence moderates the influence of different
variables on LDA.
Unit – III

Linearly separable and nonlinearly separable populations

Linear separability is an important concept in machine learning, particularly in the field of


supervised learning. It refers to the ability of a set of data points to be separated into distinct
categories using a linear decision boundary. In other words, if there exists a straight line
that can cleanly divide the data into two classes, then the data is said to be linearly
separable.

Linearly separable data points can be separated using a line, linear function, or flat
hyperplane. In practice, there are several methods to determine whether data is linearly
separable[3]. One method is linear programming, which defines an objective function
subjected to constraints that satisfy linear separability. Another method is to train and test
on the same data - if there is a line that separates the data points, then the accuracy or
AUC should be close to 100%. If there is no such line, then training and testing on the same
data will result in at least some error.

Non-linearity in machine learning refers to the ability of a model to capture complex


relationships in data that cannot be represented as a straight line or a simple linear
equation. In many real-world problems, the relationship between input features and the
target variable is not a linear function, making non-linear models essential for accurate
predictions.

Key Concepts of Non-Linearity

1. Linear vs. Non-Linear Models:


- Linear Models: These models assume a direct proportional relationship between
input features and the output. Examples include linear regression and logistic
regression.
- Non-Linear Models: These allow for more complex relationships. Examples
include decision trees, neural networks, and support vector machines (with non-
linear kernels).
2. Activation Functions:
- In neural networks, non-linearity is introduced through activation functions like
ReLU (Rectified Linear Unit), sigmoid, and tanh. These functions transform the
weighted sum of inputs into a non-linear output, enabling the network to learn
complex patterns.
3. Basis Functions:
- Non-linear transformations can also be achieved using basis functions that map
input features into higher-dimensional spaces. This can help linear models capture
non-linear relationships.
Applications of Non-Linearity
1. Deep Learning: Neural networks (especially deep ones) rely heavily on non-linear
activation functions to learn intricate patterns in data, making them suitable for
tasks like image recognition, natural language processing, and more.
2. Decision Trees and Ensemble Methods: Decision trees inherently create non-
linear decision boundaries by splitting data based on feature values. Ensemble
methods like Random Forests and Gradient Boosting leverage multiple decision
trees to capture complex relationships.
3. Support Vector Machines: SVMs can use non-linear kernels (like the RBF kernel)
to create complex decision boundaries in high-dimensional spaces.
4. Polynomial Regression: This is a form of regression analysis where the
relationship between the independent variable xx and the dependent
variable yy is modeled as an nthnth degree polynomial, allowing for non-linear
relationships.
Importance of Non-Linearity

Non-linearity is crucial in machine learning because:

• Real-World Data: Most real-world data exhibit non-linear relationships, and


models that can capture these complexities tend to perform better.
• Flexibility: Non-linear models are more flexible and can adapt to different types
of data distributions, improving generalization.
In summary, non-linearity is a fundamental concept in machine learning that enables
models to learn and represent complex relationships in data, significantly enhancing their
predictive power across various applications.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.
o It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Radial Basis Function Network


Radial Basis Function (RBF) Neural Networks are a specialized type of
Artificial Neural Network (ANN) used primarily for function approximation
tasks. Known for their distinct three-layer architecture and universal
approximation capabilities, RBF Networks offer faster learning speeds and
efficient performance in classification and regression problems

Radial Basis Functions (RBFs) are a special category of feed-forward neural


networks comprising three layers:
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process
the data.
3. Output Layer: Produces the network’s predictions, suitable for
classification or regression tasks.
Architecture of RBF Networks
The architecture of an RBF Network typically consists of three layers:
Input Layer
• Function: After receiving the input features, the input layer sends them
straight to the hidden layer.
• Components: It is made up of the same number of neurons as the
characteristics in the input data. One feature of the input vector
corresponds to each neuron in the input layer.
Hidden Layer
• Function: This layer uses radial basis functions (RBFs) to conduct the
non-linear transformation of the input data.
• Components: Neurons in the buried layer apply the RBF to the incoming
data. The Gaussian function is the RBF that is most frequently utilized.
• RBF Neurons: Every neuron in the hidden layer has a spread parameter
(σ) and a center, which are also referred to as prototype vectors. The
spread parameter modulates the distance between the center of an RBF
neuron and the input vector, which in turn determines the neuron’s output.
Output Layer
• Function: The output layer uses weighted sums to integrate the hidden
layer neurons’ outputs to create the network’s final output.
• Components: It is made up of neurons that combine the outputs of the
hidden layer in a linear fashion. To reduce the error between the network’s
predictions and the actual target values, the weights of these
combinations are changed during training.
Training Process of radial basis function neural network
An RBF neural network must be trained in three stages: choosing the
center’s, figuring out the spread parameters, and training the output weights.
Step 1: Selecting the Centers
• Techniques for Centre Selection: Centre’s can be picked at random
from the training set of data or by applying techniques such as k-means
clustering.
• K-Means Clustering: The center’s of these clusters are employed as the
center’s for the RBF neurons in this widely used center selection
technique, which groups the input data into k groups.
Step 2: Determining the Spread Parameters
• The spread parameter (σ) governs each RBF neuron’s area of effect and
establishes the width of the RBF.
• Calculation: The spread parameter can be manually adjusted for each
neuron or set as a constant for all neurons. Setting σ based on the
separation between the center’s is a popular method, frequently
accomplished with the help of a heuristic like dividing the greatest
distance between canters by the square root of twice the number of
center’s
Step 3: Training the Output Weights
• Linear Regression: The objective of linear regression techniques, which
are commonly used to estimate the output layer weights, is to minimize
the error between the anticipated output and the actual target values.
• Pseudo-Inverse Method: One popular technique for figuring out the
weights is to utilize the pseudo-inverse of the hidden layer outputs matrix
Advantages of RBF Networks
1. Universal Approximation: RBF Networks can approximate any
continuous function with arbitrary accuracy given enough neurons.
2. Faster Learning: The training process is generally faster compared to
other neural network architectures.
3. Simple Architecture: The straightforward, three-layer architecture makes
RBF Networks easier to implement and understand.
Applications of RBF Networks
• Classification: RBF Networks are used in pattern recognition
and classification tasks, such as speech recognition and image
classification.
• Regression: These networks can model complex relationships in data for
prediction tasks.
• Function Approximation: RBF Networks are effective in approximating
non-linear functions.
Example of RBF Network
Consider a dataset with two-dimensional data points from two classes. An
RBF Network trained with 20 neurons will have each neuron representing a
prototype in the input space. The network computes category scores, which
can be visualized using 3-D mesh or contour plots. Positive weights are
assigned to neurons belonging to the same category and negative weights to
those from different categories. The decision boundary can be plotted by
evaluating scores over a grid.

Support vector machine (SVM)


A support vector machine (SVM) is a supervised machine learning algorithm that
classifies data by finding an optimal line or hyperplane that maximizes the distance
between each class in an N-dimensional space

What is Kernel Method?


A set of techniques known as kernel methods are used in machine learning to address
classification, regression, and other prediction issues. They are built around the idea of
kernels, which are functions that gauge how similar two data points are to one another in a
high-dimensional feature space.

Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel methods employ a kernel function to implicitly map the data
into the feature space, as opposed to manually computing the feature space.

Kernel Method in SVMs


Support Vector Machines (SVMs) use kernel methods to transform the input data into a
higher-dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel approaches in SVMs work on the fundamental principle of
implicitly mapping input data into a higher-dimensional feature space without directly
computing the coordinates of the data points in that space.

The kernel function in SVMs is essential in determining the decision boundary that divides
the various classes. In order to calculate the degree of similarity between any two points in
the feature space, the kernel function computes their dot product.

The most commonly used kernel function in SVMs is the Gaussian or radial basis function
(RBF) kernel. The RBF kernel maps the input data into an infinite-dimensional feature
space using a Gaussian function. This kernel function is popular because it can capture
complex nonlinear relationships in the data.
Other types of kernel functions that can be used in SVMs include the polynomial kernel, the
sigmoid kernel, and the Laplacian kernel. The choice of kernel function depends on the
specific problem and the characteristics of the data.

Characteristics of Kernel Function


Kernel functions used in machine learning, including in SVMs (Support Vector Machines),
have several important characteristics, including:

o Mercer's condition: A kernel function must satisfy Mercer's condition to be valid.


This condition ensures that the kernel function is positive semi definite, which
means that it is always greater than or equal to zero.
o Positive definiteness: A kernel function is positive definite if it is always greater
than zero except for when the inputs are equal to each other.
o Non-negativity: A kernel function is non-negative, meaning that it produces non-
negative values for all inputs.
o Symmetry: A kernel function is symmetric, meaning that it produces the same
value regardless of the order in which the inputs are given.
o Reproducing property: A kernel function satisfies the reproducing property if it
can be used to reconstruct the input data in the feature space.
o Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.

Major Kernel Function in Support Vector Machine


In Support Vector Machines (SVMs), there are several types of kernel functions that can be
used to map the input data into a higher-dimensional feature space. The choice of kernel
function depends on the specific problem and the characteristics of the data.

Here are some most commonly used kernel functions in SVMs:

Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function, and
it defines the dot product between the input vectors in the original feature space.

The linear kernel can be defined as:

1. K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is a
measure of their similarity or distance in the original feature space.

When using a linear kernel in an SVM, the decision boundary is a linear hyperplane that
separates the different classes in the feature space. This linear boundary can be useful
when the data is already separable by a linear decision boundary or when dealing with high-
dimensional data, where the use of more complex kernel functions may lead to overfitting.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in SVMs, is a
polynomial kernel (Support Vector Machines). It is a nonlinear kernel function that employs
polynomial functions to transfer the input data into a higher-dimensional feature space.

One definition of the polynomial kernel is:

Where x and y are the input feature vectors, c is a constant term, and d is the degree of the
polynomial, K(x, y) = (x. y + c)d. The constant term is added to, and the dot product of the
input vectors elevated to the degree of the polynomial.

The decision boundary of an SVM with a polynomial kernel might capture more intricate
correlations between the input characteristics because it is a nonlinear hyperplane.

The degree of nonlinearity in the decision boundary is determined by the degree of the
polynomial.

The polynomial kernel has the benefit of being able to detect both linear and nonlinear
correlations in the data. It can be difficult to select the proper degree of the polynomial,
though, as a larger degree can result in overfitting while a lower degree cannot adequately
represent the underlying relationships in the data.

In general, the polynomial kernel is an effective tool for converting the input data into a
higher-dimensional feature space in order to capture nonlinear correlations between the
input characteristics.

Gaussian (RBF) Kernel


The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a popular
kernel function used in machine learning, particularly in SVMs (Support Vector Machines). It
is a nonlinear kernel function that maps the input data into a higher-dimensional feature
space using a Gaussian function.

The Gaussian kernel can be defined as:

1. K(x, y) = exp(-gamma * ||x - y||^2)


Where x and y are the input feature vectors, gamma is a parameter that controls the width
of the Gaussian function, and ||x - y||^2 is the squared Euclidean distance between the input
vectors.

When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear hyper plane
that can capture complex nonlinear relationships between the input features. The width of
the Gaussian function, controlled by the gamma parameter, determines the degree of
nonlinearity in the decision boundary.

One advantage of the Gaussian kernel is its ability to capture complex relationships in the
data without the need for explicit feature engineering. However, the choice of the gamma
parameter can be challenging, as a smaller value may result in under fitting, while a larger
value may result in over fitting.

Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type
of kernel function used in machine learning, including in SVMs (Support Vector Machines).
It is a non-parametric kernel that can be used to measure the similarity or distance between
two input feature vectors.

The Laplacian kernel can be defined as:

1. K(x, y) = exp(-gamma * ||x - y||)


Where x and y are the input feature vectors, gamma is a parameter that controls the width
of the Laplacian function, and ||x - y|| is the L1 norm or Manhattan distance between the
input vectors.

When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear hyperplane
that can capture complex relationships between the input features. The width of the
Laplacian function, controlled by the gamma parameter, determines the degree of
nonlinearity in the decision boundary.

One advantage of the Laplacian kernel is its robustness to outliers, as it places less weight
on large distances between the input vectors than the Gaussian kernel. However, like the
Gaussian kernel, choosing the correct value of the gamma parameter can be challenging.

Loss functions are a fundamental aspect of machine learning algorithms,


serving as the bridge between model predictions and the actual outcomes.

Importance of Loss Functions in Machine Learning


Loss functions are integral to the training process of machine learning
models. They provide a measure of how well the model’s predictions align
with the actual data. By minimizing this loss, models learn to make more
accurate predictions.
The choice of a loss function can significantly affect the performance of a
model, making it crucial to select an appropriate one based on the specific
task at hand.
Categories of Loss Functions
The loss function estimates how well a particular algorithm models the
provided data. Loss functions are classified into two classes based on the
type of learning task
• Regression Models: predict continuous values.
• Classification Models: predict the output from a set of finite categorical
values.
By selecting the right loss function, you optimize the model to meet the
task’s specific needs, whether it’s a regression or classification problem.
Regression Loss Functions in Machine Learning
Regression tasks involve predicting continuous values, such as house prices
or temperatures. Here are some commonly used loss functions for
regression:
1. Mean Squared Error (MSE)
It is the Mean of Square of Residuals for all the datapoints in the dataset.
Residuals is the difference between the actual and the predicted prediction
by the model.
The Mean Squared Error (MSE) is a common loss function in machine
learning where the mean of the squared residuals is taken rather than just
the sum. This ensures that the loss function is independent of the number
of data points in the training set, making the metric more reliable across
datasets of varying sizes. However, MSE is sensitive to outliers, as large
errors have a disproportionately large impact on the final result.
This squaring process is essential for most regression loss functions,
ensuring that models can minimize error and improve performance. The
formula is:
[Tex]\begin{equation} M S E=\frac{\sum_{i=1}^{n}\left(y_{i}-
\hat{y}_{i}\right)^{2}}{n} \end{equation}[/Tex]
where,
• i – ith training sample in a dataset
• n – number of training samples
• y(i) – Actual output of ith training sample
• y-hat(i) – Predicted value of ith training sample
2. Mean Absolute Error (MAE) / La Loss
The Mean Absolute Error (MAE) is a commonly used loss function in
machine learning that calculates the mean of the absolute values of
the residuals for all datapoints in the dataset.
• The absolute value of the residuals is taken to convert any negative
differences into positive values, ensuring that all errors are treated
equally.
• Taking the mean makes the loss function independent of the number of
datapoints in the training set, allowing it to provide a consistent measure
of error across datasets of different sizes.
One key advantage of MAE is that it is robust to outliers, meaning that
extreme values do not disproportionately affect the overall error calculation.
However, despite this robustness, MAE is often less preferred than Mean
Squared Error (MSE) in practice. This is because it is harder to calculate
the derivative of the absolute function, as it is not differentiable at the
minima. This makes MSE a more common choice when working with
optimization algorithms that rely on gradient-based methods.
This loss function example illustrates how the choice of a loss
function can significantly impact model performance and training efficiency.

Source: Wikipedia

The formula:
[Tex]\begin{equation} M A E=\frac{\sum_{i=1}^{n}\left|y_{i}-
\hat{y}_{i}\right|}{n} \end{equation}[/Tex]
3. Mean Bias Error
It is similar to Mean Squared Error (MSE) but provides less accuracy.
However, it can help in determining whether the model has a positive
bias or negative bias. By analyzing the loss function results, you can
assess whether the model
consistently overestimates or underestimates the actual values. This
insight allows for further refinement of the machine learning model to
improve prediction accuracy. Such loss function examples are useful in
understanding model performance and identifying areas for optimization,
making them an essential part of the machine learning process.
The formula:
[Tex]\begin{equation} M B E=\frac{\sum_{i=1}^{n}\left(y_{i}-
\hat{y}_{i}\right)}{n} \end{equation}[/Tex]
Classification Loss Functions in Machine Learning
1. Cross-Entropy Loss
Cross-Entropy Loss, also known as Negative Log Likelihood, is a
commonly used loss function in machine learning for classification
tasks. This loss function measures how well the predicted probabilities
match the actual labels.
The cross-entropy loss increases as the predicted probability diverges from
the true label. In simpler terms, the farther the model’s prediction is from the
actual class, the higher the loss. This makes cross-entropy loss an
essential tool for improving the accuracy of classification models by
minimizing the difference between the predicted and actual labels.
A loss function example using cross-entropy would involve comparing the
predicted probabilities for each class against the actual class label, adjusting
the model to reduce this error during training.
The formula:
[Tex]\begin{equation} \text { CrossEntropyLoss }=-\left(y_{i} \log
\left(\hat{y}_{i}\right)+\left(1-y_{i}\right) \log \left(1-\hat{y}_{i}\right)\right)
\end{equation}[/Tex]
2. Hinge Loss
Hinge Loss, also known as Multi-class SVM Loss, is a type of loss
function used for maximum-margin classification tasks, most commonly
applied in support vector machines (SVMs). This loss function in
machine learning is particularly effective in ensuring that the decision
boundary is as far away as possible from any data points. Hinge Loss is
a convex function, making it suitable for optimization using a convex
optimizer.
This type of loss function is widely used in classification tasks as it
encourages models to achieve a larger margin between different classes,
leading to better generalization. A common loss function example involving
Hinge Loss can be seen in SVM models.
The formula:
[Tex]\begin{equation} \text { SVMLoss }=\sum_{j \neq y_{i}} \max \left(0,
s_{j}-s_{y_{i}}+1\right) \end{equation}[/Tex]

Multi Class Classification


Multi-class classification can be treated as an extension of binary classification to
more than two classes. If each example can only be assigned to one class, then the
classification problem can be handled as a binary classification problem, where one
class contains one of the multiple classes, and the other class contains all the other
classes put together. The process can then be repeated for each of the original classes.

For example, in a three-class multi-class classification problem, where you're classifying


examples with the labels A, B, and C, you could turn the problem into two separate
binary classification problems. First, you might create a binary classifier that categorizes
examples using the label A+B and the label C. Then, you could create a second binary
classifier that reclassifies the examples that are labeled A+B using the label A and the
label B.

An example of a multi-class problem is a handwriting classifier that takes an image of a


handwritten digit and decides which digit, 0-9, is represented.

If class membership isn't exclusive, which is to say, an example can be assigned to


multiple classes, this is known as a multi-label classification problem.

Naive Bayes

Naive Bayes is a parametric algorithm that requires a fixed set of parameters


or assumptions to simplify the machine’s learning process. In parametric
algorithms, the number of parameters used is independent of the size of the
training data.

Naïve Bayes Assumption:

• It assumes that the features of a dataset are entirely independent of


each other. But it is generally not true. That is why we also call it a
‘naïve’ algorithm.
How it works?
It is a classification model based on conditional probability that uses the
Bayes theorem to predict the class of unknown datasets. This model is mostly
used for large datasets as it is easy to build and fast for both training and
prediction. Moreover, without hyperparameter tuning, it can give better results
than other algorithms.

Naïve Bayes can also be an extremely good text classifier as it performs well,
such as in the spam ham dataset.

Bayes theorem is stated as-

• By P (A|B), we are trying to find the probability of event A given that


event B is true. It is also known as posterior probability.
• Event B is known as evidence.
• P (A) is called priori of A which means it is probability of event before
evidence is seen.
• P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes
that are not linearly separated in a dataset. Let us look at the figure below:

Advantages

• It is beneficial in cases involving large datasets and many dimensions


• One of the most efficient algorithm in terms of training when you have
limited data and very fast when testing
• Works well for multiclass classification which involves categorical
variables
Disadvantages

• It is naive in terms of assuming every feature to be independent of one


another
• Independence of every feature is not possible in real life hence some
dependent features influence the output
• Might not generalize well on unseen data as zero is assigned as
probability
2. KNN (K-nearest neighbours)

KNN is a supervised machine learning algorithm that can be used to solve


both classification and regression problems. It is one of the simplest yet
powerful algorithms. It does not learn a discriminative function from the
training data but memorizes it instead. For this reason, it is also known as a
lazy algorithm.

How it works?

The K-nearest neighbor algorithm forms a majority vote between the K most
similar instances, and it uses a distance metric between the two data points to
define them as identical. The most popular choice is Euclidean distance,
which is written as:

K in KNN is the hyperparameter we can choose to get the best possible fit for
the dataset. Suppose we keep the smallest value for K, i.e., K=1. In that case,
the model will show low bias but high variance because our model will be
overfitted.

A more significant value for K, k=10, will surely smoothen our decision
boundary, meaning low variance but high bias. So, we always go for a trade-
off between the bias and variance, known as a bias-variance trade-off.

Let us understand more about it by looking at its advantages and


disadvantages:

Advantages-

• KNN makes no assumptions about the distribution of classes i.e. it is a


non-parametric classifier
• It is one of the methods that can be widely used in multiclass
classification
• It does not get impacted by the outliers
• This classifier is easy to use and implement
Disadvantages-

• K value is difficult to find as it must work well with test data also, not only
with the training data
• It is a lazy algorithm as it does not make any models
• It is computationally extensive because it measures distance with each
data point
Decision Trees

As the name suggests, the decision tree is a tree-like structure of decisions


made based on some conditional statements. This is one of the most used
supervised learning methods in classification problems because of their high
accuracy, stability, and easy interpretation. They can map linear as well as
non-linear relationships in a good way.

Let us look at the figure below, Fig.3, where we have used adult census
income dataset with two independent variables and one dependent variable.
Our target or dependent variable is income, which has binary classes i.e,
<=50K or >50K.

Fig 3: Decision Tree- Binary Classifier


Support vector regression (SVR)

Support vector regression (SVR) is a type of support vector machine (SVM)


that is used for regression tasks. It tries to find a function that best predicts
the continuous output value for a given input value.
SVR can use both linear and non-linear kernels. A linear kernel is a simple
dot product between two input vectors, while a non-linear kernel is a more
complex function that can capture more intricate patterns in the data. The
choice of kernel depends on the data’s characteristics and the task’s
complexity.

Concepts related to the Support vector regression (SVR):

There are several concepts related to support vector regression (SVR) that
you may want to understand in order to use it effectively. Here are a few of
the most important ones:
• Support vector machines (SVMs): SVR is a type of support vector
machine (SVM), a supervised learning algorithm that can be used for
classification or regression tasks. SVMs try to find the hyperplane in a
high-dimensional space that maximally separates different classes or
output values.
• Kernels: SVR can use different types of kernels, which are functions that
determine the similarity between input vectors. A linear kernel is a simple
dot product between two input vectors, while a non-linear kernel is a more
complex function that can capture more intricate patterns in the data. The
choice of kernel depends on the data’s characteristics and the task’s
complexity.
• Hyperparameters: SVR has several hyperparameters that you can adjust
to control the behavior of the model. For example, the ‘C’ parameter
controls the trade-off between the insensitive loss and the sensitive loss.
A larger value of ‘C’ means that the model will try to minimize the
insensitive loss more, while a smaller value of C means that the model
will be more lenient in allowing larger errors.
• Model evaluation: Like any machine learning model, it’s important to
evaluate the performance of an SVR model. One common way to do this
is to split the data into a training set and a test set, and use the training
set to fit the model and the test set to evaluate it. You can then use
metrics like mean squared error (MSE) or mean absolute error (MAE) to
measure the error between the predicted and true output values.
One-vs-Rest and One-vs-One for Multi-Class
Classification
Binary Classifiers for Multi-Class Classification
Classification is a predictive modeling problem that involves assigning a class label to an
example.

Binary classification are those tasks where examples are assigned exactly one of two
classes. Multi-class classification is those tasks where examples are assigned exactly one
of more than two classes.

• Binary Classification: Classification tasks with two classes.


• Multi-class Classification: Classification tasks with more than two classes.
Some algorithms are designed for binary classification problems. Examples include:

• Logistic Regression
• Perceptron
• Support Vector Machines
As such, they cannot be used for multi-class classification tasks, at least not directly.

Instead, heuristic methods can be used to split a multi-class classification problem into
multiple binary classification datasets and train a binary classification model each.

Two examples of these heuristic methods include:

• One-vs-Rest (OvR)
• One-vs-One (OvO)
Let’s take a closer look at each.

One-Vs-Rest for Multi-Class Classification


One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for
using binary classification algorithms for multi-class classification.

It involves splitting the multi-class dataset into multiple binary classification problems. A
binary classifier is then trained on each binary classification problem and predictions are
made using the model that is the most confident.
For example, given a multi-class classification problem with examples for each class ‘red,’
‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
• Binary Classification Problem 1: red vs [blue, green]
• Binary Classification Problem 2: blue vs [red, green]
• Binary Classification Problem 3: green vs [red, blue]
A possible downside of this approach is that it requires one model to be created for each
class. For example, three classes requires three models. This could be an issue for large
datasets (e.g. millions of rows), slow models (e.g. neural networks), or very large numbers
of classes (e.g. hundreds of classes).

This approach requires that each model predicts a class membership probability or a
probability-like score. The argmax of these scores (class index with the largest score) is
then used to predict a class.

This approach is commonly used for algorithms that naturally predict numerical class
membership probability or score, such as:

• Logistic Regression
• Perceptron
As such, the implementation of these algorithms in the scikit-learn library implements the
OvR strategy by default when using these algorithms for multi-class classification.

We can demonstrate this with an example on a 3-class classification problem using


the LogisticRegression algorithm. The strategy for handling multi-class classification can be
set via the “multi_class” argument and can be set to “ovr” for the one-vs-rest strategy.
Unit – IV
Clustering in Machine Learning

The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. This method is defined under the
branch of Unsupervised Learning, which aims at gaining insights from
unlabelled data points, that is, unlike supervised learning we don’t have a
target variable.

Clustering aims at forming groups of homogeneous data points from a


heterogeneous dataset. It evaluates the similarity based on a metric like
Euclidean distance, Cosine similarity, Manhattan distance, etc. and then
group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape.
The shape of clusters can be arbitrary. There are many algortihms that
workwell with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed
are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to
group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to a
cluster completely or not. For example, Let’s say there are 4 data point
and we have to cluster them into 2 clusters. So each data point will either
belong to cluster 1 or cluster 2.
Data Points Clusters

A C1

B C2

C C2

D C1

• Soft Clustering: In this type of clustering, instead of assigning each data


point into a separate cluster, a probability or likelihood of that point being
that cluster is evaluated. For example, Let’s say there are 4 data point
and we have to cluster them into 2 clusters. So we will be evaluating a
probability of a data point belonging to both clusters. This probability is
calculated for all data points.
Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through
the use cases of Clustering algorithms. Clustering algorithms are majorly
used for:
• Market Segmentation – Businesses use clustering to group their
customers and use targeted advertisements to attract more audience.
• Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example,
In USA, according to a study diapers and beers were usually bought
together by fathers.
• Social Network Analysis – Social media sites use your data to understand
your browsing behaviour and provide you with targeted friend
recommendations or content recommendations.
• Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
• Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
Types of Clustering Algorithms
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
Applications of Clustering in different fields:
1. Marketing: It can be used to characterize & discover customer segments
for marketing purposes.
2. Biology: It can be used for classification among different species of
plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics
and information.
4. Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what is
K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.


Mean-Shift Clustering
Meanshift is falling under the category of a clustering algorithm in contrast of
Unsupervised learning that assigns the data points to the clusters iteratively
by shifting points towards the mode (mode is the highest density of data
points in the region, in the context of the Meanshift). As such, it is also
known as the Mode-seeking algorithm.

Mean-shift algorithm has applications in the field of image processing and


computer vision. Unlike the popular K-Means cluster algorithm, mean-shift
does not require specifying the number of clusters in advance. The number
of clusters is determined by the algorithm with respect to the data.

The process of mean-shift clustering algorithm can be summarized as


follows:
Initialize the data points as cluster centroids.

Repeat the following steps until convergence or a maximum number of


iterations is reached:

For each data point, calculate the mean of all points within a certain radius
(i.e., the “kernel”) centered at the data point.

Shift the data point to the mean.

Identify the cluster centroids as the points that have not moved after
convergence.

Return the final cluster centroids and the assignments of data points to
clusters.

One of the main advantages of mean-shift clustering is that it does not


require the number of clusters to be specified beforehand.
It also does not make any assumptions about the distribution of the data,
and can handle arbitrary shapes and sizes of clusters. However, it can be
sensitive to the choice of kernel and the radius of the kernel.
Mean-Shift clustering can be applied to various types of data, including
image and video processing, object tracking and bioinformatics.

Kernel Density Estimation

The first step when applying mean shift clustering algorithms is representing
your data in a mathematical manner this means representing your data as
points such as the set below.

Mean-shift builds upon the concept of kernel density estimation, in short


KDE. Imagine that the above data was sampled from a probability
distribution.

KDE is a method to estimate the underlying distribution also called the


probability density function for a set of data. It works by placing a kernel on
each point in the data set.

A kernel is a fancy mathematical word for a weighting function generally


used in convolution.

There are many different types of kernels, but the most popular one is the
Gaussian kernel. Adding up all of the individual kernels generates a
probability surface example density function
Depending on the kernel bandwidth parameter used, the resultant density
function will vary. Below is the KDE surface for our points above using a
Gaussian kernel with a kernel bandwidth of 2.

Surface plot: Contour plot:

Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups
the data points together that are close to each other based on the measure
of similarity or distance. The assumption is that data points that are close to
each other are more similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts
the hierarchical relationships between groups. Individual data points are
located at the bottom of the dendrogram, while the largest clusters, which
include all the data points, are located at the top. In order to generate
different numbers of clusters, the dendrogram can be sliced at various
heights.

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:


1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative
clustering (HAC). A structure that is more informative than the unstructured
set of clusters returned by flat clustering. This clustering algorithm does not
require us to prespecify the number of clusters. Bottom-up algorithms treat
each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a
single cluster that contains all data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[d i, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains

Steps:
• Consider each alphabet as a single cluster and calculate the distance of
one cluster from all the other clusters.
• In the second step, comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to each
other therefore we merge them in the second step similarly to cluster (D)
and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
• We recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
• Repeating the same process; The clusters DEF and BC are comparable
and merged together to form a new cluster. We’re now left with clusters
[(A), (BCDEF)].
• At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a
method for splitting a cluster that contains the whole data and proceeds by
splitting clusters recursively until individual data have been split into
singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Computing Distance Matrix
While merging two clusters we check the distance between two every pair of
clusters and merge the pair with the least distance/most similarity. But the
question is how is that distance determined. There are different ways of
defining Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the
cluster.
2. Max Distance: Find the maximum distance between any two points of the
cluster.
3. Group Average: Find the average distance between every two points of
the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in
squared error when two clusters are merged.
Gaussian Mixture Model
Normal or Gaussian Distribution

In real life, many datasets can be modeled by Gaussian Distribution


(Univariate or Multivariate). So it is quite natural and intuitive to assume that
the clusters come from different Gaussian Distributions. Or in other words, it
tried to model the dataset as a mixture of several Gaussian Distributions.
This is the core idea of this model.
In one dimension the probability density function of a Gaussian Distribution is
given by

where and are respectively the mean and variance of the distribution.
For Multivariate ( let us say d-variate) Gaussian Distribution, the probability
density function is given by

Here is a d dimensional vector denoting the mean of the distribution


and is the d X d covariance matrix.
Gaussian Mixture Model

Suppose there are K clusters (For the sake of simplicity here it is assumed
that the number of clusters is known and it is K). So and are also
estimated for each k. Had it been only one distribution, they would have
been estimated by the maximum-likelihood method. But since there are K
such clusters and the probability density is defined as a linear function of
densities of all these K distributions, i.e.

where is the mixing coefficient for k th distribution. For estimating the


parameters by the maximum log-likelihood method, compute p(X| , , ).

Now define a random variable such that =p(k|X).


From Bayes theorem,

Now for the log-likelihood function to be maximum, its derivative


of with respect to , , and should be zero. So equating
the derivative of with respect to to zero and rearranging
the terms,
Similarly taking the derivative with respect to and pi respectively, one can
obtain the following expressions.

And

Note: denotes the total number of sample points in the k th


cluster. Here it is assumed that there is a total N number of samples and
each sample containing d features is denoted by .
So it can be clearly seen that the parameters cannot be estimated in closed
form. This is where the Expectation-Maximization algorithm is beneficial.

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative way to find


maximum-likelihood estimates for model parameters when the data is
incomplete or has some missing data points or has some hidden variables.
EM chooses some random values for the missing data points and estimates
a new set of data. These new values are then recursively used to estimate a
better first date, by filling up missing points, until the values get fixed.
In the Expectation-Maximization (EM) algorithm, the estimation step (E-step)
and maximization step (M-step) are the two most important steps that are
iteratively performed to update the model parameters until the model
convergence.
Estimation Step (E-step):
• In the estimation step, we first initialize our model parameters like the
mean (μk), covariance matrix (Σk), and mixing coefficients (πk).
• For each data point, We calculate the posterior probabilities of data points
belonging to each centroid using the current parameter values. These
probabilities are often represented by the latent variables γk.
• At the end Estimate the values of the latent variables γ k based on the
current parameter values
Maximization Step
• In the maximization step, we update parameter values ( i.e. , and
) using the estimated latent variable γk.
• We will update the mean of the cluster point (μk) by taking the weighted
average of data points using the corresponding latent variable
probabilities
• We will update the covariance matrix (Σk) by taking the weighted average
of the squared differences between the data points and the mean, using
the corresponding latent variable probabilities.
• We will update the mixing coefficients (πk) by taking the average of the
latent variable probabilities for each component.
Repeat the E-step and M-step until convergence
• We iterate between the estimation step and maximization step until the
change in the log-likelihood or the parameters falls below a predefined
threshold or until a maximum number of iterations is reached.
• Basically, in the estimation step, we update the latent variables based on
the current parameter values.
• However, in the maximization step, we update the parameter values using
the estimated latent variables
• This process is iteratively repeated until our model converges.
The Expectation-Maximization (EM) algorithm is a general framework and
can be applied to various models, including Gaussian Mixture Models
(GMMs). The steps described above are specifically for GMMs, but the
overall concept of the Estimization-step and Maximization-step remains the
same for other models that use the EM algorithm.

Clustering High-Dimensional Data in Data


Mining

Clustering is basically a type of unsupervised learning method. An


unsupervised learning method is a method in which we draw references from
datasets consisting of input data without labeled responses.
Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other
groups.

Challenges of Clustering High-Dimensional Data:


Clustering of the High-Dimensional Data return the group of objects which
are clusters. It is required to group similar types of objects together to
perform the cluster analysis of high-dimensional data, But the High-
Dimensional data space is huge and it has complex data types and
attributes. A major challenge is that we need to find out the set of attributes
that are present in each cluster. A cluster is defined and characterized based
on the attributes present in the cluster. Clustering High-Dimensional Data we
need to search for clusters and find out the space for the existing clusters.
The High-Dimensional data is reduced to low-dimension data to make the
clustering and search for clusters simple. some applications need the
appropriate models of clusters, especially the high-dimensional data. clusters
in the high-dimensional data are significantly small. the conventional
distance measures can be ineffective. Instead, To find the hidden clusters in
high-dimensional data we need to apply sophisticated techniques that can
model correlations among the objects in subspaces.
Subspace Clustering Methods:
There are 3 Subspace Clustering Methods:
• Subspace search methods
• Correlation-based clustering methods
• Biclustering methods
Subspace clustering approaches to search for clusters existing in subspaces
of the given high-dimensional data space, where a subspace is defined using
a subset of attributes in the full space.
Subspace Search Methods: A subspace search method searches the
subspaces for clusters. Here, the cluster is a group of similar types of objects
in a subspace. The similarity between the clusters is measured by using
distance or density features. CLIQUE algorithm is a subspace clustering
method. subspace search methods search a series of subspaces. There are
two approaches in Subspace Search Methods: Bottom-up approach starts to
search from the low-dimensional subspaces. If the hidden clusters are not
found in low-dimensional subspaces then it searches in higher dimensional
subspaces. The top-down approach starts to search from the high-
dimensional subspaces and then search in subsets of low-dimensional
subspaces. Top-down approaches are effective if the subspace of a cluster
can be defined by the local neighborhood sub-space clusters.
2. Correlation-Based Clustering: correlation-based approaches discover
the hidden clusters by developing advanced correlation models. Correlation-
Based models are preferred if is not possible to cluster the objects by using
the Subspace Search Methods. Correlation-Based clustering includes the
advanced mining techniques for correlation cluster analysis. Biclustering
Methods are the Correlation-Based clustering methods in which both the
objects and attributes are clustered.
3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we can
cluster both objects and attributes at a time in some applications. The
resultant clusters are biclusters. To perform the biclustering there are four
requirements:
• Only a small set of objects participate in a cluster.
• A cluster only involves a small number of attributes.
• The data objects can take part in multiple clusters, or the objects may
also include in any cluster.
• An attribute may be involved in multiple clusters.
UNIT – V

NEURAL NETWORKS

Multi-layer perception

A multi-layer perceptron (MLP) is a type of artificial neural network consisting of multiple


layers of neurons. The neurons in the MLP typically use nonlinear activation functions,
allowing the network to learn complex patterns in data. MLPs are significant in machine
learning because they can learn nonlinear relationships in data, making them powerful
models for tasks such as classification, regression, and pattern recognition.

Multilayer Perceptrons

A multilayer perceptron is a type of feedforward neural network consisting of fully connected


neurons with a nonlinear kind of activation function. It is widely used to distinguish data that is
not linearly separable.

MLPs have been widely used in various fields, including image recognition, natural language
processing, and speech recognition, among others. Their flexibility in architecture and ability to
approximate any function under certain conditions make them a fundamental building block in
deep learning and neural network research. Let's take a deeper dive into some of its key concepts.

Input layer

The input layer consists of nodes or neurons that receive the initial input data. Each neuron
represents a feature or dimension of the input data. The number of neurons in the input layer is
determined by the dimensionality of the input data.

Hidden layer

Between the input and output layers, there can be one or more layers of neurons. Each neuron in
a hidden layer receives inputs from all neurons in the previous layer (either the input layer or
another hidden layer) and produces an output that is passed to the next layer. The number of
hidden layers and the number of neurons in each hidden layer are hyperparameters that need to
be determined during the model design phase.

Output layer

This layer consists of neurons that produce the final output of the network. The number of
neurons in the output layer depends on the nature of the task. In binary classification, there may
be either one or two neurons depending on the activation function and representing the
probability of belonging to one class; while in multi-class classification tasks, there can be
multiple neurons in the output layer.
Weights

Neurons in adjacent layers are fully connected to each other. Each connection has an associated
weight, which determines the strength of the connection. These weights are learned during the
training process.

Bias neurons

In addition to the input and hidden neurons, each layer (except the input layer) usually includes a
bias neuron that provides a constant input to the neurons in the next layer. Bias neurons have
their own weight associated with each connection, which is also learned during training.

The bias neuron effectively shifts the activation function of the neurons in the subsequent layer,
allowing the network to learn an offset or bias in the decision boundary. By adjusting the weights
connected to the bias neuron, the MLP can learn to control the threshold for activation and better
fit the training data.

Note: It is important to note that in the context of MLPs, bias can refer to two related but distinct
concepts: bias as a general term in machine learning and the bias neuron (defined above). In
general machine learning, bias refers to the error introduced by approximating a real-world
problem with a simplified model. Bias measures how well the model can capture the underlying
patterns in the data. A high bias indicates that the model is too simplistic and may underfit the
data, while a low bias suggests that the model is capturing the underlying patterns well.

Activation function

Typically, each neuron in the hidden layers and the output layer applies an activation function to
its weighted sum of inputs. Common activation functions include sigmoid, tanh, ReLU
(Rectified Linear Unit), and softmax. These functions introduce nonlinearity into the network,
allowing it to learn complex patterns in the data.

Training with backpropagation

MLPs are trained using the backpropagation algorithm, which computes gradients of a loss
function with respect to the model's parameters and updates the parameters iteratively to
minimize the loss.
Workings of a Multilayer Perceptron: Layer by Layer

In a multilayer perceptron, neurons process information in a step-by-step manner, performing


computations that involve weighted sums and nonlinear transformations. Let's walk layer by
layer to see the magic that goes within.

Input layer

• The input layer of an MLP receives input data, which could be features extracted from
the input samples in a dataset. Each neuron in the input layer represents one feature.
• Neurons in the input layer do not perform any computations; they simply pass the
input values to the neurons in the first hidden layer.
Hidden layers

• The hidden layers of an MLP consist of interconnected neurons that perform


computations on the input data.
• Each neuron in a hidden layer receives input from all neurons in the previous layer.
The inputs are multiplied by corresponding weights, denoted as w. The weights
determine how much influence the input from one neuron has on the output of
another.
• In addition to weights, each neuron in the hidden layer has an associated bias, denoted
as b. The bias provides an additional input to the neuron, allowing it to adjust its
output threshold. Like weights, biases are learned during training.
• For each neuron in a hidden layer or the output layer, the weighted sum of its inputs is
computed. This involves multiplying each input by its corresponding weight,
summing up these products, and adding the bias:

Where n is the total number of input connections, wi is the weight for the i-th input, and xi is the
i-th input value.

• The weighted sum is then passed through an activation function, denoted as f. The
activation function introduces nonlinearity into the network, allowing it to learn and
represent complex relationships in the data. The activation function determines the
output range of the neuron and its behavior in response to different input values. The
choice of activation function depends on the nature of the task and the desired
properties of the network.
Output layer

• The output layer of an MLP produces the final predictions or outputs of the network.
The number of neurons in the output layer depends on the task being performed (e.g.,
binary classification, multi-class classification, regression).
• Each neuron in the output layer receives input from the neurons in the last hidden
layer and applies an activation function. This activation function is usually different
from those used in the hidden layers and produces the final output value or prediction.
During the training process, the network learns to adjust the weights associated with each
neuron's inputs to minimize the discrepancy between the predicted outputs and the true target
values in the training data. By adjusting the weights and learning the appropriate activation
functions, the network learns to approximate complex patterns and relationships in the data,
enabling it to make accurate predictions on new, unseen samples.

This adjustment is guided by an optimization algorithm, such as stochastic gradient descent


(SGD), which computes the gradients of a loss function with respect to the weights and updates
the weights iteratively.
.

You have seen the working of the multilayer perceptron layers and learned about stochastic
gradient descent; to put it all together, there is one last topic to dive into: backpropagation.

Backpropagation

Backpropagation is short for “backward propagation of errors.” In the context of


backpropagation, SGD involves updating the network's parameters iteratively based on the
gradients computed during each batch of training data. Instead of computing the gradients using
the entire training dataset (which can be computationally expensive for large datasets), SGD
computes the gradients using small random subsets of the data called mini-batches. Here’s an
overview of how backpropagation algorithm works:

1. Forward Pass: During the forward pass, input data is fed into the neural network, and
the network's output is computed layer by layer. Each neuron computes a weighted
sum of its inputs, applies an activation function to the result, and passes the output to
the neurons in the next layer.

2. Loss Computation: After the forward pass, the network's output is compared to the
true target values, and a loss function is computed to measure the discrepancy
between the predicted output and the actual output.

3. Backward Pass (Gradient Calculation): In the backward pass, the gradients of the
loss function with respect to the network's parameters (weights and biases) are
computed using the chain rule of calculus. The gradients represent the rate of change
of the loss function with respect to each parameter and provide information about how
to adjust the parameters to decrease the loss.

4. Parameter update: Once the gradients have been computed, the network's
parameters are updated in the opposite direction of the gradients in order to minimize
the loss function. This update is typically performed using an optimization algorithm
such as stochastic gradient descent (SGD), that we discussed earlier.

5. Iterative Process: Steps 1-4 are repeated iteratively for a fixed number of epochs or
until convergence criteria are met. During each iteration, the network's parameters are
adjusted based on the gradients computed in the backward pass, gradually reducing
the loss and improving the model's performance.
Activation function
What is an Activation Function?
An activation function is a mathematical function applied to the output of a
neuron. It introduces non-linearity into the model, allowing the network to
learn and represent complex patterns in the data. Without this non-linearity
feature, a neural network would behave like a linear regression model, no
matter how many layers it has.
The activation function decides whether a neuron should be activated by
calculating the weighted sum of inputs and adding a bias term. This helps
the model make complex decisions and predictions by introducing non-
linearities to the output of each neuron.
Why is Non-Linearity Important in Neural Networks?
Neural networks consist of neurons that operate using weights, biases,
and activation functions.
In the learning process, these weights and biases are updated based on the
error produced at the output—a process known as backpropagation.
Activation functions enable backpropagation by providing gradients that are
essential for updating the weights and biases.
Without non-linearity, even deep networks would be limited to solving only
simple, linearly separable problems. Activation functions empower neural
networks to model highly complex data distributions and solve advanced
deep learning tasks. Adding non-linear activation functions introduce
flexibility and enable the network to learn more complex and abstract
patterns from data.
Mathematical Proof of Need of Non-Linearity in Neural Networks
To illustrate the need for non-linearity in neural networks with a specific
example, let’s consider a network with two input nodes (i1and i2)(i1and i2), a
single hidden layer containing one neuron (h1)(h1), and an output neuron
(out). We will use w1,w2w1,w2 as weights connecting the inputs to the hidden
neuron, and w5w5 as the weight connecting the hidden neuron to the output.
We’ll also include biases (b1b1 for the hidden neuron and b2b2 for the output
neuron) to complete the model.
Network Structure
1. Input Layer: Two inputs i1i1 and i2i2.
2. Hidden Layer: One neuron h1h1.
3. Output Layer: One output neuron.
Mathematical Model Without Non-linearity
Hidden Layer Calculation:
The input to the hidden neuron h1h_1h1 is calculated as a weighted sum of
the inputs plus a bias:
zh1=w1i1+w2i2+b1zh1=w1i1+w2i2+b1
Output Layer Calculation:
The output neuron is then a weighted sum of the hidden neuron’s output plus
a bias:
output=w5h1+b2output=w5h1+b2
If h1h1 were directly the output of zh1zh1 (no activation function applied,
i.e., h1=zh1h1=zh1), then substituting h1h1 in the output equation yields:
output=w5(w1i1+w2i2+b1)+b2 output=w5(w1i1+w2i2+b1)+b2
output=w5w1i1+w5w2i2+w5b1+b2 output=w5w1i1+w5w2i2+w5b1+b2
This shows that the output neuron is still a linear combination of the
inputs i1i1 and i2i2.
Thus, the entire network, despite having multiple layers and weights,
effectively performs a linear transformation, equivalent to a single-layer
perceptron.
Introducing Non-Linearity in Neural Network
To introduce non-linearity, let’s use a non-linear activation function σσ for the
hidden neuron. A common choice is the ReLU function, defined
as σ(x)=max⁡(0,x)σ(x)=max(0,x).
Updated Hidden Layer Calculation:
h1=σ(zh1)=σ(w1i1+w2i2+b1)h1=σ(zh1)=σ(w1i1+w2i2+b1)
Output Layer Calculation with Non-linearity:
output=w5σ(w1i1+w2i2+b1)+b2output=w5σ(w1i1+w2i2+b1)+b2
Effect of Non-linearity
The inclusion of the ReLU activation function \sigma allows h_1 to introduce
a non-linear decision boundary in the input space. This non-linearity enables
the network to learn more complex patterns that are not possible with a
purely linear model, such as:
• Modeling functions that are not linearly separable.
• Increasing the capacity of the network to form multiple decision
boundaries based on the combination of weights and biases.
Types of Activation Functions in Deep Learning
1. Linear Activation Function
Linear Activation Function resembles straight line define by y=x. No matter
how many layers the neural network contains, if they all use linear activation
functions, the output is a linear combination of the input.
• The range of the output spans from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network’s ability to
learn complex patterns limited.
Linear activation functions are useful for specific tasks but must be combined
with non-linear functions to enhance the neural network’s learning and
predictive capabilities.

2. Non-Linear Activation Functions


1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is
mathematically defined asA=11+e−xA=1+e−x1. This formula ensures a smooth
and continuous output that is essential for gradient-based optimization
methods.
• It allows neural networks to handle and model complex patterns that
linear equations cannot.
• The output ranges between 0 and 1, hence useful for binary classification.
• The function exhibits a steep gradient when x values are between -2 and
2. This sensitivity means that small changes in input x can cause
significant changes in output y, which is critical during the training
process.

2. Tanh Activation Function


Tanh function or hyperbolic tangent function, is a shifted version of the
sigmoid, allowing it to stretch across the y-axis. It is defined as:
f(x)=tanh⁡(x)=21+e−2x–1.f(x)=tanh(x)=1+e−2x2–1.
Alternatively, it can be expressed using the sigmoid function:
tanh⁡(x)=2×sigmoid(2x)–1tanh(x)=2×sigmoid(2x)–1
• Value Range: Outputs values from -1 to +1.
• Non-linear: Enables modeling of complex data patterns.
• Use in Hidden Layers: Commonly used in hidden layers due to its zero-
centered output, facilitating easier learning for subsequent layers.
3. ReLU (Rectified Linear Unit) Function
ReLU activation is defined by A(x)=max⁡(0,x)A(x)=max(0,x), this means that if
the input x is positive, ReLU returns x, if the input is negative, it returns 0.
• Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative
values.
• Nature: It is a non-linear activation function, allowing neural networks to
learn complex patterns and making backpropagation more efficient.
• Advantage over other Activation: ReLU is less computationally
expensive than tanh and sigmoid because it involves simpler
mathematical operations. At a time only a few neurons are activated
making the network sparse making it efficient and easy for computation.
Types of Loss Functions

Loss functions are one of the most important aspects of neural


networks, as they (along with the optimization functions) are
directly responsible for fitting the model to the given training data.

Types of Loss Functions

In supervised learning, there are two main types of loss functions —


these correlate to the 2 major types of neural networks: regression and
classification loss functions

1. Regression Loss Functions — used in regression neural networks;


given an input value, the model predicts a corresponding output
value (rather than pre-selected labels); Ex. Mean Squared Error,
Mean Absolute Error

2. Classification Loss Functions — used in classification neural


networks; given an input, the neural network produces a vector of
probabilities of the input belonging to various pre-set categories —
can then select the category with the highest probability of
belonging; Ex. Binary Cross-Entropy, Categorical Cross-Entropy

Mean Squared Error (MSE)

One of the most popular loss functions, MSE finds the average of the
squared differences between the target and the predicted outputs
Image Source: Author

This function has numerous properties that make it especially suited


for calculating loss. The difference is squared, which means it does not
matter whether the predicted value is above or below the target value;
however, values with a large error are penalized. MSE is also a convex
function (as shown in the diagram above) with a clearly defined global
minimum — this allows us to more easily utilize gradient descent
optimization to set the weight values.
def mse (y_true, y_pred):
return tf.square (y_true - y_pred)

However, one disadvantage of this loss function is that it is very


sensitive to outliers; if a predicted value is significantly greater than or
less than its target value, this will significantly increase the loss.
Mean Absolute Error (MAE)

MAE finds the average of the absolute differences between the target
and the predicted outputs.

Image Source: Author

This loss function is used as an alternative to MSE in some cases. As


mentioned previously, MSE is highly sensitive to outliers, which can
dramatically affect the loss because the distance is squared. MAE is
used in cases when the training data has a large number of outliers to
mitigate this.

Here is a standard implementation in TensorFlow — built into the


TensorFlow library as well.
def mae (y_true, y_pred):
return tf.abs(y_true - y_pred)

It also has some disadvantages; as the average distance approaches 0,


gradient descent optimization will not work, as the function's
derivative at 0 is undefined (which will result in an error, as it is
impossible to divide by 0).

Because of this, a loss function called a Huber Loss was developed,


which has the advantages of both MSE and MAE.

Image Source: Author

If the absolute difference between the actual and predicted value is less
than or equal to a threshold value, 𝛿, then MSE is applied. Otherwise —
if the error is sufficiently large — MAE is applied.
This is the TensorFlow implementation —this involves using a wrapper
function to utilize the threshold variable, which we will discuss in a
little bit.
def huber_loss_with_threshold (t = 𝛿):
def huber_loss (y_true, y_pred):
error = y_true - y_pred
within_threshold = tf.abs(error) <= t
small_error = tf.square(error)
large_error = t * (tf.abs(error) - (0.5*t))
if within_threshold:
return small_error
else:
return large_error
return huber_loss

Binary Cross-Entropy/Log Loss

This is the loss function used in binary classification models — where


the model takes in an input and has to classify it into one of two pre-set
categories.
Image Source: Author

Classification neural networks work by outputting a vector of


probabilities — the probability that the given input fits into each of the
pre-set categories; then selecting the category with the highest
probability as the final output.

In binary classification, there are only two possible actual values of y —


0 or 1. Thus, to accurately determine loss between the actual and
predicted values, it needs to compare the actual value (0 or 1) with the
probability that the input aligns with that category (p(i) = probability
that the category is 1; 1 — p(i) = probability that the category is 0)

This is the TensorFlow implementation.


def log_loss (y_true, y_pred):
y_pred = tf.clip_by_value(y_pred, le-7, 1 - le-7)
error = y_true * tf.log(y_pred + 1e-7) (1-y_true) * tf.log(1-
y_pred + 1e-7)
return -error

Categorical Cross-Entropy Loss

In cases where the number of classes is greater than two, we utilize


categorical cross-entropy — this follows a very similar process to
binary cross-entropy.
Image Source: Author

Binary cross-entropy is a special case of categorical cross-entropy,


where M = 2 — the number of categories is 2.

Basics of Neural Networks

Neural networks or artificial neural networks are fundamental tools in machine learning,
powering many state-of-the-art algorithms and applications across various domains,
including computer vision, natural language processing, robotics, and more.

A neural network consists of interconnected nodes, called neurons, organized into


layers. Each neuron receives input signals, performs a computation on them using an
activation function, and produces an output signal that may be passed to other neurons
in the network. An activation function determines the output of a neuron given its
input. These functions introduce nonlinearity into the network, enabling it to learn
complex patterns in data.

The network is typically organized into layers, starting with the input layer, where data is
introduced. Followed by hidden layers where computations are performed and finally, the output
layer where predictions or decisions are made.
Neurons in adjacent layers are connected by weighted connections, which transmit signals from
one layer to the next. The strength of these connections, represented by weights, determines how
much influence one neuron's output has on another neuron's input. During the training process,
the network learns to adjust its weights based on examples provided in a training dataset.
Additionally, each neuron typically has an associated bias, which allows the neuron to adjust its
output threshold.

Neural networks are trained using techniques called feedforward propagation


and backpropagation. During feedforward propagation, input data is passed through the
network layer by layer, with each layer performing a computation based on the inputs it receives
and passing the result to the next layer.

Backpropagation is an algorithm used to train neural networks by iteratively adjusting the


network's weights and biases in order to minimize the loss function. A loss function (also known
as a cost function or objective function) is a measure of how well the model's predictions match
the true target values in the training data. The loss function quantifies the difference between the
predicted output of the model and the actual output, providing a signal that guides the
optimization process during training.

The goal of training a neural network is to minimize this loss function by adjusting the weights
and biases. The adjustments are guided by an optimization algorithm, such as gradient descent.

Types of Neural Network

The ANN depicted on the right of the image is a simple neural network called ‘perceptron’. It
consists of a single layer, which is the input layer, with multiple neurons with their own weights;
there are no hidden layers. The perceptron algorithm learns the weights for the input signals in
order to draw a linear decision boundary.

However, to solve more complicated, non-linear problems related to image processing, computer
vision, and natural language processing tasks, we work with deep neural networks.

Check out Datacamp’s Introduction to Deep Neural Networks tutorial to learn more about
deep neural networks and how to construct one from scratch utilizing TensorFlow and Keras in
Python. If you would prefer to use R language instead, Datacamp’s Building Neural Network
(NN) Models in R has you covered.

There are several types of ANN, each designed for specific tasks and architectural requirements.
Let's briefly discuss some of the most common types before diving deeper into MLPs next.

Feedforward Neural Networks (FNN)

These are the simplest form of ANNs, where information flows in one direction, from input to
output. There are no cycles or loops in the network architecture. Multilayer perceptrons (MLP)
are a type of feedforward neural network.

Recurrent Neural Networks (RNN)

In RNNs, connections between nodes form directed cycles, allowing information to persist over
time. This makes them suitable for tasks involving sequential data, such as time series prediction,
natural language processing, and speech recognition.

Convolutional Neural Networks (CNN)

CNNs are designed to effectively process grid-like data, such as images. They consist of layers
of convolutional filters that learn hierarchical representations of features within the input data.
CNNs are widely used in tasks like image classification, object detection, and image
segmentation.

Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)

These are specialized types of recurrent neural networks designed to address the vanishing
gradient problem in traditional RNN. LSTMs and GRUs incorporate gated mechanisms to
better capture long-range dependencies in sequential data, making them particularly effective for
tasks like speech recognition, machine translation, and sentiment analysis.

Autoencoder

It is designed for unsupervised learning and consists of an encoder network that compresses the
input data into a lower-dimensional latent space, and a decoder network that reconstructs the
original input from the latent representation. Autoencoders are often used for dimensionality
reduction, data denoising, and generative modeling.

Generative Adversarial Networks (GAN)

GANs consist of two neural networks, a generator and a discriminator, trained simultaneously in
a competitive setting. The generator learns to generate synthetic data samples that are
indistinguishable from real data, while the discriminator learns to distinguish between real and
fake samples. GANs have been widely used for generating realistic images, videos, and other
types of data.
Stochastic Gradient Descent (SGD)

1. Initialization: SGD starts with an initial set of model parameters (weights and biases)
randomly or using some predefined method.

2. Iterative Optimization: The aim of this step is to find the minimum of a loss
function, by iteratively moving in the direction of the steepest decrease in the
function's value.
For each iteration (or epoch) of training:

• Shuffle the training data to ensure that the model doesn't learn from the same patterns
in the same order every time.
• Split the training data into mini-batches (small subsets of data).
• For each mini-batch:
• Compute the gradient of the loss function with respect to the model parameters using
only the data points in the mini-batch. This gradient estimation is a stochastic
approximation of the true gradient.
• Update the model parameters by taking a step in the opposite direction of the gradient,
scaled by a learning rate:Θt+1 = θt - n * ⛛ J (θt)Where:θt represents the model
parameters at iteration t. This parameter can be the weight⛛ J (θt) is the gradient of
the loss function J with respect to the parameters θtn is the learning rate, which
controls the size of the steps taken during optimization
3. Direction of Descent: The gradient of the loss function indicates the direction of the
steepest ascent. To minimize the loss function, gradient descent moves in the opposite
direction, towards the steepest descent.

4. Learning Rate: The step size taken in each iteration of gradient descent is determined
by a parameter called the learning rate, denoted above as n. This parameter controls
the size of the steps taken towards the minimum. If the learning rate is too small,
convergence may be slow; if it is too large, the algorithm may oscillate or diverge.
5. Convergence: Repeat the process for a fixed number of iterations or until a
convergence criterion is met (e.g., the change in loss function is below a certain
threshold).
Stochastic gradient descent updates the model parameters more frequently using smaller subsets
of data, making it computationally efficient, especially for large datasets. The randomness
introduced by SGD can have a regularization effect, preventing the model from overfitting to the
training data. It is also well-suited for online learning scenarios where new data becomes
available incrementally, as it can update the model quickly with each new data point or mini-
batch.

However, SGD can also have some challenges, such as increased noise due to the stochastic
nature of the gradient estimation and the need to tune hyperparameters like the learning rate.
Various extensions and adaptations of SGD, such as mini-batch stochastic gradient descent,
momentum, and adaptive learning rate methods like AdaGrad, RMSProp, and Adam, have been
developed to address these challenges and improve convergence and performance.

Stochastic Gradient Descent Algorithm


• Initialization: Randomly initialize the parameters of the model.
• Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
• Stochastic Gradient Descent Loop: Repeat the following steps until the
model converges or reaches the maximum number of iterations:
o Shuffle the training dataset to introduce randomness.
o Iterate over each training example (or a small batch) in the
shuffled order.
o Compute the gradient of the cost function with respect to the
model parameters using the current training
example (or batch).
o Update the model parameters by taking a step in the direction of
the negative gradient, scaled by the learning rate.
o Evaluate the convergence criteria, such as the difference in the
cost function between iterations of the gradient.
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is usually
noisier than your typical Gradient Descent algorithm. But that doesn’t matter
all that much because the path taken by the algorithm does not matter, as
long as we reach the minimum and with a significantly shorter training time.
The path taken by Batch Gradient Descent is shown below:

A path taken by Stochastic Gradient Descent looks as follows –

One thing to be noted is that, as SGD is generally noisier than typical


Gradient Descent, it usually took a higher number of iterations to reach the
minima, because of the randomness in its descent. Even though it requires a
higher number of iterations to reach the minima than typical Gradient
Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient
Descent for optimizing a learning algorithm.

You might also like