Applied ML notes
Applied ML notes
Unit - I
It involves data exploration and pattern matching with minimal human intervention. There
are mainly four technologies that machine learning used to work:
1. Supervised Learning:
Supervised Learning is a machine learning method that needs supervision similar to the
student-teacher relationship. In supervised Learning, a machine is trained with well-labeled
data, which means some data is already tagged with correct outputs. So, whenever new
data is introduced into the system, supervised learning algorithms analyze this sample data
and predict correct outputs with the help of that labeled data.
It is classified into two different categories of algorithms. These are as follows:
o Classification: It deals when output is in the form of a category such as Yellow, blue,
right or wrong, etc.
o Regression: It deals when output variables are real values like age, height, etc.
This technology allows us to collect or produce data output from experience. It works the
same way as humans learn using some labeled data points of the training set. It helps in
optimizing the performance of models using experience and solving various complex
computation problems.
2. Unsupervised Learning:
Unlike supervised learning, unsupervised Learning does not require classified or well-
labeled data to train a machine. It aims to make groups of unsorted information based on
some patterns and differences even without any labelled training data. In unsupervised
Learning, no supervision is provided, so no sample data is given to the machines. Hence,
machines are restricted to finding hidden structures in unlabeled data by their own.
3. Semi-supervised learning:
Semi-supervised Learning is defined as the combination of both supervised and
unsupervised learning methods. It is used to overcome the drawbacks of both supervised
and unsupervised learning methods.
4. Reinforcement learning:
Reinforcement learning is defined as a feedback-based machine learning method that does
not require labeled data. In this learning method, an agent learns to behave in an
environment by performing the actions and seeing the results of actions. Agents can
provide positive feedback for each good action and negative feedback for bad actions.
Since, in reinforcement learning, there is no training data, hence agents are restricted to
learn with their experience only.
In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as a
student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model for
each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning:
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age,
salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the datapoints
and the regression line is minimum." The distance between datapoints and line tells
whether a model has captured a strong relationship or not.
o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent variables.
Here we are discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image.
Here we are predicting the
salary of an employee on
the basis of the year of
experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up
to 1, and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task
of the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Na�ve Bayes
o Decision Tree Classification
o Random Forest Classification
Data preparation is also known as data "pre-processing," "data wrangling," "data cleaning,"
"data pre-processing," and "feature engineering." It is the later stage of the machine
learning lifecycle, which comes after data collection.
Data preparation is particular to data, the objectives of the projects, and the algorithms that
will be used in data modeling techniques.
o Data cleaning: This task includes the identification of errors and making corrections or
improvements to those errors.
o Feature Selection: We need to identify the most important or relevant input data
variables for the model.
o Data Transforms: Data transformation involves converting raw data into a well-suitable
format for the model.
o Feature Engineering: Feature engineering involves deriving new variables from the
available dataset.
o Dimensionality Reduction: The dimensionality reduction process involves converting
higher dimensions into lower dimension features without changing the information.
o Missing data: Missing data or incomplete records is a prevalent issue found in most
datasets. Instead of appropriate data, sometimes records contain empty cells, values
(e.g., NULL or N/A), or a specific character, such as a question mark, etc.
o Outliers or Anomalies: ML algorithms are sensitive to the range and distribution of
values when data comes from unknown sources. These values can spoil the entire
machine learning training system and the performance of the model. Hence, it is
essential to detect these outliers or anomalies through techniques such as visualization
technique.
o Unstructured data format: Data comes from various sources and needs to be
extracted into a different format. Hence, before deploying an ML project, always consult
with domain experts or import data from known sources.
o Limited Features: Whenever data comes from a single source, it contains limited
features, so it is necessary to import data from various sources for feature enrichment or
build multiple features in datasets.
o Understanding feature engineering: Features engineering helps develop additional
content in the ML models, increasing model performance and accuracy in predictions.
1. Understand the problem: This is one of the essential steps of data preparation for
a machine learning model in which we need to understand the actual problem and
try to solve it. To build a better model, we must have detailed information on all
issues, such as what to do and how to do it. It is also very much effective to retain
clients without wasting much effort.
2. Data collection: Data collection is probably the most typical step in the data
preparation process, where data scientistsneed to collect data from various potential
sources. These data sources may be either within enterprise or third parties vendors.
Data collection is beneficial to reduce and mitigate biasing in the ML model; hence
before collecting data, always analyze it and also ensure that the data set was
collected from diverse people, geographical areas, and perspectives.
There are some common problems that can be addressed using data
collection as follows:
o It is helpful to determine the relevant attributes in the string for the .csv file
format.
o It is used to parse highly nested data structures files such as XML or JSON into
tabular form.
o It is significant in easier scanning and pattern detection in data sets.
o Data collection is a practical step in machine learning to find relevant data from
external repositories.
3. Profiling and Data Exploration: After analyzing and collecting data from various
data sources, it's time to explore data such as trends, outliers, exceptions, incorrect,
inconsistent, missing, or skewed information, etc. Although source data will provide
all model findings, it does not contain unseen biases. Data exploration helps to
determine problems such as collinearity, which means a situation when the
Standardization of data sets and other data transformations are necessary.
4. Data Cleaning and Validation: Data cleaning and validation techniques help
determine and solve inconsistencies, outliers, anomalies, incomplete data, etc.
Clean data helps to find valuable patterns and information in data and
ignoresirrelevant data in the datasets. It is very much essential to build high-quality
models, and missing or incomplete data is one of the best examples of poor data.
Since missing data always reduces prediction accuracy and performance of the
model, data must be cleaned and validated through various imputation tools to fill
incomplete fields with statistically relevant substitutes.
5. Data Formatting: After cleaning and validating data, the following approach is to
ensure that the data is correctly formatted or not. If data is formatted incorrectly, it
will help build a high-quality model.
Since data comes from various sources or is sometimes updated manually, there are
high chances of discrepancies in the data format. For example, if you have collected
data from two sources, one source has updated the product's price to USD10.50,
and the other has updated the same value to $10.50. Similarly, there may be
anomalies in their spelling, abbreviation, etc. This type of data formation leads to
incorrect predictions. To reduce these errors, you must format your data inconsistent
manner by using some input formatting protocols.
6. Improve data quality: Quality is one of the essential parameters in building high-
quality models. Quality data helps to reduce errors, missing data, extreme values,
and outliers in the datasets. We can understand it with an example such, In one
dataset, columns have First Name and Last NAME, and another dataset has Column
named as a customer that combines First and Last Name. Then in such cases,
intelligent ML algorithms must have the ability to match these columns and join the
dataset for a singular view of the customer.
7. Feature engineering and selection:
Feature engineering is defined as the study of selecting, manipulating, and
transforming raw data into valuable features or most relevant variables in supervised
machine learning.Feature engineering enables you to build an enhanced predictive
model with accurate predictions.
For example, data can be spitted into various parts to capture more specific
information, such as analyzingmarketing performance by the day of the week, not
only the month or year. In this situation, segregating the day as a separate
categorical value from the data (e.g., "Mon; 07.12.2021") may provide the algorithm
with more relevant information. There are various feature engineering techniques
used in machine learning as follows:
o Imputation: Feature imputation is the technique to fill incomplete fields in the
datasets. It is essential because most machine learning models don't work when
there are missing data in the dataset. Although, the missing values problem can
be reduced by using techniques such as single value imputation, multiple value
imputation, K-Nearest neighbor, deleting the row, etc.
o Encoding: Feature encoding is defined as the method to convert string values
into numeric form. This is important as all ML models require all values in
numeric format. Feature encoding includes label encoding and One Hot
Encoding (also known as get_dummies).
Similarly, feature engineering also includes handling outliers, log transform, scaling,
normalization, Standardization, etc.
8. Splitting data:
After feature engineering and selection, the last step is to split your data into two
different sets (training and evaluation sets). Further, always select non-overlapping
subsets of your data for the training and evaluation sets to ensure proper testing.
o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them
to test data for prediction. While making predictions, a difference occurs between
prediction values made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias. It can be defined as an inability
of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target function simple to learn. A
model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the target
function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also cannot
perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with
high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset,
and does not generalize well with the unseen dataset. As a result, such a model gives good
results with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms
with high variance are decision tree, Support Vector Machine, and K-nearest
neighbours.
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this balance between
the bias error and variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:
Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine learning
model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the
machine learning algorithms. Or it is the difference between the predicted values and the
actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or
more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors
reduce the efficiency and accuracy of the model. The overfitted model has low
bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model.
It means the more we train our model, the more chances of occurring the overfitted model.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points present in
the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the
regression model to find the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be
stopped at an early stage, due to which the model may not learn enough from the training
data. As a result, it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
How to avoid underfitting:
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the
result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and the
same happens with test data. But if we train the model for a long duration, then the
performance of the model may decrease due to the overfitting, as the model also learn the
noise present in the dataset. The errors in the test dataset start increasing, so the point, just
before the raising of errors, is the good point, and we can stop here for achieving a good
model.
There are two other methods by which we can get a good point for our model, which are
the resampling method to estimate model accuracy and validation dataset.
UNIT - II
DATA PREPROCESSING
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data preprocessing task.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also
need to use an HTML or xlsx file.
We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports
to add large, multidimensional arrays and matrices. So, in Python, we can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will useread_csv() function of pandas library, which is used to
read acsvfile and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set. Consider the
below image:
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used
to extract the required rows and columns from the dataset.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent
variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but for
R language it is not required.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above output,
these variables are encoded into 0, 1, and 2. By these values, the machine learning model
may assume that there is some correlation between these variables which will produce the
wrong output. So to remove this issue, we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically encoded
into 0 and 1.
Output:
Suppose, if we have given training to our machine learning model by a dataset and we test
it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a
machine learning model which performs well with the training set and also with the test
dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already
know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
For splitting the dataset, we will use the below lines of code:
o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for
arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique
to standardize the independent variables of the dataset in a specific range. In feature
scaling, we put our variables in the same range and in the same scale so that no any
variable dominate the other variable.
Standardization
Normalization
Here, we will use the standardization method for our dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead
of fit_transform() because it is already done in training set.
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Data Cleaning
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset. Even if results and algorithms
appear to be correct, they are unreliable if the data is inaccurate. There are numerous ways
for data to be duplicated or incorrectly labeled when merging multiple data sources.
You might eliminate those useless observations, for instance, if you wish to analyze data on
millennial clients but your dataset also includes observations from earlier generations. This
can improve the analysis's efficiency, reduce deviance from your main objective, and
produce a dataset that is easier to maintain and use.
However, occasionally the emergence of an outlier will support a theory you are
investigating. And just because there is an outlier, that doesn't necessarily indicate it is
inaccurate. To determine the reliability of the number, this step is necessary. If an outlier
turns out to be incorrect or unimportant for the analysis, you might want to remove it.
Although you can remove observations with missing values, doing so will result in the loss
of information, so proceed with caution.
Again, there is a chance to undermine the integrity of the data since you can be working
from assumptions rather than actual observations when you input missing numbers based
on other observations.
To browse null values efficiently, you may need to change the way the data is used.
5. Validate and QA
As part of fundamental validation, you ought to be able to respond to the following queries
once the data cleansing procedure is complete:
1. Ignore the tuples: This approach is not very practical because it is only useful when
a tuple has multiple characteristics and missing values.
2. Fill in the missing value: This strategy is also not very practical or effective.
Additionally, it could be a time-consuming technique. One must add the missing
value to the approach. The most common method for doing this is manually, but
other options include using attribute means or the most likely value.
3. Binning method: This strategy is fairly easy to comprehend. The values nearby are
used to smooth the sorted data. The information is subsequently split into several
equal-sized parts. The various techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is smoothed out.
Regression may be multivariate or linear. Multiple regressions have more
independent variables than linear regressions, which only have one.
5. Clustering: This technique focuses mostly on the group. Data are grouped using
clustering. After that, clustering is used to find the outliers. After that, the comparable
values are grouped into a "group" or "cluster".
Process of Data Cleaning
The data cleaning method for data mining is demonstrated in the subsequent sections.
1. Monitoring the errors: Keep track of the areas where errors seem to occur most
frequently. It will be simpler to identify and maintain inaccurate or corrupt information.
Information is particularly important when integrating a potential substitute with
current management software.
2. Standardize the mining process: To help lower the likelihood of duplicity,
standardize the place of insertion.
3. Validate data accuracy: Analyse the data and spend money on data cleaning
software. Artificial intelligence-based tools were utilized to thoroughly check for
accuracy.
4. Scrub for duplicate data: To save time when analyzing data, find duplicates. By
analyzing and investing in independent data-erasing technologies that can analyze
imperfect data in quantity and automate the operation, it is possible to avoid again
attempting the same data.
5. Research on data: Our data needs to be vetted, standardized, and duplicate-
checked before this action. There are numerous third-party sources, and these
vetted and approved sources can extract data straight from our databases. They
assist us in gathering the data and cleaning it up so that it is reliable, accurate, and
comprehensive for use in business decisions.
6. Communicate with the team: Keeping the group informed will help with client
development and strengthening as well as giving more focused information to
potential clients.
o Accuracy: The business's database must contain only extremely accurate data.
Comparing them to other sources is one technique to confirm their veracity. The stored
data will also have issues if the source cannot be located or contains errors.
o Coherence: To ensure that the information on a person or body is the same throughout
all types of storage, the data must be consistent with one another.
o Validity: There must be rules or limitations in place for the stored data. The information
must also be confirmed to support its veracity.
o Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing
process.
o Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked. The study, design, and validation stages all play a role
in the verification process. The disadvantages are frequently obvious after applying the
data to a specific number of changes.
o Clean Data Backflow: After addressing quality issues, the previously clean data must
be replaced with data that is not present in the source so that legacy applications can
profit from it and avoid the need for a subsequent data-cleaning program.
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure
Suppose we have a data set that has three attributes - pizza_name, is_veg, is_nonveg
Farm House 1 0
Veg Loaded 1 0
Chicken Sausage 0 1
Non-Veg Supreme 0 1
Chicken Fiesta 0 1
Veg Extravaganza 1 0
Deluxe Veggie 1 0
On analyzing the above table, we have found that if a pizza is not veg (i.e., is_veg is 0
selecting the pizza_name), the pizza is surely non-veg (Since there are only two values in
the pizza_name output class- Veg and Nonveg). Hence, one of these attributes became
redundant. It means that the two attributes are very much related to each other, and one
attribute can find the other. So, you can drop either the first or second attribute without any
loss of information.
Detection of Data Redundancy
The following method is used to detect the redundancies:
1. X2 Test
2. Correlation coefficient and covariance
X2 Test
X2 Test is used for qualitative or nominal, or categorical data. It is performed over qualitative
data. Let us suppose we have two attributes X and Y, in the set of data. To represent the
data tuples, you have to make a contingency table.
Where,
Expected values are the count acquired from contingency table joint events.
The X2 examines the hypothesis that X and Y are not dependent. If this hypothesis can be
rejected, we can assume that X and Y are statistically related to each other, and we can
ignore any one of them (either X or Y).
There are several different correlation coefficients, each of which is appropriate for different
types of data. The most common is the Pearson r, used for continuous variables. It is a
statistic that measures the degree to which one variable varies in tandem with another. It
ranges from -1 to +1. A +1 correlation means that as one variable rises, the other rises
proportionally; a -1 correlation means that as one rises, the other falls proportionally. A 0
correlation means that there is no relationship between the movements of the two variables.
n = number of tuples
ai = value of x in tuple i
bi = value of y in tuple i
From the above discussion, we can say that the greater the correlation coefficient, the more
strongly the attributes are correlated to each other, and we can ignore any one of them
(either a or b). If the value of the correlation constant is null, the attributes are independent.
If the value of the correlation constant is negative, one attribute discourages the other. It
means that the value of one attribute increases, then the value of another attribute is
decreasing.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for obtaining
a better fit predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given
below:
o By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good
split between the classes, and hence it reduces the dimensionality. Some real-world applications of
PCA are image processing, movie recommendation system, optimizing the power allocation in
various communication channels. It is a feature extraction technique, so it contains the important
variables and drops the least important variable.
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.
Whenever there is a requirement to separate two or more classes having multiple features efficiently,
the Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate
them efficiently. When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the
1-D plane. Using this technique, we can also maximize the separability between multiple classes.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis,
and we need to classify them efficiently. As we have already seen in the above example
that LDA enables us to draw a straight line that can completely separate the two classes of
the data points. Here, LDA uses an X-Y axis to create a new axis by separating them using
a straight line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane
into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform well
for binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups
of inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of
the variance (actually covariance) and hence moderates the influence of different
variables on LDA.
Unit – III
Linearly separable data points can be separated using a line, linear function, or flat
hyperplane. In practice, there are several methods to determine whether data is linearly
separable[3]. One method is linear programming, which defines an objective function
subjected to constraints that satisfy linear separability. Another method is to train and test
on the same data - if there is a line that separates the data points, then the accuracy or
AUC should be close to 100%. If there is no such line, then training and testing on the same
data will result in at least some error.
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel methods employ a kernel function to implicitly map the data
into the feature space, as opposed to manually computing the feature space.
The kernel function in SVMs is essential in determining the decision boundary that divides
the various classes. In order to calculate the degree of similarity between any two points in
the feature space, the kernel function computes their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial basis function
(RBF) kernel. The RBF kernel maps the input data into an infinite-dimensional feature
space using a Gaussian function. This kernel function is popular because it can capture
complex nonlinear relationships in the data.
Other types of kernel functions that can be used in SVMs include the polynomial kernel, the
sigmoid kernel, and the Laplacian kernel. The choice of kernel function depends on the
specific problem and the characteristics of the data.
Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function, and
it defines the dot product between the input vectors in the original feature space.
1. K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is a
measure of their similarity or distance in the original feature space.
When using a linear kernel in an SVM, the decision boundary is a linear hyperplane that
separates the different classes in the feature space. This linear boundary can be useful
when the data is already separable by a linear decision boundary or when dealing with high-
dimensional data, where the use of more complex kernel functions may lead to overfitting.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in SVMs, is a
polynomial kernel (Support Vector Machines). It is a nonlinear kernel function that employs
polynomial functions to transfer the input data into a higher-dimensional feature space.
Where x and y are the input feature vectors, c is a constant term, and d is the degree of the
polynomial, K(x, y) = (x. y + c)d. The constant term is added to, and the dot product of the
input vectors elevated to the degree of the polynomial.
The decision boundary of an SVM with a polynomial kernel might capture more intricate
correlations between the input characteristics because it is a nonlinear hyperplane.
The degree of nonlinearity in the decision boundary is determined by the degree of the
polynomial.
The polynomial kernel has the benefit of being able to detect both linear and nonlinear
correlations in the data. It can be difficult to select the proper degree of the polynomial,
though, as a larger degree can result in overfitting while a lower degree cannot adequately
represent the underlying relationships in the data.
In general, the polynomial kernel is an effective tool for converting the input data into a
higher-dimensional feature space in order to capture nonlinear correlations between the
input characteristics.
When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear hyper plane
that can capture complex nonlinear relationships between the input features. The width of
the Gaussian function, controlled by the gamma parameter, determines the degree of
nonlinearity in the decision boundary.
One advantage of the Gaussian kernel is its ability to capture complex relationships in the
data without the need for explicit feature engineering. However, the choice of the gamma
parameter can be challenging, as a smaller value may result in under fitting, while a larger
value may result in over fitting.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type
of kernel function used in machine learning, including in SVMs (Support Vector Machines).
It is a non-parametric kernel that can be used to measure the similarity or distance between
two input feature vectors.
When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear hyperplane
that can capture complex relationships between the input features. The width of the
Laplacian function, controlled by the gamma parameter, determines the degree of
nonlinearity in the decision boundary.
One advantage of the Laplacian kernel is its robustness to outliers, as it places less weight
on large distances between the input vectors than the Gaussian kernel. However, like the
Gaussian kernel, choosing the correct value of the gamma parameter can be challenging.
Source: Wikipedia
The formula:
[Tex]\begin{equation} M A E=\frac{\sum_{i=1}^{n}\left|y_{i}-
\hat{y}_{i}\right|}{n} \end{equation}[/Tex]
3. Mean Bias Error
It is similar to Mean Squared Error (MSE) but provides less accuracy.
However, it can help in determining whether the model has a positive
bias or negative bias. By analyzing the loss function results, you can
assess whether the model
consistently overestimates or underestimates the actual values. This
insight allows for further refinement of the machine learning model to
improve prediction accuracy. Such loss function examples are useful in
understanding model performance and identifying areas for optimization,
making them an essential part of the machine learning process.
The formula:
[Tex]\begin{equation} M B E=\frac{\sum_{i=1}^{n}\left(y_{i}-
\hat{y}_{i}\right)}{n} \end{equation}[/Tex]
Classification Loss Functions in Machine Learning
1. Cross-Entropy Loss
Cross-Entropy Loss, also known as Negative Log Likelihood, is a
commonly used loss function in machine learning for classification
tasks. This loss function measures how well the predicted probabilities
match the actual labels.
The cross-entropy loss increases as the predicted probability diverges from
the true label. In simpler terms, the farther the model’s prediction is from the
actual class, the higher the loss. This makes cross-entropy loss an
essential tool for improving the accuracy of classification models by
minimizing the difference between the predicted and actual labels.
A loss function example using cross-entropy would involve comparing the
predicted probabilities for each class against the actual class label, adjusting
the model to reduce this error during training.
The formula:
[Tex]\begin{equation} \text { CrossEntropyLoss }=-\left(y_{i} \log
\left(\hat{y}_{i}\right)+\left(1-y_{i}\right) \log \left(1-\hat{y}_{i}\right)\right)
\end{equation}[/Tex]
2. Hinge Loss
Hinge Loss, also known as Multi-class SVM Loss, is a type of loss
function used for maximum-margin classification tasks, most commonly
applied in support vector machines (SVMs). This loss function in
machine learning is particularly effective in ensuring that the decision
boundary is as far away as possible from any data points. Hinge Loss is
a convex function, making it suitable for optimization using a convex
optimizer.
This type of loss function is widely used in classification tasks as it
encourages models to achieve a larger margin between different classes,
leading to better generalization. A common loss function example involving
Hinge Loss can be seen in SVM models.
The formula:
[Tex]\begin{equation} \text { SVMLoss }=\sum_{j \neq y_{i}} \max \left(0,
s_{j}-s_{y_{i}}+1\right) \end{equation}[/Tex]
Naive Bayes
Naïve Bayes can also be an extremely good text classifier as it performs well,
such as in the spam ham dataset.
Advantages
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most
similar instances, and it uses a distance metric between the two data points to
define them as identical. The most popular choice is Euclidean distance,
which is written as:
K in KNN is the hyperparameter we can choose to get the best possible fit for
the dataset. Suppose we keep the smallest value for K, i.e., K=1. In that case,
the model will show low bias but high variance because our model will be
overfitted.
A more significant value for K, k=10, will surely smoothen our decision
boundary, meaning low variance but high bias. So, we always go for a trade-
off between the bias and variance, known as a bias-variance trade-off.
Advantages-
• K value is difficult to find as it must work well with test data also, not only
with the training data
• It is a lazy algorithm as it does not make any models
• It is computationally extensive because it measures distance with each
data point
Decision Trees
Let us look at the figure below, Fig.3, where we have used adult census
income dataset with two independent variables and one dependent variable.
Our target or dependent variable is income, which has binary classes i.e,
<=50K or >50K.
There are several concepts related to support vector regression (SVR) that
you may want to understand in order to use it effectively. Here are a few of
the most important ones:
• Support vector machines (SVMs): SVR is a type of support vector
machine (SVM), a supervised learning algorithm that can be used for
classification or regression tasks. SVMs try to find the hyperplane in a
high-dimensional space that maximally separates different classes or
output values.
• Kernels: SVR can use different types of kernels, which are functions that
determine the similarity between input vectors. A linear kernel is a simple
dot product between two input vectors, while a non-linear kernel is a more
complex function that can capture more intricate patterns in the data. The
choice of kernel depends on the data’s characteristics and the task’s
complexity.
• Hyperparameters: SVR has several hyperparameters that you can adjust
to control the behavior of the model. For example, the ‘C’ parameter
controls the trade-off between the insensitive loss and the sensitive loss.
A larger value of ‘C’ means that the model will try to minimize the
insensitive loss more, while a smaller value of C means that the model
will be more lenient in allowing larger errors.
• Model evaluation: Like any machine learning model, it’s important to
evaluate the performance of an SVR model. One common way to do this
is to split the data into a training set and a test set, and use the training
set to fit the model and the test set to evaluate it. You can then use
metrics like mean squared error (MSE) or mean absolute error (MAE) to
measure the error between the predicted and true output values.
One-vs-Rest and One-vs-One for Multi-Class
Classification
Binary Classifiers for Multi-Class Classification
Classification is a predictive modeling problem that involves assigning a class label to an
example.
Binary classification are those tasks where examples are assigned exactly one of two
classes. Multi-class classification is those tasks where examples are assigned exactly one
of more than two classes.
• Logistic Regression
• Perceptron
• Support Vector Machines
As such, they cannot be used for multi-class classification tasks, at least not directly.
Instead, heuristic methods can be used to split a multi-class classification problem into
multiple binary classification datasets and train a binary classification model each.
• One-vs-Rest (OvR)
• One-vs-One (OvO)
Let’s take a closer look at each.
It involves splitting the multi-class dataset into multiple binary classification problems. A
binary classifier is then trained on each binary classification problem and predictions are
made using the model that is the most confident.
For example, given a multi-class classification problem with examples for each class ‘red,’
‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
• Binary Classification Problem 1: red vs [blue, green]
• Binary Classification Problem 2: blue vs [red, green]
• Binary Classification Problem 3: green vs [red, blue]
A possible downside of this approach is that it requires one model to be created for each
class. For example, three classes requires three models. This could be an issue for large
datasets (e.g. millions of rows), slow models (e.g. neural networks), or very large numbers
of classes (e.g. hundreds of classes).
This approach requires that each model predicts a class membership probability or a
probability-like score. The argmax of these scores (class index with the largest score) is
then used to predict a class.
This approach is commonly used for algorithms that naturally predict numerical class
membership probability or score, such as:
• Logistic Regression
• Perceptron
As such, the implementation of these algorithms in the scikit-learn library implements the
OvR strategy by default when using these algorithms for multi-class classification.
The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. This method is defined under the
branch of Unsupervised Learning, which aims at gaining insights from
unlabelled data points, that is, unlike supervised learning we don’t have a
target variable.
Now it is not necessary that the clusters formed must be circular in shape.
The shape of clusters can be arbitrary. There are many algortihms that
workwell with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed
are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to
group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to a
cluster completely or not. For example, Let’s say there are 4 data point
and we have to cluster them into 2 clusters. So each data point will either
belong to cluster 1 or cluster 2.
Data Points Clusters
A C1
B C2
C C2
D C1
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through
the use cases of Clustering algorithms. Clustering algorithms are majorly
used for:
• Market Segmentation – Businesses use clustering to group their
customers and use targeted advertisements to attract more audience.
• Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example,
In USA, according to a study diapers and beers were usually bought
together by fathers.
• Social Network Analysis – Social media sites use your data to understand
your browsing behaviour and provide you with targeted friend
recommendations or content recommendations.
• Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
• Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
Types of Clustering Algorithms
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
Applications of Clustering in different fields:
1. Marketing: It can be used to characterize & discover customer segments
for marketing purposes.
2. Biology: It can be used for classification among different species of
plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics
and information.
4. Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what is
K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
For each data point, calculate the mean of all points within a certain radius
(i.e., the “kernel”) centered at the data point.
Identify the cluster centroids as the points that have not moved after
convergence.
Return the final cluster centroids and the assignments of data points to
clusters.
The first step when applying mean shift clustering algorithms is representing
your data in a mathematical manner this means representing your data as
points such as the set below.
There are many different types of kernels, but the most popular one is the
Gaussian kernel. Adding up all of the individual kernels generates a
probability surface example density function
Depending on the kernel bandwidth parameter used, the resultant density
function will vary. Below is the KDE surface for our points above using a
Gaussian kernel with a kernel bandwidth of 2.
Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups
the data points together that are close to each other based on the measure
of similarity or distance. The assumption is that data points that are close to
each other are more similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts
the hierarchical relationships between groups. Individual data points are
located at the bottom of the dendrogram, while the largest clusters, which
include all the data points, are located at the top. In order to generate
different numbers of clusters, the dendrogram can be sliced at various
heights.
Steps:
• Consider each alphabet as a single cluster and calculate the distance of
one cluster from all the other clusters.
• In the second step, comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to each
other therefore we merge them in the second step similarly to cluster (D)
and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
• We recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
• Repeating the same process; The clusters DEF and BC are comparable
and merged together to form a new cluster. We’re now left with clusters
[(A), (BCDEF)].
• At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a
method for splitting a cluster that contains the whole data and proceeds by
splitting clusters recursively until individual data have been split into
singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Computing Distance Matrix
While merging two clusters we check the distance between two every pair of
clusters and merge the pair with the least distance/most similarity. But the
question is how is that distance determined. There are different ways of
defining Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the
cluster.
2. Max Distance: Find the maximum distance between any two points of the
cluster.
3. Group Average: Find the average distance between every two points of
the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in
squared error when two clusters are merged.
Gaussian Mixture Model
Normal or Gaussian Distribution
where and are respectively the mean and variance of the distribution.
For Multivariate ( let us say d-variate) Gaussian Distribution, the probability
density function is given by
Suppose there are K clusters (For the sake of simplicity here it is assumed
that the number of clusters is known and it is K). So and are also
estimated for each k. Had it been only one distribution, they would have
been estimated by the maximum-likelihood method. But since there are K
such clusters and the probability density is defined as a linear function of
densities of all these K distributions, i.e.
And
NEURAL NETWORKS
Multi-layer perception
Multilayer Perceptrons
MLPs have been widely used in various fields, including image recognition, natural language
processing, and speech recognition, among others. Their flexibility in architecture and ability to
approximate any function under certain conditions make them a fundamental building block in
deep learning and neural network research. Let's take a deeper dive into some of its key concepts.
Input layer
The input layer consists of nodes or neurons that receive the initial input data. Each neuron
represents a feature or dimension of the input data. The number of neurons in the input layer is
determined by the dimensionality of the input data.
Hidden layer
Between the input and output layers, there can be one or more layers of neurons. Each neuron in
a hidden layer receives inputs from all neurons in the previous layer (either the input layer or
another hidden layer) and produces an output that is passed to the next layer. The number of
hidden layers and the number of neurons in each hidden layer are hyperparameters that need to
be determined during the model design phase.
Output layer
This layer consists of neurons that produce the final output of the network. The number of
neurons in the output layer depends on the nature of the task. In binary classification, there may
be either one or two neurons depending on the activation function and representing the
probability of belonging to one class; while in multi-class classification tasks, there can be
multiple neurons in the output layer.
Weights
Neurons in adjacent layers are fully connected to each other. Each connection has an associated
weight, which determines the strength of the connection. These weights are learned during the
training process.
Bias neurons
In addition to the input and hidden neurons, each layer (except the input layer) usually includes a
bias neuron that provides a constant input to the neurons in the next layer. Bias neurons have
their own weight associated with each connection, which is also learned during training.
The bias neuron effectively shifts the activation function of the neurons in the subsequent layer,
allowing the network to learn an offset or bias in the decision boundary. By adjusting the weights
connected to the bias neuron, the MLP can learn to control the threshold for activation and better
fit the training data.
Note: It is important to note that in the context of MLPs, bias can refer to two related but distinct
concepts: bias as a general term in machine learning and the bias neuron (defined above). In
general machine learning, bias refers to the error introduced by approximating a real-world
problem with a simplified model. Bias measures how well the model can capture the underlying
patterns in the data. A high bias indicates that the model is too simplistic and may underfit the
data, while a low bias suggests that the model is capturing the underlying patterns well.
Activation function
Typically, each neuron in the hidden layers and the output layer applies an activation function to
its weighted sum of inputs. Common activation functions include sigmoid, tanh, ReLU
(Rectified Linear Unit), and softmax. These functions introduce nonlinearity into the network,
allowing it to learn complex patterns in the data.
MLPs are trained using the backpropagation algorithm, which computes gradients of a loss
function with respect to the model's parameters and updates the parameters iteratively to
minimize the loss.
Workings of a Multilayer Perceptron: Layer by Layer
Input layer
• The input layer of an MLP receives input data, which could be features extracted from
the input samples in a dataset. Each neuron in the input layer represents one feature.
• Neurons in the input layer do not perform any computations; they simply pass the
input values to the neurons in the first hidden layer.
Hidden layers
Where n is the total number of input connections, wi is the weight for the i-th input, and xi is the
i-th input value.
• The weighted sum is then passed through an activation function, denoted as f. The
activation function introduces nonlinearity into the network, allowing it to learn and
represent complex relationships in the data. The activation function determines the
output range of the neuron and its behavior in response to different input values. The
choice of activation function depends on the nature of the task and the desired
properties of the network.
Output layer
• The output layer of an MLP produces the final predictions or outputs of the network.
The number of neurons in the output layer depends on the task being performed (e.g.,
binary classification, multi-class classification, regression).
• Each neuron in the output layer receives input from the neurons in the last hidden
layer and applies an activation function. This activation function is usually different
from those used in the hidden layers and produces the final output value or prediction.
During the training process, the network learns to adjust the weights associated with each
neuron's inputs to minimize the discrepancy between the predicted outputs and the true target
values in the training data. By adjusting the weights and learning the appropriate activation
functions, the network learns to approximate complex patterns and relationships in the data,
enabling it to make accurate predictions on new, unseen samples.
You have seen the working of the multilayer perceptron layers and learned about stochastic
gradient descent; to put it all together, there is one last topic to dive into: backpropagation.
Backpropagation
1. Forward Pass: During the forward pass, input data is fed into the neural network, and
the network's output is computed layer by layer. Each neuron computes a weighted
sum of its inputs, applies an activation function to the result, and passes the output to
the neurons in the next layer.
2. Loss Computation: After the forward pass, the network's output is compared to the
true target values, and a loss function is computed to measure the discrepancy
between the predicted output and the actual output.
3. Backward Pass (Gradient Calculation): In the backward pass, the gradients of the
loss function with respect to the network's parameters (weights and biases) are
computed using the chain rule of calculus. The gradients represent the rate of change
of the loss function with respect to each parameter and provide information about how
to adjust the parameters to decrease the loss.
4. Parameter update: Once the gradients have been computed, the network's
parameters are updated in the opposite direction of the gradients in order to minimize
the loss function. This update is typically performed using an optimization algorithm
such as stochastic gradient descent (SGD), that we discussed earlier.
5. Iterative Process: Steps 1-4 are repeated iteratively for a fixed number of epochs or
until convergence criteria are met. During each iteration, the network's parameters are
adjusted based on the gradients computed in the backward pass, gradually reducing
the loss and improving the model's performance.
Activation function
What is an Activation Function?
An activation function is a mathematical function applied to the output of a
neuron. It introduces non-linearity into the model, allowing the network to
learn and represent complex patterns in the data. Without this non-linearity
feature, a neural network would behave like a linear regression model, no
matter how many layers it has.
The activation function decides whether a neuron should be activated by
calculating the weighted sum of inputs and adding a bias term. This helps
the model make complex decisions and predictions by introducing non-
linearities to the output of each neuron.
Why is Non-Linearity Important in Neural Networks?
Neural networks consist of neurons that operate using weights, biases,
and activation functions.
In the learning process, these weights and biases are updated based on the
error produced at the output—a process known as backpropagation.
Activation functions enable backpropagation by providing gradients that are
essential for updating the weights and biases.
Without non-linearity, even deep networks would be limited to solving only
simple, linearly separable problems. Activation functions empower neural
networks to model highly complex data distributions and solve advanced
deep learning tasks. Adding non-linear activation functions introduce
flexibility and enable the network to learn more complex and abstract
patterns from data.
Mathematical Proof of Need of Non-Linearity in Neural Networks
To illustrate the need for non-linearity in neural networks with a specific
example, let’s consider a network with two input nodes (i1and i2)(i1and i2), a
single hidden layer containing one neuron (h1)(h1), and an output neuron
(out). We will use w1,w2w1,w2 as weights connecting the inputs to the hidden
neuron, and w5w5 as the weight connecting the hidden neuron to the output.
We’ll also include biases (b1b1 for the hidden neuron and b2b2 for the output
neuron) to complete the model.
Network Structure
1. Input Layer: Two inputs i1i1 and i2i2.
2. Hidden Layer: One neuron h1h1.
3. Output Layer: One output neuron.
Mathematical Model Without Non-linearity
Hidden Layer Calculation:
The input to the hidden neuron h1h_1h1 is calculated as a weighted sum of
the inputs plus a bias:
zh1=w1i1+w2i2+b1zh1=w1i1+w2i2+b1
Output Layer Calculation:
The output neuron is then a weighted sum of the hidden neuron’s output plus
a bias:
output=w5h1+b2output=w5h1+b2
If h1h1 were directly the output of zh1zh1 (no activation function applied,
i.e., h1=zh1h1=zh1), then substituting h1h1 in the output equation yields:
output=w5(w1i1+w2i2+b1)+b2 output=w5(w1i1+w2i2+b1)+b2
output=w5w1i1+w5w2i2+w5b1+b2 output=w5w1i1+w5w2i2+w5b1+b2
This shows that the output neuron is still a linear combination of the
inputs i1i1 and i2i2.
Thus, the entire network, despite having multiple layers and weights,
effectively performs a linear transformation, equivalent to a single-layer
perceptron.
Introducing Non-Linearity in Neural Network
To introduce non-linearity, let’s use a non-linear activation function σσ for the
hidden neuron. A common choice is the ReLU function, defined
as σ(x)=max(0,x)σ(x)=max(0,x).
Updated Hidden Layer Calculation:
h1=σ(zh1)=σ(w1i1+w2i2+b1)h1=σ(zh1)=σ(w1i1+w2i2+b1)
Output Layer Calculation with Non-linearity:
output=w5σ(w1i1+w2i2+b1)+b2output=w5σ(w1i1+w2i2+b1)+b2
Effect of Non-linearity
The inclusion of the ReLU activation function \sigma allows h_1 to introduce
a non-linear decision boundary in the input space. This non-linearity enables
the network to learn more complex patterns that are not possible with a
purely linear model, such as:
• Modeling functions that are not linearly separable.
• Increasing the capacity of the network to form multiple decision
boundaries based on the combination of weights and biases.
Types of Activation Functions in Deep Learning
1. Linear Activation Function
Linear Activation Function resembles straight line define by y=x. No matter
how many layers the neural network contains, if they all use linear activation
functions, the output is a linear combination of the input.
• The range of the output spans from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network’s ability to
learn complex patterns limited.
Linear activation functions are useful for specific tasks but must be combined
with non-linear functions to enhance the neural network’s learning and
predictive capabilities.
One of the most popular loss functions, MSE finds the average of the
squared differences between the target and the predicted outputs
Image Source: Author
MAE finds the average of the absolute differences between the target
and the predicted outputs.
If the absolute difference between the actual and predicted value is less
than or equal to a threshold value, 𝛿, then MSE is applied. Otherwise —
if the error is sufficiently large — MAE is applied.
This is the TensorFlow implementation —this involves using a wrapper
function to utilize the threshold variable, which we will discuss in a
little bit.
def huber_loss_with_threshold (t = 𝛿):
def huber_loss (y_true, y_pred):
error = y_true - y_pred
within_threshold = tf.abs(error) <= t
small_error = tf.square(error)
large_error = t * (tf.abs(error) - (0.5*t))
if within_threshold:
return small_error
else:
return large_error
return huber_loss
Neural networks or artificial neural networks are fundamental tools in machine learning,
powering many state-of-the-art algorithms and applications across various domains,
including computer vision, natural language processing, robotics, and more.
The network is typically organized into layers, starting with the input layer, where data is
introduced. Followed by hidden layers where computations are performed and finally, the output
layer where predictions or decisions are made.
Neurons in adjacent layers are connected by weighted connections, which transmit signals from
one layer to the next. The strength of these connections, represented by weights, determines how
much influence one neuron's output has on another neuron's input. During the training process,
the network learns to adjust its weights based on examples provided in a training dataset.
Additionally, each neuron typically has an associated bias, which allows the neuron to adjust its
output threshold.
The goal of training a neural network is to minimize this loss function by adjusting the weights
and biases. The adjustments are guided by an optimization algorithm, such as gradient descent.
The ANN depicted on the right of the image is a simple neural network called ‘perceptron’. It
consists of a single layer, which is the input layer, with multiple neurons with their own weights;
there are no hidden layers. The perceptron algorithm learns the weights for the input signals in
order to draw a linear decision boundary.
However, to solve more complicated, non-linear problems related to image processing, computer
vision, and natural language processing tasks, we work with deep neural networks.
Check out Datacamp’s Introduction to Deep Neural Networks tutorial to learn more about
deep neural networks and how to construct one from scratch utilizing TensorFlow and Keras in
Python. If you would prefer to use R language instead, Datacamp’s Building Neural Network
(NN) Models in R has you covered.
There are several types of ANN, each designed for specific tasks and architectural requirements.
Let's briefly discuss some of the most common types before diving deeper into MLPs next.
These are the simplest form of ANNs, where information flows in one direction, from input to
output. There are no cycles or loops in the network architecture. Multilayer perceptrons (MLP)
are a type of feedforward neural network.
In RNNs, connections between nodes form directed cycles, allowing information to persist over
time. This makes them suitable for tasks involving sequential data, such as time series prediction,
natural language processing, and speech recognition.
CNNs are designed to effectively process grid-like data, such as images. They consist of layers
of convolutional filters that learn hierarchical representations of features within the input data.
CNNs are widely used in tasks like image classification, object detection, and image
segmentation.
Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)
These are specialized types of recurrent neural networks designed to address the vanishing
gradient problem in traditional RNN. LSTMs and GRUs incorporate gated mechanisms to
better capture long-range dependencies in sequential data, making them particularly effective for
tasks like speech recognition, machine translation, and sentiment analysis.
Autoencoder
It is designed for unsupervised learning and consists of an encoder network that compresses the
input data into a lower-dimensional latent space, and a decoder network that reconstructs the
original input from the latent representation. Autoencoders are often used for dimensionality
reduction, data denoising, and generative modeling.
GANs consist of two neural networks, a generator and a discriminator, trained simultaneously in
a competitive setting. The generator learns to generate synthetic data samples that are
indistinguishable from real data, while the discriminator learns to distinguish between real and
fake samples. GANs have been widely used for generating realistic images, videos, and other
types of data.
Stochastic Gradient Descent (SGD)
1. Initialization: SGD starts with an initial set of model parameters (weights and biases)
randomly or using some predefined method.
2. Iterative Optimization: The aim of this step is to find the minimum of a loss
function, by iteratively moving in the direction of the steepest decrease in the
function's value.
For each iteration (or epoch) of training:
• Shuffle the training data to ensure that the model doesn't learn from the same patterns
in the same order every time.
• Split the training data into mini-batches (small subsets of data).
• For each mini-batch:
• Compute the gradient of the loss function with respect to the model parameters using
only the data points in the mini-batch. This gradient estimation is a stochastic
approximation of the true gradient.
• Update the model parameters by taking a step in the opposite direction of the gradient,
scaled by a learning rate:Θt+1 = θt - n * ⛛ J (θt)Where:θt represents the model
parameters at iteration t. This parameter can be the weight⛛ J (θt) is the gradient of
the loss function J with respect to the parameters θtn is the learning rate, which
controls the size of the steps taken during optimization
3. Direction of Descent: The gradient of the loss function indicates the direction of the
steepest ascent. To minimize the loss function, gradient descent moves in the opposite
direction, towards the steepest descent.
4. Learning Rate: The step size taken in each iteration of gradient descent is determined
by a parameter called the learning rate, denoted above as n. This parameter controls
the size of the steps taken towards the minimum. If the learning rate is too small,
convergence may be slow; if it is too large, the algorithm may oscillate or diverge.
5. Convergence: Repeat the process for a fixed number of iterations or until a
convergence criterion is met (e.g., the change in loss function is below a certain
threshold).
Stochastic gradient descent updates the model parameters more frequently using smaller subsets
of data, making it computationally efficient, especially for large datasets. The randomness
introduced by SGD can have a regularization effect, preventing the model from overfitting to the
training data. It is also well-suited for online learning scenarios where new data becomes
available incrementally, as it can update the model quickly with each new data point or mini-
batch.
However, SGD can also have some challenges, such as increased noise due to the stochastic
nature of the gradient estimation and the need to tune hyperparameters like the learning rate.
Various extensions and adaptations of SGD, such as mini-batch stochastic gradient descent,
momentum, and adaptive learning rate methods like AdaGrad, RMSProp, and Adam, have been
developed to address these challenges and improve convergence and performance.