Machine Learning Fundamentals Explained
Machine Learning Fundamentals Explained
Alan Turing’s definition would have fallen under the category of “systems that
act like humans.”
Hidden patterns and information about the problem can be used to predict
future events and to make all sorts of complex decisions.
The name machine learning was coined by Arthur Samuel in 1959, an American
pioneer in the field of computer gaming and artificial intelligence who stated
that Learning gives computers the ability to learn without being explicitly
Arthur Samuel created the first self study program for playing checkers. You
realize that the more the system plays, the better it performs.
And in 1997, Tom Mitchell gave a mathematical and relational definition that
computer program is said to learn from experience E with respect to some task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience.
Performance: Measure The probability of the program which will win the next
game of chess
The process of machine learning can be broken down into 7 steps as shown in
Figure To illustrate the significance and function of each step, we would be
using an example of a simple model. This model would be responsible for
differentiating between an apple and an orange. Machine learning is capable
of much for complex tasks. However, to explain the process in simplistic terms,
Supervised learning
Supervised learning methods are the ML methods that are most commonly used. It takes
the data sample (usually called training data) and the associated output (usually called
labels or responses) with each data sample during the training process of the model. The
main objective of supervised learning is to understand the association between input
training data and corresponding labels.
Let’s understand it with an example. Suppose we have:
Input variable:
Output variable:
In order to learn the mapping function from the input to output, we need to apply an
algorithm whose main objective is to approximate the mapping function so well that we can
also easily predict the output variable for the new input data, as shown in
the following example:
Y = f(x)
These methods are called supervised learning methods because the ML model learns from
the training data where the desired output is already known. Logistic regression, k-Nearest
neighbors (KNN), Decision tree, and Random Forest are some of the well known supervised
machine learning algorithms.
Based on the type of ML-based tasks, supervised learning methods can be divided into two
major classes as follows: The main objective of the classification-based tasks is to predict
categorical output responses based on the input data that is being
provided. The output depends on the ML model’s learning in the training phase. Categorical
means unordered and discrete values; hence, the output responses will belong to a specific
discrete category.
SUPERVISED LEARNING
For example, predicting high-risk patients and discriminating them from low-risk patients is
also a classification task. Suppose for newly admitted patients, an emergency room in a
hospital measures 12 variables (such as blood sugar, blood pressure, age, weight, and so
on). After measuring these variables, a decision is to be taken whether to put the patient in
ICU or not. There is a simple condition that a high priority should be given to the patients
who may survive more than a month.
The main objective of regression-based tasks is to predict continuous numerical output
responses based on the input data that is being provided. The output depends on the ML
model’s learning in the training phase. Similar to classification, with the help of regression,
we can predict the output responses for unseen data instances, but that is with continuous
numerical output values. Predicting the price of houses is one of the most common real-
world examples of regression.
Unsupervised learning
Unsupervised learning methods (as opposed to supervised learning methods) do not require
any pre-labeled training data. In such methods, the machine learning model or algorithm
tries to learn patterns and relationships from the given raw data without any supervision.
Although there are a lot of uncertainties in the result of these models, we can still obtain a
lot of useful information like all kinds of unknown patterns in the data and the features that
can be useful for categorization.
To make it clearer, suppose we have:
UNSUPERVISED LEARNING
Input variable: x
There would be no corresponding output variable. For learning, the algorithm needs to
discover interesting patterns in data. Kmeans Clustering, Hierarchical Clustering, and
Hebbian Learning are some of the well-known unsupervised machine learning algorithms.
Based on the type of ML-based tasks, unsupervised learning methods can be categorized
into the following broad areas: Clustering, one of the most useful unsupervised machine
learning algorithms/methods, is used to find the similarity and relationship patterns among
data samples. Once the relationship patterns are found, it clusters the data samples into
groups having similar features. The following figure illustrates the working of clustering
methods:
Data plays a significant role in the machine learning process. One of the significant issues
that machine learning professionals face is the absence of good quality data. Unclean and
noisy data can make the whole process extremely exhausting. We don’t want our
algorithm to make inaccurate or faulty predictions. Hence the quality of data is essential
to enhance the output. Therefore, we need to ensure that the process of data
preprocessing which includes removing outliers, filtering missing values, and removing
unwanted features, is done with the utmost level of perfection.
This process occurs when data is unable to establish an accurate relationship between
input and output variables. It simply means trying to fit in undersized jeans. It signifies the
data is too simple to establish a precise relationship. To overcome this issue:
Maximize the training time
Enhance the complexity of the model
Add more features to the data
Reduce regular parameters
Increasing the training time of model
Overfitting refers to a machine learning model trained with a massive amount of data that
negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately,
this is one of the significant issues faced by machine learning professionals. This means
that the algorithm is trained with noisy and biased data, which will affect its overall
performance. Let’s understand this with the help of an example. Let’s consider a model
trained to differentiate between a cat, a rabbit, a dog, and a tiger. The training data
contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable
probability that it will identify the cat as a rabbit. In this example, we had a vast amount of
data, but it was biased; hence the prediction was negatively affected.
We can tackle this issue by:
Analyzing the data with the utmost level of perfection
Use data augmentation technique
Remove outliers in the training set
Select a model with lesser features
The machine learning industry is young and is continuously changing. Rapid hit and trial
experiments are being carried on. The process is transforming, and hence there are high
chances of error which makes the learning complex. It includes analyzing the data,
removing data bias, training data, applying complex mathematical calculations, and a lot
more. Hence it is a really complicated process which is another big challenge for Machine
learning professionals.
The most important task you need to do in the machine learning process is to train the
data to achieve an accurate output. Less amount training data will produce inaccurate or
too biased predictions. Let us understand this with the help of an example. Consider a
machine learning algorithm similar to training a child. One day you decided to explain to a
child how to distinguish between an apple and a watermelon. You will take an apple and a
watermelon and show him the difference between both based on their color, shape, and
taste. In this way, soon, he will attain perfection in differentiating between the two. But
on the other hand, a machine-learning algorithm needs a lot of data to distinguish. For
complex problems, it may even require millions of data to be trained. Therefore we need
to ensure that Machine learning algorithms are trained with sufficient amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The machine
learning models are highly efficient in providing accurate results, but it takes a
tremendous amount of time. Slow programs, data overload, and excessive requirements
usually take a lot of time to provide accurate results. Further, it requires constant
monitoring and maintenance to deliver the best output.
So you have found quality data, trained it amazingly, and the predictions are really concise
and accurate. Yay, you have learned how to create a machine learning algorithm!! But
wait, there is a twist; the model may become useless in the future as data grows. The best
model of the present may become inaccurate in the coming Future and require further
rearrangement. So you need regular monitoring and maintenance to keep the algorithm
working. This is one of the most exhausting issues faced by machine learning
professionals.
STATISTICAL LEARNING
INTRODUCTION
There are two major goals for modeling data: 1) to accurately predict some future
quantity of interest, given some observed data, and 2) to discover unusual or
interesting patterns in the data.
Function approximation.
Building a mathematical model for data usually means understanding how one data
variable depends on another data variable. The most natural way to represent the
relationship between variables is via a mathematical function or map.
Optimization.
Given a class of mathematical models, we wish to find the best possible model in
that class. This requires some kind of efficient search or optimization procedure. The
optimization step can be viewed as a process of fitting or calibrating a function to
observed data.
For example, in the classification case with zero–one loss function the risk is equal to
the probability of incorrect classification:
We denote this sample by Ƭ = {(X1, Y1), . . . , (Xn, Yn)} and call it the training
set (Ƭ is a mnemonic for training) with n examples. It will be important to distinguish
between a random training set T and its (deterministic) outcome {(x1, y1), . . . ,
(xn, yn)}.
Our goal is thus to “learn” the unknown g* using the n examples in the training set Ƭ.
Let us denote by gT the best (by some criterion) approximation for g∗ that we can
construct from Ƭ.
The above setting is called supervised learning , because one tries to learn the
functional relationship between the feature vector x and response y in the presence
of a teacher who provides n examples. It is common to speak of “explaining” or
predicting y on the basis of x, where x is a vector of explanatory variables. In
contrast, unsupervised learning makes no distinction between response and
explanatory variables, and the objective is simply to learn the structure of the
unknown distribution of the data. In other words, we need to learn f (x). In this case
the guess g(x) is an approximation of f (x) and the risk is of the form
which we call the training loss. The training loss is thus an unbiased estimator of the
risk (the expected loss) for a prediction function g, based on the training data. To
approximate the optimal prediction function 𝑔∗ (the minimizer of the risk ℓ(g)) we first
select a suitable collection of approximating functions G and then take our learner to
be the function in G that minimizes the training loss; that is,
For example, the simplest and most useful G is the set of linear functions of x; that
is, the set of all functions 𝑔: 𝑥 → 𝛽T𝑥 for some real-valued vector β.
Generalization risk
The prediction accuracy of new pairs of data is measured by the generalization risk
Of the learner. For a fixed training set τ it is defined as
where {(𝑋' , 𝑌 ' ), (𝑋' , 𝑌' ), … … (𝑋' , 𝑌 ' )}=: 𝑟 ' ′ is a so-called test sample. The test sample
1 1 2 2 n n
is completely separate from Ƭ , but is drawn in the same way as Ƭ; that is, via
independent draws from f (x, y), for some sample size n′.
We can decompose the generalization risk into the following three components
Irreducible risk
where 𝑃∗ = 𝑃(𝑔∗) is the irreducible risk and gG:= argming∈G ℓ(g) is the best learner
within class G. No learner can predict a new response with a smaller risk than 𝑃∗.
Approximation error
The second component is the approximation error It measures the difference
between the irreducible risk and the best possible risk that can be obtained by
selecting the best prediction function in the selected class of functions G.
Statistical (estimation) error
The third component is the statistical (estimation) error . It depends on the training
set τ and, in particular, on how well the learner 𝑔cG estimates the best possible
prediction function, 𝑔G, within class G. For any sensible estimator this error should
decay to zero (in probability or expectation) as the training size tends to infinity.
Estimating Risk
Test loss can be estimated by quantifying the generalization risk. How ever
generalization risk depends on the training sets, different training sets can produces
different results during the estimation.
Where each response variable 𝑌i' is drawn from f(y/xi). Even in this simplified setting,
the training loss of the learner will be a poor estimate of the in-sample risk. Instead,
the proper way to assess th prediction accuracy of the learner at the feature vectors
x1, . . . , x n, is to draw new response values 𝑌i' ~𝑓(y ), i = 1, . . . , n, that
x
i
are independent from the responses y1, . . . , yn in the training data, and then
estimate the in-sample risk of 𝑔c
UNIT-II
Regression
The following Table 2.1 shows, the dataset, which serves the
purpose of predicting the house price, based on different
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
Here the input variables are Size of No. of Bedrooms and No. of
Bathrooms & the output variable is the Price of which is a
continuous value. Therefore, this is a Regression
Outliers are observed data points that are far from the least
square line or that differs significantly from other data or
observations, or in other words, an outlier is an extreme value
that differ greatly from other values in a set of values.
X1 = [0, 3, 4, 9, 6, 2]
X1 = -2 * X2
So X1 & X2 are collinear. Here it’s better to use only one variable,
either X1 or X2 for the input.
When model performs well on the training dataset but not on the
test dataset, variance exists.
On the left side of the above Figure anyone easily predicts that
the line does not cover all the points shown in the graph. Such a
model tends to cause a phenomenon known as underfitting of
data. In the case of underfitting, the model cannot learn enough
from the training data and from the training data and thus
reduces precision and accuracy. There is a high bias and low
variance in the underfitted model.
y=
Where,
Y: Dependent Variable
X: Independent variable
is the slope of the regression line, which tells whether the line is
increasing or decreasing.
Therefore, in this graph, the dots are our data and based on this
data we will train our model to predict the results. The black line
is the best-fitted line for the given data. The best-fitted line is a
straight line that best represents the data on a scatter plot. The
best-fitted line may pass through all of the points, some of the
points or none of the points in the graph.
To find the best-fit line for the dataset. The best-fit line is the one
for which total prediction error (all data points) are as small as
possible.
How the dependent variable is changing, by changing the
independent variable.
The values and must be chosen so that they minimize the error.
If the sum of squared error is taken as a metric to evaluate the
model, then the goal to obtain a line that best reduces the error
is achieved.
Figure 2.9: Weather training data set and corresponding frequency and
likelihood table
Step Calculate the posterior probability for each class using the
Naive Bayesian equation. The outcome of prediction is the class
with the highest posterior probability.
Naive Bayes employs a similar method to predict the likelihood of
various classes based on various attributes. This algorithm is
commonly used in text classification and multi-class problems.
The Naive Bayes Model is classified into three types, which are
listed as follows:
Text classification
Spam Filtering
Real-time Prediction
Multi-class Prediction
Recommendation Systems
Credit Scoring
Sentiment Analysis
Decision Tree
Decision-tree working
Begin the tree with node T, which contains the entire dataset.
Divide the T into subsets that contain the best possible values for
the attributes.
Create the decision tree node with the best attribute.
Continue this process until you reach a point where you can no
longer classify the nodes and refer to the final node as a leaf
node.
days.
days.
days.
days.
days.
days.
days.
days.
days.
days.
days.
Overfitting problem
Complexity
The K-NN algorithm assumes that the new case and existing
cases are similar and places the new case in the category that is
most similar to the existing categories. The K-NN algorithm stores
all available data and classifies a new data point based on its
similarity to the existing data. This means that when new data
appears, the KNN algorithm can quickly classify it into a suitable
category.
Assign the new data points to the category with the greatest
number of neighbors.
As can be seen, the three closest neighbors are all from category
A, so this new data point must also be from that category.
Logistic Regression
The following Table 2.4 shows the difference between Linear and
Logistic Regression:
Regression: Regression:
There are only two possible outcomes for the categorical response.
Examples
Support vectors are data points that are closer to the hyperplane
and have an influence on the hyperplane's position and orientation
as shown in Figure We maximize the classifier's margin by using
these support vectors. The hyperplane's position will be altered if
the support vectors are deleted. These are the points that will
assist us in constructing SVM.
It's the distance between two lines on the closest data point of
different classes. The perpendicular distance between the line and
the support vectors can be calculated. A large margin is regarded
as a good margin, while a small margin is regarded as a bad
margin.
Working of SVM
Types of SVM
Speech recognition
Disadvantages of SVM
Ensemble learning is a general meta approach to machine learning that seeks better predictive
performance by combining the predictions from multiple models.
Although there are a seemingly unlimited number of ensembles that you can develop for your predictive
modeling problem, there are three methods that dominate the field of ensemble learning. So much so,
that rather than algorithms per se, each is a field of study that has spawned many more specialized
methods.
The three main classes of ensemble learning methods are bagging, stacking, and boosting, and it is
important to both have a detailed understanding of each method and to consider them on your
predictive modeling project.
But, before that, you need a gentle introduction to these approaches and the key ideas behind each
method prior to layering on math and code.
In this tutorial, you will discover the three standard ensemble learning techniques for machine learning.
Bagging involves fitting many decision trees on different samples of the same dataset and
averaging the predictions.
Stacking involves fitting many different models types on the same data and using another model
to learn how to best combine the predictions.
Boosting involves adding ensemble members sequentially that correct the predictions made by
prior models and outputs a weighted average of the predictions.
Ensemble learning refers to algorithms that combine the predictions from two or more models.
Although there is nearly an unlimited number of ways that this can be achieved, there are perhaps three
classes of ensemble learning techniques that are most commonly discussed and used in practice. Their
popularity is due in large part to their ease of implementation and success on a wide range of predictive
modeling problems.
A rich collection of ensemble-based classifiers have been developed over the last several years. However,
many of these are some variation of the select few well- established algorithms whose capabilities have
also been extensively tested and widely reported.
Given their wide use, we can refer to them as “standard” ensemble learning strategies; they are:
1. Bagging.
2. Stacking.
3. Boosting.
There is an algorithm that describes each approach, although more importantly, the success of each
approach has spawned a myriad of extensions and related techniques. As such, it is more useful to
describe each as a class of techniques or standard approaches to ensemble learning.
Rather than dive into the specifics of each method, it is useful to step through, summarize, and contrast
each approach. It is also important to remember that although discussion and use of these methods are
pervasive, these three methods alone do not define the extent of ensemble learning.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have knowledge of the Decision
Tree Algorithm.
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
o It enhances the accuracy of the model and prevents the overfitting issue.
Now we will implement the Random Forest Algorithm tree using Python. For this, we will use the same
dataset "user_data.csv", which we have used in previous classification models. By using the same
dataset, we can compare the Random Forest classifier with other classification models such as Decision
tree Classifier, KNN, SVM, Logistic Regression, etc.
Voting Classifier :
You never know if your model is useful unless you evaluate the performance of the machine learning
model. The goal of a data scientist is to train a robust and high-performing model. There are various
techniques or hacks to improve the performance of the model, ensembling of models being one of
them.
Ensembling is a powerful technique to improve the performance of the model by combining various
base models in order to produce an optimal and robust model. Types of Ensembling techniques include:
Boosting
Stacking Classifier
Voting Classifier
Read one of my previous articles to get a better understanding on bagging ensemble technique:
Understand the working of Bootstrap Aggregation (Bagging) ensemble learning and implement a
Random Forest Bagging model…
[Link]
In this article, we will discuss the implementation of a voting classifier and further discuss how can it be
used to improve the performance of the model.
Voting Classifier:
A voting classifier is a machine learning estimator that trains various base models or estimators and
predicts on the basis of aggregating the findings of each base estimator. The aggregating criteria can be
combined decision of voting for each estimator output. The voting criteria can be of two types:
Soft Voting: Voting is calculated on the predicted probability of the output class.
How Voting Classifier can improve performance?
The voting classifier aggregates the predicted class or predicted probability on basis of hard voting or
soft voting. So if we feed a variety of base models to the voting classifier it makes sure to resolve the
error by any model.
Implementation:
Scikit-learn packages offer implementation of Voting Classifier in a few lines of Python code.
For our sample classification dataset, we are training 4 base estimators of Logistic Regression, Random
Forest, Gaussian Naive Bayes, and Support Vector Classifier.
Parameter voting=‘soft’ or voting=‘hard’ enables developers to switch between hard or soft voting
aggregators. The parameter weight can be tuned to users to overshadow some of the good-performing
base estimators. The sequence of weights to weigh the occurrences of predicted class labels for hard
voting or class probabilities before averaging for soft voting.
We are using a soft voting classifier and weight distribution of [1,2,1,1], where twice the weight is
assigned to the Random Forest model. Now lets, observe the benchmark performance of each of the
base estimators vis-a-vis the voting classifier.
From the above pretty table, the voting classifier boosts the performance compared to its base
estimator performances.
Conclusion:
Voting Classifier is a machine-learning algorithm often used by Kagglers to boost the performance of
their model and climb up the rank ladder. Voting Classifier can also be used for real-world datasets to
improve performance, but it comes with some limitations. The model interpretability decreases, as one
cannot interpret the model using shap, or lime packages.
Scikit-learn does not provide implementation to compute the top-performing features for the voting
classifier unlike other models, but I have come with a hack to compute the same. You can compute the
feature importance by combining the importance score of each of the estimators based on the weights.
Follow my previous article to get a better understanding of the same:
There are several ways to group models. They differ in the training algorithm and
data used in each one of them and also how they are grouped. We’ll be talking in the
article about two methods called Bagging and Pasting and how to implement them
in scikit-learn.
But before we begin talking about Bagging and Pasting, we have to know what
is Bootstrapping.
Bootstrapping
For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each
bootstrap sample containing n observations, the following are valid samples:
Since we drawn data with replacement, the observations can appear more than one
time in a single sample.
Out-of-Bag Scoring
If we are using bagging, there’s a chance that a sample would never be selected, while
anothers may be selected multiple time. The probability of not selecting a specific
sample is (1–1/n), where n is the number of samples. Therefore, the probability of not
picking n samples in n draws is (1–1/n)^n. When the value of n is big, we can
approximate this probability to 1/e, which is approximately 0.3678. This means that
when the dataset is big enough, 37% of its samples are never selected and we could
use it to test our model. This is called Out-of-Bag scoring, or OOB Scoring.
Random Forests
As the name suggest, a random forest is an ensemble of decision trees that can be
used to classification or regression. In most cases, it is used bagging. Each tree in the
forest outputs a prediction and the most voted becomes the output of the model. This
is helpful to make the model with more accuracy and stable, preventing overfitting.
Another very useful property of random forests is the ability to measure the relative
importance of each feature by calculating how much each one reduce the impurity of
the model. This is called feature importance.
A scikit-learn
To see how bagging works in scikit-learn, we will train some models alone and then
aggregate them, so we can see if it works.
In this example we’ll be using the 1994 census dataset on US income. It contains
informations such as marital status, age, type of work and more. As target column we
have a categorical data type that informs if a salary is less than or equal 50k a year(0)
or not(1). Let’s explore the DataFrame with Pandas’ info method:
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education_num 32561 non-null int64
marital_status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital_gain 32561 non-null int64
capital_loss 32561 non-null int64
hours_per_week 32561 non-null int64
native_country 32561 non-null object
high_income 32561 non-null int8
dtypes: int64(6), int8(1), object(8)
As we can see, there’s numerical(int64 and int8) and categorical(object) data types.
We have to deal with each type separately to send to the predictor.
Data Preparation
First we load the CSV file and convert the target column to categorical, so when we
are passing all columns to the pipeline we don’t have to worry about the target
column.
import numpy as np
import pandas as pd# Load CSV
df = pd.read_csv('data/[Link]')# Convert target to categorical
col = [Link](df.high_income)
df["high_income"] = [Link]
There’s numerical and categorical columns in our dataset. We need to make different
preprocessing in each one of them. The numerical features need to be normalized and
the categorical features need to be converted to integers. To do this, we define a
transformer to preprocess our data depending on it’s type.
from [Link] import BaseEstimator, TransformerMixin
from [Link] import MinMaxScalerclass
PreprocessTransformer(BaseEstimator, TransformerMixin):
def init (self, cat_features, num_features):
self.cat_features = cat_features
self.num_features = num_features def fit(self, X, y=None):
return self def transform(self, X, y=None):
df = [Link]() # Treat ? workclass as unknown
[Link][df['workclass'] == '?', 'workclass'] = 'Unknown'
# Too many categories, just convert to US and Non-US
[Link][df['native_country']!='United-States','native_country']='non_usa' #
Convert columns to categorical
for name in self.cat_features:
col = [Link](df[name])
df[name] = [Link] # Normalize numerical features
scaler = MinMaxScaler()
df[self.num_features] = scaler.fit_transform(df[num_features]) return df
The data is then splitted into train and test, so we can see later if our model
generalized to unseen data.
from sklearn.model_selection import train_test_split# Split the dataset into
training and testing
X_train, X_test, y_train, y_test = train_test_split(
[Link]('high_income', axis=1),
df['high_income'],
test_size=0.2,
random_state=42,
shuffle=True,
stratify=df['high_income']
)
Finally, we build our models. First we create a pipeline to preprocess with our custom
transformer, select the best features with SelectKBest and train our predictors.
from [Link] import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import cross_val_score
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
from [Link] import BaggingClassifierrandom_state = 42
leaf_nodes = 5
num_features = 10
num_estimators = 100# Decision tree for bagging
tree_clf = DecisionTreeClassifier(
splitter='random',
max_leaf_nodes=leaf_nodes,
random_state=random_state
)# Initialize the bagging classifier
bag_clf = BaggingClassifier(
tree_clf,
n_estimators=num_estimators,
max_samples=1.0,
max_features=1.0,
random_state=random_state,
n_jobs=-1
)# Create a pipeline
pipe = Pipeline([
('preproc', PreprocessTransformer(categorical_features, numerical_features)),
('fs', SelectKBest()),
('clf', DecisionTreeClassifier())
])
Since what we are trying to do is see the difference between a simple decision tree
and an ensemble of them, we can use scikit-learn’s GridSearchCV to train all
predictors using a single fit method. We are using AUC and accuracy as scoring and a
KFold with 10 splits as cross-validation.
from sklearn.model_selection import KFold, GridSearchCV
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, make_scorer# Define our search space
for grid search
search_space = [
{
'clf': [DecisionTreeClassifier()],
'clf max_leaf_nodes': [128],
'fs score_func': [chi2],
'fs k': [10],
},
{
'clf': [RandomForestClassifier()],
'clf n_estimators': [200],
'clf max_leaf_nodes': [128],
'clf bootstrap': [False, True],
'fs score_func': [chi2],
'fs k': [10],
}
]# Define scoring
scoring = {'AUC':'roc_auc', 'Accuracy':make_scorer(accuracy_score)}# Define cross
validation
kfold = KFold(n_splits=10, random_state=42)# Define grid search
grid = GridSearchCV(
pipe,
param_grid=search_space,
cv=kfold,
scoring=scoring,
refit='AUC',
verbose=1,
n_jobs=-1
)# Fit grid search
model = [Link](X_train, y_train)
The mean of AUC and accuracy for each model tested on GridSearchCV are:
Since the best estimator was the random forest, we can visualize the OOB score and
the features importance by:
best_estimator = grid.best_estimator_.steps[-1][1]
columns = X_test.[Link]()print('OOB Score:
{}'.format(best_estimator.oob_score_))
print('Feature Importances')
Which prints:
OOB Score: 0.8396805896805897
Feature Importances:
age: 0.048
workclass: 0.012
fnlwgt: 0.167
education: 0.138
education_num: 0.001
marital_status: 0.329
occupation: 0.009
relationship: 0.259
race: 0.012
sex: 0.025
Boosting
heory, Implementation, and Visualization
1. AdaBoost
Definition of Weakness
Now with weakness defined, the next step is to figure out how to combine
the sequence of models to make the ensemble stronger overtime.
Pseudocode
Implementation in Python
The base estimator from which the boosted ensemble is built. If None,
then the base estimator is DecisionTreeClassifier(max_depth=1)
2. Gradient Boosting
Definition of Weakness
Pseudocode
Implementation in Python
Regression:
Classification:
o Bagging
o Boosting
o Stacking
1. Bagging
2. Boosting
3. Stacking
Architecture of Stacking
There are some other ensemble techniques that can be considered the
forerunner of the stacking method. For better understanding, we have
divided them into the different frameworks of essential stacking so that
we can easily understand the differences between methods and the
uniqueness of each technique. Let's discuss a few commonly used
ensemble techniques related to stacking.
Voting ensembles:
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points.
Support Vectors:The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
After executing the above code, we will pre-process the data. The code will give the dataset as:
Consider these two red lines as the decision boundary and the green line as the
hyperplane. Our objective, when we are moving on with SVR, is to basically consider
the points that are within the decision boundary line. Our best fit line is the hyperplane
that has a maximum number of points.
The first thing that we’ll understand is what is the decision boundary (the danger red line
above!). Consider these lines as being at any distance, say ‘a’, from the hyperplane. So,
these are the lines that we draw at distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the
text is basically referred to as epsilon.
wx+b= +a
wx+b= -a
Our main aim here is to decide a decision boundary at ‘a’ distance from the original
hyperplane such that data points closest to the hyperplane or the support vectors are within
that boundary line.
Hence, we are going to take only those points that are within the decision boundary and
have the least error rate, or are within the Margin of Tolerance. This gives us a better fitting
model.
import pandas as pd
dataset = pd.read_csv('Position_Salaries.csv')
X = [Link][:, 1:2].values
y = [Link][:, 2].values
A real-world dataset contains features that vary in magnitudes, units, and range. I would
suggest performing normalization when the scale of a feature is irrelevant or misleading.
Feature Scaling basically helps to normalize the data within a particular range. Normally
several common class types contain the feature scaling function so that they make feature
scaling automatically. However, the SVR class is not a commonly used class type so we
should perform feature scaling using Python.
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
Kernel is the most important feature. There are many types of kernels – linear, Gaussian,
etc. Each is used depending on the dataset. To learn more about this, read this: Support
Vector Machine (SVM) in Python and R
y_pred = [Link](6.5)
y_pred = sc_y.inverse_transform(y_pred)
Step 6. Visualizing the SVR results (for higher resolution and smoother curve)
X_grid = [Link](min(X), max(X), 0.01) #this step required because data is feature scaled.
[Link]('Position level')
[Link]('Salary')
[Link]()
This is what we get as output- the best fit line that has a maximum number of points. Quite
accurate!
UNIT-IV
Clustering in Machine Learning
lustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined as "A
way of grouping the data points into different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides
them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled
dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to
simplify the processing of large and complex datasets.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of dataset that we are using.
In classification, we work with the labeled data set, whereas in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any shopping mall,
we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one section,
and trousers are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into
several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft
Clustering (data points can belong to another group also). But there are also other various approaches of Clustering exist.
Below are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The
cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared
to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different
clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided
from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs
to a particular distribution. The grouping is done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models
(GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each
dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-
means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process
until it does not find the best clusters. The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create
a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means
here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either the points
from the dataset or any other point. So, here we are selecting the below two points as k points, which are not the
part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below mage:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right
of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the
new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a
median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line.
So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in
the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line, which means our
model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the
below image:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept
of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a
cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean distance or
Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of
K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for
the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the number of clusters equal to the
data points, then the value of WCSS becomes zero, and that will be the endpoint of the plot.
Guarantees convergence.
k-means Generalization
What happens when clusters are of different densities and sizes? Look at Figure 1. Compare the intuitive clusters on the
left side with the clusters actually found by k-means on the right side. The comparison shows how k-means can stumble
on certain datasets.
To cluster naturally imbalanced clusters like the ones shown in Figure 1, you can adapt (generalize) k-means. In Figure 2,
the lines show the cluster boundaries after generalizing k-means as:
While this course doesn't dive into how to generalize k-means, remember that the ease of modifying k-means is another
reason why it's powerful. For information on generalizing k-means, see Clustering – K-means Gaussian mixture
models by Carlos Guestrin from Carnegie Mellon University.
Disadvantages of k-means
Choosing k manually.
Use the “Loss vs. Clusters” plot to find the optimal (k), as discussed in Interpret Results.
For a low k, you can mitigate this dependence by running k-means several times with different initial values and picking
the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids
(called k-means seeding). For a full discussion of k- means seeding see, A Comparative Study of Efficient Initialization
Methods for the K-Means Clustering Algorithm by M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.
k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to
generalize k-means as described in the Advantages section.
Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing
or clipping outliers before clustering.
As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any
given examples. Reduce dimensionality either by using PCA on the feature data, or by using “spectral clustering” to
modify the clustering algorithm as explained below.
Curse of Dimensionality and Spectral Clustering
These plots show how the ratio of the standard deviation to the mean of distance between examples decreases as the
number of dimensions increases. This convergence means k-means becomes less effective at distinguishing between
examples. This negative consequence of high-dimensional data is called the curse of dimensionality.
Figure 3: A demonstration of the curse of dimensionality. Each plot shows the pairwise distances between
200 random points.
Spectral clustering avoids the curse of dimensionality by adding a pre-clustering step to your algorithm:
Therefore, spectral clustering is not a separate clustering algorithm but a pre- clustering step that you can use with any
clustering algorithm. The details of spectral clustering are complicated. See A Tutorial on Spectral Clustering by Ulrike von
Luxburg.
Segmentation By clustering
It is a method to perform Image Segmentation of pixel-wise segmentation. In this type of segmentation, we try to cluster the
pixels that are together. There are two approaches for performing the Segmentation by clustering.
Clustering by Merging
Clustering by Divisive
Clustering by merging or Agglomerative Clustering:
In this approach, we follow the bottom-up approach, which means we assign the pixel closest to the cluster. The algorithm for
performing the agglomerative clustering as follows:
Take each point as a separate cluster.
For a given number of epochs or until clustering is satisfactory.
Merge two clusters with the smallest inter-cluster distance (WCSS).
Repeat the above step
The agglomerative clustering is represented by Dendrogram. It can be performed in 3 methods: by selecting the closest pair for
merging, by selecting the farthest pair for merging, or by selecting the pair which is at an average distance (neither closest nor
farthest). The dendrogram representing these types of clustering is below:
Nearest clustering
Average Clustering
Farthest Clustering
Where j is the number of clusters, and i will be the points belong to the jth cluster. The above objective function is called within-
cluster sum of square (WCSS) distance.
A good way to find the optimal value of K is to brute force a smaller range of values (1-10) and plot the graph of WCSS
distance vs K. The point where the graph is sharply bent downward can be considered the optimal value of K. This method is
called Elbow method.
For image segmentation, we plot the histogram of the image and try to find peaks, valleys in it. Then, we will perform the
peakiness test on that histogram.
Implementation
In this implementation, we will be performing Image Segmentation using K-Means clustering. We will be using OpenCV k-
Means API to perform this clustering.
Python3
# imports
import numpy as np
import cv2 as cv
[Link]["[Link]"] = (12,50)
# load image
img = [Link]('[Link]')
Z = [Link]((-1,3))
# convert to np.float32
Z = np.float32(Z)
for i in range(10):
K = i+3
ret,label,center=[Link](Z,K,None,criteria,attempts = 10,
cv.KMEANS_RANDOM_CENTERS)
center = np.uint8(center)
res = center[[Link]()]
res2 = [Link](([Link]))
ax[i, 1].imshow(res2)
ax[i,1].set_title('K = %s Image'%K)
ax[i, 0].imshow(img)
ax[i,0].set_title('Original Image')
Image Segmentation for K=3,4,5
Image Segmentation for K=6,7,8
Remember meForgot Password
Sign
Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into
an understandable format for ML algorithms. Real-world data usually is noisy (contains errors,
outliers, duplicates), incomplete (some values are missed), could be stored in different places and
different formats. The task of Data Preprocessing is to handle these issues.
In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and
Training / Tunning Model.
1. Different ML models have different required input data (numerical data, images in specific
format, etc). Without the right data, nothing will work.
2. Because of “bad” data, ML models will not give any useful results, or even may give wrong
answers, that may lead to wrong decisions (GIGO principle).
3. The higher the quality of the data, the less data is needed.
Before starting data preparation, it is recommended to determine what data requirements are
presented by the ML algorithm for getting quality results. In this article we consider the K-means
clustering algorithm.
Data has no noises or outliers. K-means is very sensitive to outliers and noisy data. More
detail here and here.
Data has symmetric distribution of variables (it isn’t skewed). Real data always has
outliers and noise, and it’s difficult to get rid of it. Transformation data to normal distribution
helps to reduce the impact of these issues. In this way, it’s much easier for the algorithm to
identify clusters.
Variables on the same scale — have the same mean and variance, usually in a range -1.0 to
1.0 (standardized data) or 0.0 to 1.0 (normalized data). For the ML algorithm to consider all
attributes as equal, they must all have the same scale. More detail here and here.
Besides the requirements above, there are a few fundamental model assumptions:
the variance of the distribution of each attribute (variable) is spherical (or in other words, the
boundaries between k-means clusters are linear);
These assumptions are beyond the data preprocessing stage. There is no way to validate them
before getting model results.
1. Data Cleaning
Removing duplicates
3. Data Integration
4. Data Transformation
Feature Construction
Handling skewness
Data Scaling
5. Data Reduction
Using Clustering for Semi-Supervised Learning
Semi-supervised clustering is a method that partitions unlabeled data by creating the use of domain knowledge. It is
generally expressed as pairwise constraints between instances or just as an additional set of labeled instances.
The quality of unsupervised clustering can be essentially improved using some weak structure of supervision, for
instance, in the form of pairwise constraints (i.e., pairs of objects labeled as belonging to similar or different clusters).
Such a clustering procedure that depends on user feedback or guidance constraints is known as semisupervised
clustering.
There are several methods for semi-supervised clustering that can be divided into two classes which are as follows −
Constraint-based semi-supervised clustering − It can be used based on user-provided labels or constraints to support
the algorithm toward a more appropriate data partitioning. This contains modifying the objective function depending on
constraints or initializing and constraining the clustering process depending on the labeled objects.
Distance-based semi-supervised clustering − It can be used to employ an adaptive distance measure that is trained to
satisfy the labels or constraints in the supervised data. Multiple adaptive distance measures have been utilized, including
string-edit distance trained using Expectation-Maximization (EM), and Euclidean distance changed by the shortest
distance algorithm.
An interesting clustering method, known as CLTree (CLustering based on decision TREEs). It integrates unsupervised
clustering with the concept of supervised classification. It is an instance of constraint-based semi-supervised clustering. It
changes a clustering task into a classification task by considering the set of points to be clustered as belonging to one
class, labeled as “Y,” and inserts a set of relatively uniformly distributed, “nonexistence points” with a multiple class label,
“N.”
The problem of partitioning the data area into data (dense) regions and empty (sparse) regions can then be changed into
a classification problem. These points can be considered as a set of “Y” points. It shows the addition of a collection of
uniformly distributed “N” points, defined by the “o” points.
The original clustering problem is thus changed into a classification problem, which works out a design that distinguishes
“Y” and “N” points. A decision tree induction method can be used to partition the two-dimensional space. Two clusters are
recognized, which are from the “Y” points only.
It can be used to insert a large number of “N” points to the original data can introduce unnecessary overhead in the
calculation. Moreover, it is unlikely that some points added would truly be uniformly distributed in a very high-dimensional
space as this can need an exponential number of points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped
clusters or convex clusters. In other words, they are suitable only for compact and well-separated clusters.
Moreover, they are also severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.
The figure below shows a data set containing nonconvex clusters and outliers/noises. Given such data, k-means
algorithm has difficulties in identifying these clusters with arbitrary shapes.
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighborhood around a data point i.e. if the distance between two points is lower or equal
to ‘eps’ then they are considered neighbors. If the eps value is chosen too small then large part of the data will
be considered as outliers. If it is chosen very large then the clusters will merge and the majority of the data
points will be in the same clusters. One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the larger value of
MinPts must be chosen. As a general rule, the minimum MinPts can be derived from the number of
dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}
Implementation of the above algorithm in Python :
Here, we’ll use the Python library sklearn to compute DBSCAN. We’ll also use the [Link] library for
visualizing clusters. The dataset used can be found here.
Evaluation Metrics
Moreover, we will use the Silhouette score and Adjusted rand score for evaluating clustering algorithms.
Silhouette score is in the range of -1 to 1. A score near 1 denotes the best meaning that the data point i is very
compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1. Values
near 0 denote overlapping clusters.
Absolute Rand Score is in the range of 0 to 1. More than 0.9 denotes excellent cluster recovery, above 0.8 is a
good recovery. Less than 0.5 is considered to be poor recovery.
Example
Python3
import numpy as np
# Load data in X
cluster_std=0.50, random_state=0)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_
print(labels)
# Plot result
unique_labels = set(labels)
print(colors)
if k == -1:
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
markeredgecolor='k',
markersize=6)
markeredgecolor='k',
markersize=6)
[Link]()
#evaluation metrics
sc = metrics.silhouette_score(X, labels)
print("Silhouette Coefficient:%0.2f"%sc)
Output:
Gaussian Mixture Model
Suppose there are set of data points that need to be grouped into several parts or clusters based on their
similarity. In machine learning, this is known as Clustering.
There are several methods available for clustering:
K Means Clustering
Hierarchical Clustering
Gaussian Mixture Models
In this article, Gaussian Mixture Model will be discussed.
In real life, many datasets can be modeled by Gaussian Distribution (Univariate or Multivariate). So it is quite
natural and intuitive to assume that the clusters come from different Gaussian Distributions. Or in other words, it
is tried to model the dataset as a mixture of several Gaussian Distributions. This is the core idea of this model.
In one dimension the probability density function of a Gaussian Distribution is given by
Here is a d dimensional vector denoting the mean of the distribution and is the d X d covariance matrix.
Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of clusters is known
and it is K). So and is also estimated for each k. Had it been only one distribution, they would have been
estimated by the maximum-likelihood method. But since there are K such clusters and the probability density is
defined as a linear function of densities of all these K distributions, i.e.
Note: denotes the total number of sample points in the k-th cluster. Here it is assumed that
there is a total N number of samples and each sample containing d features is denoted by .
So it can be clearly seen that the parameters cannot be estimated in closed form. This is where the Expectation-
Maximization algorithm is beneficial.
The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates for model
parameters when the data is incomplete or has some missing data points or has some hidden variables. EM
chooses some random values for the missing data points and estimates a new set of data. These new values are
then recursively used to estimate a better first date, by filling up missing points, until the values get fixed.
These are the two basic steps of the EM algorithm, namely E Step or Expectation Step or Estimation
Step and M Step or Maximization Step.
Estimation step:
initialize , and by some random values, or by K means clustering results or by
hierarchical clustering results.
Then for those given parameter values, estimate the value of the latent variables (i.e )
Maximization Step:
Update the value of the parameters( i.e. , and ) calculated using ML method.
Curse of Dimensionality describes the explosive nature of increasing data dimensions and its
resulting exponential increase in computational efforts required for its processing and/or analysis.
This term was first introduced by Richard E. Bellman, to explain the increase in volume of
Euclidean space associated with adding extra dimensions, in area of dynamic programming.
Today, this phenomenon is observed in fields like machine learning, data analysis, data mining to
name a few. An increase in the dimensions can in theory, add more information to the data
thereby improving the quality of data but practically increases the noise and redundancy during its
analysis.
Behavior of a Machine Learning Algorithms — Need for data points and Accuracy of
Model
In machine learning, a feature of an object can be an attribute or a characteristic that defines it.
Each feature represents a dimension and group of dimensions creates a data point. This represents
a feature vector that defines the data point to be used by a machine learning algorithm(s). When
we say increase in dimensionality it implies an increase in the number of features used to describe
the data. For example, in the field of breast cancer research, age, number of cancerous nodes can
be used as features to define the prognosis of the breast cancer patient. These features constitute
the dimensions of a feature vector. But other factors like past surgeries, patient history, type of
tumor and other such features help a doctor to better determine the prognosis. In this case by
adding features, we are theoretically increasing the dimensions of our data.
As the dimensionality increases, the number of data points required for good performance of any
machine learning algorithm increases exponentially. The reason is that, we would need more
number of data points for any given combination of features, for any machine learning model to be
valid. For example, let’s say that for a model to perform well, we need at least 10 data points for
each combination of feature values. If we assume that we have one binary feature, then for its 21
unique values (0 and 1) we would need 2¹x 10 = 20 data points. For 2 binary features, we would
have 2² unique values and need 2² x 10 = 40 data points. Thus, for k-number of binary features we
would need 2ᵏ x 10 data points.
Hughes (1968) in his study concluded that with a fixed number of training samples, the predictive
power of any classifier first increases as the number of dimensions increase, but after a certain
value of number of dimensions, the performance deteriorates. Thus, the phenomenon of curse of
dimensionality is also known as Hughes phenomenon.
Graphical Representation of Hughes Principle
That is, for a d — dimensional space, given n-random points, the distₘᵢₙ(A) ≈ distₘₐₓ(A) meaning,
any given pair of points are equidistant to each other.
Therefore, any machine learning algorithms which are based on the distance measure including
KNN(k-Nearest Neighbor) tend to fail when the number of dimensions in the data is very high.
Thus, dimensionality can be considered as a “curse” in such algorithms.
One of the ways to reduce the impact of high dimensions is to use a different measure of distance
in a space vector. One could explore the use of cosine similarity to replace Euclidean distance.
Cosine similarity can have a lesser impact on data with higher dimensions. However, use of such
method could also be specific to the required solution of the problem.
Other methods:
Other methods could involve the use of reduction in dimensions. Some of the techniques that can
be used are:
1. Forward-feature selection: This method involves picking the most useful subset of features from
all given features.
2. PCA/t-SNE: Though these methods help in reduction of number of features, but it does not
necessarily preserve the class labels and thus can make the interpretation of results a tough task.
Top 10 Dimensionality Reduction Techniques For
Machine Learning
Every second, the world generates an unprecedented volume of data. As data has become a crucial
component of businesses and organizations across all industries, it is essential to process, analyze, and
visualize it appropriately to extract meaningful insights from large datasets. However, there’s a catch –
more does not always mean productive and accurate. The more data we produce every second, the more
challenging it is to analyze and visualize it to draw valid inferences.
This is where Dimensionality Reduction comes into play.
Table of Contents
Best Machine Learning Courses & AI Courses Online
Master of Science in
Executive Post Graduate Programme in Machine Learning & AI
Machine Learning & AI
from IIITB
from LJMU
Advanced
Advanced
Certificate
Certificate
Programme
Programme
in Machine Executive Post Graduate Program in Data Science & Machine
in Machine
Learning & Learning from University of Maryland
Learning &
Deep
NLP from
Learning
IIITB
from IIITB
Source
The higher is the number of features or factors (a.k.a. variables) in a feature set, the more difficult it
becomes to visualize the training set and work on it. Another vital point to consider is that most of the
variables are often correlated. So, if you think every variable within the feature set, you will include many
redundant factors in the training set.
Furthermore, the more variables you have at hand, the higher will be the number of samples to represent all
the possible combinations of feature values in the example. When the number of variables increases, the
model will become more complex, thereby increasing the likelihood of overfitting. When you train an ML
model on a large dataset containing many features, it is bound to be dependent on the training data. This
will result in an overfitted model that fails to perform well on real data.
The primary aim of dimensionality reduction is to avoid overfitting. A training data with considerably
lesser features will ensure that your model remains simple – it will make smaller assumptions.
Apart from this, dimensionality reduction has many other benefits, such as:
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation that converts a set of
correlated variables to a set of uncorrelated variables. PCA is the most widely used tool in exploratory data analysis and in
machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique used to examine the
interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best
fit.
Module Needed:
import pandas as pd
import numpy as np
import [Link] as plt
%matplotlib inline
Code #1:
# instantiating
cancer = load_breast_cancer()
# creating dataframe
df = [Link](cancer['data'], columns =
cancer['feature_names'])
[Link]()
Output:
Code #2:
# fitting
[Link](df)
scaled_data = [Link](df)
# Importing PCA
pca = PCA(n_components = 2)
[Link](scaled_data)
x_pca = [Link](scaled_data)
x_pca.shape
Output:
569, 2
Output:
# components
pca.components_
Output:
# plotting heatmap
[Link](df_comp)
# components
pca.components_
[Link](df_comp)
Output:
Scikit Learn Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in
Python, is built upon NumPy, SciPy and Matplotlib.
Audience
This tutorial will be useful for graduates, postgraduates, and research students who either have an interest in this Machine
Learning subject or have this subject as a part of their curriculum. The reader can be a beginner or an advanced learner.
Prerequisites
The reader must have basic knowledge about Machine Learning. He/she should also be aware about Python, NumPy,
Scipy, Matplotlib.
Randomized PCA
Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis,
information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing
large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm (Halko et al., 2011), but is
largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on
real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X
faster than the svds in Matlab with little error. The algorithm computes the first 100 principal components of a large information
retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional
methods fail due to the out-of-memory issue.
Read
Discuss
PRINCIPAL COMPONENT ANALYSIS: is a tool which is used to reduce the dimension of the data. It allows us to reduce
the dimension of the data without much loss of information. PCA reduces the dimension by finding a few orthogonal linear
combinations (principal components) of the original variables with the largest variance.
The first principal component captures most of the variance in the data. The second principal component is orthogonal to the
first principal component and captures the remaining variance, which is left of first principal component and so on. There are as
many principal components as the number of original variables.
These principal components are uncorrelated and are ordered in such a way that the first several principal components explain
most of the variance of the original data. To learn more about PCA you can read the article Principal Component Analysis
KERNEL PCA:
PCA is a linear method. That is it can only be applied to datasets which are linearly separable. It does an excellent job for
datasets, which are linearly separable. But, if we use it to non-linear datasets, we might get a result which may not be the
optimal dimensionality reduction. Kernel PCA uses a kernel function to project dataset into a higher dimensional feature space,
where it is linearly separable. It is similar to the idea of Support Vector Machines.
There are various kernel methods like linear, polynomial, and gaussian.
Code: Create a dataset which is nonlinear and then apply PCA on the dataset.
[Link]()
X_pca = pca.fit_transform(X)
[Link]("PCA")
[Link]("Component 1")
[Link]("Component 2")
[Link]()
Code: Applying kernel PCA on this dataset with RBF kernel with a gamma value of 15.
X_kpca = kpca.fit_transform(X)
[Link]("Kernel PCA")
[Link]()
In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function to project the dataset into a higher-
dimensional space, where it is linearly separable.
Finally, we applied the kernel PCA to a non-linear dataset using scikit-learn.
UNIT-V
Birds inspired us to fly, burdock plants inspired Velcro, and nature has inspired countless
more inventions. It seems only logical, then, to look at the brain’s architecture for inspiration
on how to build an intelligent machine. This is the logic that sparked artificial neural
networks (ANNs): an ANN is a Machine Learning model inspired by the networks of
biological neurons found in our brains. However, although planes were inspired by birds,
they don’t have to flap their wings. Similarly, ANNs have gradually become quite different
from their biological cousins. Some researchers even argue that we should drop the
biological analogy altogether (e.g., by saying “units” rather than “neurons”), lest we restrict
our creativity to biologically plausible systems.1
ANNs are at the very core of Deep Learning. They are versatile, powerful, and scalable,
making them ideal to tackle large and highly complex Machine Learning tasks such as
classifying billions of images (e.g., Google Images), powering speech recognition services
(e.g., Apple’s Siri), recommending the best videos to watch to hundreds of millions of users
every day (e.g., YouTube), or learning to beat the world champion at the game of Go
(DeepMind’s AlphaGo).
Overview of Keras
Keras runs on top of open source machine libraries like TensorFlow, Theano or Cognitive Toolkit (CNTK).
Theano is a python library used for fast numerical computation tasks. TensorFlow is the most famous
symbolic math library used for creating neural networks and deep learning models. TensorFlow is very
flexible and the primary benefit is distributed computing. CNTK is deep learning framework developed by
Microsoft. It uses libraries such as Python, C#, C++ or standalone machine learning toolkits. Theano and
TensorFlow are very powerful libraries but difficult to understand for creating neural networks.
Keras is based on minimal structure that provides a clean and easy way to create deep learning models
based on TensorFlow or Theano. Keras is designed to quickly define deep learning models. Well, Keras
is an optimal choice for deep learning applications.
Features
Keras leverages various optimization techniques to make high level neural network API easier and more
performant. It supports the following features −
Consistent, simple and extensible API.
Minimal structure - easy to achieve the result without any frills.
It supports multiple platforms and backends.
It is user friendly framework which runs on both CPU and GPU.
Highly scalability of computation.
Benefits
Keras is highly powerful and dynamic framework and comes up with the following advantages −
Larger community support.
Easy to test.
Keras neural networks are written in Python which makes things simpler.
Keras supports both convolution and recurrent networks.
Deep learning models are discrete components, so that, you can combine into many ways.
How to Build Multi-Layer Perceptron Neural Network
Models with Keras
The Keras Python library for deep learning focuses on creating models as a sequence of layers.
In this post, you will discover the simple components you can use to create neural networks and simple deep
learning models using Keras from TensorFlow.
Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and
the Python source code files for all examples.
Let’s get started.
The simplest model is defined in the Sequential class, which is a linear stack of Layers.
You can create a Sequential model and define all the layers in the constructor; for example:
2 model = Sequential(...)
A more useful idiom is to create a Sequential model and add your layers in the order of the computation you
wish to perform; for example:
2 model = Sequential()
3 [Link](...)
4 [Link](...)
5 [Link](...)
Model Inputs
The first layer in your model must specify the shape of the input.
This is the number of input attributes defined by the input_shape argument. This argument expects a tuple.
For example, you can define input in terms of 8 inputs for a Dense type layer as follows:
1 Dense(16, input_shape=(8,))
Model Layers
Layers of different types have a few properties in common, specifically their method of weight initialization and
activation functions.
Weight Initialization
The type of initialization used for a layer is specified in the kernel_initializer argument.
Some common types of layer initialization include:
random_uniform: Weights are initialized to small uniformly random values between -0.05 and 0.05.
random_normal: Weights are initialized to small Gaussian random values (zero mean and standard deviation
of 0.05).
zeros: All weights are set to zero values.
You can see a full list of the initialization techniques supported on the Usage of initializations page.
Activation Function
Keras supports a range of standard neuron activation functions, such as softmax, rectified linear (relu), tanh,
and sigmoid.
You typically specify the type of activation function used by a layer in the activation argument, which takes a
string value.
You can see a full list of activation functions supported by Keras on the Usage of activations page.
Interestingly, you can also create an Activation object and add it directly to your model after your layer to apply
that activation to the output of the layer.
Layer Types
There are a large number of core layer types for standard neural networks.
Some common and useful layer types you can choose from are:
Dense: Fully connected layer and the most common type of layer used on multi-layer perceptron models
Dropout: Apply dropout to the model, setting a fraction of inputs to zero in an effort to reduce overfitting
Concatenate: Combine the outputs from multiple layers as input to a single layer
You can learn about the full list of core Keras layers on the Core Layers page.
Model Compilation
Once you have defined your model, it needs to be compiled.
This creates the efficient structures used by TensorFlow in order to efficiently execute your model during
training. Specifically, TensorFlow converts your model into a graph so the training can be carried out efficiently.
You compile your model using the compile() function, and it accepts three important attributes:
1. Model optimizer
2. Loss function
3. Metrics
1 [Link](optimizer=..., loss=..., metrics=...)
1. Model Optimizers
The optimizer is the search technique used to update weights in your model.
You can create an optimizer object and pass it to the compile function via the optimizer argument. This allows
you to configure the optimization procedure with its own arguments, such as learning rate. For example:
2 sgd = SGD(...)
3 [Link](optimizer=sgd)
You can also use the default parameters of the optimizer by specifying the name of the optimizer to the
optimizer argument. For example:
1 [Link](optimizer='sgd')
Some popular gradient descent optimizers you might want to choose from include:
Model Training
The model is trained on NumPy arrays using the fit() function; for example:
Training both specifies the number of epochs to train on and the batch size.
Epochs (epochs) refer to the number of times the model is exposed to the training dataset.
Batch Size (batch_size) is the number of training instances shown to the model before a weight update is
performed.
The fit function also allows for some basic evaluation of the model during training. You can set the
validation_split value to hold back a fraction of the training dataset for validation to be evaluated in each epoch
or provide a validation_data tuple of (X, y) data to evaluate.
Fitting the model returns a history object with details and metrics calculated for the model in each epoch. This
can be used for graphing model performance.
Model Prediction
Once you have trained your model, you can use it to make predictions on test data or new data.
There are a number of different output types you can calculate from your trained model, each calculated using
a different function call on your model object. For example:
Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Not all users know that you can install the TensorFlow GPU if your hardware supports it. We’ll
discuss what Tensorflow is, how it’s used in today’s world, and how to install the latest TensorFlow
version with CUDA, cudNN, and GPU support in Windows, Mac, and Linux.
Introduction to TensorFlow
TensorFlow is an open-source software library for machine learning, created by Google. It was
initially released on November 28, 2015, and it’s now used across many fields including research in
the sciences and engineering.
The idea behind TensorFlow is to make it quick and simple to train deep neural networks that use a
diversity of mathematical models. These networks are then able to learn from data without human
intervention or supervision, making them more efficient than conventional methods. The library also
offers support for processing on multiple machines simultaneously with different operating systems
and GPUs.
TensorFlow applications
TensorFlow is a library for deep learning built by Google, it’s been gaining a lot of traction ever since
its introduction early last year. The main features include automatic differentiation, convolutional
neural networks (CNN), and recurrent neural networks (RNN). It’s written in C++ and Python, for
high performance it uses a server called a “Cloud TensorFlow” that runs on Google Cloud Platform.
It doesn’t require a GPU, which is one of its main features.
The newest release of Tensorflow also supports data visualization through matplotlib. This
visualization library is very popular, and it’s often used in data science coursework, as well as by
artists and engineers to do data visualizations using MATLAB or Python / R / etc.
Windows
Prerequisite
Python 3.6–3.8
Windows 7 or later (with C++ redistributable)
Check [Link] For latest version information
Steps
[Link]
We will install CUDA version 11.2, but make sure you install the latest or updated version (for
example – 11.2.2 if it’s available).
Click on the newest version and a screen will pop up, where you can choose from a few options, so
follow the below image and choose these options for Windows.
Once you choose the above options, wait for the download to complete.
Install it with the Express (Recommended) option, it will take a while to install on your machine.
Now, copy these 3 folders (bin, include, lib). Go to C Drive>Program Files, and search for NVIDIA
GPU Computing Toolkit.
Open the folder, select CUDA > Version Name, and replace (paste) those copied files.
Now click on the bin folder and copy the path. It should look like this: C:\Program Files\NVIDIA
GPU Computing Toolkit\CUDA\v11.2\bin.
On your PC, search for Environment variables, as shown below.
Click on Environment Variables on the bottom left. Now click on the link which states PATH.
Once you click on the PATH, you will see something like this.
Now click on New (Top Left), and paste the bin path here. Go to the CUDA folder, select
libnvvm folder, and copy its path. Follow the same process and paste that path into the system
path. Next, just restart your PC.
4) Installing Tensorflow
Now copy the below commands and paste them into the prompt (Check for the versions).
You’ll see an installation screen like this. If you see any errors, Make sure you’re using the correct
version and don’t miss any steps.
Test
To test the whole process we’ll use Pycharm. If not installed, get the community edition
→ [Link]
First, to check if TensorFlow GPU has been installed properly on your machine, run the below code:
Configure the env, create a new Python file, and paste the below code:
# Imports
import torch
import torchvision
import [Link] as F
import [Link] as datasets
import [Link] as transforms
from torch import optim
from torch import nn
from [Link] import DataLoader
from tqdm import tqdm
Check the rest of the code here -> [Link]
Collection/blob/master/ML/Pytorch/Basics/pytorch_simple_CNN.py.
When you run the code, look for successfully opened cuda(versioncode).
Once the training started, all the steps were successful!
MacOS
MacOS doesn’t support Nvidia GPU for the latest versions, so this will be a CPU-only installation.
You can get GPU support on a Mac with some extra effort and requirements.
Prerequisite
Python 3.6–3.8
macOS 10.12.6 (Sierra) or later (no GPU support)
Check [Link] For the latest version information
You can install the latest version available on the site, but for this tutorial, we’ll be using Python 3.8.
Also, check with the TensorFlow site for version support.
2) Prepare environment:
dependencies:
- python=3.8
- pip>=19.0
- jupyter
- scikit-learn
- scipy
- pandas
- pandas-datareader
- matplotlib
- pillow
- tqdm
- requests
- h5py
- pyyaml
- flask
- boto3
- pip:
- tensorflow==2.4
- bayesian-optimization
- gym
- kaggle
Run the following command from the same directory that contains [Link].
To test the whole process, we’ll use a Jupyter notebook. Use this command to start Jupyter:
jupyter notebook
Cope the below code and run on jupyter notebook.
import sys
import [Link]
import pandas as pd
import sklearn as sk
import tensorflow as tf
Linux
We can install both CPU and GPU versions on Linux.
Prerequisite
Python 3.6–3.8
Ubuntu 16.04 or later
Check [Link] for the latest version information
Steps
nvcc -V
You’ll see it output something like this:
Now, check with TensorFlow site for version, and run the below command:
dependencies:
- jupyter
- scikit-learn
- scipy
- pandas
- pandas-datareader
- matplotlib
- pillow
- tqdm
- requests
- h5py
- pyyaml
- flask
- boto3
- pip
- pip:
- bayesian-optimization
- gym
- kaggle
Test
import tensorflow as tf
[Link].list_physical_devices("GPU")
You will see similar output, [PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’)]
Second, you can also use a jupyter notebook. Use this command to start Jupyter.
jupyter notebook
Now, run the below code:
import sys
import [Link]
import pandas as pd
import sklearn as sk
import tensorflow as tf
Python 3.8.0
Pandas 'version'
Scikit-Learn 'version'
GPU is available
Load and preprocess images
his tutorial shows how to load and preprocess an image dataset in three ways:
First, you will use high-level Keras preprocessing utilities (such as [Link].image_dataset_from_directory) and
layers (such as [Link]) to read a directory of images on disk.
Next, you will write your own input pipeline from scratch using [Link].
Finally, you will download a dataset from the large catalog available in TensorFlow Datasets.
Setup
import numpy as np
import os
import PIL
import [Link]
import tensorflow as tf
import tensorflow_datasets as tfds
print(tf. version )
2.9.1
This tutorial uses a dataset of several thousand photos of flowers. The flowers dataset contains five sub-
directories, one per class:
flowers_photos/
daisy/
dandelion/
roses/
sunflowers/
tulips/
Note: all images are licensed CC-BY, creators are listed in the [Link] file.
import pathlib
dataset_url =
"[Link]
data_dir = [Link].get_file(origin=dataset_url,
fname='flower_photos',
untar=True)
data_dir = [Link](data_dir)
After downloading (218MB), you should now have a copy of the flower photos available. There are 3,670 total
images:
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)
3670
Each directory contains images of that type of flower. Here are some roses:
roses = list(data_dir.glob('roses/*'))
[Link](str(roses[0]))
roses = list(data_dir.glob('roses/*'))
[Link](str(roses[1]))
Create a dataset
batch_size = 32
img_height = 180
img_width = 180
It's good practice to use a validation split when developing your model. You will use
80% of the images for training and 20% for validation.
train_ds = [Link].image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
You can find the class names in the class_names attribute on these datasets.
class_names = train_ds.class_names
print(class_names)
Here are the first nine images from the training dataset.
[Link](figsize=(10, 10))
for images, labels in train_ds.take(1):
for i in range(9):
ax = [Link](3, 3, i + 1)
[Link](images[i].numpy().astype("uint8"))
[Link](class_names[labels[i]])
[Link]("off")
You can train a model using these datasets by passing them to [Link] (shown later in
this tutorial). If you like, you can also manually iterate over the dataset and retrieve
batches of images:
You can call .numpy() on either of these tensors to convert them to a [Link].
The RGB channel values are in the [0, 255] range. This is not ideal for a neural network;
in general you should seek to make your input values small.
Here, you will standardize values to be in the [0, 1] range by using [Link]:
normalization_layer = [Link](1./255)
There are two ways to use this layer. You can apply it to the dataset by
calling [Link]:
0.0 0.96902645
Or, you can include the layer inside your model definition to simplify deployment. You
will use the second approach here.
Note: If you would like to scale pixel values to [-1,1] you can instead
write [Link](1./127.5, offset=-1)Note: You previously resized images
using the image_size argument of [Link].image_dataset_from_directory. If you
want to include the resizing logic in your model as well, you can use
the [Link] layer.
Let's make sure to use buffered prefetching so you can yield data from disk without
having I/O become blocking. These are two important methods you should use when
loading data:
[Link] keeps the images in memory after they're loaded off disk during the first epoch.
This will ensure the dataset does not become a bottleneck while training your model. If your
dataset is too large to fit into memory, you can also use this method to create a performant on-
disk cache.
[Link] overlaps data preprocessing and model execution while training.
Interested readers can learn more about both methods, as well as how to cache data to
disk in the Prefetching section of the Better performance with the [Link] API guide.
AUTOTUNE = [Link]
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
UNIT-II
dataset should be ensembled. If two statistically similar models are ensembled (models that
UNIT-II
make wrong predictions on the same set of samples), the resulting model will only be as good as
the contributing models. An ensemble won’t make any difference to the predictionability in such a
case.
The diversity in the predictions of the contributing models of an ensemble is popularly verified
using the Kullback-Leibler and Jensen-Shannon Divergence metrics (this paper is great example
demonstrating the point).
Here are some of the scenarios where ensemble learning comes in handy.
2. Excess/Shortage of data
In cases where a substantial amount of data is available, we may divide the classification tasks
between different classifiers and ensemble them during prediction time, rather than trying to train
one classifier with large volumes of data.
On the other hand, in cases where the dataset available is small (for example, in the biomedical
domain, where acquiring labeled medical data is costly), we can use a bootstrapping
ensemble strategy.
The way it works is quite simple—
We train different classifiers using various “bootstrap samples” of data, i.e., we create several
subsets of a single dataset using replacement. It means that the same data sample may be present in
more than one subset, which will be later used to train different models (for further reading, check
out this paper).
This method will be further explained in the section on the “Bagging” ensemble technique.
3. Confidence Estimation
UNIT-II
The very core of an ensemble framework is based on the confidence in predictions by the different
models. For example, when trying to draw a consensus between four models on a cat/dog
classification problem, if two models predict the sample as class “cat” and the other two predict as
“dog,” the confidence of the ensemble is low.
Further, researchers also use the confidence scores of the individual classifiers to generate a final
confidence score of the ensemble (Examples: Paper-1, Paper-2). Involving the confidence scores for
developing the ensemble gives more robust predictions than simple “majority voting” since a
prediction with 95% confidence is more reliable than a predictionwith 51% confidence.
Therefore, we can assign more importance to classifiers that predict with more confidence during
the ensemble.
5. Information Fusion
The most prevalent reason for using an ensemble learning model is information fusion for
enhancing classification performance. That is, models that have been trained on different
distributions of data pertaining to the same set of classes are employed during prediction time to
get a more robust decision.
For example, we may have trained one cat/dog classifier on high-quality images taken by a
professional photographer. In contrast, another classifier has been trained on data using low-
quality photos captured on mobile phones. When predicting a new sample, integrating the decisions
from both these classifiers will be more robust and bias-free.
1. Bagging
The Bagging ensemble technique is the acronym for “bootstrap aggregating” and is one ofthe earliest
ensemble methods proposed.
For this method, subsamples from a dataset are created and they are c alled “bootstrap
sampling.” To put it simply, random subsets of a dataset are created using replacement,
meaning that the same data p int may
o be present in several subsets.
These subsets are now treated as independent datasets, on which several Machine Learning
models will be fit. During test time, the predictions from all such models trained ondifferent subsets
of the same data are accounted for.
There is an aggregation mec anism
h used to compute the final prediction (like averaging,
weighted averaging, etc. discussed later).
Bagging
Note that, in the bagging mechanism, a parallel stream of processing occurs. The main aim of the
bagging method is to reduce variance in the ensemble predictions.
Thus, the chosen ensemble classifiers usually have high variance and low bias (complex models
with many trainable parameters). Popular ensemble methods based on this approach include:
Bagged Decision Trees
Random Forest Classifiers
Extra Trees
2. Boosting
The boosting ensemble mechanism works in a way markedly different from the bagging
mechanism.
Here, instead of parallel processing of data, sequential processing of the dataset occurs. The first
classifier is fed with the entire dataset, and the predictions are analyzed.
The instances where Classifier-1 fails to produce correct predictions (that are samples near the
decision boundary of the feature space) are fed to the second classifier.
This is done so that Classifier-2 can specifically focus on the problematic areas of feature space
and learn an appropriate decision boundary. Similarly, further steps of the same idea are employed,
and then the ensemble of all these previous classifiers is computed to make the final prediction on
the test data.
The pictorial representation of the same is shown below.
UNIT-II
The main aim of the boosting method is to reduce bias in the ensemble decision. Thus, theclassifiers
are chosen for the ensemble usually need to have low variance and high bias, i.e., simpler models
with less trainable parameters.
Other algorithms based on this approach include:
Adaptive Boosting
Stochastic Gradient Bo o sting
Gradient Boosting Mac h ines
3. Stacking
The stacking ensemble method also involves creating bootstrapped data subsets, like the bagging
ensemble mechanism for training multiple models.
However, here, the outputs of all such models are used as an input to another classifier, called
meta-classifier, which finally predicts the samples. The intuition behind using two layers of
classifiers is to determine whether the training data have been appropriately learned.
UNIT-II
For example, in the example o f the cat/dog/wolf classifier at the beginning of this article, if,
say, Classifier-1 can distinguish between cats and dogs, but not between dogs and wolves,the meta-
classifier present in the second layer will be able to capture this behavior from
classifier-1. The meta classifier can then correct this behavior before making the final
prediction.
A pictorial representation of the stacking mechanism is shown below.
The diagram above shows one level of stacking. There are also multi-level stacking
ensemble methods where additional layers of classifiers are added in between.
However, such practices become computationally very expensive for a relat vely small boostin
performance.
4. Mixture of Experts
The “Mixture of Experts” genre of ensemble trains several classifiers, the outputs of which
are ensemble using a generali ed linear
z rule.
The weights assigned to these combinations are further determined by a “Gating Network,”also a
trainable model and usually a neural network.
UNIT-II
5. Majority Voting
Majority voting is one of the e a rliest and easiest ensemble schemes in the literature. In this
method, an odd number of contributing classifiers are chosen, and for each sample, the
predictions from the classifiers are computed. Then, as the name suggests, the class that gets most
of the class from the classifier pool is deemed the ensemble’s predicted class.
Such a method works well f r binary
o classification problems, where there are only two
candidates for which the classifiers can vote. However, it fails for a problem with manyclasses
since many cases arise, where no class gets a clear majority of the votes.
In such cases, we usually cho se ao random class among the top candidates, which leads to
a more considerable margin of error. Thus, methods based on the confidence scores aremore
reliable and are used more widely now.
UNIT-II
6. Max Rule
The “Max Rule” ensemble method relies on the probability distributions generated by each
classifier. This method employs the concept of “confidence in prediction” of the classifiers and
thus is a superior method to Majority Voting for multi-class classification challenges.
Here, for a predicted class by a classifier, the corresponding confidence score is checked. The class
prediction of the classifier that predicts with the highest confidence score is deemed the
prediction of the ensemble framework.
7. Probability Averaging
In this ensemble technique, the probability scores for multiple models are first computed. Then,
the scores are averaged over all the models for all the classes in the dataset.
Probability scores are the confidence in predictions by a particular model. So, here we are pooling
the confidences of several models to generate a final probability score for the ensemble. The
class that has the highest probability after the averaging operation is assigned as the predicted
class.
Such a method has been used in this paper for COVID-19 detection from lung CT-scan images.
For example, Fuzzy Ensembles are a class of ensemble techniques that use the concept of“dynamic
importance.”
The “weights” given to the classifiers are not fixed; they are modified based on the contributing
models’ confidence scores for every sample, rather than checking the performance on the entire
dataset. They perform much better than the popularly used weighted average probability
methods. The codes for the papers are also available here: Paper-1, Paper-2, and Paper-3.
Another genre of ensemble technique that has recently gained popularity is called “Snapshot
Ensembling.”
As we can see from the discussion throughout this article, ensemble learning comes at the expense
of training multiple models.
specially in deep learning, it is a costly operation, even with transfer learning. So, this ensemble
learning method proposed in this paper trains only one deep learning model and saves the model
snapshots at different training epochs.
The ensemble of these models generates a final ensemble prediction framework on the testdata.
They proposed some modifications to the usual deep learning model training regime to ensure
the diversity in the model snapshots. The model weights saved at these different epochs need to be
significantly different to make the ensemble successful.
1. Disease detection
UNIT-II
Classification and localization of diseases for simplistic and fast prognosis have been aided by
Ensemble learning, like in cardiovascular disease detection from X-Ray and CT scans.
2. Remote Sensing
Monitoring of physical characteristics of a target area without coming in physical contact, called
Remote Sensing, is a difficult task since the data acquired by different sensors have varying
resolutions leading to incoherence in data distribution.
Tasks like Landslide Detection and Scene Classification have also been accomplished with the help
of Ensemble Learning.
UNIT-II
3. Fraud Detection
Detection of digital fraud is an important and challenging task since very minute precision is
required to automate the p rocess. Ensemble Learning has proved its efficacy in
detecting Credit Card Fraud and Impression Fraud.
The greater the number of trees in the forest, the more accurate
it is and the problem of overfitting is avoided.
The following steps and Figure 2.23 can be used to explain the
working process:
Find the predictions of each decision tree for new data points and
assign the new data points to the category with the most votes.
The following are some reasons why we should use the random
forest algorithm:
Despite the fact that random forest can be used for both
classification and regression tasks, it is not better suited to
regression.
UNIT
Medical The disease trends and risks can be identified with the
help of this algorithm.
Regression
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
parameters:
Here the input variables are Size of No. of Bedrooms and No. of
Bathrooms & the output variable is the Price of which is a
continuous value. Therefore, this is a Regression
Outliers are observed data points that are far from the least
square line or that differs significantly from other data or
observations, or in other words, an outlier is an extreme
value that differ greatly from other values in a set of values.
X1 = [0, 3, 4, 9, 6, 2]
X1 = -2 * X2
So X1 & X2 are collinear. Here it’s better to use only one variable,
either X1 or X2 for the input.
When model performs well on the training dataset but not on the
test dataset, variance exists.
On the left side of the above Figure anyone easily predicts that
the line does not cover all the points shown in the graph. Such a
model tends to cause a phenomenon known as underfitting of
data. In the case of underfitting, the model cannot learn enough
from the training data and from the training data and thus
reduces precision and accuracy. There is a high bias and low
variance in the underfitted model.
y=
Where,
Y: Dependent Variable
X: Independent variable
is the slope of the regression line, which tells whether the line is
increasing or decreasing.
Therefore, in this graph, the dots are our data and based on this
data we will train our model to predict the results. The black
line is the best-fitted line for the given data. The best-fitted line
is a straight line that best represents the data on a scatter plot.
The best-fitted line may pass through all of the points, some of
the points or none of the points in the graph.
To find the best-fit line for the dataset. The best-fit line is the
one for which total prediction error (all data points) are as small
as possible.
How the dependent variable is changing, by changing the
independent variable.
Where,
Step Calculate the posterior probability for each class using the
Naive Bayesian equation. The outcome of prediction is the class
with the highest posterior probability.
Naive Bayes employs a similar method to predict the likelihood of
various classes based on various attributes. This algorithm is
commonly used in text classification and multi-class problems.
Types of Naïve Bayes
The Naive Bayes Model is classified into three types, which are
listed as follows:
Text classification
Spam Filtering
Real-time Prediction
Multi-class Prediction
Recommendation Systems
Credit Scoring
Sentiment Analysis
Decision Tree
Begin the tree with node T, which contains the entire dataset.
Divide the T into subsets that contain the best possible values for
the attributes.
Continue this process until you reach a point where you can no
longer classify the nodes and refer to the final node as a leaf
node.
days.
days.
days.
days.
days.
days.
days.
days.
days.
days.
days.
Overfitting problem
Complexity
K-Nearest Neighbors (K-NN) algorithm
The K-NN algorithm assumes that the new case and existing
cases are similar and places the new case in the category that is
most similar to the existing categories. The K-NN algorithm stores
all available data and classifies a new data point based on its
similarity to the existing data. This means that when new data
appears, the KNN algorithm can quickly classify it into a suitable
category.
As can be seen, the three closest neighbors are all from category
A, so this new data point must also be from that category.
Logistic Regression
Regression: Regression:
There are only two possible outcomes for the categorical response.
Support vectors are data points that are closer to the hyperplane
and have an influence on the hyperplane's position and
orientation as shown in Figure We maximize the classifier's margin
by using these support vectors. The hyperplane's position will be
altered if the support vectors are deleted. These are the points
that will assist us in constructing SVM.
It's the distance between two lines on the closest data point of
different classes. The perpendicular distance between the line
and the support vectors can be calculated. A large margin is
regarded as a good margin, while a small margin is regarded as a
bad margin.
Working of SVM
Speech recognition
Overfitting problem
UNIT-IV
Page 1
UNIT-IV
2. Used in the healthcare industry. Helpful in segmenting cancer cells and tumours using
which their severity can be gauged.
In this article, we will perform segmentation on an image of the monarch butterfly using a
clustering method called K Means Clustering.
Page 2
UNIT-IV
import numpy as np
import [Link] as plt
import cv2
%matplotlib inline
[Link](image)
# Reshaping the image into a 2D array of pixels and 3 color values (RGB)
pixel_vals = [Link]((-1,3))
[Link](segmented_image)
Page 3
UNIT-IV
Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into
an understandable format for ML algorithms. Real-world data usually is noisy (contains errors,
outliers, duplicates), incomplete (some values are missed), could be stored in different places
and different formats. The task of Data Preprocessing is to handle these issues.
In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and
Training / Tunning Model.
Page 4
UNIT-IV
2. Because of “bad” data, ML models will not give any useful results, or even may give
wrong answers, that may lead to wrong decisions (GIGO principle).
3. The higher the quality of the data, the less data is needed.
Stages of Data preprocessing for Clustering
1. Data Cleaning
Removing duplicates
Removing irrelevant observations and errors
Removing unnecessary columns
Handling inconsistent data
Handling outliers and noise
2. Handling missing data
3. Data Integration
4. Data Transformation
Feature Construction
Handling skewness
Data Scaling
5. Data Reduction
Removing dependent (highly correlated) variables
Feature selection
PCA
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then
we use it to cluster the data points into groups or batches. Here we will focus on Density-based spatial
clustering of applications with noise (DBSCAN) clustering method.
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for
each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of
points.
Page 5
UNIT-IV
Limitations of K-means
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for
compact and well-separated clusters. Moreover, they are also severely affected by the
presence of noise and outliers in the data.
1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.
The figure below shows a data set containing non convex clusters and outliers/noises. Given
such data, k-means algorithm has difficulties in identifying these clusters with arbitrary
shapes.
DBSCAN algorithm requires two parameters:
Page 6
UNIT-IV
1. eps : It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is
chosen too small then large part of the data will be considered as outliers. If it is
chosen very large then the clusters will merge and the majority of the data points will
be in the same clusters. One way to find the eps value is based on the k-distance
graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum
MinPts can be derived from the number of dimensions D in the dataset as, MinPts >=
D+1. The minimum value of MinPts must be chosen at least 3.
1. Find all the neighbor points within eps and identify the core points or visited with more
than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as
the core point.
A point a and b are said to be density connected if there exist a point c which has a
sufficient number of points in its neighbors and both the points a and b are within the eps
distance. This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is
neighbor of e, which in turn is neighbor of a implies that b is neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.
Page 7
UNIT-IV
Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian
distributions, and each of these distributions represent a cluster. Hence, a Gaussian Mixture
Model tends to group the data points belonging to a single distribution together.
Let’s say we have three Gaussian distributions (more on that in the next section) – GD1, GD2,
and GD3. These have a certain mean (μ1, μ2, μ3) and variance (σ1, σ2, σ3) value respectively.
For a given set of data points, our GMM would identify the probability of each data point
belonging to each of these distributions.
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for
distributing the points in different clusters.
Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let’s
take the data point highlighted in red. The probability of this point being a part of the blue
cluster is 1, while the probability of it being a part of the green or cyan clusters is 0.
Now, consider another point – somewhere in between the blue and cyan (highlighted in the
below figure). The probability that this point is a part of cluster green is 0, right? And the
probability that this belongs to blue and cyan is 0.2 and 0.8 respectively.
Page 8
UNIT-IV
Gaussian Mixture Models use the soft clustering technique for assigning data points to Gaussian
Dimensionality Reduction
Curse of Dimensionality — A “Curse” to Machine Learning
Curse of Dimensionality describes the explosive nature of increasing data dimensions and its
resulting exponential increase in computational efforts required for its processing and/or
analysis. In machine learning, a feature of an object can be an attribute or a characteristic that
defines it. Each feature represents a dimension and group of dimensions creates a data point.
This represents a feature vector that defines the data point to be used by a machine learning
algorithm(s). When we say increase in dimensionality it implies an increase in the number of
features used to describe the data. As the dimensionality increases, the number of data points
required for good performance of any machine learning algorithm increases exponentially.
Page 9
UNIT-IV
Two main classes of feature selection techniques include wrapper methods and filter methods.
Wrapper methods, as the name suggests, wrap a machine learning model, fitting and evaluating
the model with different subsets of input features and selecting the subset the results in the
best model performance. RFE is an example of a wrapper feature selection method.
Filter methods use scoring methods, like correlation between the feature and the target
variable, to select a subset of input features that are most predictive. Examples include
Pearson’s correlation and Chi-Squared test.
2. Matrix Factorization
Techniques from linear algebra can be used for dimensionality reduction. Specifically, matrix
factorization methods can be used to reduce a dataset matrix into its constituent parts.
Examples include the eigen decomposition and singular value [Link] most
common approach to dimensionality reduction is called principal components analysis or PCA.
3. Manifold Learning
Techniques from high-dimensionality statistics can also be used for dimensionality reduction.
These techniques are sometimes referred to as “manifold learning” and are used to create a
low-dimensional projection of high-dimensional data, often for the purposes of data
visualization.
The projection is designed to both create a low-dimensional representation of the dataset
whilst best preserving the salient structure or relationships in the data.
Examples of manifold learning techniques include:
Kohonen Self-Organizing Map (SOM).
Sammons Mapping
Multidimensional Scaling (MDS)
t-distributed Stochastic Neighbor Embedding (t-SNE).
Page 10
UNIT-IV
It can be thought of as a projection method where data with m-columns (features) is projected
into a subspace with m or fewer columns, whilst retaining the essence of the original data.
The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of
the components or axes of the new subspace for A.
If all eigenvalues have a similar value, then we know that the existing representation may
already be reasonably compressed or dense and that the projection may offer little. If there are
eigenvalues close to zero, they represent components or axes of B that may be discarded.
A total of m or less components must be selected to comprise the chosen subspace. Ideally, we
would select k eigenvectors, called principal components, that have the k largest eigenvalues.
Using Scikit-Learn
PCA example with Iris Data-set
import numpy as np
import [Link] as plt
from sklearn import decomposition
from sklearn import datasets
[Link](5)
iris = datasets.load_iris()
X = [Link]
y = [Link]
fig = [Link](1, figsize=(4, 3))
[Link]()
Page 11
UNIT-IV
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
[Link]()
RANDOMIZED PCA:
The classical PCA uses the low-rank matrix approximation to estimate the principal
components. However, this method becomes costly and makes the whole process difficult to
scale, for large [Link] randomizing how the singular value decomposition of the dataset
happens, we can approximate the first K principal components quickly than classical PCA.
KERNEL PCA:
PCA is a linear method. It works great for linearly separable datasets. However, if the dataset
has non-linear relationships, then it produces undesirable results.
Kernel PCA is a technique which uses the so-called kernel trick and projects the linearly
inseparable data into a higher dimension where it is linearly separable.
Page 12
UNIT-IV
There are various kernels that are popularly used; some of them are linear, polynomial, RBF,
and sigmoid.
1 import [Link] as plt
2 from [Link] import make_circles
3 from [Link] import PCA, KernelPCA
4
5 X,y = make_circles(n_samples=500, factor=.1, noise=0.02, random_state=47)
6
7 [Link](X[:,0], X[:,1], c=y)
8 [Link]()
Page 13
UNIT-V
Artificial Neural Network ANNANN is an efficient computing system whose central theme is
borrowed from the analogy of biological neural networks. ANNs are also named as “artificial
neural systems,” or “parallel distributed processing systems,” or “connectionist systems.”
ANN acquires a large collection of units that are interconnected in some pattern to allow
communication between the units. These units, also referred to as nodes or neurons, are
simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection
link is associated with a weight that has information about the input signal. This is the most
useful information for neurons to solve a particular problem because the weight usually
excites or inhibits the signal that is being communicated. Each neuron has an internal state,
which is called an activation signal. Output signals, which are produced after combining the
input signals and activation rule, may be sent to other units.
Biological Neuron
A nerve cell neuronneuron is a special biological cell that processes information. According
to an estimation, there are huge number of neurons, approximately 10 11 with numerous
interconnections, approximately 1015.
As shown in the above diagram, a typical neuron consists of the following four parts with
the help of which we can explain its working −
Page 1
UNIT-V
Dendrites − They are tree-like branches, responsible for receiving the information
from other neurons it is connected to. In other sense, we can say that they are like
the ears of neuron.
Soma − It is the cell body of the neuron and is responsible for processing of
information, they have received from dendrites.
Axon − It is just like a cable through which neurons send the information.
Synapses − It is the connection between the axon and other neuron dendrites.
The following diagram represents the general model of ANN followed by its
processing.
For the above general model of artificial neural network, the net input can be calculated as
follows –
Page 2
UNIT-V
Workflow of ANN
Let us first understand the different phases of deep learning and then, learn how Keras
helps in the process of deep learning.
Collect required data
Deep learning requires lot of input data to successfully learn and predict the result. So, first
collect as much data as possible.
Analyze data
Analyze the data and acquire a good understanding of the data. The better understanding of
the data is required to select the correct ANN algorithm.
Choose an algorithm (model)
Choose an algorithm, which will best fit for the type of learning process (e.g image
classification, text processing, etc.,) and the available input data. Algorithm is represented
by Model in Keras. Algorithm includes one or more layers. Each layers in ANN can be
represented by Keras Layer in Keras.
Prepare data − Process, filter and select only the required information from the data.
Split data − Split the data into training and test data set. Test data will be used to
evaluate the prediction of the algorithm / Model (once the machine learn) and to cross
check the efficiency of the learning process.
Compile the model − Compile the algorithm / model, so that, it can be used further to
learn by training and finally do to prediction. This step requires us to choose loss
function and Optimizer. loss function and Optimizer are used in learning phase to find
the error (deviation from actual output) and do optimization so that the error will be
minimized.
Fit the model − The actual learning process will be done in this phase using the
training data set.
Predict result for unknown value − Predict the output for the unknown input data
(other than existing training and test data)
Evaluate model − Evaluate the model by predicting the output for test data and cross-
comparing the prediction with actual result of the test data.
Freeze, Modify or choose new algorithm − Check whether the evaluation of the model
is successful. If yes, save the algorithm for future prediction purpose. If not, then
modify or choose new algorithm / model and finally, again train, predict and evaluate
the model. Repeat the process until the best algorithm (model) is found.
Architecture of Keras
Page 3
UNIT-V
layers, convolution layer, pooling layer, etc., Keras model and layer access Keras
modules for activation function, loss function, regularization function, etc., Using Keras
model, Keras Layer, and Keras modules, any ANN algorithm (CNN, RNN, etc.,) can be
represented in a simple and efficient manner.
Model
model = Sequential()
[Link](Dense(512, activation = 'relu', input_shape = (784,)))
Where,
Layer
Each Keras layer in the Keras model represent the corresponding layer (input layer, hidden
layer and output layer) in the actual proposed neural network model. Keras provides a lot of
pre-build layers so that any complex neural network can be easily created. Some of the
important Keras layers are specified below,
Core Layers
Convolution Layers
Pooling Layers
Recurrent Layers
A simple python code to represent a neural network model using sequential model is as
follows −
from [Link] import Sequential
from [Link] import Dense, Activation, Dropout model = Sequential()
Page 4
UNIT-V
Core Modules
Keras also provides a lot of built-in neural network related functions to properly create the
Keras model and Keras layers. Some of the function are as follows −
Activations module − Activation function is an important concept in ANN and
activation modules provides many activation function like softmax, relu, etc.,
Loss module − Loss module provides loss functions like mean_squared_error,
mean_absolute_error, poisson, etc.,
Optimizer module − Optimizer module provides optimizer function like adam, sgd,
etc.,
Regularizers − Regularizer module provides functions like L1 regularizer, L2 regularizer,
etc.,
Keras - Regression Prediction using MPL
Step 1 − Import the modules
Page 5
UNIT-V
Let us change the dataset according to our model, so that, we can feed into our
model. The data can be changed using below code −
x_train_scaled = [Link](x_train)
scaler = [Link]().fit(x_train)
x_test_scaled = [Link](x_test)
Here, we have normalized the training data
using [Link] function. [Link]().fit f
unction returns a scalar with the normalized mean and standard deviation of the
training data, which we can apply to the test data using [Link] function.
This will normalize the test data as well with the same setting as that of training data.
Step 4 − Create the model
Let us create the actual model.
model = Sequential()
[Link](Dense(64, kernel_initializer = 'normal', activation = 'relu',
input_shape = (13,)))
[Link](Dense(64, activation = 'relu')) [Link](Dense(1))
Step 5 − Compile the model
Let us compile the model using selected loss function, optimizer and metrics.
[Link](
loss = 'mse',
optimizer = RMSprop(),
metrics = ['mean_absolute_error']
)
Step 6 − Train the model
Let us train the model using fit() method.
history = [Link](
x_train_scaled, y_train,
batch_size=128,
epochs = 500,
verbose = 1,
validation_split = 0.2,
callbacks = [EarlyStopping(monitor = 'val_loss', patience = 20)]
)
Step 7 − Evaluate the model
Let us evaluate the model using test data.
score = [Link](x_test_scaled, y_test, verbose = 0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Executing the above code will output the below information −
Test loss: 21.928471583946077 Test accuracy: 2.9599233234629914
Step 8 − Predict
Page 6
UNIT-V
TENSORFLOW INSTALLATION
. Step 1 − Verify the python version being installed.
Step 2 − A user can pick up any mechanism to install TensorFlow in the system. We
recommend “pip” and “Anaconda”. Pip is a command used for executing and
installing modules in Python.
Before we install TensorFlow, we need to install Anaconda framework in our system.
Page 7
UNIT-V
Page 8
UNIT-V
Step 5 − Use pip to install “Tensorflow” in the system. The command used for
installation is mentioned as below −
pip install tensorflow
And,
pip install tensorflow-gpu
Page 9
UNIT-V
service. However, reading huge datasets efficiently is not the only difficulty: the data also
needs to be preprocessed. Indeed, it is not always composed strictly of convenient
numerical fields: sometimes there will be text features, categorical features, and so on. To
handle this, TensorFlow provides the Features API: it lets you easily convert these features
to numerical features that can be consumed by your neural network.
Tensor‐
Flow’s ecosystem:
• TF Transform ([Link]) makes it possible to write a single preprocessing function that
can be run both in batch mode on your full training set, before training (to speed it up), and
then exported to a TF Function and incorporated into your trained model, so that once it is
deployed in production, it can take care of preprocessing new instances on the fly.
• TF Datasets (TFDS) provides a convenient function to download many common datasets of
all kinds, including large ones like ImageNet, and it provides convenient dataset objects to
manipulate them using the Data API.