0% found this document useful (0 votes)
729 views

Brainheaters Notes: SERIES 313-2018 (A.Y

A quality product by BrainheatersT" LLC Brainheaters provides notes and study materials for machine learning courses. The notes cover topics such as introduction to machine learning, basic machine learning algorithms, dimensionality reduction, neural networks, and clustering. The notes prioritize topics and include practice questions to help students prepare for exams. Brainheaters aims to craft high-quality study materials to help students learn machine learning concepts.

Uploaded by

jen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
729 views

Brainheaters Notes: SERIES 313-2018 (A.Y

A quality product by BrainheatersT" LLC Brainheaters provides notes and study materials for machine learning courses. The notes cover topics such as introduction to machine learning, basic machine learning algorithms, dimensionality reduction, neural networks, and clustering. The notes prioritize topics and include practice questions to help students prepare for exams. Brainheaters aims to craft high-quality study materials to help students learn machine learning concepts.

Uploaded by

jen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

A quality product by

BrainheatersT" LLC

Brainheaters Notes
IML Semester-7

SERIES 313-2018 (A.Y 2021-22)


www.brainheaters.in
2016-21 | Proudly Powered by
BH.Index
(Learn as per the Priority to prepare smartily)

Sr
Chapter Name & Content Priorlty Pgno
No

1. Introduction to Machine Learning: 4 2

2. Basic Machine Learning Algorlthms: 2 13


3. Dimenslonality Reduction: 3 24
Bayesian Concept of Learning: 3 29
5. Logistic Regression and Support Vector 2 37

Machine
3. Basics of Neural Network: 47

7. Computation and Ensemble Learning: 53

Basic Concepts of Clustering: 2 60


8.

Page no 1 Handcrafted by Engineers |P Priority


MODULE-1

Ql. Explain Different Types of Learning (P4 Appeared 1Time) (3-7

Marks)

ANS: Supervised Learning

Supervised learning is one of the most basic types of machine

learning. In this type, the machine learning algorithm is trained on

labeled data.

.Even though the data needs to be labeled accurately for this

method to work, supervised learning is extremely powerful when

used in the right circumstances.

In supervised learning, the ML algorithm is given a small training

dataset to work with. This training dataset is a smaller part of the

bigger dataset and serves to give the algorithm a basic idea of the

problem, solution, and data points to be dealt with.

The training dataset is also very similar to the final dataset in its

characteristics and provides the algorithm with the labelled

parameters required for the problem.

Poge no 2 Handcrafted by Engineers | P Priority


then finds relationships between the parameters
The algorithm
cause and effect relationship
given, essentially establishinga
between the variables in the dataset.

has idea of how the data


At the end of the training, the algorithm
an

works and the relationship between the input and the output.

the final dataset, which it


This solution is then deployed for use with
learns from in the same way as the training dataset.

.This means that supervised machine learning algorithms will


improve even after being deployed, discovering
new
continue to

as it trains itself on new data.


patterns and relationships
Unsupervised Learning
able
Unsupervised machine learning holds the advantage of being
to work with unlabeled data.
This means that human labour is not required to make the dataset

machine-readable, allowing much larger datasets to be worked on

by the program.

I n supervised learning, the labels allow the algorithm to find the

exact nature of the relationship between any two data points.

However, unsupervised learning does not have labels to work off of,

resulting in the creation of hidden structures.


Relationships between data points are perceived by the algorithm

in an abstract manner, with no input required from human beings.

Page no 3 Handcrafted by Engineers | P - Priority


.The creation of these hidden structuresis what makes unsupervised

learning algorithms versatile.

.Instead of a defined and set problem statement, unsupervised

learning algorithms can adapt to the data by dynamically

changing hidden structures.

This offers more post-deployment development than supervised

learning algorithms.
Reinforcement Learning

Reinforcement learning directly takes inspiration from how human

beings learn from data in their lives.

It features an algorithm that improves upon itself and learns from

new situations using a trial-and-error method.

Favourable outputs are encouraged or 'reinforced', and

non-favourable outputs are discouraged or 'punished'.

Based on the psychological concept of conditioning, reinforcement

learning works by putting the algorithm in a work environment with

an
interpreter andareward system.
I n every iteration of the algorithm, the output result is given to the

interpreter, which decides whether the outcome is favourable or

not.

In case of the program finding the correct solution, the interpreter

reinforces the solution by providing a reward to the algorithm.

Page no 4 Handcrafted by Engineers | P Priority


.If the outcome is not favourable, the algorithm is forced to reiterate
until it finds a better result. In most cases, the reward system is

directly tied to the effectiveness of the result.


In typical reinforcement learning use-cases, such as finding the

shortest route between two points on a map, the solution is not an

absolute value.

Instead, it takes on a score of effectiveness, expressed in a

percentage value.

The higher this percentage value is, the more reward is given to the

algorithm.
Thus, the program is trained to give the best possible solution for

the best possible reward.

Q2. Explain Hypothesis Space (P4- Appeared 1Time) (3-7 Marks)

ANS: Hypothesis Space (H):


Hypothesis space is the set of all the possible legal hypotheses. This
is the set from which the machine learning algorithm would

determine the best possible (only one) which would best describe

the target function or the outputs.

Page no 5 Handcrafted by Engineers |P Priority


Hypothesis space learning assumes the following sets:

, the instance space, is the set of all possible examples

H, the hypothesis space, is a set of Boolean functions on the input

featureS.

ESI is the set of training examples. Values for the input features and

the target feature are given for the training example.

Sales vs MRP 4 l 8 333


80OC0

6000

SO00

4000

3000

2000

1000

00 150 00 0

Q3. Explain Inductive Bias (P4 Appeared I Time) (3-7 Marks)

ANS: Inductive Biaas:

The inductive bias (also known as learning bias) of a learning

algorithm is the set of assumptions that the learner uses to predict

Page no-6 Handcrafted by Engineers | P-Priority


outputs of given inputs that it has not encountered.

In machine learning, one aims to construct algorithms that are ablee

to predict a certain target output. To achieve this, the learning

algorithm is presented with some training examples that

demonstrate the intended relation of input and output values. Then

the learner is supposed to approximate the correct output, even for

examples that have not been shown during training.

Without any additional assumptions, this problem cannot be solved

since unseen situations might have an arbitrary output value. The

kind of necessary assumptions about the nature of the target

function is subsumed in the phrase inductive bias.

A classical example of an inductive bias is Occam's razor, assuming

that the simplest consistent hypothesis about the target function is

actually the best. Here consistent means that the hypothesis of the

learner yields correct outputs for all of the examples that have been

given to the algorithm.

Types:
The following is a list of common inductive biases in machine learning
algorithmns.
Moximum conditional independence:ifthe hypothesis can be cast
in a Bayesian framework, try to maximize conditional independence.
This is the bias used in the Naive Bayes classifier.

Page no 7 Handcrafted by Engineers | P - Priority


Minimum cross-validation error: when trying to choose among

hypotheses, select the hypothesis with the lowest cross-validation

error. Although cross-validation may seem to be free of bias, the 'no

free lunch' theorems show that cross-validation must be biased.

Maximum margin: when drawing a boundary between two classes,

attempt to maximize the width of the boundary. This is the bias

used in support vector machines. The assumption is that distinct

classes tend to be separated by wide boundaries.

Minimum description length: when forming a hypothesis, attempt to

minimize the length of the description of the hypothesis. The

assumption is that simpler hypotheses are more likely to be true.

This is NOT what Occam's razor says. Simpler models are more

testable, not "more likely to be true." See Occam's razor.

Minimum features: unless there is good evidence that a feature is

useful, it should be deleted. This is the assumption behind feature

selection algorithmns.
Nearest neighbors: assume that most of the cases in a small

neighborhood in feature space belong to the same class.

Given a case for which the class is unknown, guess that it belongs

to the same class as the majority in its immediate neighborhood.

This is the bias used in the k-nearest neighbors algorithm.

Page no 8 Handcrafted by Engineers | P - Priority


The assumption is that cases that are near each other tend to

belong to the same class.

Q4.Explain Evaluation and Cross-Validation (P4 - Appeared 1Time)

(3-7 Marks)

ANS: Evolution:
Model evaluation aims to estimate the generalization accuracy of a

model on future (unseen/out-of-sample) data.


Methods for evaluating a model's performance are divided into 2

categories: namely, holdout and Cross-validation.

Holdout

The purpose of holdout evaluation is to test a model on different

data than it was trained on. This provides an unbiased estimate of

learning performance.
In this method, the dataset is randomly divided into three subsets:

Training set is a subset of the dataset used to build predictive

models.

Validation set is a subset of the dataset used to assess the

performance of the model built in the training phase.

Pageno 9 Handcrafted by Engineers | P Priority


It provides a test platform for fine-tuning a model's parameters and

selecting the best performing model. Not all modeling algorithms

need a validation set.

Test set, or unseen data, is a subset of the dataset used to assess

the likely future performance of a model.

fa model fits the training set much better than itfits the test set,

overfitting is probably the cause.


Cross-Validation

Cross-validation is a technique that involves partitioning the

original observation dataset into a training set, used to train the

model, and an independent set used to evaluate the analysis.

The most common cross-validation technique is k-fold

cross-validation, where the original dataset is partitioned into k

equal size subsamples, called folds.

The k is a user-specified number, usually with 5 or 10 as its preferred

value. This is repeated k times, such that each time, one of the k

subsets is used as the test set/validation set and the other k-1

subsets are put together to form a training set


The error estimation is averaged over all k trials to get the total

effectiveness of our model.

Page no 10 Handcrafted by Engineers | P Priority


.Cross-validation is a technique in which we train our model using

the subset of the data-set and then evaluate using the

complementary subset of the data-set.


The three steps involved in cross-validation are as follows:

1.Reserve some portion of the sample data-set.

2. Using the rest data set, train the model.

3. Test the model using the reserve portion of the data-set.

Methods of Cross-Validation

Validation

I n this method, we perform training 50% of the


on
given data-set
and the rest 50% is used for the
testing purpose.
The major drawback of this method is that we perform training on

the 50% of the dataset, it may be possible that the remaining 50% of
the data contains some important information which we are
leaving while training our model i.e higher bias
LOOCV (Leave One Out Cross Validation)
I n this method, we perform training on the whole data-set but
leaves only one data-point of the available data-set and then
iterates for each
data-point. It has some advantages as well as
disadvantages also.
An
advantage of using this method is that we make use of all data
points and hence it is low bias.

Page no 11
Handcrafted by Engineers | P Priority
The major drawback of this method is that it leads to higher

variation in the testing model as we are testing against one data

point
I f the data point is an outlier it can lead to the higher variation.
Another drawback is it takes a lot of execution time as it iterates

over 'the number of data points' times.


K-Fold Cross Validation

I n this method, we split the data-set intok numbers of


subsets(known as folds) then we perform training on all the subsets
but leave one(k-1) subset for the evaluation of the trained
model.
In this method, we iterate k times with a different subset reserved for
testing purposes each time.

Page no 12
Handcrafted by Engineers | P - Prinritu
MODULE-2

QI. Explain Linear Regression (P4 Appeared 1Time) (3-7 Marks)

ANS: Linear Regression:

Linear regression is one of the easiest and most popular Machine

Learning algorithms. It is a statistical method that is used for

predictive analysis.
Linear regression makes predictions for continuous/real or numeric

variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship betweena

dependent (y) and one or more independent (y) variables, hence


called as linear regression.
Since linear regression shows the linear relationship, which means it

finds how the value of the dependent variable is changing

according to the value of the independent variable.

The linear regression model provides a sloped straight line

representing the relationship between the variables.

Consider the below image:

Page no - 13 Handcrafted by Engineers| P Priority


Y

Datapoints

Line of
regression

independent Variables X

Figure: Linear Regression in Machine Learning

Mathematically, we can represent a linear regression as:

y a0+alx+e

Here
Y Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)

Page no 14 Handcrafted by Engineers | P Priority


al Linear regression coefficient (scale factor to each input value).

E= random error

The values for x and y variables are training datasets for Linear Regression

model representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression:

Ifa single independent variable is used to predict the value of a

numerical dependent variable, then such a Linear Regression

algorithm is called Simple Linear Regression.

Multiple Linear regression:


.If more than oneindependentvariable is used to predict the value

of a numerical dependent variable, then such a Linear Regression

algorithm is called Multiple Linear Regression.

Q 2 . Describe Decision Trees (P4 - Appeared 1 Time) (3-7 Marks)

ANS: Decision Tree

I t is a Supervised learning technique that can be used for both

Page no 15 Handcrafted by Engineers| P Priority


classification and Regression problems, but mostly it is preferred for

solving Classification problems.

It is a tree-structured classifier, where internal nodes represent the

features of a dataset, branches represent the decision rules and

each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node

and Leaf Node.

Decision nodes are used to make any decision and have multiple

branches, whereas Leaf nodes are the output of those decisions

and do not contain any further branches.

The decisions or the test are performed on the basis of features of

the given dataset. It is a graphical representation for getting all the

possible solutions to a problem/decision based on given conditions.


It is calleda decision tree because similarto a tree, it starts with the

root node, which expands on further branches and constructs a

tree-like structure.

In order to build a tree, we use the CART algorithm, which stands for

Classification and Regression Tree algorithm.


A decision tree simply asks a question, and based on the answer

(Yes/No), it further splits the tree into subtrees.

Handcrafted by Engineers | P - Priority


Page no- 16
Below diagram explains the general structure of a decision tree:

Decision Node Root Node

---
ISub-Tree Decision Node Decision Node

Leaf Node Leaf Node Leaf Node Decision Node

- - - -

Leaf Node Leaf Node

Figure: General Structure of decision tree

Working of Decision Tree algorithm:

. I n a decision tree, for predicting the class of the given dataset, the

algorithm starts from the root node of the tree.


This algorithm compares the values of the root attribute with the

record (real dataset) attribute and, based on the comparison,

follows the branch and jumps to the next node.

Page no 17 Handcrafted by Engineers | P Priority


For the next node, the algorithm again compares the attribute value

with the other sub-nodes and moves further. It continues the

process until it reaches the leaf node of the tree.

The complete process can be better understood using the below

algorithm:
Step-1: Begin the tree with the root node, says S, which contains the

complete dataset.

Step-2: Find the best attribute in the dataset using the Attribute

Selection Measure (ASM).


Step-3: Divide the S into subsets that contain possible values for the

best attributes.

Step-4: Generate the decision tree node, which contains the best

attribute.

Step-5: Recursively moke new decision trees using the subsets of


the dataset created in step-3. Continue this process until a stage is

reached where you cannot further classity the nodes and call the

final node as a leaf node.

Page no 18 Handcrafted by Engineers | P - Priority


39%
Introduction To Machine
Learning

Q3. Explain Learning Decision Trees


(P4 ApPpeared 1 Time)
(3-7 Marks)
ANS: Learning Decision Trees
A decision tree is
a
simple representation for
Decision tree learning is one of
classifying examples.
the most successful
techniques for
supervised classification learning.
For this section, assume that all of the features have
finite discrete
domains, and there is a
single target feature called the
classification.
Each element of the domain of the
classification is called a class.
A decision tree or a
classification tree is a tree in which each
internal (non-leaf) node is labeled with an input feature. The arcs
coming from a node labeled with a feature are labeled with each of
the possible values of the feature.
Each leaf of the tree is labeled with a class or a probability
distribution over the classes.
To classify an example, filter it down the tree, as follows. For each

feature encountered in the tree, the arc corresponding to the value

of the example for that feature is followed.

Whena leaf is reached, the classification corresponding to that leaf

is returned.

Handcrafted by Engineers | P - Priority


Page no 19
Length Length

long \short long short


skips reads with
skips Thread probability 0.82
new Yollow_up
reads Author
unknowr known
skips reads

Figure: Two decision trees

Q4.write about K-nearest Neighbour (P4 - Appeared 1 Time) (3-7

Marks)

ANS: K-Nearest Neighbour

It is one of the simplest Machine Learning algorithms based on the

Supervised Learning technique. The K-NN algorithm assumes the

similarity between the new case/data and available cases and puts

the new case into the category that is most similar to the available

categories.
K-NN algorithm stores all the available data and classifiesa new

data point based on the similarity. This means when new data

Page no 20 Handcrafted by Engineers | P - Priority


into a well suited category
appears then
it can be easily classified

by using K- NN algorithm.
as well as for
The K-NN algorithm can be used for Regression

used for the Classification problems.


Classification but mostly it is
which means it does not make
K-NN is a non-parametric algorithm,
learner
underlying data. It is also called lazy
a
on
any assumption
because it does not learn from the training set
algorithm
the dataset and at the time of
immediately instead it stores
action on the dataset.
classification, it performs an
stores the dataset and
The KNN algorithm at the training phase just
that data into a category
when it gets new data, then it classifies
that is much similar to the new data.

creature that looks


Example: Suppose, we have an image of
a

but we want to know whether it is a cat or


similar to cat and dog,
use the KNN algorithm, as it
dog. So for this identification, we can
works ona similarity measure. Our KNN model will find the similar

features of the new data set to the cats and dogs images and

based on the most similar features it will put it in either cat or dog

category.
The K-NN working can be explained on the basis of the below algorithm:

. Select the number K of the neighbors


2. Calculate the Euclidean distance of K number of neighbors

Pageno 21 Handcrafted by Engineers | P Priority


3. Take the K nearest neighbors as per the calculated Euclidean

distance.

4 Among these k neighbors, count the number of the data points in

each category.

5. Assign the new data points to that category for which the number

of the neighbor is maximum.

6. Our model is ready.

Q5. Explain Collaborative Filtering (P4 Appeared 1Time) (3-7 Marks)

ANS: Collaborative Filtering:

Collaborative filtering filters information by using the interactions

and data collected by the system from other users. It's based on the

idea that people who agreed in their evaluation of certain items are

likely to agree again in the future.

Most collaborative filtering systems apply the so-called similarity

index-based technique. In the neighborhood-based approach, a

number of users are selected based on their similarity to the active

user. Inference for the active user is made by calculating a

weighted average of the ratings of the selected users.

Page no 22 Handcrafted by Engineers P Priority


.Collaborative-filtering systems focus on the relationship between

users and items. The similarity of items is determined by the

similarity of the ratings of those items by the users who have rated

both items.

There are two classes of Collaborative Filtering:

User-based, which measures the similarity between target users

and other users.

.Item-based, which measures the similarity between the items that

target users rate or interact with and other items.

Page no 23 Handcrafted by Engineers | P Priority


MODULE-3

QI.Explain Feature Selection (P4 Appeared 1Time) (3-7 Marks)

ANS: Feature selection:

I t is the process of reducing the number of input variables when

developing a predictive model.

It is desirable to reduce the number of input variables to both

reduce the computational cost of modelling and, in some cases, to

improve the performance of the model.

Benefits of feature selection

T h e main benefit of feature selection is that it reduces overfiting.


By removing extraneous data, allows the model to focus only on the

important features of the data, and not get hung up on features

that don't matter.

Another benefit of removing irrelevant information is that it

improves the accuracy of the model's predictions.

It also reduces the computation time involved to get the model.

Finally, having a smaller number of features makes your model

more interpretable and easy to understand.

Page no 24 Handcrafted by Engineers | P Priority


• overall, feature selection Is key to being able to predict values with

any amount of accuracy.

There are three types of feature selection:

• wrapper methods (forward, backward, and stepwise selection)

• Filter methods (ANOVA, Pearson correlation , variance thresholdi ng)

• Embedded methods (Lasso, Ridge, Decision Tree).

Q2. Explain Feature Extraction (P4 - Appeared 1Time) (3- 7 Marks)

ANS: Feature Extraction:

• Feature extraction is a process of dimension ality reduction by which

an initial set of raw data is reduced to more managea ble groups for

processing.

• A characteri stic of these large data sets is a large number of

variables that require a lot of computing resources to process.

• Feature extraction is the name for methods ·that select and / or

combine variables into features, effectively reducing the amount of

data that must be processed, while still accurately and completel y

describing the original data set.

Usage of Feature Extraction:

• The process of feature extraction is useful when you need to reduce

Page no - 25 Handcrafted by Engineers I P - Priority


• the numb er of resources neede d for processing withou t losing

impor tant or relevant inform ation.

• Feature extrac tion can also reduc e the amou nt of redun dant data

for a given analysis. Also, the reduc tion of the data and the

machi ne's efforts in buildin g variab le comb inatio ns (featu res)

facilita te the speed of learni ng and gener alizati on steps in the

mach ine learni ng process.

Feature Extraction Techniques such as:

• PCA

• ICA

• LOA

• LLE

• t-SNE

• AE

Principal Comp onent s Analysis (PCA)

• PCA is one of the most used linear dimen sional ity reduc tion

techn ique.

• When using PCA, we take as input our origin al data and try to find a

comb inatio n of the input featur es which can best summ arize the

original data distrib ution so that to reduc e its origin al dimensions.

• PCA is able to do this by maxim izing varian ces and minim izing the

recon struction error by looking at pair wised distan ces.

Page no - 26 Handcra fted b y Engineers I P - Priority


• In PCA, our original data Is projected Into a set of orthogona l oxes

and each of the axes gets ranked in order of importanc e.

• PCA is an unsupervised learning algorithm, therefore it doesn't care

about the data labels but only about variation. This can lead in

some cases to misclassif ication of data.

Independe nt Compone nt Analysis (ICA)

• ICA is a linear dimension ality reduction method that takes as input

data a mixture of independe nt componen ts and it aims to correctly

identify each of them (deleting all the unnecess ary noise).

• Two input features can be considere d independe nt if both their

linear and not linear dependen ce is equal to zero.

• Independe nt Compone nt Analysis is commonly used in medical

applicatio ns such as EEG and fMRI analysis to separate useful

signals from unhelpful ones.

Linear Discrimina nt Analysis (LDA)

• LOA is a supervised learning dimension ality reduction technique

and Machine Learning classifier.

• LOA aims to maximize the distance between the mean of each class

and minimize the spreading within the class itself.

• LOA uses therefore within classes and between classes as

measures.

J-'(J(jl f l(J / /
llo11du o I 11:J l.)y E:11~1111;Je1:. I P Pr 10111y
• This is a good choice because maximizing the distance between

the m eans of each class when projecting the da ta in a

lower- dimensional space can lead to better classification results

(thanks to the reduced overlap between the different classes).

• When using LDA, is assumed that the input data follows a Gaussian

Distribution (like in this case), therefore applying LDA to not

Gaussian data can possibly lead to poor classification results.

Pa g e nc, 28 Hu11du u ftcu l.Jy !:.ng111~~, ~ I P Pr 1011LLJ


MODULE-4

Ql. Explain Bayesian Learning (P4 - Appeared 1Time) (3-7 Marks)

ANS: Bayesian Learning:

• Bayesian ML is a paradigm for constructing statistical models

based on Bayes' Theorem

_ p(xl9)p(O)
( )
p 9IX - p(x)
• Generally speaking, the goal of Bayesian ML is to estimate the

posterior distribution ( p(elx) ) given the likelihood ( p(xle) ) and the

prior distribution, p(e).

• The likelihood is something that can be estimated from the training

data.

• Maximum Likelihood Estimation, an iterative· process that updates

the model's parameters in an attempt to maximize the probability

of seeing the training data x having already seen the model

parameters 0.

Pa ge no - 29 Handcrafted b y Engineers I P - Priorit y


• To maximize the posterior distribu tion which takes the training data

as fixed and determ ines the probab ility of any param eter setting e

given that data.

• We call this process Maxim um a Posteriori (MAP).

• It's easier, however, to think about it in terms of the likeliho od

functio n.

• By Bayes' Theorem, we con write the posterior as

p(Olx) ex p(x lO)p(O)

• Here we leave out the denom inator, p(x) , becaus e we are taking

the maxim ization with respec t to e which p{x) does not depend on.

Therefore, we can ignore it in the maxim ization proced ure.

• The key piece of the puzzle which leads Bayesian models to differ

from their classic al counte rparts trained by MLE is the inclusio n of

the term p{e).

• We call this the prior distribu tion over e.

• The idea is that its purpos e is to encode our beliefs about the

model' s param eters before we've even seen them.

• That's to say, we can often make reason able assum ptions about

the "suitability" of differe nt param eter configu rations based simply

on what we know about the proble m domain and the laws of

statistics.

Page no - 30 Ha ndcrafted by Engineers I P - Priority


• For example, it's pretty common to use a Gaussian prior over the

model's parameters.

• This means we assume that they're drawn from a normal

distribution having some mean and variance.

• This distribution's classic bell-curved shape consolidates most of its

mass close to the meanwhile values towards its tails are rather rare.

• It turns out that using these prior distributions and performing MAP

is equivalent to performing MLE in the classical sense along with the

addition of regularization.

• There's a pretty easy mathematical proof of this fact that we won't

go into here, but the gist is that by constraining the acceptable

model weights via the prior we're effectively imposing a regularizer.

Q2. Explain Naive Bayes {P4 - Appeared 1Time) (3-7 Marks)

ANS: Na'ive Bayes algorithm:

• It is a supervised learning algorithm, which is based on the Bayes

theorem and used for solving classification problems.

• It is mainly used in text classification that includes a

high-dimensional training dataset.

Page no - 31 Handcra fted by Engineers I P - Priority


• The Naive Bayes Classifier is one of the simple and most effective

Classification algorithms which helps in building fast machine

learning models that can make quick predictions.

• It is a probabilistic classifier, which means it predicts on the basis of

the probability of an object.

Some popular examples of the Naive Bayes Algorithm are

• spam filtration

• Sentimental analysis

• classifying articles.

The Naive Bayes algorithm is comprised of two words Naive and Bayes,

Which can be described as:

Naive:

• It is called Naive because it assumes that the occurrence of a

certain feature is independent of the occurrence of other features.

Such as if the fruit is identified on the basis of color, shape, and

taste, then red, spherical, and sweet fruit is recognized as an apple.

Hence each feature individually contributes to identifying that it is

an apple without depending on each other.

Bayes:

• It is called Bayes because it depends on the principle of Bayes·

Theorem.

Page no - 32 Handcra fted by Engineers I P - Prio rity


Advantages of Nai've Bayes Classifier:

• Naive Bayes is one of the fast and easy ML algor ithms to predict a

class of datasets.

• It can be used for Binary as well as Multi-class Classifications.

• It perfo rms well in Multi- class predictions as comp ared to the other

Algorithms.

• It is the most popu lar choic e for text classification probl ems.

Disadvantages of Naive Bayes Classifier:

• Naive Bayes assumes that all features are indep ende nt or

unrelated, so it cann ot learn the relationship betwe en features.

Applications of Naive Bayes Classifier:

• It is used for Credi t Scoring.

• It is used in medi cal data classification.

• It can be used in real-t ime predi ctions beca use Naive Bayes

Classifier is an eage r learn er.

• It is used in Text class ificati on such as Spam filtering and Sentiment

analysis.

Page no - 33 Handcr a fted by Engineers I P - Priori ty


Q3. Explain Bayesian Network (P4 - Appeared 1Time) (3-7 Marks)

ANS: Bayesian networks:

It consists of two parts:

1. Structure

2. parameters.

• The structure is a directed acyclic graph (DAG) that expresses

conditional independencies and dependencies among ran- dom

variables associated with nodes.

• The parameters consist of conditional probability distributions

associated with each node.

• A Bayesian network is a compact, flexible and interpretable

representation of a joint probability distribution.

• It is also an useful tool in knowledge discovery as directed acyclic

graphs allow representing causal relations between variables.

Typically, a Bayesian network is learned from data.

Application of Bayesian Network

Healthcare Industry:

• The Bayesian network is used in the healthcare industry for the

detection and prevention of diseases.

Page no - 34 Handcra fted by Engineers I P - Prlant y


• Based on models built, we can find out the likely symptoms and

predict whether a person will be diseased or not.

• For instance, if a person has cholesterol, then there are high

chances that the person gets a heart problem. With this

information, a person can take preventive measures.

Web Search:

• Bayesian Network modes can be used for search accuracy based

on user intent.

• Based on the user's intent, these models show things that are

relevant to the person.

• For instance, when we search for Python functions most of the tim e,

then the web search model activates our intent and it makes sure

to show relevant things to the user.

Mail Spam Filtering:

• Gmail is using Bayesian Models to filter the mails by reading or

understan ding the context of mail.

• For instance, we may have observed spam emails in the spam

folder in Gmail. So, how are these emails classified as spam? Using

the Bayesian model, which observes the mail and based on the

previous experience and the probability it classifies mail as spam or

not.

Page no - 35
Handcra fted by Engineer s I P • Prior 1ty
Biomonitoring:

• Bayesian Models are used to quantify the concentration of

chemicals in blood and human tissues.

• These use indicators to measure blood and urine.

• To find the level of ECCs, one can conduct biometric studies.

Information Retrieval:

• Models can be used for retrieving information from the database.

• During this process, we refine our problem multiple times.

• It is used to reduce information overload.

Page no - 36 Handcra fted by Engineer s IP - P, 10 1tty


MODULE-5

Ql. Explain Logistic Regression (P4 - Appeared 1Time) (3-7 Marks)

ANS: Logistic regression:

• It is a supervised learning classific ation algorith m used to predict

the probabil ity of a target variable. The nature of target or

depende nt variable is dichotom ous, which means there would be

only two possible classes.

• In simple words, the depend ent variable is binary in nature having

data coded as either 1 (stands for success/yes) or O (stands for

failure/n o).

• Mathem atically, a logistic regression model predicts P(Y=l) as a

function of X. It is one of the simples t ML algorithm s that can be

used for various classific ation problem s such as spam detectio n,

Diabetes prediction, cancer detectio n etc.

Types of Logistic Regression

• Generally, logistic regression means binary logistic regression

having binary target variables, but there can be two more

categories of target variables that can be predicte d by it.

Page no - 37 Handcrafte d by Engineers I P - Priority


1. Based on those numb er of categories, Logistic regression can be

divided into the following types -

Binary or Binomial

• In such a kind of classification, a depen dent variab le will have only

two possible types either 1and 0.

• For example, these variables may represent success or failure, yes

or no, win or loss etc.

Multinomial

• In such a kind of classification, the depen dent variab le can have 3

or more possible unord ered types or the types having no

quant itative significance.

• For example, these variables may represent "Type A" or "Type B" or

"Type C".

Ordinal

• In such a kind of classification, depen dent variab les can have 3 or

more possible ordere d types or the types having a quant itative

significance.

• For example, these variables may repres ent "poor" or "good", "very

good", "Excellent'' and each categ ory can have the scores like 0,1,2,3.

Page no - 38
Handcra f ted by Engineers I P - Priority
Q2. Explain Support Vector Machine {P4 - Appeared 1Time) (3-7 Marks)

ANS: Support Vector Machine or SVM:

• It is one of the most popular Supervised Learning algorithm s, which

is used for Classification as well as Regress ion problem s.

• However, primarily, it is used for Classific ation problem s in Machine

Learning.

• The goal of the SVM algorithm is to create the best line or decision

boundar y that can segrega te n-dimen sional space into classes so

that we can easily put the new data point in the correct categor y in

the future. This best decision boundar y is called a hyperpla ne.

• SVM chooses the extreme points/v ectors that help in creating the

hyperplane.

• These extreme cases are called support vectors, and hence

algorithm is termed a Support Vector Machine.

• Consider the below diagram in which there ·are two different

categori es that are classified using a decision boundar y or

hyperpla ne:

Page no - 39 Handcrafte d by Engineers IP - Priority


Maximum
Margin Positive
/ _ Hyperplane
f,,,
, \

Maximum •'' ' ♦ ♦
Margin - --+-----'... .-..,.__ ',, ♦ ♦ ♦
Hyperp Iane ', ♦ ♦
\
\

••• \
\
\
\

• • ••
\
\
\

•• Support
Vectors
Ne at ive H perpla ne

Figure: Support Vector Machine Algorithm

Types of SVM

SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which

means if a dataset can be classified into two classes by using a

single straight line, then such data is termed as linearly separable

data, and classifier is used called as Linear SVM classifier.

• Non-linear SVM: Non- Linear SVM is used for non-linearly separated

data, which means if a dataset cannot be classified by using a

Page no - 40 Handcrafted b y Engineers I P - Priori ty


straight line, then such data is termed as non-line ar data and the

classifier used is called a Non-line ar SVM classifier.

Q3. Explain The Dual Formation (P4 - Appeare d 1Time) (3-7 Marks)

ANS: Dual Form Of SVM

• The Lagrang e problem is typically solved using dual form.

• The duality principle says that optimiza tion can be viewed from 2

differen t perspec tives.

• The 1st one is the primal form which is minimiz ation problem and

another one is a dual problem which is the maximiz ation problem

Lagrange formula tion of SVM is


n
£
1
(w:b,a) = 2 llwll 2
L ai [Y-i (wr x i + b) - 1]
i -1
71
1
- WT W
2
Lo i [Y·i (w TX i + b) - 1] · .• · · · · · 7. 1
i=I

bllVjf~,·tto a , > 0 For all i = 1, 2, · · · · ·, n


To solve minimiz ation problem we have to take the partial derivative w.r.t w

a s well as b

Pa ge no - 41 Handcra fted b y Engineers I P - Priority


11

-
dur
= W L Of Yi ..r\ ; = 0
·i = l

'"
w = L oi yiX i
i- 1

cl £ n
-
db
=- L
i=l
Oi Yi = 0

Substitute all these in equation 7.1 then we get


n
Loi [Yi ( wT xi + b) - 1]
i= l
n n n
- WT L OiYiX i - b L YiOi +L o,
i= l i=l i=l
n l n n n
w (a : b) = L Oi - b LYiOi - - L L OiOjY·iYj (
·
t=l ·
i= 1
2 i=
. 1 . 1
J=
xr ,Xj)
11

But L Oi Yi =0
·i= l

And so the final equation will be


n l 11 11
U' (n ) = ~ Q ·
~ t
- - ~~ o ·o ·y ·y · (x T...X·)
2 ~ L..J I, J & J l J
1= 1 i= l j = l

And the following optimiza tion problem is called a dual problem.

Page no - 42 Handcraft ed by Engineers I P · Prlo rny


n

subj ect to O'i > 0 For all i = ·1, 2, · · .. , n andL o;y; = 0


i=l

Important observations from dual form SVM are:

For every Xi we have alpha(i) Dual form only involves ( XiAT.Xj)

Alpha(i) is greater than zero only for support vectors and for all other

points it is O So while prediction for a query point only support vectors do

matter

Q4. Explain Nonlinear SVM and Kernel Function OR Give the difference
between Linear and Nonlinear SVM (P4 - Appeared 1 Time) (3-7 Marks)

ANS:

No. Linear SVM Non-Linear SVM

l. It can be easily separated with It cannot be easily separated

a linear line with a linear line

2. Data is classified with the help We use kernels to make

of hyperplane non-separable data into

Page no - 43 Ha ndcra fted by Engineers I P - Priori ty


separable data.

3. Data can be easily classified We map data into high

by drawing a straight line dimensional space to classify

Kernel Function:

• A kernel is a function used in SVM for helping to solve problems.

• They provide shortcuts to avoid complex calculations.

• The amazing thing about kernels is that we can go to higher

dimensions and perform smooth calculations with the help of it.

Working of Kernel Functions

• Kernels are a way to solve non-linear problems with the help of

linear classifiers.

• This is known as the kernel trick method.

• The kernel functions are used as parameters in the SVM codes.

• They help to determine the shape of the hyperplane and decision

boundary.

• The value can be any type of kernel from linear to polynomial.

• If the value of the kernel is linear then the decision boundary would

be linear and two-dimensio nal.

• These kernel functions also help in giving decision boundaries for

higher dimensions.

Page no - 44 Handcrafted by Engineers IP - Pr1onty


• Overfitting happens when there are more feature sets than sample

sets in the data.

• We can solve the problem by either increasing the data or by

choosing the right kernel.

• There are kernels like RBF that work well with smaller data as well.

• But, RBF is a universal kernel and using it on smaller datasets might

increase the chances of overfitting.

Q5. Define and explain SVM and its Solution to the Dual Problem OR
Explain SVM advantages, disadvantage s and limitations (P4 - Appeared 1

Time) (3-7 Marks)

ANS: SVM Advantages

• SVM's are very good when we have no idea about the data.

• Works well with even unstructured and semi-structur ed data like

text, Images and trees.

• The kernel trick is a real strength of SVM. With an appropriate kernel

function, we can solve any complex problem.

• Unlike in neural networks, SVM is not solved for local optima.

• It scales relatively well to high dimensional data.

Page no - 45 Handcrafted by Engineers I P - Priority


• SVM models have generalization in practice, the risk of overfit ting is

less in SVM.

• SVM is always compared with ANN. When compared to ANN m odels

• SVMs give better results.

SVM Disadvantages

• Choosing a "good" kernel function is not easy.

• Long training time for large datasets.

• Difficult to understand and interpret the final model, variable

weights and individual impact.

• Since the final model is not so easy to see, we can not do small

calibrations to the model hence it's tough to incorporate our

business logic.

• The SVM hyperparameters are Cost -C and gamma. It is not that

easy to fine-tune these hyper-parameters. It is hard to visualize

their impact

SVM Application

• Protein Structure Prediction

• Intrusion Detection

• Handwriting Recognition

• Detecting Steganography in digital images

• Breast Cancer Diagnosis

• Almost all the applications where ANN is used

Page no - 46 Ha ndcrafted by Engineers I P - Prio rit y


MODULE - 6

Q 1. Explain Neural network and Multilayer Neural Network


(P4 - Appeared 1Time) (3-7 Marks)

ANS: Neural networks:

• It is also known as artificial neural networks (ANNs) or simulated

neural networks (SNNs), which are a subset of machine learning

and are at the heart of deep learning algorithms.

• Their name and structure are inspired by the human brain,

mimicking the way that biological neurons signal to one another.

• Artificial neural networks (ANNs) are composed of a node layer,

containing an input layer, one or more hidden layers, and an o utput

layer.

• Each node, or artificial neuron, connects to another and has a n

associated weight and threshold.

• If the output of any individual node is above the specified threshold

value, that node is activated, sending data to the next layer of the

network.

• Otherwise, no da ta is passed along to the next layer of the network.

Pu g1: nc, 4 /
Hu11du u ltdd t.,y !:ny111dc:r::, I P P, 10 11 t y
Multilayer Perceptron
• A n,ulti - layer neural network contains more than one layer of

artificial neurons or nodes.

• They differ widely in design. It Is Impor tant to note that while

single-layer neural networks were useful early in the evolution of Al,

the vast major ity of networks used today have a multi- layer mode l.

• Multi-layer neural networks can be set up in nume rous ways.

Typically, they have at least one input layer, which sends weigh ted

inputs to a series of hidde n layers, and an outpu t layer at the end.

• These more sophisticated setups are also assoc iated with nonlin ear

builds using sigmoids and other functions to direct the firing or

activation of artificial neurons.

• While some of these systems may be built physically, with physical

materials, most are create d with software functions that mode l

neural activity.

• Convolutional neural networks (CNNs), so useful for image

processing and comp uter vision, as well as recurrent neural

networks, deep networks and deep belief systems, are all examples

of multi- layer neural networks.

• CNN's, for example, can have dozens of layers that work

sequentially on an image.

Page no - 48
Handcra fted by Engineers I P - P, 1011ty
• All of this is centra l to under stand ing how mode rn neura l networks

functi on.

Advan tages on Multi-Layer Perceptron

• Used for deep learni ng [ due to the prese nce of dense fully

conne cted layers and back propa gation ]

Disad vanta ges on Multi- Layer Perceptron:

• Comp arativ ely comp lex to desig n and maint ain

• Comp arativ ely slow (depe nds on a numb er of hidde n layers )

Q2. Explain Neural Network and Backp ropag ation Algor ithm
(P4 - Appea red 1 Time) (3-7 Marks)

ANS: Backp ropag ation Algori thm

• Backp ropag ation (back ward propa gation ) is an impor tant

mathe matic al tool for impro ving the accur acy of predic tions in

data minin g and mach ine learni ng.

• Essentially, backp ropag ation is an algori thm used to calcu late

deriva tives quickly.

• Artificial neura l netwo rks use backp ropag ation as a learni ng

algori thm to comp ute a gradie nt desce nt with respe ct to weigh ts.

Page no - 49
Hondcr ofled by Enginee rs I P - Priority
.Desired outputs are compared to achieved system outputs, and

then the systems are tuned by adjusting connection weights to

narrow the difference between the two as much as possible.

The algorithm gets its name because the weights are updated

backwards, from output towards input.

The difficulty of understanding exactly how changing weights and

biases affects the overall behaviour of an artificial neural network

was one factor that held back wider application of neural network

applications, arguably until the early 2000s when computers

provided the necessary insight.

Today, backpropagation algorithms have practical applications in

many areas of artificial intelligence (Al), including optical character


recognition (OCR), natural language processing (NLP) and image

processing.
Because backpropagation requires a known, desired output for

each input value in order to calculate the loss function gradient, it is

usually classified as a type of supervised machine learning.

Along with classifiers such as Naive Bayesian filters and decision

trees, the backpropagation algorithm has emerged as an

important part of machine learning applications that involve

predictive analytics.

Page no 50 Handcrafted by Enguneers| P Priority


Q 3 . Explain Deep Neural Network (P4 - Appeared 1 Time) (3-7 Marks)

ANS: Deep Neural Networks (DNNs):

Deep Neural Networks (DNNS) are such types of networks where


each layer can perform complex operations such as representation

and abstraction that make sense of images, sound, and text.

.Considered the fastest-growing field in machine learning, deep

it is
learning represents a truly disruptive digital technology, and
used by increasingly more companies to create new
being
business models.
T h e neural network needs to learn all the time to solve tasks in a

more qualified manner or even to use various methods to provide a

better result.
When it gets new information in the system, it learns how to act

accordingly to a new situation.

tasks you solve get harder.


Learning becomes deeper when the
A deep neural network represents the type of machine learning

when the system uses many layers of nodes to derive high-level

functions from input information.


I t means transforming the data into a more creative and abstracct

component.

Pageno 51 Handcrafted by Engineers| P-Priority


Q3. Explain Deep Neural Network (P4 Appeared 1Time) (3-7 Marks)

ANS: Deep Neural Networks (DNNs):


Deep Neural Networks (DNNs) are such types of networks where

each layer can perform complex operations such as representation

and abstraction that make sense of images, sound, and text.

.Consideredthefastest-growing field in machine learning, deep


learning represents a truly disruptive digital technology, and it is

being used by increasingly more companies to create new


business models.

The neural network needs to learn all the time to solve tasks in a

more qualified manner or even to use various methods to provide a

better result.

.When it gets new information in the system, it learns how to act

accordingly to a new situation.

.Learning becomes deeper when the tasks you solve get harder.
A deep neural network represents the type of machine learning
when the system uses many layers of nodes to derive high-level

functions from input information.


It means transforming the data into a more creative and abstract

component.

Handcrafted by Engineers | P - Priority


Page no 51
n order to understand the result of deep learning better, let's

imaginea picture of an average man.

.Although you have never seen this picture and his face and body

before, you will always identify that it is a human and differentiate it

from other creatures.

This is an example of how the deep neural network works. Creative

and analytical components of information are analyzed and

grouped to ensure that the object is identified correctly.

These components are not brought to the system directly, thus the

ML system has to modify and derive them.

A deep neural network is beneficial when you need to replace

human labour with autonomous work without compromising its

efficiency.
Deep neural network usage can find various applicatiorns in real life.

For example, a Chinese company SenseTime created an automatic

face recognition system to identify criminals, which uses real-time

cameras to find an offender in the crowd.

Nowadays, it has become a popular practice in police and other

governmental entities.

Page no 52 Handcrafted by Engineers| P- Priority


MODULE-7

QI. Explain Computation Learning (P4 - Appeared I Time) (3-7 Marks)

ANS: Computational Learning Theory:


It is concerned with Supervised Learning, which is a type of inductive

learning in the field of Machine Learning that maps an input to an

output on the basis of existing input-output pairs.

I t provides a formal framework for accurately formulating and

answering questions about the performance of various learning

algorithms, allowing for thorough comparisons of both the

predictive capacity and the computational efficiency of alternative

learning algorithms.
Computational learning theory provides a formal framework in

which it is possible to precisely formulate and address questions

regarding the performance of different learning algorithms.

Thus, careful comparisons of both the predictive power and the

computational efficiency of competing learning algorithms can be

made.

Page no - 53 Handcrafted by Engineers|P- Priority


Q2. Explain Finite Hypothesis Space (P4 Appeared I Time) (3-7 Marks)

ANS: Finite Hypothesis Space:

A hypothesis is a function on the sample space, giving a value for

each point in the sample space.


If the possible values are {0,1} then we can identify a hypothesis

with the subset of those points that are given value 1.

The error of a hypothesis is the probability of that subset where the

hypothesis disagrees with the true hypothesis.

Learning from examples is the process of making independent

random observations and eliminating those hypotheses that

disagree with your observations.


Suppose we have a finite set of hypotheses, H, and that we make m

observations.

If his a hypothesis with error greater than E, then the probability that

it will b e c o n s i s t e n t with a given observation is less t h a n 1 - E, a n d

the probability that it will be consistent with all m observations is

less than (1- E)m, which is less than exp( -Em)


Therefore the total probability that some hypothesis with error

greater than E remains after m observations is less than IHlexp(

-Em)

Pageno 54 Handcrafted by Engineers|P Priority


Q3. Explain the VC Dimension (P4 - Appeared 1 Time) (3-7 Marks)

ANS: VC dimension:

It is a formal measure of bias which has played an important role in

mathematical work on learnability.


The VC dimension of a representation system is defined to be the

maximum number of datapoints that can be separated (ie.,


grouped) in all possible ways.

Another way of saying this is to describe it as the the most

datapoints that can be shattered' by the representation. More

powerful representations are able to shatter larger sets of

datapoints. These have higher Vc dimension.

Less powerful representations can only shatter smaller sets of

datapoints. These then have lower VC dimension.

VC dimension as a definition of bias

VC dimension seems to focus on a particularly demanding


representation task, i.e., representing all possible ways of grouping
datapoints.
From the intuitive point of view, this makes it less than ideal as a

general measure of bias strength.

Pageno 55 Handcrafted by Engineers | P - Priority


.We could have a system with very low Vc dimension that is actually
quite weakly biased.
This would happen, for example, if the system was able to almost
shatter large datasets, while only being able to fully shatter very
small ones.

VC dimension in mathematics
VC dimension is useful in formal analysis of learnability, however.
This is because the VC dimension
provides an upper bound on
generalization error.
The mathematics of this are quite
complex. The basic idea is that
reducing VC dimension has the effect of eliminating potential
generalization errors.
So if we have some notion of how
many generalization errors are
possible, the VC dimension gives an indication of how many could
be made in any given context.
The subfield of
Computational Learning Theory is concerned with
deriving VC-dimension bounds in different training scenarios.

Pageno 56
Handcrafted by Engineers|P- Priority
Q4. Define and explain Ensembles OR Explain Bagging and Boosting (P4

Appeared 1 Time) (3-7 Marks)

ANS: Ensemble:

It is a Machine Learning concept in which the idea is to train

multiple models using the same learning algorithm.

The ensembles take part in a bigger group of methods, called multi

classifiers, where a set of hundreds or thousands of learners with a

common objective are fused together to solve the problem.

The second group of multi classifiers contain the hybrid methods.

They usea set of learners too, but they can be trained using

different learning techniques.

Stacking is the most well-known. If you want to learn more about

Stacking, you can read my previous post, "Dream team combining

classifiers".

The main causes of error in learning are due to noise, bias and

variance.

Ensemble helps to minimize these factors.

These methods are designed to improve the stability and accuracy

of Machine Learning algorithms.

Page no 57 Handcrafted by Engineers |P Priority


. Combinations of multiple classifiers decrease variance, especially

in the case of unstable classlfiers, and may produce a more reliable

classification than a single classifier.

Types of Ensemble Methods

BAGGing, or Bootstrap AGGregating:

BAGGing gets its name because it combines Bootstrapping and

Aggregation to form one ensemble model1.

Given a sample of data, multiple bootstrapped subsamples are

pulled.
A Decision Tree is formed each of the
on bootstrapped subsamples.
After each subsample Decision Tree has been formed, an
algorithm
is used to aggregate over the Decision Trees to form the most

efficient predictor.
Random Forest Models:

Random Forest Models can be thought of as BAGGing, witha slight


tweak.

When deciding where to split and how to make decisions, BAGGed


Decision Trees have the full disposal of features to choose from.

Therefore, although the bootstrapped samples may be slightly


different, the data is largely going to break off at the same features
throughout each model.

Pageno 58
Handcrafted by Engineers|P- Priority
In contrast, Random Forest models decide where to split based on a

random selection of features.

.Rather than spitting at similar features at each node throughout,


Random Forest models implement a level of differentiation because

each tree will split based on different features.

This level of differentiation provides a greater ensemble to

aggregate over, ergo producing a more accurate predictor. Refer to

the image for a better understandin9.


Similar to BAGGing, bootstrapped subsamples are pulled froma

larger dataset.

A decision tree is formed on each subsample. HOWEVER, the

decision tree is split on different features (in this diagram the

features are represented by shapes).

Page no 59 Hondcrafted by Engineers | P - Priority


MODULE-8

QI.Explain Clustering (P4 Appeared 1Time) (3-7 Marks)

ANS: Clustering
These algorithms take the data and using some sort of similarity

metrics, they form these groups- later these groups can be used in

various business processes like information retrieval, pattern

recognition, image processing, data compression, bioinformatics

etc.
In the Machine Learning process for Clustering, as mentioned

above, a distance-based similarity metric plays a pivotal role in

deciding the clustering.

The various types of clustering are:

Connectivity-based Clustering (Hierarchical clustering


Centroids-based Clustering (Partitioning methods)
Distribution-based Clustering

Density-based Clustering (Model-based methods)


Fuzzy Clustering
Constraint-based (Supervised Clustering)

Page no 60 Handcrafted by Engineers | P - Priority


Connectivity-Based Clustering (Hierarchical
Clustering)

Hierarchical Clustering is a method of unsupervised machine

learning clustering where it begins withapredefined top tooa


bottom hierarchy of clusters. It then proceeds to performa

based on this hierarchy, hence


decomposition of the data objects
obtaining the clusters.

This method follows two approaches based on the direction of

progress, i.e., whether it is the top-down or bottom-up flow of

creating clusters. These are the Divisive Approach and the


Agglomerative Approach respectively.
Centroid Based Clustering
Centroid based clustering is considered as one of the simplest

clustering algorithms, yet the most effective way of creating

clusters and assigning data points to it.

The intuition behind centroid based clustering is that a cluster is

characterized and represented by a central vector and data points

that are in close proximity to these vectors are assigned to the

respective clusters.

These groups of clustering methods iteratively measure the

distance between the clusters and the characteristic centroids

using various distance metrics.

Page no- 61 Handcrafted by Engineers | P Priority


These are either of Euclidean distance, Manhattan Distance or

Minkowski Distance
.The major setback here is that we should either intuitively or

scientifically (Elbow Method) define the number of clusters, "K, to


begin the iteration of any clustering machine learning algorithm to

start assigning the data points


Density-based Clustering (Model-based Methods)
Density-based algorithms can get us clusters with arbitrary shapes,

clusters without any limitation in cluster sizes, clusters that contain

the maximum level of homogeneity by ensuring the same levels of

density within it, and also these clusters are inclusive of outliers or

the noisy data.

Distribution-Based Clustering

The distribution models of clustering are most closely related to

statistics as it very closely relates to the way how datasets are

generated and arranged using random sampling principles i.e, to

fetch data points from one form of distribution.

Clusters can then be easily be defined as objects that are most

likely to belong to the same distribution.

Amajor drawback of density andboundary-based approaches is in

specifying the clusters apriori to some of the algorithms and mostly

the definition of the shape of the clusters for most of the algorithms.

Pageno- 62 Handcrafted by Engineers | P - Priority


There is at least one tuning or hyper-parameter which needs to be

selected and not only that is trivial but also any inconsistency in

that would lead to unwanted results.

Distribution based clustering has a vivid advantage over the

proximity and centroid based clustering methods in terms of

flexibility, correctness and shape of the clusters formed.

The major problem however is that these clustering methods work

well only with synthetic or simulated data or with data where most

of the data points most certainly belong to a predefined

distribution, if not, the results will overfit.

Fuzzy Clustering
.Fuzzy clustering can be used with datasets where the variables

have a high level of overlap.

It is a strongly preferred algorithm for Image Segmentation,

especially in bioinformatics where identifying overlapping gene

codes makes it difficult for generic clustering algorithms to

differentiate between the image's pixels and they fail to performa

proper clustering.

Pageno 63 Handcrafted by Engineers | P Priority


Q2. Explain K-means Clustering (P4 Appeared 1Time) (3-7 Marks)

ANS: k-Means Clustering

k-Means is one of the most widely used and perhaps the simplest

unsupervised algorithms to solve clustering problems.

Using this algorithm, we classify a given data set through a certain

number of predetermined clusters or "k" clusters.

Each cluster is assigned a designated cluster center and they are

placed as much as possible far away from each other.

Subsequently, each point belonging gets associated with it to the

nearest centroid till no point is left unassigned.


Once it is done, the centres are re-calculated and the above steps

are repeated.

The algorithm converges at a point where the centroids cannot

move any further.

This algorithm targets to minimize an objective function called the

squared error function F(V):

C Ci

=l j=

Where,

Page no 64 Handcrafted by Engineers|P- Priority


IIxi vill is the distance betvween Xi and V.

Ci is the count of data in the cluster. C is the number of cluster

centroids.

Advantages:
C a n b e a p p l i e d t o a n y form of d a t a - as long as the data has

numerical (continuous) entities.

Much faster than other algorithms.

Easy to understand and interpret.

Drawbacks:

Fails for non-linear data.

It requires us to decide on the number of clusters before we start

the algorithm - where the user needs to use additional

mathematical methods and also heuristic knowledge to verify the

correct number of centres.

This cannot work for Categorical data.

Cannot handle outliers.

Application Areas:

Document clustering high application area in Segmenting

text-matrix related like data like DTM, TF-IDF etc.

Banking and Insurance fraud detection where majority of the

columns representa financial figure continuous data.

Image segmentation.

Pageno 65 Handcrafted by Engineers|P- Priority


Q3. Explain Agglomerative Hierarchical Clustering (P4 Appeared

Time) (3-7 Marks)

ANS: AGNES starts by considering the fact that each data point has its o

cluster, i.e, if there are n data rows, then the algorithm begins with n

clusters initially.
Then, iteratively, clusters that are most similar - again based on the

distances as measured in DIANA are now combined to form a

larger cluster.

The iterations are performed until we are left with one huge cluster

that contains all the data-points.

Implementation:
In R, we make use of the agnes() function from cluster package

(cluster:agnes()) or the built-in hclust() function from the native


found in
stats package. In python, the implementation can be
scikit-learn package via the AgglomerativeClustering function

inside the cluster module(sklearn.cluster.AgglomerativeClustering)

Advantages:
No prior knowledge about the number of clusters is needed

although the user needs to define a threshold for divisions.

Handcrafted by Engineers | P Priority


Pageno 66
Easy to implement across various forms of data and known to

via various
provide robust results for data generated sources.

Hence it has a wide application area.

Disadvantages:
The cluster division (DIANA) or combination (AGNES) is really strict
and once performed, it cannot be undone and re-assigned in

subsequent iterations or reruns.


I t has a high time complexity, in the order of O(n^2 log n) for all the

n data-points, hence cannot be used for larger datasets.

Cannot handle outliers and noise

Application areas:

Widely used in DNA sequencing to analyse the evolutionary history

and the relationships among biological entities (Phylogenetics).


Identifying fake news by clustering the news article corpus, by

assigning the tokens or words into these clusters and marking out

suspicious and sensationalized words to get possible faux words.

Personalization and targeting in marketing and sales.

Pageno 67 Handcrafted by Engineers | P - Priority

You might also like