Brainheaters Notes: SERIES 313-2018 (A.Y
Brainheaters Notes: SERIES 313-2018 (A.Y
BrainheatersT" LLC
Brainheaters Notes
IML Semester-7
Sr
Chapter Name & Content Priorlty Pgno
No
Machine
3. Basics of Neural Network: 47
Marks)
labeled data.
bigger dataset and serves to give the algorithm a basic idea of the
The training dataset is also very similar to the final dataset in its
works and the relationship between the input and the output.
by the program.
However, unsupervised learning does not have labels to work off of,
learning algorithms.
Reinforcement Learning
an
interpreter andareward system.
I n every iteration of the algorithm, the output result is given to the
not.
absolute value.
percentage value.
The higher this percentage value is, the more reward is given to the
algorithm.
Thus, the program is trained to give the best possible solution for
determine the best possible (only one) which would best describe
featureS.
ESI is the set of training examples. Values for the input features and
6000
SO00
4000
3000
2000
1000
00 150 00 0
actually the best. Here consistent means that the hypothesis of the
learner yields correct outputs for all of the examples that have been
Types:
The following is a list of common inductive biases in machine learning
algorithmns.
Moximum conditional independence:ifthe hypothesis can be cast
in a Bayesian framework, try to maximize conditional independence.
This is the bias used in the Naive Bayes classifier.
This is NOT what Occam's razor says. Simpler models are more
selection algorithmns.
Nearest neighbors: assume that most of the cases in a small
Given a case for which the class is unknown, guess that it belongs
(3-7 Marks)
ANS: Evolution:
Model evaluation aims to estimate the generalization accuracy of a
Holdout
learning performance.
In this method, the dataset is randomly divided into three subsets:
models.
fa model fits the training set much better than itfits the test set,
value. This is repeated k times, such that each time, one of the k
subsets is used as the test set/validation set and the other k-1
Methods of Cross-Validation
Validation
the 50% of the dataset, it may be possible that the remaining 50% of
the data contains some important information which we are
leaving while training our model i.e higher bias
LOOCV (Leave One Out Cross Validation)
I n this method, we perform training on the whole data-set but
leaves only one data-point of the available data-set and then
iterates for each
data-point. It has some advantages as well as
disadvantages also.
An
advantage of using this method is that we make use of all data
points and hence it is low bias.
Page no 11
Handcrafted by Engineers | P Priority
The major drawback of this method is that it leads to higher
point
I f the data point is an outlier it can lead to the higher variation.
Another drawback is it takes a lot of execution time as it iterates
Page no 12
Handcrafted by Engineers | P - Prinritu
MODULE-2
predictive analysis.
Linear regression makes predictions for continuous/real or numeric
Datapoints
Line of
regression
independent Variables X
y a0+alx+e
Here
Y Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
E= random error
The values for x and y variables are training datasets for Linear Regression
model representation.
Linear regression can be further divided into two types of the algorithm:
In a Decision tree, there are two nodes, which are the Decision Node
Decision nodes are used to make any decision and have multiple
tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for
---
ISub-Tree Decision Node Decision Node
- - - -
. I n a decision tree, for predicting the class of the given dataset, the
algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using the Attribute
best attributes.
Step-4: Generate the decision tree node, which contains the best
attribute.
reached where you cannot further classity the nodes and call the
is returned.
Marks)
similarity between the new case/data and available cases and puts
the new case into the category that is most similar to the available
categories.
K-NN algorithm stores all the available data and classifiesa new
data point based on the similarity. This means when new data
by using K- NN algorithm.
as well as for
The K-NN algorithm can be used for Regression
features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog
category.
The K-NN working can be explained on the basis of the below algorithm:
distance.
each category.
5. Assign the new data points to that category for which the number
and data collected by the system from other users. It's based on the
idea that people who agreed in their evaluation of certain items are
similarity of the ratings of those items by the users who have rated
both items.
an initial set of raw data is reduced to more managea ble groups for
processing.
• Feature extrac tion can also reduc e the amou nt of redun dant data
for a given analysis. Also, the reduc tion of the data and the
• PCA
• ICA
• LOA
• LLE
• t-SNE
• AE
• PCA is one of the most used linear dimen sional ity reduc tion
techn ique.
• When using PCA, we take as input our origin al data and try to find a
comb inatio n of the input featur es which can best summ arize the
• PCA is able to do this by maxim izing varian ces and minim izing the
about the data labels but only about variation. This can lead in
• LOA aims to maximize the distance between the mean of each class
measures.
J-'(J(jl f l(J / /
llo11du o I 11:J l.)y E:11~1111;Je1:. I P Pr 10111y
• This is a good choice because maximizing the distance between
• When using LDA, is assumed that the input data follows a Gaussian
_ p(xl9)p(O)
( )
p 9IX - p(x)
• Generally speaking, the goal of Bayesian ML is to estimate the
data.
parameters 0.
as fixed and determ ines the probab ility of any param eter setting e
functio n.
• Here we leave out the denom inator, p(x) , becaus e we are taking
the maxim ization with respec t to e which p{x) does not depend on.
• The key piece of the puzzle which leads Bayesian models to differ
• The idea is that its purpos e is to encode our beliefs about the
• That's to say, we can often make reason able assum ptions about
statistics.
model's parameters.
mass close to the meanwhile values towards its tails are rather rare.
• It turns out that using these prior distributions and performing MAP
addition of regularization.
• spam filtration
• Sentimental analysis
• classifying articles.
The Naive Bayes algorithm is comprised of two words Naive and Bayes,
Naive:
Bayes:
Theorem.
• Naive Bayes is one of the fast and easy ML algor ithms to predict a
class of datasets.
• It perfo rms well in Multi- class predictions as comp ared to the other
Algorithms.
• It is the most popu lar choic e for text classification probl ems.
• It can be used in real-t ime predi ctions beca use Naive Bayes
analysis.
1. Structure
2. parameters.
Healthcare Industry:
Web Search:
on user intent.
• Based on the user's intent, these models show things that are
• For instance, when we search for Python functions most of the tim e,
then the web search model activates our intent and it makes sure
folder in Gmail. So, how are these emails classified as spam? Using
the Bayesian model, which observes the mail and based on the
not.
Page no - 35
Handcra fted by Engineer s I P • Prior 1ty
Biomonitoring:
Information Retrieval:
failure/n o).
Binary or Binomial
Multinomial
• For example, these variables may represent "Type A" or "Type B" or
"Type C".
Ordinal
significance.
• For example, these variables may repres ent "poor" or "good", "very
good", "Excellent'' and each categ ory can have the scores like 0,1,2,3.
Page no - 38
Handcra f ted by Engineers I P - Priority
Q2. Explain Support Vector Machine {P4 - Appeared 1Time) (3-7 Marks)
Learning.
• The goal of the SVM algorithm is to create the best line or decision
that we can easily put the new data point in the correct categor y in
• SVM chooses the extreme points/v ectors that help in creating the
hyperplane.
hyperpla ne:
••• \
\
\
\
• • ••
\
\
\
•• Support
Vectors
Ne at ive H perpla ne
Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which
Q3. Explain The Dual Formation (P4 - Appeare d 1Time) (3-7 Marks)
• The duality principle says that optimiza tion can be viewed from 2
• The 1st one is the primal form which is minimiz ation problem and
a s well as b
'"
w = L oi yiX i
i- 1
cl £ n
-
db
=- L
i=l
Oi Yi = 0
But L Oi Yi =0
·i= l
Alpha(i) is greater than zero only for support vectors and for all other
matter
Q4. Explain Nonlinear SVM and Kernel Function OR Give the difference
between Linear and Nonlinear SVM (P4 - Appeared 1 Time) (3-7 Marks)
ANS:
Kernel Function:
linear classifiers.
boundary.
• If the value of the kernel is linear then the decision boundary would
higher dimensions.
• There are kernels like RBF that work well with smaller data as well.
Q5. Define and explain SVM and its Solution to the Dual Problem OR
Explain SVM advantages, disadvantage s and limitations (P4 - Appeared 1
• SVM's are very good when we have no idea about the data.
less in SVM.
SVM Disadvantages
• Since the final model is not so easy to see, we can not do small
business logic.
their impact
SVM Application
• Intrusion Detection
• Handwriting Recognition
layer.
value, that node is activated, sending data to the next layer of the
network.
Pu g1: nc, 4 /
Hu11du u ltdd t.,y !:ny111dc:r::, I P P, 10 11 t y
Multilayer Perceptron
• A n,ulti - layer neural network contains more than one layer of
the vast major ity of networks used today have a multi- layer mode l.
Typically, they have at least one input layer, which sends weigh ted
• These more sophisticated setups are also assoc iated with nonlin ear
neural activity.
networks, deep networks and deep belief systems, are all examples
sequentially on an image.
Page no - 48
Handcra fted by Engineers I P - P, 1011ty
• All of this is centra l to under stand ing how mode rn neura l networks
functi on.
• Used for deep learni ng [ due to the prese nce of dense fully
Q2. Explain Neural Network and Backp ropag ation Algor ithm
(P4 - Appea red 1 Time) (3-7 Marks)
mathe matic al tool for impro ving the accur acy of predic tions in
algori thm to comp ute a gradie nt desce nt with respe ct to weigh ts.
Page no - 49
Hondcr ofled by Enginee rs I P - Priority
.Desired outputs are compared to achieved system outputs, and
The algorithm gets its name because the weights are updated
was one factor that held back wider application of neural network
processing.
Because backpropagation requires a known, desired output for
predictive analytics.
it is
learning represents a truly disruptive digital technology, and
used by increasingly more companies to create new
being
business models.
T h e neural network needs to learn all the time to solve tasks in a
better result.
When it gets new information in the system, it learns how to act
component.
The neural network needs to learn all the time to solve tasks in a
better result.
.Learning becomes deeper when the tasks you solve get harder.
A deep neural network represents the type of machine learning
when the system uses many layers of nodes to derive high-level
component.
.Although you have never seen this picture and his face and body
These components are not brought to the system directly, thus the
efficiency.
Deep neural network usage can find various applicatiorns in real life.
governmental entities.
learning algorithms.
Computational learning theory provides a formal framework in
made.
observations.
If his a hypothesis with error greater than E, then the probability that
-Em)
ANS: VC dimension:
VC dimension in mathematics
VC dimension is useful in formal analysis of learnability, however.
This is because the VC dimension
provides an upper bound on
generalization error.
The mathematics of this are quite
complex. The basic idea is that
reducing VC dimension has the effect of eliminating potential
generalization errors.
So if we have some notion of how
many generalization errors are
possible, the VC dimension gives an indication of how many could
be made in any given context.
The subfield of
Computational Learning Theory is concerned with
deriving VC-dimension bounds in different training scenarios.
Pageno 56
Handcrafted by Engineers|P- Priority
Q4. Define and explain Ensembles OR Explain Bagging and Boosting (P4
ANS: Ensemble:
They usea set of learners too, but they can be trained using
classifiers".
The main causes of error in learning are due to noise, bias and
variance.
pulled.
A Decision Tree is formed each of the
on bootstrapped subsamples.
After each subsample Decision Tree has been formed, an
algorithm
is used to aggregate over the Decision Trees to form the most
efficient predictor.
Random Forest Models:
Pageno 58
Handcrafted by Engineers|P- Priority
In contrast, Random Forest models decide where to split based on a
larger dataset.
ANS: Clustering
These algorithms take the data and using some sort of similarity
metrics, they form these groups- later these groups can be used in
etc.
In the Machine Learning process for Clustering, as mentioned
respective clusters.
Minkowski Distance
.The major setback here is that we should either intuitively or
density within it, and also these clusters are inclusive of outliers or
Distribution-Based Clustering
the definition of the shape of the clusters for most of the algorithms.
selected and not only that is trivial but also any inconsistency in
well only with synthetic or simulated data or with data where most
Fuzzy Clustering
.Fuzzy clustering can be used with datasets where the variables
proper clustering.
k-Means is one of the most widely used and perhaps the simplest
are repeated.
C Ci
=l j=
Where,
centroids.
Advantages:
C a n b e a p p l i e d t o a n y form of d a t a - as long as the data has
Drawbacks:
Application Areas:
Image segmentation.
ANS: AGNES starts by considering the fact that each data point has its o
cluster, i.e, if there are n data rows, then the algorithm begins with n
clusters initially.
Then, iteratively, clusters that are most similar - again based on the
larger cluster.
The iterations are performed until we are left with one huge cluster
Implementation:
In R, we make use of the agnes() function from cluster package
Advantages:
No prior knowledge about the number of clusters is needed
via various
provide robust results for data generated sources.
Disadvantages:
The cluster division (DIANA) or combination (AGNES) is really strict
and once performed, it cannot be undone and re-assigned in
Application areas:
assigning the tokens or words into these clusters and marking out