Machine Learning Insem
Machine Learning Insem
Introduction
to Machine Learning
Syllabus
Introduction is Machine Learning, Definitions and Real-life applications, Comparison of
: What
Machine learning with traditional programming, ML vs AI vs Data Science.
Learning Paradigms : Learning Tasks - Descriptive and Predictive Tasks, Supervised, Unsupervised,
Semi-supervised and Reinforcement Learnings.
Models of Machine learning : Geometric model, Probabilistic Models, Logical Models, Grouping and
grading models, Parametric and non-parametric models.
Feature Transformation : Dimensionality reduction techniques - PCA and LDA.
Contents
1.1. What is Machine Learning ?
1.2 Real-life Applications
1.3. Comparison of Machine Learning with Traditional Programming
1.4 Learning Paradigms
4.5 Supervised Learning
1.6 Unsupervised Learning
1.7. Semi-supervised Learning
1.8 Reinforcement Learnings
1.9 Models of Machine Learning
1.10 Grouping and Grading Models
1.11. Parametric Models
1.12 Non-parametric Models
1.13 Feature
1.14 PCA
1.15 LDA
1.16 Application of Machine Learning
ttn
1-2
Introduction to Ma
“aC chine Learnj
nin
in
Machine e
arning ?
—_———
is M a c h i n e Le
ER What
Machine Learning (ML) is a sub-field of ; Artificial Intelligence (AI) which concerns
computational theories of learning and building learning
with developing
machines. of various
and process which has manifestations
Learning is a phenom enon ad symbolic knowledg
e ~
includes gaining of
Learning process . It is a
aspects.
iv e sk il ls th ro ug h in struction and practice
development of cognit riment.
ct s an d th eo ri es th ro ugh observation and expe
discovery of new fa
learn from
Machine Learning Definition : A computer program 1s said to
mance measure P, if
experience E with respect to some class of tasks T and perfor
experience E.
its performance at tasks in T, as measured by P, improves with
mance criterion
Machine learning is programming computers to optimize a perfor
of machine learning methods
using example data or past experience. Application
to large’ databases is called data mining.
It is very hard to write programs that solve problems like recognizing a human
how our
face. We do not know what program to write because we don't know
collect lots of
brain does it. Instead of writing a program by hand, it is possible to
examples that specify the correct output for a given input.
A machine learning algorithm then takes these examples and produces a program
that does the job. The program produced by the learning algorithm may look very
different from a typical hand-written program.It may contain millions of numbers.
If we do it right, the program works for new cases as well as the ones we trained
it on.
Main goal of machine learning is to devise learning algorithms that do the
learning automatically without human intervention or assistance. The machine
learning paradigm can be viewed as "programming by example." Another goal is
to develop computational models of human learning process and perform
computer simulations. .
The goal of machine learning is to build computer systems that can adapt and
learn from their experience.
Algorithm is used to solve a problem on computer. An algorithm is a sequence of
instruction. It should carry out to transform the input to output. For example, fot
addition of four numbers is carried out by giving four number as input to the
algorithm and output is sum
o n
of all four : numbers. For the same task, there . may be
ing the
= ong algorithms. It is interested to find the most efficient one, requir
vari :
st number of instructions or memory or both.
For so me tasks, howe
ver, we do not have an algorithm.
Machine Learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
¢ Machine learning algorithms discover the relationships between the variables of a
system (input, output and hidden) from direct samples of the system.
e Following are some of the reasons :
1. Some tasks cannot be defined well, except by examples. For example:
recognizing people.
Relationships and correlations can be hidden within large amounts of data. To
solve these problems, machine learning and data mining may be able to find
these relationships.
Human designers often produce machines that do not work as well as desired
in the environments in which they are used.
The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans.
5. Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans.
Machine learning also helps us find solutions of many problems in computer
_ vision, speech recognition and robotics. Machine. learning uses the theory of
statistics in building mathematical models, because the core task is making
inference from a sample.
How Machines Learn ?
e Machine learning typically follows three phases:
I. Training: A training set of examples of correct behavior is analyzed and some
representation of the newly learnt knowledge is stored. This is some form of
rules.
Validation : The rules are checked and, if necessary, additional training is
_given. Sometimes additional test data are used, but instead, a human expert
may validate the rules, or ‘some other automatic knowledge - based component
may be used. The role of the tester is often called the opponent.
3. Application : The rules are used in responding to some new situation.
¢ Fig. 1.1.1 shows phases of ML.
New Situation
New knowledge
Training
A el
.
Existing | y
knowledge gees | Response
Validation
4
e Machine learning algorithms can figure out how to perform important tasks by
generalizing from examples.
Machine learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
Machine learning algorithms discover the relationships between the variables of a
system (input, output and hidden) from direct samples of the system.
e Following are some of the reasons : .
1. Some tasks cannot be defined well, except by examples. For example :
Recognizing people.
2. Relationships and correlations can be hidden within large amounts of data. To
‘ solve these problems, machine learning and data mining may be able to find
these relationships.
3. Human designers often produce machines that do not work as well as desired
in the environments in which they are used.
4, The amount. of knowledge available about certain tasks might be too large for
explicit encoding
by humans.
5. Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans:
ae
Machine Leaning a
a subset of features from the Origina|
se
=
Feature
. selec | 1 is a process that choo
sclectior smal ly reduced according to a
} S oplime
.
Certain
tur e spa ce
"
at the fea
features So {!
criterion.
ma us |
1, Justify the following | ae |
k.
son. Is it a regression tas
i) Predict the height of a per on task ? |
ofa per son by ana lyz ing his writing style. Is 1 a classificati
ii) Find the gender
mple of unsupervised learning.
iii) Filter out spam emails. Is it a exa
s of machine learning.
2. What is machine learning ? Explain type
Ee Real-life Applications
ine learning :
e Examples of successful applications of mach
Here are several examples :
ten characters by the
1 Optical character recognition : Categorize images of handwrit
letters represented.
2 Face detection : Find faces in images (or indicate if a face is present).
3 Spam filtering : Identify email messages as spam or non-spam topic spotting:
categorize news articles (say) as to whether they are about politics, sports,
entertainment, etc.
EEA Classification
Classification is the process of placing each individual from the population under
study in many classes. This is identified as independent variables.
Classification helps analysts use measurements of an object to identify the category
to which that object belongs. To establish an efficient rule, analysts use data. Data
_consists of many examples of objects with their correct classification.
For example, before a bank decides to disburse a loan, ‘it assesses customers on
their ability to repay the loan. By considering factors such as customer's earning,
age, savings, and financial history, we can do it. This information is taken from
the past data of the loan. Hence, the seeker uses this data to create a relationship
between customer attributes and related risks.
Face Recognition
Face recognition task is effortlessly and every day we recognize our friends,
relative and family members. We also recognition by looking at the photographs.
In photographs, they are in different pose, hair styles, background light, makeup
and without makeup.
We do it subconsciously and cannot explain how we do it. Because we can't
explain how we do it, we can't write an algorithm.
Face has some structure. It is not a random collection of pixel. It is symmetric
structure. It contains predefined components like nose, mouth, eye, ears. Every
person face is a pattern composed of a particular combination of the features. By
analyzing sample face images of a person, a learning program captures the pattern
specific to that person and uses it to recognize if a new real face or new image
belongs to this specific person or not.
e In the case of face recognition, the input is an image, the classes are Peop!
recognized, and the learning program should learn to associate the face ; ¢
ma e
* to
_
identities. This problem. is more difficult than optical character recognition 5 Cause
.
there are more classes, input image is larger and a face is 3D and difference
; s in
pose and lighting cause significant changes in the image.
Medical diagnosis
e In medical diagnosis, the inputs are the relevant information about the patient and
the classes are the illnesses. The inputs contain the age of patient's, gender, past
medical history and current symptoms.
° Some tests may not have been applied to the patient, and thus these inputs woul
be missing. Tests take time, may be costly and may inconvience the patient so we
do not want to apply them unless we believe that they will. give us valuable
information.
EEX Regression
e- Regression : Trying to predict a real value. For instance, predict the value of a
stock tomorrow given its past performance. Or predict Alice's score on the
machine learning final exam based on her homework scores.
'e If the desired output consists of one or. more continuous variables, then the task is
called regression. An example of a regression problem would be the prediction of
“the yield in a chemical manufacturing process in which the inputs consist of the
concentrations of reactants, the temperature, and the pressure.
e The goal of regression is to predict the value of.one or more continuous target
Review Questions
Fig 1.3.1
Machine Learning is a proven technique for helping to solve com Plex prop
voice recognition, recommendation systems, self-d
riy: Ng calem,
such as facial and
and email spam detection.
Artificial
jntelligence ©
Machine
learnin g : ‘Data
science
Deep
learning
In Table Format :
Machine Learning uses Artificial Intelligence uses Data Science deals with
statistical models. logic and decision trees. structured data.
Learning Paradigms
Descriptive Tasks
e Two primary techniques are used for reporting past events : Data aggregation and
data mining.
® It presents past data in an easily digestible format for the benefit of a wide
business audience.
s A set of techniques for reviewing and examining the data set to understand the
data and analyze business performance.
een Fe
| °. Used when user want to summarize Used when user want to make an
results for all or part of your business. educated guess at likely results.
ee . - _ mn
geen tnernernnennnennetnnnn
ano ens eseente 1
ere ~
6. Limitation : Snapshot of the past, often Limitation : Guess at the future, helps
| with limited ability to help guide inform low complexity decisions.
j
i decisions.
beccomansorsie:
|
{
i
Review Question
EE Supervised Learning
e Supervised learning. is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples.
The task of the supervised learner is to predict the output behavior of a system for
any set of input values, after an initial training phase.
¢ Supervised learning in which the network is trained by providing it with input
and matching output patterns. These input-output pairs are usually provided by
an external teacher.
e Human learning is based on the past experiences. A computer does not have
experiences.
e io,
1-14 Introduction to Ma
ning
Machine Lear
S
mM; Ing
ok com
experi Mac
represent some "past
learns from dala, which
e A computer sysiem CNees"
main.
an application do discy
ca n be us e d to pred ict the values of a e
To learn a target functi
on that es etisk or low risk. The . Clas,
d hi gh-r
e OF nol-approved an e learning ask jg
altribule, @.g. approv assification or inductiv
learning, Cl
commonly called ; Supervised
e in pu t an d the de si red results. For some exa Mp]
botl 1 th
Training data includes
Cs
and are giv en in inp ut to the model dy;
rgels) are known
the correct results (ta r training, validation and ee, Set
r
.
e
. .
ns tr uc ti on of a pr op
e co
the learning process. Th and accurate.
thods are usually fast
is crucial. These me new data are giv
en
lize - Give the correct results wh en
Have to be able to genera
g a priori the target.
in input with out knowin ing a function from
the m achine learning task of inferr
e Supervised learning is training examples. [py
a. The training d ata consist of qa-set of
supervised tr aining dat a pai + consisting of
an input object and q
h example is
supervised learning, eac
|
desired output value. data and produces an
algorithm analyzes the training
e A supervised le arning sion function. Fig. 15.1
which is ¢ alled a classifier or a regres
inferred function,
ng process.
shows supervised learni
8 Learning
algorithm
ae
Testing
Training
n°
sy st em , to pe rf or m tas k better as compared to
the
The learned model helps
learning.
es a corres ponding target vector.
‘e Each inpu t vector requir
rget Vector)
_.- Training Pair = ( Input Vector, Ta
15.2 sho ws inp ut vec tor . (Se e Fig. 1.5.2 on next page)
e Fig. collected
inp ut vectors ar e
in which some
od
° Supervised learning denotes a meth ne t- wo rk 1 5 observ
put comput ed by the
and presented to the network. The out ights art
the dev iat ion fro m the exp ect ed ans wer is m easured. The Wwe d by th
and
rected acc ord ing to the ma gn it ud e of the error in the way define
cor
_learning algorithm.
Error
signal
generate
Desired
output
algorithm.
; : ‘ning, algorit on the col] lected
5, Complete the design and then run_ the learning algorithm
training set.
Evaluate the accuracy of the learned function. After parameter adjustmen and
6.
s peel ction should be moa.
learning, the performance of the resulting functic be measured On a
test set that is separate from the training set.
i
divideded into types : : CleClassification and
two‘9 types g sic a.
Regres
e i
Supervised learniingng isis divid
1. Classification : = |
Classification predicts categorical labels (classes), prediction models continuous -
valued functions. Classification is considered to be supervised learning.
e Classifies data based on the training set and the values in a classifying attribute
and uses it in classifying new data. Prediction means models continuous-valued
functions, i.e., predicts unknown or missing values.
e We have complete control over choosing the number of classes we want in the
training data.
eM MaelSiitelits
ER Unsupervised Learning
Unsupervised learning
Fig. 1.6.1
5. Trying to predict a function from labeled Try to detect interesting relations in data.
data.
6. Supervised learning requires that the target For unsupervised learning ‘typically either
variable is well defined and that a sufficient the target variable is unknown or has only
number of its values are given. — been recorded for too small a number of
cases.
?i
ire a large ap
e In many real world applications, it 1s relatively easy to acqu 2 c Ou
Nt of
unlabeled data x.
the Web, images can be of tained
crawled from
e For example, documents can be
ech c an be collected from broadcast. How
from surveillance cameras, and spe
. EVE
ut
A large amount of inp a
—engeneneaeenenen erent
- reinforced learning.
e Reinforced learningis deals St+1 | ENVIRONMENT
with agents that.must sense ae
and act upon their Fig. 1.8.1 Reinforced learning
environment. It combines
classical Artificial Intelligence and machine learning techniques.
e It allows machines and software agents to automatically determine the ideal-
behavior within a specific context, in order to maximize its performance. Simple
reward feedback is required for the agent to learn its behavior; this is
known as
the reinforcement signal. |
* Two most important distinguishing features of reinforcement learning is
trial-and-error and delayed reward.
e With
With reinfor ceme
reinforcement yarn
learning ‘j hms ; an
algorit age
an agent an prove itsits performance b
improve
using, the feedback it gets from the environment. This environmental feedback ack ;
is
called the reward signal.
¢ Based on accumulated experience, the agent needs to learn which action to take;
a given situation in order to obtain a desired long term goal. Essentially actions
that lead to long term rewards need to reinforced. Reinforcement learning has
connections with control theory, Markov decision processes and game theory.
time steps. At each time step t, the agent receives the state of the environment
° e
e .
sm Se
a scalar numerical
1
reward for the previous action, and then the agent then °
:
an action.
Review Question
For classification and regression problem, there are different choices of Machin
Learning Models each of which can be viewed as a black box that solve the sai
problem. However, each model come from a different algorithm approaches -
will perform differently under different data set. The best way is to use Cross
validation to determine which model performs best on test data.
The model - based approach seeks to create a modified solution tailored to eag,
new application. Instead of having to transform user problem to fit some standarg
algorithm, in model-based machine learning user design the algorithm Precisely tg
fit problem.
The core idea at the heart of model - based machine learning is that all the
assumptions about the problem domain are made explicit in the form of a model.
Model is just made up of set of assumptions, expressed
in a precise mathematical
form. These assumptions include the number and types of variables in the
problem domain, which variables affect each othe and what the effect of changing
one variable is on another variable.
Machine learning models are classified as : _
1. Geometric model (Using the Geometry of the instance space).
2. Probabilistic model (Using Probability to classify the instance space).
3. Logical model (Using a Logical expression).
geometric features from images and learning them using efficient machine learning
methods.
In geometric models, there are two ways we could impose similarity. We could
use geometric concepts like lines or planes to segment (classify) the instance space.
These are called linear models.
Linear models are parametric, which means that they have a fixed form with a
small number of numeric parameters that need to be Jearned from data. Linear
models have low variance and high bias. This implies that linear models are Jess
likely to overfit the training data than some other models.
In other method, we can use the geometric notion of distance to represent
similarity. In this case, if two points are close together, they have similar values
for features
and thus can be classed as similar. We call such models as
Distance - based models.
Examples of distance - based’ models include the nearest - neighbour models,
which use the training data. |
e
Geometric learning methods can not only solve recognition problems but also
predict subsequent actions by analyzing a set of sequential input sensory images,
usually some extracting features of images.
Example of Geometric models : K - nearest neighbors, linear regression, support
eee
APY Probabilistic
logic
learning
learning
Fig. 1.9.1 Probabilistic logic learning as the intersection of probability logic and
tive.
There are two types of probabilistic models : Predictive and Genera
Predictive probability models use. the idea of a conditional probability distribution
P (Y |X) from which Y can be predicted from X. Generative models estimate the
joint distribution P (Y, X). |
can derive any
Once we know the joint distribution for the generative models, we
Thus, the
conditional or marginal distribution involving the same variables.
knowing
generative model is capable of creating new data points and their labels,
the joint probability distribution.
The joint distribution looks for a relationship between two variables. Once this
relationship is inferred, it is possible to infer new data points. Naive Bayes is a
example of a probabilistic classifier.
e Example of Probabilistic models : Naive Bayes, Gaussian process regression
conditional random field.
Probabilistic modeling is a statistical technique used to take into acc ount the
of future
impact of random events or actions in predicting the potential occurrence
outcomes.
the system by sing a limited data set called
e In machine learning, we train
e
“4,
training data’ and based on the confidence level of the training data we expect i
' ts ‘\
machine learning algorithm to depict the behaviour of the larger set of actual date
a
Probab
y,-
ility
ily y
theory
’ “a
provi
. 1
des1c a mathematical foundation for quantifyin g uncertaint
.
‘ ¢ j
y
of the knowledge.
¢ ML is focused on making predictions as accurate as possible, while traditional
statistical models are aimed at inferring relationships between variables.
¢ We]
make, to
observations : .
using the sensors in ‘ .
the world. Based on the observations,
we intend to make decisions. Given the same observations,
the decision should be
the same. However, the world changes, observations change, our sensors change,
the output should not change.
° We build models for predictions; can we trust them? Are they certain? Many
applications of machine learning depend on good estimation
of the uncertainty :
a) Forecasting
b) Decision making
c) Learning from limited, noisy, and missing data
d) Learning complex personalised models
_@) Data compression . |
f)
Automating’ scientific modelling, discovery, and experiment design.
e A signal is called random if its occurrence can not be predicted. Such signal
can
not be by any mathematical equation. | |
e The random signals are represented collectively by a random variable takes its
value will be taken at particular time is not known.
_*® The random variables are analyzed statistically with the help of probability,
probability density functions and statistical averages such as mean, variance etc.
Relative frequency : For event 'A' relative frequency is defined as,
Number of times event occurs (N A) _Na
Relative frequency =
Total number of trials N
As number of trials approach infinity, relative frequency is called probability.
Probability of event 'A' is defined as the ratio of number of possible favourable.
outcomes to total number of outcomes i.e.,
. ; lim
Probability, P(A) = Noo
Number of possible favourable outcomes
Total number of outcomes
Introduction to Machine lL ni
1 - 28 €a
Machine Leaming 4 Mac
and Combinations
Permutations n!
'r' at a time,
wot : 4 n
C,
=
(a—n)it!
taken
en)
Combination
. .
of n
n!
. »n =
. rot oy
‘rat a time, “P, = (n-r)!
Permutation of 'n' taken
Conditional Probability
° Letet AA | that : P(A) > 0. We : denote P(B| A) the probabilit
and B be two events such
f B given ed, it becomes the
sole cose ie caries Since A is known to:! have occurr ;
new
€ replacing the origi
p(B/A) = PAN B) niginal S. From this, the definition is,
P(A)
OR
P(AN B) P(A) P(B/A)
TE
ECCHHNICAL : PUBL
ICATIONS 7 an Up
lL -thrust for know
; ledge
Machine Leaming 1-29 Introduction to Machine Learning
The notation P(B | A) is read "the probability of event B piven event A’ It is the
probability of an event B given the occurrence of the event A.
We say that, the probability that both A and B occur is equal to the probability
that A occurs times the probability that B occurs given that A has occurred. We
call P(B] A) the conditional probability of B given A, i.c., the probability that B
will occur given that A has occurred.
similarly, the conditional probability of an event A, given B by,
P
P(A/B) = (AO 8) 3
P(B)
The probability P(A |B) simply reflects the fact that the probability of an event A
may depend on a second event B. If A and B are mutually exclusive AN B= 6
and P(A|B) = 0.
Ce 7 se * aD
P(Second/First) = P
(First
Fi aS
choice
]
and second choice)
SE
P (First choice)
Conditional probability is a defined quantity and cannot be prov
en.
The key to solving conditional probability problems is to :
1. Define the events.
2. Express the given information and question in probability notation.
3. Apply the formula.
Joint Probability
A joint probability is a probability that measures the likelihood that two or more
events will happen concurrently.
If there are two independent events A and B, the probability that A and B will
occur is found by multiplying the two probabilities. Thus for two events A and B,
the special rule of multiplication shown symbolically is :
P(A and B) = P(A) P(B).
The general rule of multiplication is used to find the joint probability that two
events will occur. Symbolically, the general rule of multiplication is,
P(A and B) = P(A) P(BIA).
The probability P(AN B) is called the joint probability for two events A and B
which intersect in the sample space. Venn diagram will readily shows that
P(AN B) = P(A) + P(B) - P (A u B)
Equivalently :
B)s P(A)+ P(B)
P(AN B) = P(A)+ P(B)- P(AN
the sum of the even
The probability of. the union of two events never exceeds
probabilities.
g conditional and joint probabilities, ,
A tree diagram is very useful for portrayin .
dia gram por tra ys out comes that are mutually exclusive
tree
Bayes Theorem
ability of an event given additiona,
Bayes’ theorem is a method to revise the prob
rmat ion. Bayes's theo rem calcu lates a conditional probability called a posterior
info
|
or revised probability. |
in probability theory that relates conditional
Bayes’ theorem is a result
two events, P(A|B) denotes the conditional
probabilities. If A and B denote
The two conditional probabilities
probability of A occurring, given that B occurs.
t.
P(A|B) and P(B|A) are in general differen
n P(A|B) and P(B/A). An important
Bayes theorem gives a relation betwee
s a rule how to update or revise the
application of Bayes’ theorem is that it give
of new evidence a posteriori.
strengths of evidence-based beliefs in light
r prob abil ity is an initi al probability value originally obtained before any
A prio
additional information is obtained.
that: has been revised by using
A posterior probability is a probability value
ned.
additional information that is later obtai
-- Bn partition the outcomes of an experiment and that A
Suppose that By, B2,B3
< n, we have the formula :
is another event. For any number, k, with 1 <k ,
P(A/B,): P(Bx) |
P(By/A) = =o L <
¥) P(A/B;): P(B,)
~ i=l
Grouping Model :
RE,
Grading Model :
Grading models form one global model
over the instance space. They don't use
the notion of segment.
e ‘Grading models are usually able to distinguish between arbi trary instances, no
matter how similar they are.
_
ERee Parametric Models
Pe
Model can be represented using a pre
- determined number of parameters. Thes
methods in Machine Learning typicall e
y take a model - based approach. We
make
an assumption there with respect to the form of
the func ction to be guessed. Then
we choose an appropriate model based
on this assumption correct to estimate
set of parame the
ters.
The advantage of the parametric approach is that the model is defined
up toa
small number of parameters, for example mean and variance, the sufficient —
Statistics ofee the .distr
Race,ibuti
Seon. ig Once those parameters are estim, ated from the sample,
Ne whole distribution
is known,
ie
Considered as a function of 0.
e If the distribution is discrete, f will be the frequency distribution function.
e The maximum likelihood estimate of 6 is that value of that maximises lik@) : It is
the value that makes the observed data the most probable.
1-0 if x=0
¢ Usually, a:
Usually, we use the notation P(.) for a probability mass and the notation P(.) for’
_ probability density. For mathematical convenience
write P(X
P(X = x) = @*(1-0)1* ~
Be [ag] = Rene)
=
1
—|,B =)
1 2
wa], -, By
oe PAGE A) ta = [SS
=
M-2
s
M-1
/ 7
M-1
r
In such case, then for a given point x € By, the density estimator from the
histogram will be
n Number of observations within B , 1
Pn(x) = x
n -Length of the bin7
= “SX, €B,)
n
n ; i=]
The intuition of this density estimator is that the histogram assign equal density
value to every points within the bin. So for B,, that contains x, the ratio of
observations within this bin is ~¥" 1, € By) which should be equal to the
density estimate times the length of the bin.
Non - parametric methods lean towards additional precision because they ky to
find the best fit for the data points. Though, this approaches at the cost of needing
i
Parametric methods
Sr. No. Non-parametric method
;
a
3. It can be used on small samples. - Tend to need larger samples. peer eee oe
—
rn
GE oe
4.13 | Feature
In machine learning, features are individual independent variables that act like
Input mn your system. Feature is an attribute of a data set and used in a machine
samMMINg process.
learning process. Selection
Select; :
of the subset van;
of features which are meaningful for
machine learning is a sub-area of feature engineering
The
| ’
features
a - -
in a data
j
set are also called
~ b
its dimensions.
. . .
So a data set having
~ rq Cc r
'n \
Feature engineering is the process of creating features (also called "attributes") that
don't already exist in the dataset. This means that if your dataset already contains
enough "useful" features, you don't necessarily need to engineer additional
features. .
Feature engineering refers to the process of translating a data set into features
such that these features are able to represent the data set more effectively and
result in a better learning performance.
If feature engineering is performed properly, it helps to improve the power of
prediction of machine learning algorithms by creating the features using the raw
data that facilitate the machine learning process.
Elements of feature enginecring is feature transformation and feature subset
selection.
Feature extraction.
lJ
fk Feature Construction
e
Hates
a2 Caeuwikd construction involves transforming a given set of input features
cenersanr
See ee ea e a new set of more powerful features which can then used for prediction,
e Feature
Featu consnstruction methods may be applied. to pursue two distinct goals
Reducing data dimensionality and improving prediction performance. r
e Steps:
i. Start with an initial feature space Fp .
Transform F, to construct a new feature space Fy -
)
+a¥)
mM
©
between features and augments the feature space by creating additional features.
e Hence, if there are ‘'n’ features or dimensions in a data set, after featur
construction 'm' more features or dimensions may get added. So at the end, the
data set will become 'n + m' dimensional. .
0 eee
Machine Leaming 4-38 Introducti
ction to Maching
>Le
Stn
q Ma
start with a full feature set until the preset bound cannot be maintained.
e Using a suitable error function, this can be used in both regression and
classification problems. There are 2d possible
subsets of d variables, but we cannot
test for all of them unless d is small and we employ heuris
tics to get a reasonable
(but not optimal) solution in reasonable (polynomial)
time.
e Subset selection are of two types : Forward and
backward selection.
J. Forward selection : It start without variables and add them one by one, at
each step adding the one that decreases the error the most, until any further
addition does not decrease the error.
.
2. Backward selection : It start with all variables and remove them one by one,
at each step removing the one that decreases the error the most, until any
further removal increases the error signific
antly.
e Sequential Forward Selection (SFS) : SFS is the simplest greedy search algorithm.
It start from the empty set, sequentially add the feature x*. SFS
performs best
when the optimal subset is small.
e The main disadvantage of SFS is that it is unable to remov
e features that become
obsolete after the addition of other features.
e Sequential Backward Selection (SBS) : It works in the opposite directi
on of SFS.
Starting from the full set, sequentially remove the feature x~ that least
reduces the
value of the objective function.
e SBS works best when the optimal feature subset is large, since SBS spends most of
its time visiting large subsets. The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has been discarded.
e SFS is performed from the empty set. SBS is performed from the full set.
e There are two floating methods :
1. Sequential Floating Forward Selection (SFFS) starts from the empty set. After
each forward step, SFFS performs backward steps as long as the objective
function increases.
2. Sequential Floating Backward Selection (SFBS) starts from the full set. After
each backward step, SFBS performs forward steps as long as. the objective
function increases.
e Subset selection is supervised in that outputs are used by the regressor or classifier
to calculate the error, but it can be used with any regression or classification
‘method.
of Dimensionality
4.13.6 | Curse to the number of features
simply refers
onality
machine le arning, "dimensi
In t.
s) in your datase
@
fo o ~ fp tnnwiledge
Machine Learning 1-41
Introduction to Machine Learning
Selection
Si
Combination
Di sa dv an ta ge s of Di me ns ionality Reduction
fekwa Advantages and
ty Reduction
Advantages of Dimensionali
stor age space.
in dat a co mp res sio n and hence reduced
It helps
It reduces computation time.
undant features, if any. ”
s It also helps remove red
nsionality Reduction
e Disadvantages of Dime
amount of data loss.
» It may lead to some which is sometimes
between variables,
to find linear correlations
PCA tends
undesirable. enough to define datasets.
mean and cov ariance are not
fails in cases where me
PCA
co mponents to keep - in practice, so
how many principal
un We may not know
thumb rules are applied.
e Hence, PCA employs a linear transformation that” is based on preserving, the most
variance in the data using the least number of dimensions.
e It involves the following steps :
1. Construct the covariance matrix of the data.
2. Compute the eigenvectors of this matrix.
3. Eigenvectors corresponding to the largest eigen values are used to reconstruct
a large fraction of variance of the original data.
e The data instances are projected onto a lower dimensional space where the new
features best represent the entire data in the least squares sense.
e It can be shown that the optimal approximation, in the least square error sense, of
a d-dimensional random vector x7 < d by a linear combination of independent
vectors is obtained by projecting the vector x onto the eigenvectors e;
_ corresponding to the largest eigen values /; of the covariance matrix (or the scatter
matrix) of the data from which x is drawn.
e The eigenvectors of the covariance matrix of the data are referred to as principal
axes of the data, and the projection of the data instances on to these principal axes
are called the principal components. Dimensionality reduction is then obtained by
only retaining those axes (dimensions) that account for most of the variance, and
discarding all others.
e In the Fig. 1.14.1, Principal axes are along the eigenvectors of the covariance
matrix of the data. There are two principal axes shown in the figure, first one is
- closed to origin, the other is far from origin.
e If X = X41, Xa, ,Xy is the set of n patterns of dimension d, the sample mean
decomposing images, each row in X isa single image, and each column represents
some feature,
X41 | wy hy
x _ x9 , We
WwW 2 , H =
ha
Xk Wk]. hy
* Take the i row in X, xj. If you think about the equation, you will find that x;
can be written aS ~ .
k .
= Wi xh;
j=l
e :
Basically, we can, interpret x; to be aa Ww weighted sum of some components, where
each row in H is a component, and each row in W cont
ains the weights of each
component.
Machine Learning Introduction to Machine Learning
|
4. PCA is non-iterative. NMF is iterative.
|
5. Designed for producing optimal basis — Designed for producing coefficients with a
images. specific property.
e In sparse PCA one wants to get a small number of features which still capture
most of the variance. Thus one needs to enforce sparsity of the PCA component,
which yields a trade-off between explained variance and sparsity.
To address the non-sparsity issue of traditional PCA, sparse PCA imposes
additional constraint on the number of non-zero element in the vector v.
¢ This is achieved through the /, norm, which gives the number of non-zero
element in the vector v. A sparse PCA with at most k non-zero loadings can then
be formulated as the following optimization problem.
* Optimization problems with /, norm constraint is in general NP-hard. Therefore,
most methods for sparse PCA relaxes the /g norm constraint with l, norm
appended to the objective function.
O.6F EL Ba 0.67
oat SM. PEE Ses
0.2+ 7 “no 5 xx x x x x
ay
0.2+
oe a Xo
*9 x. x Sexxy O+
O+ x x xx”
—0.2+. x on x * xy x “ * _ 0.27
—0.4 T x
—0.4+ x x weet * "x ere x ete x”
0.6 gp 0.6b, 4, Fs
08-06 -0.4-02 0 02 0.4 0.6 0.6 -0.8-0.6-0.4 -0.2 0 0.2 04 0.6 0.8
“4 . X4
Fig. 1.14.2 (a) PCA Fig. 1.14.2 (b) KPCA
Preliminaries :
# Load libraries
from sklearn.decomposition import PCA, KernelIPCA
from sklearn.datasets import make_circles
| Machine Learning
Introduction to Machine
Learning
;
Emra LDA
|
Fig. 1.15.1
Scalar v ‘x; is the distance of projection of x; from the origin. Thus it v'x; is the
Projection of x; into a one dimensional subspace.
Thus the projection of sample x; onto a line in direction v is given by v 'x;.
How to measure separation between projection of different classes ?
Let p; and p, be the means of projection of classes 1 and
2.
Let uw, and U> be the means of classes 1 and 2.
1 n4{ 1 nN t
t _ t —
pp
_
=—ny Dv XV — Yxjavli
n4
xj €Cy xj EC]
Similarly, p2 = V th
8 ——-
10°
+e
_
Ho50
Oo
a Me
©- 1 |
|
= >
© — 4
& ! A A
“n by Mp
Large variance .
Fig. 1.15.2
to variance.
e We need to normalize |i —fi2| by a factor which is proportional
n
=
Have samples Z1,.--,Zn: Samples mean is Wz
NI rR Si:
e
i=l
s = S(2-Hz)?
i=1
scatter
e Fisher solution : Normalize |}, —{ly | by
e We need to normalize by both scatter of class 1 and scatter of class 2. Thus Fisher
linear discriminant is to project on line in the direction v which: maximizes
¥4-1.)2
y(v) = Gao Ba)"
¢ Define the separate class scatter matrices S; and S., for classes 1 and 2.
Discriminant function :
e For a 2-class case, we seek a weight vector w and threshold w , such that
T > 0 WwW ]
Ww xXt+Wo => xE
<0 W>5
w! (x; -X2) = O
= Ww "fs tr——|+w
OP tiw
_= c=
twill?
Iw I
2(x
OF, r= 80)
I] w II
e The value of the discriminant function for a pattern x is a measure of its distance
e Limitations of LDA :
a) LDA is a parametric method since it assumes unimodal Gaussian likelihoods |
b) LDA will fail when the discriminatory information is not in the mean but rathe-
the variance of the data. |
Good discriminability |
| 4. Bad for discriminability
5 inds the
PCA as a technique that finds LDA attempts to find a feature subspace that |
A
7.
X44 ,
Bad projection
x
y
NS
Face
<
detection
\fco 7
:- Find
Wy
faces
°
in. images
s
(or indicate if a face is present)
tw
e Data and machine learning are responsible for the explosive growth of digital
voice assistants. They continue o get better with the more experiences they have
| [
and the data they accumulate.
e When user make a request of Alexa, the microphone on the device records
command. This’ recording is sent to over the internet to the cloud. If user are
talking to Alexa, the recording is sent to Alexa Voice Services (AVS). This
cloud-based service will review the recording and interpret user request. Then, the
system will send a relevant response back to the device.
e Amazon breaks down user "orders" into individual sounds. It then consults a
Google Home : .
| i"
e Google services such as its image search and franslation
‘ tools ; 1 Ss hist
SOphistte ate
machine learning which allow computers to see, listen ind
e, wen and speak l in much the
same way as human do.
- activated
S ]
virtual
ews
to the
. . '
internet.
° The Google Home is always listening to its environment, but it won't record what
we are saying or respond to our commands until we speak one of its
pre-programmed wake words -- either "OK, Google" or "Hey, Google."
« TF-IDF, is a numerical statistic that is intended ‘to reflect how important a word is
to a document from collection corpus. It is often used as a weighting factor for
searches of information retrieval, text mining and user’modeling.
e The TD-IDF value increases proportionally to the number of times a word appears
in the document but it is often offset by the frequency of the word in the corpus,
which helps to adjust for the fact that some words appear more frequently.
technology
} ¢ was initially- developed to aid ground forces in the transfer of heaavy
equipment,
e However, the technology has witnessed significant evolution over the years, givin
; ow . wap i 8
rise to more tactical vehicles designed to assist in’ surveillance or [pp
search-and-destroy missions.
e For example, unmanned ships in the course of the voyage, the default route is to
ensure the obstruction of the premise of a straight line navigation.
e During the course of the voyage, the hull is changed by the intensity and direction
of the waves and is unpredictable. It is clear for the unmanned boat itself.
Therefore, unmanned ships in the process of navigation, continue to train the
perception of the surrounding environment and make the appropriate strategy, if
the results of the implementation of the strategy in line with the default route to
be rewarded.
Review Question
good
Regression
Syllabus
Introductiojon - ;Regressi
ere oe Need of.Reg) “essi
ession, Differ
, ence between Regression and Correlation, Types
of Regt ession - Univariate vs. Multivariate, Linear vs, Nonlinear, Simple Linear vs. Multipl
e Linear,
Bias-Variance tradeoff, Overfitting and Underfitting.
Regression Techniques - Polynomial Regression, Stepwise Regression, Decision Tree Regression,
Random Forest Regression, Support Vector Regression, Ridge Regression, Lasso Regres
sion,
ElasticNet Regression, Bayesian Linear Regression.
Evaluation Metrics : Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared
Error (RMSE), R-squared , Adjusted R-squared.
Contents
2.1. Introduction
2.2 Regression
2.3 Types of Regression
2.4 Overfitting and Underfitting
2.5 Regression Techniques : Polynomial Regression
. 2.6 Support Vector Regression
2.7 Ridge Regression
2.8 Lasso Regression
2.9 ElasticNet Regression
2.10 Bayesian Linear Regression
2.11 Evaluation Metrics
E
Machine Learning 2-2 Regre Oe
~Y!0n ~
M
—_—
FXG introduction
e Regression analysis is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independen;
variables. It can be utilized to assess the strength of the relationship between F
variables and for modelling the future relationship between them.
e The two basic types of regression are linear regression and multiple linea;
regression.
e Regression analysis includes several variations, such as linear, multiple linear anq
nonlinear. The most common models are simple linear and multiple linear,
Nonlinear regression analysis is commonly used for more complicated data sets in
which the dependent and independent variables show a nonlinear relationship.
Review Question
———_
. oo ; |
1. How the performance of regression is‘assessed ? Write various performance metrics used for |
it. |
Pez Regression
e For an input x, if the output is continuous, this is called a regression problem. For
example, based on historical information of demand for tooth paste in your
supermarket, you are asked to predict the demand for the next month.
Regression is concerned with the prediction of continuous quantities. Linear
regression is the oldest and most widely used predictive model in the field of
machine learning. The goal is to minimize the sum of the squared errors to fit a
straight line to a set of data points.
‘t
Population ,
oh \ Population slope Random
x J - intercept error
b | Change ¥ P > 1 ao
5 b=Slood in Yi = Bo
+ By X; + 2;
SPChange
aS iniX /
Dependent Independent
| variable variable
a= Y - intercept
> X
e for
For regression
regression tasks
tasks, the » typical
typic ‘curacy
decuracy metrics are Roo!
7 >> \ ,
M Can Sx lle
(RMSE) and Mean Absolute Percentage Error (MAP E). Th jeate error
ese metrics measure the
distance between the predicted numeric target and the éactualal n numerici answer.
Regression Line
east squares : The least <> ae Acne? ,
° I ee The least squares regression line is the line that makes the sum of
squared residuals as small as possible. Linear means "straight line" ;
S str e”.
e Regression line is the line which gives the best estimate of one variable from the
value of any other given variable. .
e The regression line gives the average relationship between the two variables in
mathematical form.
For two variables X and Y, there are always two lines of regression.
Regression line of X on Y gives the best estimate for the value of X for any
specific given values of Y :
X = at+byY
where
a. = X - intercept
where
le a = Y - intercept
b = Slope of the-line
Y = Dependent variable
x = Independent variable
the vertical
e By using the least squares method (a procedure that minimizes
deviations of plotted points surrounding a straight line) we are able to construct a
best fitting straight line to the scatter diagram points and then formulate a
regression equation in the form. of :
y =-at+bX -
§ = F+b(x-xX)
Where the betas are constants and the epsilons are independent and _ identically
distributed normal random variables with mean zero.
e In a regression tree the idea is this : since the target variable does not have classes,
we fit a regression model to the target variable using each of the independent
variables. Then for each independent variable, the data is split at several split
points.
e At each split point, the "error" between the predicted value and the actual values
is squared to get a "Sum of Squared Errors (SSE)". The split point errors across the
variables are compared and the variable/point yielding the lowest SSE is chosen
as the root node/split point. This process is recursively continued.
e Error function measures how much our predictions deviate from the desired
answers. .
1 2 (y; -£ (x;))?
Mean-squared error J, = n
1= M1
Advantages :
By exa
omminne
ingid the
fe me g
magni and sign
i of the Tegression coefficients you
can infer
es alfect the target outcome.
Regr oe, " a technique for investigating the relationship between independent
varia . or eatures and a dependent variable or outcome. It's used as a method
for pre ive modelling in machine learning, in which an algorithm is used to
predict continuous outcomes.
Regression models are used to predict a continuous value. Predicting prices of a
house given the features of house like size, price is one of the common examples
of Regression. It is a supervised technique.
Regression analysis is a way to find trends in data. For example, we might guess
that there's a connection between how much we eat and how much is the weigh;
regression. analysis can help us quantify that equation. Regression analysis will
provide with an equation for a graph so that we can make predictions about our
data.
Regression is essentially the "best guess" at. utilizing a collection of data to
generate some form of forecast. It is the process of fitting a set of points to a
graph.
———______
2-6 Regres:
Machine Learning =a Son
<< M
that minimize the prediction error. Linear model selection approaches include bes
: ; DCs
subsets regression and stepwise regression.
| Regression Correlation
| Regression tells us how to draw the straight Correlation describes the strength of a linear
line described by the correlation. relationship between two - variables. rs
Multivariate |
ty No. Univariate
2. It does not deal with causes and It deal with causes and relationships |
relationships
3, It does not contain any dependent variable. It contains more than one dependent |
variable |
e Equation: y = By +Byxte
e The ee model is more flexible and accurate. Although both models can
Zacco mm odate curvature, the nonlinear model is significantly more versatile in
terms of the forms of the curves it can accept
e If a model can take the inputs, and routinely get the same. outputs, the model is
interpretable : .
1. If you overeat your magi at dinnertime and you always have troubles sleeping,
the situation is interpretable.
2. ° If all 2019 polls showed " ABC party" win and the "XYZ party" candidate took
office, all those models showed low interpretability:
e Interpretability poses no issue in low-risk scenarios. If a’ model is recommending
movies to watch, that can be a low-risk task
e Fitness of a target function approximated by a learning algorithm determines how
correctly it is:able to classify a set of data it has never seen.
© Training error can be reduced by making the hypothesis more sensitive to training
data, but this may lead to overfitting and poor generalization.
Underfitting : If we put too few variables in the model, leaving out variables «
could help explain the response, we are underfitting. Consequences :
1. Fitted model is not good for prediction of new data - prediction is biased
2.
dos Regression coefficients are biased
Because of overfitting, low error on training data and high error on test data
Overfitting occurs when a model begins to memorize training data rather thar
learning to generalize from trend.
The more difficult a criterion is to predict, the more noise exists in pas
information that need to be ignored. The problem is determining which part to
ignore,
wi Y Y
x
>
>
x<Y
Fig. 2.4.1
ee 2-9
Machine Learning Regression
—_—_—_——_— ——
e In the machine learning, the more complex model is gall to. hour wiene of
overfitting, while the simpler model underfilting. Often several eeueistia are
developed in order to avoid overfitting, for example, when designing, neural
networks one may :
1. Limit the number of hidden nodes.
2. Stop training early to avoid a perfect explanation of the training set, and
3. Apply weight decay to limit the size of the weights, and thus of the function
class implemented by the network.
¢ Give two classes of hypothesis (e.g. linear models and k-NNs). to fit to some
training data set, we observe that the more flexible hypothesis class has a low bias
term but a higher variance term. If we have paramettic family of hypothesis, then
we can increases the flexibility of the hypothesis but we still observe the increase
of variance.
——
Machine Leaming _ 2-10 . - $a Sin,
target outputs.
The variance is error from sensitivity to small fluctuations in the training set,
Nu
High variance can cause overfitting modeling the random noise in the
training data, rather than the intended outputs.
¢ In order to reduce the model error, the designer can aim at reducing either the
bias or the variance, as the noise components is irreducible.
ee As the model increases in complexity, its bias is likely to diminish. However, as
the number of training examples is kept fixed, the parametric identification of the
model may strongly vary from one DN to another. This will increase the variance
term.
e At one stage, the decrease in bias will be inferior to the increase in variance,
warning that the model should not be too complex. Conversely, to decrease the
variance term, the designer has to simplify its model so that it is less sensitive to a
specific training set. This simplification will lead to a higher bias.
Example 2.4.1
8 8
o o
Solution
¢ Given Fig. 2.4.3 is related to overfitting and underfitting.
Underfitting (High bias and low variance) :
° A statistical model, or a machine learning algorithm is said
to have underfitting
when it cannot capture the underlying
trend of the data.
e It
when
en we try to buildwhen
happens we have less data to build an accurate model and als?
a linear model with a non-linear data.
TECHNICAL PUBLICATIONS
® ~ an up-thrust for knowledge
Pegression
Size Size
2 2 2
Oy + 04x 05 + 0,x + 05x* 09 + 0,x + 05x + Nox + Ho%
(overfit)
High bais (underfit) High bais (underfit) High variance
Fig. 2.4.4
flexible to
In such cases the rules of the machine learning model are too easy and
ably make a
be applied on such minimal data and therefore the model will prob
lot of wrong predictions.
reducing the features by
» Underfitting can be avoided by using more data and also
feature selection.
Review Questions
Fig. 2.4.5
i) Find contingency table
it) Find recall iii) Precision
iv) Negative recall vw) False positive rate
6. Difference between overfitting and underfitting.
for
7 .
the ’
high
i 5
SIAMficant,
2. Backward elimination : Appropriately fit th highest order model and then
delete terms one at a time starting13 with
wi =
vy the: highest order, until the highest
sjenifi
order remaining© term hashas aa sign ificant { slalistic
advantages :
e We can model non-linear relationships between
swe variab! es.
e There is a large range of different functions that you can use for fitting
of curvature and its
« Good for exploration purposes : You can test for the presence
inflections.
“edic t d on
.. the random forest tal kes the pred iction from each tree and primarily base
tre 7 nT ehe people's vot
last outp ut ‘
mos ;
P otes:: of predictions, and it predicts ‘ the 3 very
-Y as )
ao more Wider vari ety of trees with ‘a3in the forest results in better accuracy and
The
ing.
hassle of: overfitt
the
prevents
the wid e variety N for select ion trees which we want to build.
Step - 3:
Choose
picture :
Machine
pts
Learmng
? “a
3. Land use : We can perceive the areas of comparable land use with the aid of ths
algorithm.
Marketing : Marketing tendences can be. recognized by the usage of ths
algorithm.
Support Vector Machines (SVMs) are a set of supervised learning methods which
JIearn from the dataset and
used for classification.
SVM is a classifier derived
oO
from statistical learning Class 2
Vapnik
oO
theory by
%
and
i
Chervonenkis.
An SVM kind
is’ a of a
; a a bs Class 1
large-margin classifier: it is >
a vector space based
Fig. 2.6.1 Two class problem
machine learning method
where the goal is to find a
decision boundary between
two classes that is
o
Class 2 Class 2
oO
maximally far from any
point in the training data
Given a set of training
Class 1
examples, each marked as
belonging to one of two
classes, an SVM algorithm Fig. 2.6.2 Bad decision boundary of SVM
builds a model _ that
Simply speaking,
predicts whether a new example falls into one class or the other.
in space,
we can think of an SVM model as representing the examples as points
divided by a gap
‘mapped so that each of the examples of the separate classes are
that is as wide as possible.
are then mapped into the same space and classified to belong to
New examples
the class based on which side of the gap they fall on.
Problems
Mansy decision boundaries can separate these two classes. Which one should we
i_ Two * Clas
choose ?
¢ Perceptron learning rule can be used to find any decision boundary between
class 1 and class 2.
¢ The line that maximizes the minimum margin is a good bet. The model class of
“hyper-planes with a margin of m" has a low VC dimension if m is big.
e This maximum-margin separator is determined by a subset of the data points,
Data points in this subset are called "support vectors". It will be useful |
computationally if only a small fraction of the data points are support vectors, |
because we use the support vectors to decide which side of the separator a test
case is on.
Example of Bad Decision Boundaries
e SVM. are primarily two-class classifiers with the distinct characteristic that they
aim to find the optimal hyperplane such that the expected generalization error is
minimized. Instead of directly minimizing the empirical risk calculated from the
training data, SVMs perform structural risk minimization to achieve good |
generalization.
e The empirical risk is the average
loss of an estimator for a finite set High 4 \\ |
of data drawn from P. The: idea of \ Expected risk 2 ee
risk minimization is not only \\ ee |
measure the performance of an ‘ Nee : a |
estimator by its risk, but to actually Short \ oe Confidence |
search for the estimator that Noo
minimizes risk over distribution P. vo. '
Because we don't know distribution eo SN Empirical risk
P we instead minimize empirical ow i TO
risk over a training dataset drawn >
from P. This general learning Small Large
technique is called empirical risk Comglexity of function set
minimization.
e Fig. 2.6.3 shows empirical risk, Fig. 2.6.3 Empirical risk
Ja
ee
® denotes +1
© denotes —1
2, Maximal margin classifier : a classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies
that only support vectors matter; other training examples are ignorable.
Gare For the following figure find a linear hyperplane (decision boundary) that will
separate the data.
oO
O07 oO
O
O
| a oO
O
P| El
| ra
1 |
P = =
- Fig. 2.6.5
®
TECHNICAL PUBLICATIONS - an up-thrust for knowledge
$$ eee - Regre..
“Nir
ort
A ~
Ny
i
°
° °
Oo
Bs< + - mw. ~~~.< be
Oo
°
} i)
a a “4
i i
| a
a = = :
3. Other
2 a Ss
possible solution
cermla ca sARK |
<. Which one is better? B1 or B2?
4 7s :
| B,
5 ° °
f ~ Sc ° )
ho
-_-=>~
° ng
o
| Bs RS SS. ° ~ BoRy 2.
; 7) | wes.
WeSSS
=~ ~~ °
= 2 | wf °
i= = SS a
~ a SS ~J
= ~~J a
| = a
= = = a = a
B, oO
° oO
bo" ; g, «2
a og | Wextb=+1
wextb=-17 | mg VA .
m= ~ J)
bio
Fig. 2.6.6
3. Map data to high dimensional space where it is easier to classify with linea!
decision surfaces : reformulate problem so that data is mapped implicitly to thi:
space
1. Use a single hyperplane which subdivides the space into two half-spaces, on
which is occupied by Class 1 and the other by Class 2.
0. The most popular versions of SVMs use non-linear kernel functions and map the
attribute space into a higher dimensional Space to facilitate finding ood" linear
decision boundaries in the modified space, an
pa SVM Applications
* A. sof L-margin allows a few variables to ‘choke into the margin or over the
hyperplane, allowing misclassification.
. penalize the crossover by looking at the number and distance of the
'Sclassifications. This is a trade off between the hyperplane violations and the
Network |
||
Neural
| ;
eg
s |
‘ Requires number of hidden units and layer
ERT
i
{
Je hee
;
pmnvocgamencmntnn
5) are
identify which data points (1, 2, 3, 4,
nae
Fig. 2.6.7
:
between data points & the classifi ier boundary. The margin
e Margin (m) 1s the gap any compli, ta -
distance of
is the minimum
- can re b decisi on boundary. _ ! If 4 thi5s of
wes e Iste In the can
" erplan :
. onical form, the margin nba
hyp
yer measured by the lengt
the weight vector.
—_
ma rg in cla ssi fie r : A cla ssifier i tha t ma ximi zes the margin.
Maximal i 1 rin the family F 7
the margin s
’ s _nizing intuition and PAC theory. Implie
. —s
ing to
als
ord
¢
acc
TOOC
.
1s
,
arg)
i es sor t v 4k Ss . B
h
f axin
.
ay traini ble.
at only support
the
vectors matterer;
¥
,
other training examples are ignora
SUPE
to
lin ear ly separable ? Sla ck variables can be added
» What if a framing set is not sy examples, resulting margin called soft.
.allow misclassification of difficult or noi
into the margin or over the
allows a few variables to cross
» A soft-margin
ation.
hyperplane, allowing misclassific
number and distance of the
crossover by looking at the
the
penalize the
« We
ons . Thi s is a trade off be tw ee n the hyperplane violations and
misclassificati ther they-
gin size. The sla ck var iables are bounded by some set cost. The far
mar diction.
s influence they have on the pre
are from the soft margin, the les
associated slack variable
e All observations have an
.
n all pots on the margin
1. Slack variable = 0 the of the
or on the wrong side
9. -Slack variable > 0 then a point in the margin
|
_ hyperplane.
margin.
is the tradeo ff be tw een the slack variable penalty and the
3. C
Review Questions
t are sup por t vect ors and mar gin s ?-Also explain soft margin SVM.
Wha
Bm
margin errors.
What is slack variable ? Discuss
N
SVM.
gins ? Also explain soft margin
What are support vectors and mar
nm
27 | Ridge Regression
vers ions of linear regression
* Ri dge regressio—
n is one of -the maximum sturdy
of bia s is del ivered so that we will get higher long term
~~
e The quantity of bias added to the version is. called ridge regression penalty. We |
can compute this penalty term with the aid of multiplying with the lambda to the |
squared weight of each individual features.
e The equation for ridge regression might be : |
L(x, y) = Min( Son (yi wir)? +2 Dp g(a)? )
collinearity among the unbiased variables, to be able to solve such problems, ridge
regression may be used.
the
Ridge regression is a regularization method, which is used to reduce
complexity of the version. It is likewise referred to as L2 regularization.
samples.
It helps to clear up the troubles if we have greater parameters than
EXAM Sciki
Scikit Learn Code for Ridge Regression
1
“to
_ from sklearn.linear_model import Ridge
import numpy as np
|
' n_samples, n_features = 24, 19
| mg = np.random.RandomState(0) |
y = mg.randn(n_samples) |
| X = rmg.randn(n_samples, n_features) |
rdg = Ridge(alpha = 0.5) |
| rdg.fit(X, y)
rdg.score(X,y) | a
|
PPE Lasso Regression
e It is similar to the ridge regression except that penalty time period includes only
the absolute weights instead of a rectangular of weights.
e Since it takes absolute values, consequently, it is able to shrink the slope to 0,
pA V8 e i
Min} WK) : WN)
L(x, Y)
\ i ‘
*
po ElasticNet Regression
:
featu re selection
P
and feature preservation, and to
Thiscc allows7S] if fc Yrbalance between
“OO ,
»
or example, when there are more features than observations, lasuo repression may
“4
e
features
th
ated
ay keep them all. Elastic net regression can select a subset of correl
3
Advantages
complexity by eliminating irrelevant features, which is more
1. It can reduce model
effective than ridge regression.
ween bias and variance than
2. Elastic net regression can achieve a better trade-off bet
ation parameters.
lasso and ridge regression by tuning the regulariz
to various types of data, such as linear,
3. This type of regression can be applied
logistic or Cox regression models.
Disadvantages
more computational resources and time due to two regularization
1. It requires
parameters and a cross-validation process.
select a large number of features
2. It may not be easily interpretable, as it could
ents.
with small coefficients or a small number of features with large coeffici
eae the law of large numbers. However, what if we could only flip the coin a
Would we guess that a coin is biased if we saw three heads in
ful of times?
ene
probabilities
* . .
for hypotheses. -
S
N
yacline Pago BION
. ; =—
Ne
pu Evaluation Metrics
to evaluate
a
Absolute
S
Mean
“ «
(MSE), and
ir
Error
e r WC
squa ed
and
Mean
ay
—— on
the regression problem's accuracy.
Mean Squared Error (MSE) is calculated by taking the average of the square of the
also be
difference between the original and predicted values of the data. It can
called the quadratic cost function or sum of squared errors.
is always positive or greater than zero. A value close to zero
The value of MSE
of the estimator/predictor. An MSE of zero (0)
will represent better quality
represents the fact that the predictor is a perfect predictor.
Pe ~
MSE = N 5) (Actual values — Predicted values)*
i=]
total number of "9 oo. ,
Here N is the
Sa. 1 et Regression /
observations/rows in the dataset.
. 6 YN" °
The sigma symbol denotes that the :
oo.
difference between actual and (Y = Y’)
Sum —
Y| , 0”0-6 MSE = —
i d values taken on every i
predicte
ue
value ranging from 1 to n. . are) ae
most -
Mean squared error is the
for . X a
commonly used loss function
regression. MSE is sensitive Fig. 2.11.1 Representation of MSE
|
towards outliers and given several
the optimal prediction will be their
examples with the same input feature values,
Mean Absolute Error, where the
mean target value. This should be compared with
good to use if you believe that your _
optimal prediction is the median. MSE is thus
distributed around a mean
target data, conditioned on the input, is normally
extra much.
value, and when it's important to penalize outliers
es both the variance and the bias of
incorporat
the predictor. MSE also gives
MSE
more it is penalized.
more weight to larger differences. The bigger the error, the
uous
Example : You want to predict future house prices. The price is a contin
be used as the loss
value, and therefore we want to do regression. MSE can here
function.
MAE is the sum of absolute differences between our target and predicted
: . = ue
Variab les. So it: measures the average magnitude of errors in a set of predictions,
wit hout consid ideri
ering theirir direct
di ions.
e The loss is the mean overseen data of the absolute differences between true and
precicted values
| nh
e¢ Use mean absolute error when you are doing regression and don't want outliers ,
play a big role. It can also be useful if you know that your distribution j,_
multimodal. MAL loss is useful if the training data is corrupted with outliers.
Pare R-square
e R-squared is also known as: the coefficient of determination. This metric gives an
indication of how good a model fits a given dataset. It indicates how close the
regression line is to the actual data values.
e The R squared value lies between 0 and 1 where 0 indicates that this model
doesn't fit the given data and 1 indicates that the model fits perfectly to the
dataset provided. |
First sum of errors
R-squared = 1-
Second sum of errors
e R-squared can also be expressed as a function of mean squared error. R-squared
represents the fraction of variance of response variable captured by the regression
model rather than the MSE which captures the residual error.
e Specifically, this linear regression is used to determine how well a line fits’ to a
data set of observations, especially when comparing models. Also, it is the fraction
of the total variation in y that is captured by a model. Or, how well does a line
follow the variations within a set of data.
e The R, value varies between 0 and 1 where 0 represents no correlation between :
the predicted and actual value and 1 represents complete correlation.
e R-squared is a good measure to evaluate the model fitness. It is also known as the
coefficient of determination. R-squared is the fraction by which the variance of the
errors is less than the variance of the dependent variable. |
_¢ It is called R-squared because in a simple regression model it is just the square of |
the correlation between the dependent and independent variables, which is
commonly denoted by "r" |
e In a multiple regression model R-squared is determined by pairwise correlations |
among all the variables, including correlations of the independent variables with
each other as well as with the dependent variable.
peers
;
—
py sanding the "best" match between the line and the data. The "best" choice of Byg,Bi
osen to minimize.
will be ch
3'(yi —Bo +B1x;)? = Se?
i=l i=1
ll
5p ED 2
x
x
|
x
(Sx yi
Review Question
MSE in the contex t
What do you mean| by coefficient of: regression? Explain SST, SSE, SSR,
of regression.
In other words, it tells us how concentrated the data is around the line of best it. | |
Root mean square error is commonly used in climatology, forecasting and |
regression analysis to verify experimental results. |
The RMSE represents the square root of the second sample moment of the |
differences between predicted values and observed values or the quadratic mean ||
of these differences. These deviations are called residuals when the calculations are|
performed over the data sample that was used for estimation and are called errors |
(or prediction errors) when computed out-of-sample.
The RMSE serves to aggregate the magnitudes of the errors in predictions for |
various data points into a single measure of predictive power. |
|
° RMSE is calculated y using following formula :
=
SSEw
7
[1wei 5
RMSE
|
“where :
SSEw = Weighted sum of squares
W = Total weight of the population
N = Number of observations
_ quared errors.
MSD is proportional error; thus larger errors have a
ol
s proportional to the size of the squared is sensitive to
4 HS
disproportionately large effect on RMSD. Consequently, RMSE
outliers.
Q.2 a) Explain PCA and LDA. What is difference between LDA and PCA.
| [7]
(Refer sections 1.14 and 1.15)
Q.4 a) What is regression ? Explain need of regression. Discuss ‘types of regressio Nn.
- [Ss]
(Refer sections 2.2 and 2.3)
(Refer sections 2.8 and 2.9) - [7]
b) Explain Lasso and ElasticNet regression.