0% found this document useful (0 votes)
21 views86 pages

Machine Learning Insem

Uploaded by

skrandom145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views86 pages

Machine Learning Insem

Uploaded by

skrandom145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

‘Unie 1

Introduction
to Machine Learning

Syllabus
Introduction is Machine Learning, Definitions and Real-life applications, Comparison of
: What
Machine learning with traditional programming, ML vs AI vs Data Science.
Learning Paradigms : Learning Tasks - Descriptive and Predictive Tasks, Supervised, Unsupervised,
Semi-supervised and Reinforcement Learnings.
Models of Machine learning : Geometric model, Probabilistic Models, Logical Models, Grouping and
grading models, Parametric and non-parametric models.
Feature Transformation : Dimensionality reduction techniques - PCA and LDA.

Contents
1.1. What is Machine Learning ?
1.2 Real-life Applications
1.3. Comparison of Machine Learning with Traditional Programming
1.4 Learning Paradigms
4.5 Supervised Learning
1.6 Unsupervised Learning
1.7. Semi-supervised Learning
1.8 Reinforcement Learnings
1.9 Models of Machine Learning
1.10 Grouping and Grading Models
1.11. Parametric Models
1.12 Non-parametric Models
1.13 Feature
1.14 PCA
1.15 LDA
1.16 Application of Machine Learning
ttn
1-2
Introduction to Ma
“aC chine Learnj
nin
in
Machine e

arning ?
—_———

is M a c h i n e Le
ER What
Machine Learning (ML) is a sub-field of ; Artificial Intelligence (AI) which concerns
computational theories of learning and building learning
with developing
machines. of various
and process which has manifestations
Learning is a phenom enon ad symbolic knowledg
e ~
includes gaining of
Learning process . It is a
aspects.
iv e sk il ls th ro ug h in struction and practice
development of cognit riment.
ct s an d th eo ri es th ro ugh observation and expe
discovery of new fa
learn from
Machine Learning Definition : A computer program 1s said to
mance measure P, if
experience E with respect to some class of tasks T and perfor
experience E.
its performance at tasks in T, as measured by P, improves with
mance criterion
Machine learning is programming computers to optimize a perfor
of machine learning methods
using example data or past experience. Application
to large’ databases is called data mining.
It is very hard to write programs that solve problems like recognizing a human
how our
face. We do not know what program to write because we don't know
collect lots of
brain does it. Instead of writing a program by hand, it is possible to
examples that specify the correct output for a given input.
A machine learning algorithm then takes these examples and produces a program
that does the job. The program produced by the learning algorithm may look very
different from a typical hand-written program.It may contain millions of numbers.
If we do it right, the program works for new cases as well as the ones we trained
it on.
Main goal of machine learning is to devise learning algorithms that do the
learning automatically without human intervention or assistance. The machine
learning paradigm can be viewed as "programming by example." Another goal is
to develop computational models of human learning process and perform
computer simulations. .
The goal of machine learning is to build computer systems that can adapt and
learn from their experience.
Algorithm is used to solve a problem on computer. An algorithm is a sequence of
instruction. It should carry out to transform the input to output. For example, fot
addition of four numbers is carried out by giving four number as input to the
algorithm and output is sum
o n
of all four : numbers. For the same task, there . may be
ing the
= ong algorithms. It is interested to find the most efficient one, requir
vari :
st number of instructions or memory or both.
For so me tasks, howe
ver, we do not have an algorithm.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Machine Leaming f - 3 Introduction to Me achine Learning

Why is Machine Learning Important ?


e Machine learning algorithms can figure out how to perform important tasks by
generalizing from examples.

Machine Learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
¢ Machine learning algorithms discover the relationships between the variables of a
system (input, output and hidden) from direct samples of the system.
e Following are some of the reasons :
1. Some tasks cannot be defined well, except by examples. For example:
recognizing people.
Relationships and correlations can be hidden within large amounts of data. To
solve these problems, machine learning and data mining may be able to find
these relationships.
Human designers often produce machines that do not work as well as desired
in the environments in which they are used.
The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans.
5. Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans.
Machine learning also helps us find solutions of many problems in computer
_ vision, speech recognition and robotics. Machine. learning uses the theory of
statistics in building mathematical models, because the core task is making
inference from a sample.
How Machines Learn ?
e Machine learning typically follows three phases:
I. Training: A training set of examples of correct behavior is analyzed and some
representation of the newly learnt knowledge is stored. This is some form of
rules.
Validation : The rules are checked and, if necessary, additional training is
_given. Sometimes additional test data are used, but instead, a human expert
may validate the rules, or ‘some other automatic knowledge - based component
may be used. The role of the tester is often called the opponent.
3. Application : The rules are used in responding to some new situation.
¢ Fig. 1.1.1 shows phases of ML.

TECHNICAL PUBLICA TIONS” - an up-thrust for knowledge


Introduction to jy
—— $$ thing Les
Machin e Leaming
—$$—$<—— _ :
—_ — Mir q

New Situation

New knowledge
Training
A el
.
Existing | y
knowledge gees | Response
Validation
4

Test data (tester

Fig. 1.1.1 Phases of ML

4.1.1 | Why Machine Learning is Important ?

e Machine learning algorithms can figure out how to perform important tasks by
generalizing from examples.
Machine learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
Machine learning algorithms discover the relationships between the variables of a
system (input, output and hidden) from direct samples of the system.
e Following are some of the reasons : .
1. Some tasks cannot be defined well, except by examples. For example :
Recognizing people.
2. Relationships and correlations can be hidden within large amounts of data. To
‘ solve these problems, machine learning and data mining may be able to find
these relationships.
3. Human designers often produce machines that do not work as well as desired
in the environments in which they are used.
4, The amount. of knowledge available about certain tasks might be too large for
explicit encoding
by humans.
5. Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans:
ae

TECHNICAL PUBLICATIONS®- an up-thrust for knowledge


—_—
Machine Leaming 1-5 Introduction to Machine Learning

Machine learning also helps us find solutions. of many problems in computer


vision, speech recognition and robotics. Machine learning, uses the theory of
statistics in building mathematical models, because the core task is making
inference from a sample.
Learning is used when :
I. Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech recognition)
No

3. Solution changes in time (routing on a computer network)


4 Solution needs to be adapted to particular cases (user biometrics)

ERE Ingredients of Machine Learning

The ingredients of machine learning are as follows :


1. Tasks : The problems that can be solved with machine learning. A task is an
abstract representation of a problem. The standard methodology in machine
learning is to learn one task at a time. Large problems are broken into small,
reasonably independent sub-problems that are learned separately and then
recombined.
Predictive tasks perform inference on the current data in order to make
predictions. Descriptive tasks characterize the general properties of the data in the
database.
Models : The output of machine learning. Different models are geometric models,
probabilistic models, logical models, grouping and grading.
The model-based approach seeks to create a modified solution tailored to each
new application. Instead of having to transform your: problem to fit some standard
algorithm, in model-based machine learning you design the algorithm precisely to
fit your problem.
Model is just made up of set of assumptions, expressed in a precise mathematical
form. These assumptions include the number and types of variables in the problem
domain, which variables affect each other and what the effect of changing one
variable is on another variable.
Machine learning models are classified as : Geometric model, Probabilistic model
and Logical model.
Features : The workhorses of machine learning. A good feature representation is
central to achieving high performance in any machine learning task.
Feature extraction starts from an initial set of measured data and builds derived
values intended to be informative, non redundant, facilitating the subsequent
learning and generalization steps.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Introduction to Machin te
4-6 les.

Machine Leaning a
a subset of features from the Origina|
se
=
Feature
. selec | 1 is a process that choo
sclectior smal ly reduced according to a
} S oplime
.
Certain
tur e spa ce
"
at the fea
features So {!
criterion.

ma us |
1, Justify the following | ae |
k.
son. Is it a regression tas
i) Predict the height of a per on task ? |
ofa per son by ana lyz ing his writing style. Is 1 a classificati
ii) Find the gender
mple of unsupervised learning.
iii) Filter out spam emails. Is it a exa
s of machine learning.
2. What is machine learning ? Explain type

Ee Real-life Applications
ine learning :
e Examples of successful applications of mach
Here are several examples :
ten characters by the
1 Optical character recognition : Categorize images of handwrit
letters represented.
2 Face detection : Find faces in images (or indicate if a face is present).
3 Spam filtering : Identify email messages as spam or non-spam topic spotting:
categorize news articles (say) as to whether they are about politics, sports,
entertainment, etc.

4 Spoken language understanding : Within the context of a limited domain,


determine the meaning of something uttered by a speaker to the extent that it can
be classified into one of a fixed set of categories.

Reza Learning Associations

e Learning association is the process of developing insights into various association:


between products. A good example is how seemingly unrelated products ma}
reveal an association to one another when analyzed in relation to the buy
behaviors of customers.
¢ This application of machine learning involves studying the association between the
products people buy and is also: known as basket analysis. |
If a buyer buys X, would they buy Y because of a relationship that can b
igentiticr) between them ? Knowing these relationships could help in suggest
an associated product to the customer.
¢ For a higher likelihood of the cust tps in pundlin:
omer buvi
products for a better package. ying i, , can also help

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Machine Learning 1-7 Introduction to Machine Learning

e This learning of associations between products by a machine is called learning


associations. Once we find an association by ex amining a large amount of sales
data, big data analysts can develop a rule to derive a probability test in learning a
conditional probability.
In finding an association rule, learning a conditional probability of the form
P(Y|X) where Y is the product we would like to condition on X, which is the
product or the set of products which we know that the customer has already
purchased.
So an example of an association rule can be lezgs, bacon} — {cheese} expressing,
that customers who buy eggs and bacon also often buy cheese (to make
ham-and-eggs).
The two basic characteristics of an association rule are support and confidence.

EEA Classification
Classification is the process of placing each individual from the population under
study in many classes. This is identified as independent variables.
Classification helps analysts use measurements of an object to identify the category
to which that object belongs. To establish an efficient rule, analysts use data. Data
_consists of many examples of objects with their correct classification.
For example, before a bank decides to disburse a loan, ‘it assesses customers on
their ability to repay the loan. By considering factors such as customer's earning,
age, savings, and financial history, we can do it. This information is taken from
the past data of the loan. Hence, the seeker uses this data to create a relationship
between customer attributes and related risks.

Face Recognition
Face recognition task is effortlessly and every day we recognize our friends,
relative and family members. We also recognition by looking at the photographs.
In photographs, they are in different pose, hair styles, background light, makeup
and without makeup.
We do it subconsciously and cannot explain how we do it. Because we can't
explain how we do it, we can't write an algorithm.
Face has some structure. It is not a random collection of pixel. It is symmetric
structure. It contains predefined components like nose, mouth, eye, ears. Every
person face is a pattern composed of a particular combination of the features. By
analyzing sample face images of a person, a learning program captures the pattern
specific to that person and uses it to recognize if a new real face or new image
belongs to this specific person or not.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Machine Learning 1-8 Introduction to Mag a
Ine
beam,

e Machine learning algorithm creates an optimized model of the 26h


learned based on data or past experience. Cing

e In the case of face recognition, the input is an image, the classes are Peop!
recognized, and the learning program should learn to associate the face ; ¢
ma e
* to
_
identities. This problem. is more difficult than optical character recognition 5 Cause
.

there are more classes, input image is larger and a face is 3D and difference
; s in
pose and lighting cause significant changes in the image.

Medical diagnosis
e In medical diagnosis, the inputs are the relevant information about the patient and
the classes are the illnesses. The inputs contain the age of patient's, gender, past
medical history and current symptoms.

° Some tests may not have been applied to the patient, and thus these inputs woul
be missing. Tests take time, may be costly and may inconvience the patient so we
do not want to apply them unless we believe that they will. give us valuable
information.

e In the case of a medical diagnosis, a wrong decision may lead to a wrong or no


treatment, and in cases of doubt it is preferable that the classifier reject and defer
decision to a human expert.

EEX Regression
e- Regression : Trying to predict a real value. For instance, predict the value of a
stock tomorrow given its past performance. Or predict Alice's score on the
machine learning final exam based on her homework scores.
'e If the desired output consists of one or. more continuous variables, then the task is
called regression. An example of a regression problem would be the prediction of
“the yield in a chemical manufacturing process in which the inputs consist of the
concentrations of reactants, the temperature, and the pressure.
e The goal of regression is to predict the value of.one or more continuous target

variables t given the value of a D-dimensional vector x of input variables.

e Navigation of a mobile robot is one of the examples of regression. An autonomot


‘car, where the output is the angle by which the steering wheel should be tume
at each time, to advance without hitting obstacles and deviating from the route.
i vide?
¢ Inputs in such a case are provided by sensors on the car, for example, she
camera, GPS etc. Training data can be collected by monitoring ding
and recor
actions of a human driver. |

TECHNICAL PUBLICATIONS® - an up-thrust for knowl


edge
Machine Learning 7-9 Introduction to Machine Learning

e In regression, we can use the principle of machine learning to optimize parameters


and to cut the approximation error and calculate the closest possible outcome.

Review Questions

I, Write mathematical form of the following :


1) Classification ti) Class probability estimation — iii) Regression
Which one out of these three is more precise ? Which one leads to overfitting
?

Rey Comparison of Machine Learning with Traditional Programming


e Machine learning seeks to construct.a model or logic for the problem by analyzing
its input. data and answers. In contrast, traditional programming is that
Programming aims to answer a problem using:a predefined set of rules or logic.
e Machine learning is the ability of machines to automate a learning process. The
input of this learning process is data and the output is a model. Through machine
learning, a system can perform a learning function with the data it ingests ang
thus it becomes progressively better at said function.
e Traditional programming is a manual process. It requires a programmer to create
‘the rules or logic of the program. We have to manually come up with the rules
and feed it to the computer alongside input data. The machine then processes the
given data according to the coded rules and comes up with answers as output.
e Fig 1.3.1 shows machine learning and traditional programming.

Data —> Data —> .

Computer | Program — Computer - Output


Output —»} _ Program —>

(a) Machine learning process (b) Traditional programming

Fig 1.3.1

© For projects that involve predicting output or identifying


objects in images,
machine learning has proven to be much more efficie
nt. In traditional
programming, the rule-based approach is preferred
in situations where the
problem is of an algorithmic manner and there are not
so many parameters to.
consider when writing the logical rules.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


40
Introduction
to Mach; Me
Machine Learning L
ar Ding

Machine Learning is a proven technique for helping to solve com Plex prop
voice recognition, recommendation systems, self-d
riy: Ng calem,
such as facial and
and email spam detection.

EES ML vs Al vs Data Science


Artificial Intelligence (AI) is the broad concept of developing machines tha
n
simulate human thinking, reasoning and behavior.
computer systems learn from the
Machine Learning (ML) is a subset of AI wherein
and in turn, use these learnings to. improve experiences ang
environment
all AI is machine learning.
processes. All machine learning is A I, but not
analysis and extraction of relevant assumptions
Data Science is the processing,
the data. A Data Scientist makes
from data. It's about finding hidden patterns in
events.
use of machine learning in order to predict future
Data science.
_Fig. 1.3.2 shows relation between AI, ML and

Artificial
jntelligence ©

Machine
learnin g : ‘Data
science
Deep
learning

Fig. 1.3.2 Relation between Al, Mt and Data science

Learning uses statistical models. Artificial Intelligence usess_ logic and


Machine
decision trees. Data Science deals with structured data.
Machine’ Learning : A form of analytics in which software programs learn about
data and find patterns.
AI : Development of computerized applications that simulate human intelligen
and interaction.
to extrac t relevant
Data Science : The process of using advanced analytics
information from data.

TECHNICAL PUBLICA TIONS® - an up-thrust for knowledge


Machine Learning
Introduction to Machine Learning

In Table Format :

Machine Learning Artificial Intelligence Data Science

Focuses on providing a means Focuses.on giving machines Focuses on extracting


for algorithms and systems to cognitive and intellectual information needles from data
learn from experience with capabilities similar to those of haystacks to aid in
data and use that experience humans. decision-making and planning.
to improve over time.

Machine Learning uses Artificial Intelligence uses Data Science deals with
statistical models. logic and decision trees. structured data.

A form of analytics in which Development of computerized The process of using advanced |


software programs learn about applications that simulate analytics to extract relevant
data and find patterns. human intelligence and information from data.
interaction.

Objective is to maximize Objective is to maximize the Objective is to extract


accuracy, chance of success. actionable insights from the
data.

ML can be done through AI encompasses a collection of Uses statistics, mathematics,


supervised, unsupervised or intelligence concepts, data wrangling, big data
reinforcement learning including elements of analytics, machine learning
approaches. perception, planning and and various other methods to
prediction. answer analytics questions.

ML is concerned with AI is concerned with Data science is all about data


-knowledge accumulation. knowledge dissemination and engineering.
conscious machine actions.

Learning Paradigms

Descriptive Tasks

e Descriptive Analytics is the conventional form of business intelligence and data


analysis, seeks to provide a depiction or "summary view" of facts and figures in
an understandable format, to either inform or prepare data for further analysis.

e Two primary techniques are used for reporting past events : Data aggregation and
data mining.
® It presents past data in an easily digestible format for the benefit of a wide
business audience.
s A set of techniques for reviewing and examining the data set to understand the
data and analyze business performance.

TECHNICAL PUBLICA TIONS® - an up-thrust for knowledge


“UM UHING
Machine Learning a Le; Inj
Ne

to understand what happened


"Descriptive analytics helps organisations
t and customors D
] the
the relationshipi between produc
past. It helps to understand
is to understanding, what approach to take
The objective of this analysis in
future OutcE the
future. If we learn from past behaviour , it helps us to influence Mes,
which simply proy; des
Company reports is an example of descriptive analytics
financials
historic reviw of company operations, stakeholders, customers and
It also helps to describe and present data in such format, which can be asi]
J
understood by a wide variety of business readers.

E24 Predictive Tasks


To make prediction, predictive mining tasks performs inference on the curren
data. Predictive analysis provides answers of the future queries that move acrosg
using historical data as the chief principle for decisions.
It involves the supervised learning functions used for the prediction of the target
value. The methods fall under this mining category are the classification,
time-series analysis and regression.
Data modeling is the necessity of the predictive analysis, which works by utilizing
some variables to anticipate the unknown future data values for other variables.
It provides organizations with actionable insights based on data. It provides an
estimation regarding the likelihood of a future outcome.
To ‘do this, a variety of techniques are used, such as machine learning, data

mining, modeling and game theory.


Predictive modeling can, for example, help to identify any risks or opportunities in
. the future.
Predictive analytics can be used in all departments, from predicting customer
behaviour in sales and marketing, to forecasting demand for operations or
determining risk profiles for finance. | "1
_A very well-known application of predictive analytics is credit scoring used by
financial services to determine the likelihood of customers making future credit
payments on time. Determining such a risk profile requires a vast amount of date,
including public and social data. |
Historical and transactional data are used to identify patterns, and statistic?!
models and algorithms are used to capture relationships in various datasets.
Predictive analytics has taken off in the big data era and there
are many te
available for organisations to predict future
outcomes

een Fe

TECHNICAL PUBLICATIONS® | ans. an


Machine Learning 1-13 Introduction to Machine Learning

ERY Difference between Descriptive and Predictive Tasks


as
Sr. No. Descriptive model Predictive model
| 7
l. It use data aggregation and data mining — Use statistical models and forecasts
to provide insight into the past and techniques to understand the future and
answer. answer.

2. What has happened ? What could happened ?

3. Descriptive analytics is the analysis of Predictive analytics predicts future


past or historical data to understand trends.
trends and evaluate metrics over time.

4. Examples of tools used : Data Examples of, tools used : Machine


aggregation and data mining. learning, statistical models and
simulation. |

| °. Used when user want to summarize Used when user want to make an
results for all or part of your business. educated guess at likely results.
ee . - _ mn
geen tnernernnennnennetnnnn
ano ens eseente 1
ere ~

6. Limitation : Snapshot of the past, often Limitation : Guess at the future, helps
| with limited ability to help guide inform low complexity decisions.
j
i decisions.

beccomansorsie:
|
{
i

Review Question

| 1. Explain predictive and descriptive task.

EE Supervised Learning
e Supervised learning. is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples.
The task of the supervised learner is to predict the output behavior of a system for
any set of input values, after an initial training phase.
¢ Supervised learning in which the network is trained by providing it with input
and matching output patterns. These input-output pairs are usually provided by
an external teacher.
e Human learning is based on the past experiences. A computer does not have
experiences.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge -


“VESSES

e io,
1-14 Introduction to Ma
ning
Machine Lear

S
mM; Ing

ok com
experi Mac
represent some "past
learns from dala, which
e A computer sysiem CNees"

main.
an application do discy
ca n be us e d to pred ict the values of a e
To learn a target functi
on that es etisk or low risk. The . Clas,
d hi gh-r
e OF nol-approved an e learning ask jg
altribule, @.g. approv assification or inductiv
learning, Cl
commonly called ; Supervised
e in pu t an d the de si red results. For some exa Mp]
botl 1 th
Training data includes
Cs
and are giv en in inp ut to the model dy;
rgels) are known
the correct results (ta r training, validation and ee, Set
r
.

e
. .

ns tr uc ti on of a pr op
e co
the learning process. Th and accurate.
thods are usually fast
is crucial. These me new data are giv
en
lize - Give the correct results wh en
Have to be able to genera
g a priori the target.
in input with out knowin ing a function from
the m achine learning task of inferr
e Supervised learning is training examples. [py
a. The training d ata consist of qa-set of
supervised tr aining dat a pai + consisting of
an input object and q
h example is
supervised learning, eac
|
desired output value. data and produces an
algorithm analyzes the training
e A supervised le arning sion function. Fig. 15.1
which is ¢ alled a classifier or a regres
inferred function,
ng process.
shows supervised learni

8 Learning
algorithm
ae

Testing
Training

Fig. 1.5.1 Supervised learning process


sy st em , to pe rf or m tas k better as compared to
the
The learned model helps
learning.
es a corres ponding target vector.
‘e Each inpu t vector requir
rget Vector)
_.- Training Pair = ( Input Vector, Ta
15.2 sho ws inp ut vec tor . (Se e Fig. 1.5.2 on next page)
e Fig. collected
inp ut vectors ar e
in which some
od
° Supervised learning denotes a meth ne t- wo rk 1 5 observ
put comput ed by the
and presented to the network. The out ights art
the dev iat ion fro m the exp ect ed ans wer is m easured. The Wwe d by th
and
rected acc ord ing to the ma gn it ud e of the error in the way define
cor
_learning algorithm.

: TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


achine:l eé ; 1-15
ae Introduction to Machine Learning
Machine Learning _ ee

- Neural ay aes Actual


Input network output

Error
signal
generate
Desired
output

Fig. 1.5.2 Input vector

e Supervised learning is further divided into methods which use reinforcement or


error correction. The perceptron learning algorithm is an example of supervised
learning with reinforcement.

Data formats in supervised learning :


e Supervised learning always uses a: dataset to define finite set of real vectors with
m features each : | |
X = {X1,X2,---,Xn} where x; € R™

e Considering that user approach is always probabilistic, we need to consider each X


as drawn from a statistical multivariate distribution D..It is also useful to add an
important condition upon the whole dataset X. Here we consider that all samples
to be independent and identically distributed. This means all variables belong to
the same distribution D and considering an arbitrary subset of m_ values, it
happens that : m
P(X1,X2,---/Xm) = [[ Pe)
. i=l

e The corresponding output values can be both numerical - continuous and


categorical. In the first case, the process‘is called regression, while in the second, it
is called classification.
e Example : Dataset contains city populations by year for the past 100 years and
user want to know what the population of a specific city will be four years from
now. The outcome uses labels that already exist in the data set : Population, city
and year. .
_© In order to solve a given problem of supervised learning, following steps are
performed : ,
1. Find out the type of training examples.

TECHNICAL PUBLICATIONS” - an up-thrust for knowledge


.
Machine Leaminl 9 1-16 Introduction to } Aaching
f
be:
Machin “Nine,
/

Collect a training, set


>
» learned
representation of the lear function
Determine the input feature
ov Pe ti md cor re
4. Determine the structure of the learned function « responding learnir 9
Fy

algorithm.
; : ‘ning, algorit on the col] lected
5, Complete the design and then run_ the learning algorithm
training set.

Evaluate the accuracy of the learned function. After parameter adjustmen and
6.
s peel ction should be moa.
learning, the performance of the resulting functic be measured On a
test set that is separate from the training set.
i
divideded into types : : CleClassification and
two‘9 types g sic a.
Regres
e i
Supervised learniingng isis divid

1. Classification : = |
Classification predicts categorical labels (classes), prediction models continuous -
valued functions. Classification is considered to be supervised learning.
e Classifies data based on the training set and the values in a classifying attribute
and uses it in classifying new data. Prediction means models continuous-valued
functions, i.e., predicts unknown or missing values.

e Preprocessing of the data in preparation for classification and prediction can


involve data cleaning to reduce noise or handle missing values, relevance analysis
to remove irrelevant or redundant attributes and data transformation, such as
generalizing the data to higher level concepts or normalizing data.
e Numeric prediction is the task of predicting continuos values for given input. For
example, we may wish to predict the salaryof college employee with 15 years of
work experience or the potential sales of a new product given its price.
e Some of the classification’ methods like back - propagation, support vector
machines and k-nearest-neighbor classifiers can be used
for prediction.
2. Regression :
e For an input x, if the output is continuous,
this is called a regression problem. For
example, based on_ historical information
of demand for tooth paste in
supermarket, user are asked to predict
the demand for the next month
e g ession ji 1s
Regr concerned with the predictiicti on of continuo : us quantities. Linear
—— is the oldest and most widely use
d predictive model in the fiel
machine learning. The goal d of
is to minimize the sum
tral
straight line to a set of data points of th e squared errors to fit4
quared €i9
3 gression algorithm
Re
use d in supervised
linear regre 7 Je arning is linear regression, Bayesi
ss10n, polynomial regression, re Sressi . a"
on tree ete. 6 -

TECHN ICAL PUBL


ICATIONS” ° - 4 up-thrust for know
ledge
Machine Learning
1-17 Introduction to Machine Learning

Eee Advantages and Disadvantages of Supervised Learning

14. Advantages of supervised learning


e It performs classification and regression tasks.

e It allows estimating or mapping the result to a new sample.

e We have complete control over choosing the number of classes we want in the
training data.

2. Disadvantages of supervised learning


e Supervised learning cannot handle all complex tasks in Machine Learning.
e Computation time is vast for supervised learning.
e It requires a labelled data set.
e It requires a training process.

eM MaelSiitelits

1. Explain supervised learning with example.


2. Explain data formats for supervised learning problem with example.

ER Unsupervised Learning

° Unsupervised learning is a type of machine learning in which models are trained


using unlabeled dataset and are allowed to act on that data without any
supervision. .

e In unsupervised learning, a dataset is provided without labels and a model learns


_ useful properties of the structure of the dataset. The main goal of. unsupervised
learning is to discover hidden and interesting patterns in unlabeled data.
e They are called unsupervised because they do not need a teacher or super-visor to
- label a set of training. examples. Only the original data is required to start the
analysis.
e Unsupervised learning tasks typically involve grouping similar examples together,
dimensionality reduction and density estimation.
e Common algorithms used in _ unsupervised learning include clustering, anomaly
detection, neural networks.

TECHNICAL PUBLICA TIONS® -an up-thrust for knowledge


Introduction to Mz
AChing l
Machine Leaming “Barr, r 4
—————_—_—_

shows unsupervised learning,


e Fig. 1.6.1
' —~] Output . 3 '

Input }——" Processing ~[-}


Interpretation |-——e Algorithm
raw _ |
= Output 2
data

Unsupervised learning
Fig. 1.6.1

method is cluster analysis, which app}


{PPlies
The most common unsupervise d learning in day,
find hidden patterns or groupings
to explore data and
clustering me thods , to idenss.
is typ ica lly app lie d before supe rvised learning
Unsupervised learning ish cla sses based on groupings.
ana lys is and est abl
features in exploratory data
to :
Unsupervised machine
learning is mainly used
e
on similarities be tween
features OF S€g ment data.
1. Cluster datasets
different da ta point such as automated music
relationship between
2. Understand
recommendations.

3. Perform initial data analysis. s


a Igo rit hms hav e the capa bility of analyzing large amount
e Unsupervised learning those anomalies
ide nti fyi ng unu sua l poi nts am ong the dataset. Once
of data and r, who can
the y can be bro ugh t t o the awareness of the use
have been detected,
this warning.
then decide to act or not on
g sectors. Indeed,
y det ect ion can be ver y useful in the financial and bankin
e An om al h which credit
fra ud h as bec ome a dail y problem, due to the ease wit
financial o:
be sto len . Usi ng un sup erv ise d learning models, unauthorized
card details can as it will most often
len t tra nsa cti ons on a ban k account can be identi fied
fraudu
mal pattern of spending.
constitute a change in the user's nor
er
: Usi ng cus tom er data and user want to create segments of custom
e Example not labeled and the
data that user are providing is
who like similar products. The
in the out com e are gene rate d base d on t he similarities that were discovered
labels
between data points.
clustering and association analysis.
Types of unsupervised learning is
that can be deployed under uns upervis
There is a wide range of algorithms : os ont
K-means clustering, Principal componet
learning. A few of them includes
analysis, Hierarchical clustering and Dendrogram.

ERE Advantages and Disadvantages of Unsupervised Learning


1. Advantages of unsupervised learning
¢ It does not require a training data to be labelled.

TECHNICAL PUBLICA Tions® - an up-thrust for knowledge


Machine Leaming 1-19 Introduction to Machine Learning

e¢ Dimensionality reduction can be easily accomplished using unsupervised learning.


e Capable of finding previously unknown patterns in data.

2. Disadvantages of unsupervised learning


e Difficult to measure accuracy or effectiveness due to lack of predefined answers
during training.
e The results often have lesser accuracy. .
e The user needs to spend time interpreting and label the classes which follow that
classification.

Lisa: Difference BEnveen Supervised and Unsupervised Learning

| Sr. Sunotrited learning Unsupervised learning


No.
is 1. _ Desired output is given. Desired output is not given.
2. It is not possible to learn larger and more It is possible to learn larger and more
| complex models than with supervised complex models with unsupervised
learning. learning.'
| 3, Use training data to infer model. No training data is used:
4. Every input pattern that is used to train the The target output is not presented to the
network is associated with an output network.

5. Trying to predict a function from labeled Try to detect interesting relations in data.
data.
6. Supervised learning requires that the target For unsupervised learning ‘typically either
variable is well defined and that a sufficient the target variable is unknown or has only
number of its values are given. — been recorded for too small a number of
cases.

’: Example: Optical character recognition. Example : Find a face in an image.


NX

?i

8 We can test our model. t iz We can not test our model.

9. Supervised learning ‘is also called - Unsupervised learning is also called


classification. clustering.

‘ERA Semi-supervised Learning

¢ Semi-supervised learning uses both labeled and unlabeled data to improve


supervised learning. The goal is to learn a predictor that predicts future test data
better than the predictor learned from the labeled training data alone.
° Semi-supervised learning is motivated by its practical value in learning faster,
better and cheaper.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Introduction tO Mach;
“Se
1-20 aching
Machine Learning a ? Lear
Ning

ire a large ap
e In many real world applications, it 1s relatively easy to acqu 2 c Ou
Nt of
unlabeled data x.
the Web, images can be of tained
crawled from
e For example, documents can be
ech c an be collected from broadcast. How
from surveillance cameras, and spe
. EVE

the prediction task, such as sentiment Orientatj ,


their corresponding labels y for
~ ¢ On,

phonetic transcript, often re


quires slow human annot ation
intrusion detection and
ts.
and expensive laboratory experimen
many prac tical lea rni ng dom ain s, there is a large supply of unlabeled data py,
e In generate. For example : Tey
limited ch can be expensive to
labeled data, whi
informatics etc.
processing, video-indexing, bio
es use of bot h labeled and unlabeled data fo;
e Semi-supervised Learning mak data with a large amount of
of 1 abeled
training, typically a small amount nt
unl abe led d ata is used in conjunction with a small amou
unlabeled data. When
it can pro duce con s ider able improvement in learning accuracy.
of labeled data,
testing at reduced
mes enables p redictive model
learning someti
e Semi-supervised
cost.
ts additional
upe rvi sed : Training on labeled data exploi
classification
e Semi-s er.
abe led dat a, fre que ntl y res ult ing in a mo re accurate classifi
unl
ring : Uses small amount of labe led data to aid and bias
° ‘Semi-supervis ed cluste
data.
the clustering of unlabeled
rvised, Unsupervised,
1.7.1 Comparision between Supe
Semi-supervised Learning AT

Unsupervise d learning Semi-supervised learning


Supervised. learning
|

Sr. No. necccanensnsvants

ut
A large amount of inp a
—engeneneaeenenen erent

Input data is labeled.


Input data is unlabeled. while
iM data is unlabeled
_ 4
small amount is labeled.

Trying to understand the Using unsupervised


2. Trying to predict a specific methods to improve
quantity. data.
_ ee supervised algorithm. _—
Pe
Used in Identity Used in spam detection.
3. Used in Fraud detection. a
management.
po eee
Subtype : Clustering and. Subtype : Classificatio™all
4, Subtype : Classification 7 sag
clustering °
.

and regression. association. regression,


association.
:
ae Higher accuracy. Lesser accuracy. Lesser accuracy
Machine Leaming 1-21 Introduction to Machine Learning

1.8 | Reinforcement Learnings

e Reinforcement Learning (RL) is the science of decision making. It is about learning


the optimal behavior in an environment to obtain maximum reward. In RL, the
data is accumulated from machine learning systems that use a_ trial-and-error
method. Data is not part of the input that we would find in supervised or
unsupervised machine learning.
e Reinforcement learning uses algorithms that learn from outcomes and decide
which action to take next. After each action, the algorithm receives feedback that
helps it determine whether the choice it made was correct, neutral or incorrect. It
is a good technique to use for automated systems that have to make a lot of small
decisions without human guidance.
e Reinforcement learning is an autonomous, self - teaching system that essentially
learns by trial and error. It performs actions with the aim of maximizing rewards,
or in other words, it is learning by doing in order to achieve the best outcomes.
e A good example of using reinforcement learning is a robot learning how to walk.
The robot first tries a large step forward and falls. The outcome of a fall with that
big step is a data point the reinforcement learning system responds to. Since the
feedback was negative, a fall, the system adjusts the action to try a smaller step.
The robot is able to move forward. This is an example of reinforcement learning in
action.
e Reinforcement learning is
learning what to do. and AGENT
Y

how to map situations to


actions.. The learner is not
told which actions to take. Situation Action

Fig. 1.8.1 shows: concept of st at

- reinforced learning.
e Reinforced learningis deals St+1 | ENVIRONMENT
with agents that.must sense ae
and act upon their Fig. 1.8.1 Reinforced learning
environment. It combines
classical Artificial Intelligence and machine learning techniques.
e It allows machines and software agents to automatically determine the ideal-
behavior within a specific context, in order to maximize its performance. Simple
reward feedback is required for the agent to learn its behavior; this is
known as
the reinforcement signal. |
* Two most important distinguishing features of reinforcement learning is
trial-and-error and delayed reward.

TECHNICAL. PUBLICA TIONS® - an up-thrust for knowledge


4
Machine Learning 1 - 22 Introduction to Mach;

e With
With reinfor ceme
reinforcement yarn
learning ‘j hms ; an
algorit age
an agent an prove itsits performance b
improve

using, the feedback it gets from the environment. This environmental feedback ack ;
is
called the reward signal.
¢ Based on accumulated experience, the agent needs to learn which action to take;
a given situation in order to obtain a desired long term goal. Essentially actions
that lead to long term rewards need to reinforced. Reinforcement learning has
connections with control theory, Markov decision processes and game theory.

EE Elements of Reinforcement Learning


learning elements are as follows : Environment
e Reinforcement
1. Policy 2. Reward function
. Action
3. Value function 4. Model of the environment
e Fig. 1.8.2 shows elements of RL. Reward
Interpreter
e Policy : Policy defines the learning agent
It is a mapping \gae bea J
behavior for given time period. 9

states of the environment to ;


from perceived
in those states. Agent
actions to be taken when

function : Reward function is used to ; 1.8.2 Elements ; of


Fig.
e Reward reinforcement learning
goal in a_ reinforcement learning |
define a
environment to a single
problem. It also maps each perceived state of the
number.
in the long run. The value
° Value function : Value functions specify what is good
expect to accumulate over
of a state is the total amount of reward an agent can
the future, starting from: that state.
e Model of the environment : Models are used for planning.
algorithms learn to generate
e Credit assignment problem : Reinforcement learning
they are in leading '
an internal value for the intermediate states as to how good
the goal.
e The learning decision maker is called the agent. The agent interacts with the
environment that includes everything outside the agent.
; . vps an
e The agent has sensors to decide on its state in the environment and takes*
. .

action that modifies its state.

¢ The reinforcement learning problem model is an agent continuously interenc a


-with an environment. The agent and the environment interact in a sequ
y an

time steps. At each time step t, the agent receives the state of the environment
° e

e .

sm Se
a scalar numerical
1
reward for the previous action, and then the agent then °
:

an action.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


dt
Machine Leaming 1-23 Introduction to Machine Learning

e Reinforcement learning is a technique for solving Markov decision problems.


e Reinforcement learning uses a formal framework defining the interaction between
a learning agent and its environment in terms of states, actions and rewards. This
framework is intended to be a simple way of representing essential features of
the artificial intelligence problem.

EE-F¥ Application of Reinforcement Lea


rning
1. Robotics : Robots with pre-programmed behavior
are useful in structured
environments, such as the assembly line of an automobile
manufacturing plant,
‘Where the task is repetitive in nature.
2. A master chess player makes a move. The choice is informed both
by planning,
anticipating possible replies and counter replies.
3. An adaptive controller adjusts parameters of a petroleum
refinery's operation in
real time.

| 1.8.3 | Advantages and Disadvantages of Reinforcement Learning


Advantages of Reinforcement learning
1. Reinforcement learning can be used to solve very complex proble
ms that cannot be
solved by conventional techniques.
.
2. The model can correct the errors that occurred during the training process
:
3. In RL, training data is obtained via the direct interaction of the
agent with the
environment. | .
Disadvantages of Reinforcement learning
1. Reinforcement learning is not preferable to use for solving simple problems.
2. Reinforcement learning needs a lot of data and a lot of computation.

Review Question

1. Discuss the reinforcement learning and write the brief applications.

EE Models of Machine Learning


e A machine learning model is a program that can find patterns or make decisions
from a previously unseen dataset. For example, in natural language - processing,
machine learning models can parse and correctly recognize the intent behind
previously unheard sentences or combinations of words. —

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


1-24 Introduction lo
Machine Leaming
SS Mach;
ley ng
A machine learning model can perform such tasks by having it ‘trained!
large dataset. During training, the machine learning algorithm igs Optimize
' d to 5n
certain patterns or oulputs from the dataset, depending on the task. The i
this process - often a computer program with specific rules and data structuy
ls
called a machine learn.

For classification and regression problem, there are different choices of Machin
Learning Models each of which can be viewed as a black box that solve the sai
problem. However, each model come from a different algorithm approaches -
will perform differently under different data set. The best way is to use Cross
validation to determine which model performs best on test data.
The model - based approach seeks to create a modified solution tailored to eag,
new application. Instead of having to transform user problem to fit some standarg
algorithm, in model-based machine learning user design the algorithm Precisely tg
fit problem.

The core idea at the heart of model - based machine learning is that all the
assumptions about the problem domain are made explicit in the form of a model.
Model is just made up of set of assumptions, expressed
in a precise mathematical
form. These assumptions include the number and types of variables in the
problem domain, which variables affect each othe and what the effect of changing
one variable is on another variable.
Machine learning models are classified as : _
1. Geometric model (Using the Geometry of the instance space).
2. Probabilistic model (Using Probability to classify the instance space).
3. Logical model (Using a Logical expression).

EE: Geometric Model


Here, we consider models that define similarity by considering the geometry of
the instance space. In Geometric models, features could be described as points in
two dimensions (x - and y - axis) or a three-dimensional space (x, y and Z).
Geometric models are constructed directly in instance space, using geometric
concepts like lines, planes and distances.
Even when features are not intrinsically geometric, they could be modelled in °
geometric manner (for example, temperature as a function of time can be modelled
in two axes).
This models use intuitions from geometry such
as separating hyper planes, linea?
transformations and distance metric. The 5 “sd ase
m i
ain goal of this method is to find a
of representati ve features of geome tric form
:
to represent an object by collectine
ing

TECHNICAL PUBLICATIONS® - an up-thrust


for knowledge
Machine Learning
1-25 Introduction to Machine Leaming

geometric features from images and learning them using efficient machine learning
methods.

In geometric models, there are two ways we could impose similarity. We could
use geometric concepts like lines or planes to segment (classify) the instance space.
These are called linear models.
Linear models are parametric, which means that they have a fixed form with a
small number of numeric parameters that need to be Jearned from data. Linear
models have low variance and high bias. This implies that linear models are Jess
likely to overfit the training data than some other models.
In other method, we can use the geometric notion of distance to represent
similarity. In this case, if two points are close together, they have similar values
for features
and thus can be classed as similar. We call such models as
Distance - based models.
Examples of distance - based’ models include the nearest - neighbour models,
which use the training data. |
e

Geometric learning methods can not only solve recognition problems but also
predict subsequent actions by analyzing a set of sequential input sensory images,
usually some extracting features of images.
Example of Geometric models : K - nearest neighbors, linear regression, support
eee

vector machine, logistic regression.


Geometric features :
1. Corners : Corners is a very simple but significant feature of objects. Especially,
Complex objects usually have different corner features with each other. Corners

of an object can be extracted by using the technique, calling corner detection.


2. Edges : Edges are one-dimensional structure features of an image. They
represent the boundary of different image regions. The outline of an object can
be easily detected by finding the edge using the technique of edge detection.
3. Blobs : Blobs represent regions of images, which can be detected using blob
detection method.

4. Ridges : From a practical viewpoint, a ridge can be thought of as a


one - dimensional curve that represents an axis of symmetry, ie. Ridges
- detection method.

EEA Probabilistic Models


e Probabilistic models’ view learning as a process of reducing uncertainty, modeled
by means probability distributions. A model describes data that one could observe

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


- — Introduction to Machine f
ee
Machine Leaming ————__= arp,
- —————— My
Ma
use the mathematics of probability theory to expr
from a system. If we " Css
al}
noise associated wi th our model.
forms of uncertainty and

and t argel variables as random variab}, 1


Probabilistic models see features S,
and manipulates the level of uncertainty he
process of modelling represents With \
respect to these variables.
e Fig. 1.9.1 shows Probabilistic logic learning.

APY Probabilistic
logic
learning

learning
Fig. 1.9.1 Probabilistic logic learning as the intersection of probability logic and

tive.
There are two types of probabilistic models : Predictive and Genera
Predictive probability models use. the idea of a conditional probability distribution
P (Y |X) from which Y can be predicted from X. Generative models estimate the
joint distribution P (Y, X). |
can derive any
Once we know the joint distribution for the generative models, we
Thus, the
conditional or marginal distribution involving the same variables.
knowing
generative model is capable of creating new data points and their labels,
the joint probability distribution.
The joint distribution looks for a relationship between two variables. Once this
relationship is inferred, it is possible to infer new data points. Naive Bayes is a
example of a probabilistic classifier.
e Example of Probabilistic models : Naive Bayes, Gaussian process regression
conditional random field.
Probabilistic modeling is a statistical technique used to take into acc ount the
of future
impact of random events or actions in predicting the potential occurrence
outcomes.
the system by sing a limited data set called
e In machine learning, we train
e
“4,
training data’ and based on the confidence level of the training data we expect i
' ts ‘\

machine learning algorithm to depict the behaviour of the larger set of actual date
a

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


-
Machine Learning . 1-27 Introduction to Machine Learning

Probab
y,-
ility
ily y
theory
’ “a
provi
. 1
des1c a mathematical foundation for quantifyin g uncertaint
.
‘ ¢ j
y
of the knowledge.
¢ ML is focused on making predictions as accurate as possible, while traditional
statistical models are aimed at inferring relationships between variables.
¢ We]
make, to
observations : .
using the sensors in ‘ .
the world. Based on the observations,
we intend to make decisions. Given the same observations,
the decision should be
the same. However, the world changes, observations change, our sensors change,
the output should not change.
° We build models for predictions; can we trust them? Are they certain? Many
applications of machine learning depend on good estimation
of the uncertainty :
a) Forecasting
b) Decision making
c) Learning from limited, noisy, and missing data
d) Learning complex personalised models
_@) Data compression . |
f)
Automating’ scientific modelling, discovery, and experiment design.
e A signal is called random if its occurrence can not be predicted. Such signal
can
not be by any mathematical equation. | |
e The random signals are represented collectively by a random variable takes its
value will be taken at particular time is not known.
_*® The random variables are analyzed statistically with the help of probability,
probability density functions and statistical averages such as mean, variance etc.
Relative frequency : For event 'A' relative frequency is defined as,
Number of times event occurs (N A) _Na
Relative frequency =
Total number of trials N
As number of trials approach infinity, relative frequency is called probability.
Probability of event 'A' is defined as the ratio of number of possible favourable.
outcomes to total number of outcomes i.e.,
. ; lim
Probability, P(A) = Noo
Number of possible favourable outcomes
Total number of outcomes
Introduction to Machine lL ni
1 - 28 €a
Machine Leaming 4 Mac

and Combinations
Permutations n!
'r' at a time,
wot : 4 n
C,
=
(a—n)it!
taken
en)

Combination
. .
of n

n!
. »n =
. rot oy
‘rat a time, “P, = (n-r)!
Permutation of 'n' taken

ERQPA Naive Bayes Classifier based an


are a family of simple probabilistic classifiers
e Naive Bayes classifiers between the
tions
ng Bay es’ theorem wit h strong independence assump
applyi ters linear in the
requiring a number of parame
features. It is highly scalable, ng problem.
dictors) in a learni
number of variables (features/ pre set of
ts a class value given a
is a program which predic
e A Naive Bayes Classifier
attributes.

For each known class value,


e, conditional on the class value.
e
1. Calculate probabilities for each attribut
probability for the attributes.
2. Use the product rule to obtain a joint conditional
ities for the class variable.
3. Use Bayes rule to derive conditional probabil
output the class with the highest
e Once this has been done for all class values,
;
probability.
s by assuming that the
e Naive bayes simplifies the calculation of probabilitie
is independent of al
probability of each attribute belonging to a given class value
is a strong assumption but results in a fast and effective
other attributes. This
method.
e The probability of a class value given a value of an attribute is called the
for
conditional probability. By multiplying the conditional probabilities together
insta nc
e, we have a probability of a data
each attribute for a given class valu
belonging to that class. =

Conditional Probability
° Letet AA | that : P(A) > 0. We : denote P(B| A) the probabilit
and B be two events such
f B given ed, it becomes the
sole cose ie caries Since A is known to:! have occurr ;
new
€ replacing the origi
p(B/A) = PAN B) niginal S. From this, the definition is,
P(A)
OR
P(AN B) P(A) P(B/A)

TE
ECCHHNICAL : PUBL
ICATIONS 7 an Up
lL -thrust for know
; ledge
Machine Leaming 1-29 Introduction to Machine Learning

The notation P(B | A) is read "the probability of event B piven event A’ It is the
probability of an event B given the occurrence of the event A.
We say that, the probability that both A and B occur is equal to the probability
that A occurs times the probability that B occurs given that A has occurred. We
call P(B] A) the conditional probability of B given A, i.c., the probability that B
will occur given that A has occurred.
similarly, the conditional probability of an event A, given B by,
P
P(A/B) = (AO 8) 3
P(B)
The probability P(A |B) simply reflects the fact that the probability of an event A
may depend on a second event B. If A and B are mutually exclusive AN B= 6
and P(A|B) = 0.

Another way to look at the conditional probability formula is :


ee Ee

Ce 7 se * aD
P(Second/First) = P
(First
Fi aS
choice
]
and second choice)
SE

P (First choice)
Conditional probability is a defined quantity and cannot be prov
en.
The key to solving conditional probability problems is to :
1. Define the events.
2. Express the given information and question in probability notation.
3. Apply the formula.

Joint Probability
A joint probability is a probability that measures the likelihood that two or more
events will happen concurrently.
If there are two independent events A and B, the probability that A and B will
occur is found by multiplying the two probabilities. Thus for two events A and B,
the special rule of multiplication shown symbolically is :
P(A and B) = P(A) P(B).
The general rule of multiplication is used to find the joint probability that two
events will occur. Symbolically, the general rule of multiplication is,
P(A and B) = P(A) P(BIA).
The probability P(AN B) is called the joint probability for two events A and B
which intersect in the sample space. Venn diagram will readily shows that
P(AN B) = P(A) + P(B) - P (A u B)

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Introducti :
1- 30 uction to Machine Lear;
Machine Learning

Equivalently :
B)s P(A)+ P(B)
P(AN B) = P(A)+ P(B)- P(AN
the sum of the even
The probability of. the union of two events never exceeds
probabilities.
g conditional and joint probabilities, ,
A tree diagram is very useful for portrayin .
dia gram por tra ys out comes that are mutually exclusive
tree

Bayes Theorem
ability of an event given additiona,
Bayes’ theorem is a method to revise the prob
rmat ion. Bayes's theo rem calcu lates a conditional probability called a posterior
info
|
or revised probability. |
in probability theory that relates conditional
Bayes’ theorem is a result
two events, P(A|B) denotes the conditional
probabilities. If A and B denote
The two conditional probabilities
probability of A occurring, given that B occurs.
t.
P(A|B) and P(B|A) are in general differen
n P(A|B) and P(B/A). An important
Bayes theorem gives a relation betwee
s a rule how to update or revise the
application of Bayes’ theorem is that it give
of new evidence a posteriori.
strengths of evidence-based beliefs in light
r prob abil ity is an initi al probability value originally obtained before any
A prio
additional information is obtained.
that: has been revised by using
A posterior probability is a probability value
ned.
additional information that is later obtai
-- Bn partition the outcomes of an experiment and that A
Suppose that By, B2,B3
< n, we have the formula :
is another event. For any number, k, with 1 <k ,
P(A/B,): P(Bx) |
P(By/A) = =o L <
¥) P(A/B;): P(B,)
~ i=l

EEEe Logical Models


ble logical expansions
e Logical models are defined in terms of easily interpreta
Logical models use a logical expression to divide the instance space into segmen .
and hence construct grouping models. | :
3
A logical expressio n is ann expressioion that retu
: rms a j.e , a 1 cf
False outcome. Sas
human -uneeti ‘ Models
Sea
— involving log se
;
slated
Satements caseasily tranfarls
ee
- an up-thrust for knowle
dge
Machine Learning
1-31 Introduction to Machine Learning

Once the data is grouped using a logical expression, the data


is divided into
homogeneous groupings for the problem we are trying to
solve. For example, for a
classi fication problem, all the instances in the group
belong to one class.
There are mainly two kinds of logical models : Tree models and Rule models.
Rule models consist of a collection of implications or IF-THEN rules. For
tr-eebased models, the ‘if - part’ defines a segmen
t and the 'then - part’ defines
the behavior of the model for this segment. Rule models follow
reasoning. the same

* Example of Logical models :


Decision trée, random forest. °
Ry Grouping and Grading Models
LETS AT Te

Grading vs grouping is an orthog


onal categorization to geometric - prob
logical-compositional model. Dif abilistic -
ference between grouping and grading
the way they handle the instance models is
space.
SS

Grouping Model :
RE,

Grouping models break the instance


Space up into groups or segments and
segment apply a very simple method in each
. Example : Decision tree, KNN.
Grouping models have fixed resolution.
The y cannot distinguish instances beyond
a resolution. At the finest resolution,
grouping models assign the majority
all instances that fall into the segme nt. class to
Find the right segments and label all the
objects in that segment.

Grading Model :
Grading models form one global model
over the instance space. They don't use
the notion of segment.
e ‘Grading models are usually able to distinguish between arbi trary instances, no
matter how similar they are.
_
ERee Parametric Models

Pe
Model can be represented using a pre
- determined number of parameters. Thes
methods in Machine Learning typicall e
y take a model - based approach. We
make
an assumption there with respect to the form of
the func ction to be guessed. Then
we choose an appropriate model based
on this assumption correct to estimate
set of parame the
ters.
The advantage of the parametric approach is that the model is defined
up toa
small number of parameters, for example mean and variance, the sufficient —
Statistics ofee the .distr
Race,ibuti
Seon. ig Once those parameters are estim, ated from the sample,
Ne whole distribution
is known,
ie

ee an up-thrust for knowledge


1-32 Introduction to Machine Le
Machine Learning Arnj
4
the given sample, Plug
e We estimate the parameters of the distribution from
distribution, which mn
these estimates to the assumed model and get an estimated We
then use to make a decision. The method we use to estimate the parameters Of a

distribution is maximum likelihood estimation.

e Examples of parametric machine learning algor ithms are Logistic Regressig


n,
Linear Discriminant Analysis, Perceptron, Naiv e Bayes and Simple Neural
Networks.
e Advantages :
1. These methods are simpler and easier to understand.

2 These models are very rapid to learn from data.


3. They do not need as much training data.
4 The methods are well - matched to simpler problems.

Ene Maximum Likelihood Estimation

e Maximum - Likelihood Estimation (MLE) is a method of estimating the parameters


of a statistical model. When applied to a data set and given a statistical model,
maximum - likelihood estimation provides estimates for the model's parameters.
X1,X2,X3,+",Xp have joint density denoted fg(x1,X2/°°° Xp ]0).
Xn) = £(%4,X2/+7+
Given observed values X1 = X1,X2 = X2/1t/ Xp = Xn’
lik(®) = £(x1,X2/77/Xn|9)

Considered as a function of 0.
e If the distribution is discrete, f will be the frequency distribution function.
e The maximum likelihood estimate of 6 is that value of that maximises lik@) : It is
the value that makes the observed data the most probable.

Examples of maximizing likelihood :


e A random variable with this distribution is a formalization of a coin toss. The
‘ value of the random variable is 1 with probability
6 and 0 with probability 1-6
Let X be a Bernoulli random variable and let x be an outcome of X. then we have
P(X = x) a 8 if x=1

1-0 if x=0

¢ Usually, a:
Usually, we use the notation P(.) for a probability mass and the notation P(.) for’
_ probability density. For mathematical convenience
write P(X
P(X = x) = @*(1-0)1* ~

TECHNICAL PUBLICATIONS® ~ an tip thrusts :


or knowledge
Machine Leaming 1-33 Introduction to Machine Learning

4.12 | Non-parametric Models

Size of the model depends on data, cannot be represented using, a pre-determined


number of parameters.
Instead, non-parametric methods state to a set of algorithms. That does not make
any primary assumptions with respect to the form of the function to be assessed.
These methods are accomplished by approximating the unidentified function f that
could be of any form.
In machine learning, nonparametric methods are also called instance - based or
memory - based learning algorithms.
Density estimation is the problem of reconstructing the probability density
function using a set of given data points. Namely, we observe X1;...;X, and we
want to recover the underlying probability density function generating our
dataset.
Here we discuss, histogram methods for density estimation. A histogram
is a chart
that plots the distribution of a numeric variable's values as a series
of bars. Each
bar typically covers a range of numeric values called a bin or class;
a bar's height
indicates the frequency of data points with a value within the corres
ponding bin.
For simplicity, we assume that X; € [0; 1], so p(x) is non - zero only within [0; 1].
We also assume that p(x) is smooth and |p’(X)| < L for all x. The histo
gram is to
partition the set [0; 1] into several bins and using. the count
of the bin as a density
estimate.
When we have M bins, this yields a partition

Be [ag] = Rene)
=
1
—|,B =)
1 2
wa], -, By
oe PAGE A) ta = [SS
=
M-2
s
M-1
/ 7
M-1
r

In such case, then for a given point x € By, the density estimator from the
histogram will be
n Number of observations within B , 1
Pn(x) = x
n -Length of the bin7

= “SX, €B,)
n

n ; i=]

The intuition of this density estimator is that the histogram assign equal density
value to every points within the bin. So for B,, that contains x, the ratio of
observations within this bin is ~¥" 1, € By) which should be equal to the
density estimate times the length of the bin.
Non - parametric methods lean towards additional precision because they ky to
find the best fit for the data points. Though, this approaches at the cost of needing

* TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Machine Leaming 1-34 Introduction to Machin
® Le,
aj

a very huge amount of observations. That is desired so as to approxi


© th;
Imat

unidentified function (f) exactly.


Non - parametric methods can occasionally. present overfitting, They ¢ any
-occasion learn the errors and noise in a way that they cannot simplify a |
: e
:
MY
new, unseen data points as these methods tend to be more flexible.

Examples of non - parametric methods are k-Nearest Neighbors, Decision p Ces

like CART and C4.5, Support Vector Machines.


Advantages of nonparametric methods :
.
Accomplished in fitting a huge number of functional forms
WO Nope

There are no assumptions about the original function.


They may outcome in higher performance models for
prediction.

Limitations of nonparametric methods


the mapping function.
1. They require a lot more training data to estimate
2. Overfitting : Extra risk to overfit the training data.

s and Parame tric Methods


Berne Difference between Non-parametric Method a

ge a AORN PCN TLE TLE RIT TS

i
Parametric methods
Sr. No. Non-parametric method

Algorithms that do not make particular - Parametric model is a learner that


1. collection of
assumptions about the kind of mapping summarizes data through a
parameters. .
function are known as non-parametric
algorithms.

Non-parametric analysis to test group Parametric analysis to test group means.


ee 2.
Ze medians. -
— ot

;
a

3. It can be used on small samples. - Tend to need larger samples. peer eee oe

rn
GE oe

4. No information about the population is ~ Information about population is


available. completely known. .
er
= : : ee
7 - .
B

Used mainly on interval and ratio scale


ree

5. It can-be used on ordinal and nominal


. | scale data. - data. ! re |
ae —~

6. Not necessarily the samples are ‘ Samples are independent. ee


:
independent. Seite al
ieee
Kenearest neighbors is eee ofa Examples of parametric models include |
7.
Machine Leaming 1-35
Ju Introduction to Machine Learning

4.13 | Feature

In machine learning, features are individual independent variables that act like
Input mn your system. Feature is an attribute of a data set and used in a machine
samMMINg process.
learning process. Selection
Select; :
of the subset van;
of features which are meaningful for
machine learning is a sub-area of feature engineering
The
| ’
features
a - -
in a data
j
set are also called
~ b
its dimensions.
. . .
So a data set having
~ rq Cc r
'n \

features is called an n-dimensional data set


A good feature representation is central to achieving high performance in any
machine learning task.
Consider an example of text categorization. Assume that we need to train a model
for classifying a given document as spam and not spam. If we represent
a
document as a bag of words, the feature space consists of a vocabulary of all
unique words present in all the documents in the training
set.
For a collection of 100,000 to 1,000,000 documents, we can easily expect hundreds
of thousands of features. If we further extend this document model to include all
possible bigrams and trigrams, we could easily get over a million features.
A feature tree is a tree such that each internal node is labelled with a feature, and
each edge emanating from an internal node is labelled with a literal. The set of
literals at a node is called a split. Each leaf of the tree represents a logical
expression, which is the conjunction of literals encountered on the path from the
root of the tree to the leaf. The extension of that conjunction is called the instance
space segment associated with the leaf.
Two features are redundant if they are highly. correlated, regardless of- whether
they are correlated with the task or not.

Feature engineering is the process of creating features (also called "attributes") that
don't already exist in the dataset. This means that if your dataset already contains
enough "useful" features, you don't necessarily need to engineer additional
features. .
Feature engineering refers to the process of translating a data set into features
such that these features are able to represent the data set more effectively and
result in a better learning performance.
If feature engineering is performed properly, it helps to improve the power of
prediction of machine learning algorithms by creating the features using the raw
data that facilitate the machine learning process.
Elements of feature enginecring is feature transformation and feature subset
selection.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


1.36 Introduction to Mach;
Bass
Maat
ee Oo LEOGITHNG
——————
— ING Le
Ry
ERESO@ Feature Transformation Mac

e Feature transformation transforms the data, structured or unstructured ' int OQ


set of features which can represent the underlying problem) which ma a
MCHin,
See eeeeeas g is trving to solve.
> .
Tino eres tes

There are two distinct goals of feature transformation :


Achieving best reconstruction of the original features in the data set,
2 thieving highest efficiency in the learning task.
here are two variants of feature transformation :
. _
1. Feature construction.

Feature extraction.
lJ

fk Feature Construction

e
Hates
a2 Caeuwikd construction involves transforming a given set of input features
cenersanr
See ee ea e a new set of more powerful features which can then used for prediction,

e Feature
Featu consnstruction methods may be applied. to pursue two distinct goals
Reducing data dimensionality and improving prediction performance. r

e Steps:
i. Start with an initial feature space Fp .
Transform F, to construct a new feature space Fy -
)

subset of features F, from Fy, .


bos
(f)
(y)

+a¥)
mM
©

4. If some terminating criteria is achieved : Go back to step 3 otherwise set


F_ = F.
5. F_is the newly constructed feature space.
ahis

between features and augments the feature space by creating additional features.
e Hence, if there are ‘'n’ features or dimensions in a data set, after featur
construction 'm' more features or dimensions may get added. So at the end, the
data set will become 'n + m' dimensional. .

The task of constructing appropriate features is often highly application specific


and labour intensive. Thus building auto-mated feature construction methods tha!
require minimal user effort is challenging. In particular we want methods that :
i. Generate a set of features that help improve prediction accuracy.
2. Are computationally efficient.
3. Are generalizable to different classifiers.
4. Allow for easy addition of domain knowledge.

TECHNICAL PUBLICATIONS© - an up-thrust for knowledge


Machine Learning 1 - 37 Introduction to Machine Learming

* Genel programming is an evolutionary algorithm-based technique that starts


with a population of individuals, evaluates them based on some fitness fonettert
and eonstqucis a new population by applying a set of mutation and crossover
operators on high scoring individuals and eliminating the low scoring, ones,
In the feature construction paradigm, genetic programming, is used to derive a
new feature set from the original one. Individuals are often tree like
representations of features, the fitness function is usually based on the prediction
performance of the classifier trained on these features while the operators can be
applications specific.
e The method essentially performs a search in the new feature space and helps
generate a high performing subset of features. The newly generated features may
often be more comprehensible and intuitive than the original feature set, which
makes GP-related methods well-suited for such tasks.
e In decision trees, the model explicitly selects features that are highly correlated
depth of the decision tree, one can at
with the label. In particular, by limiting the
least hope that the model will be able to throw away irrelevant features.

ERE Feature Extraction


es from the original
e Feature extraction is a process that extracts a set of new featur
some functional mapping. Feature extraction method creates a
features through
new feature set.
ed models by extracting features
° Feature extraction increases the accuracy of learn
the input data. This phas e of the general framework reduces the
from
the redundant data.’
dimensionality of data by removing
istic of these larg e data sets is a larg e number of variables that require
e A character
process.
a lot of computing resources to
for methods that select and. combine variables into
e Feature extraction is the name
that must be processed, while
redu cing the amount of data
features, effectively
and co mp le te ly des cr ibi ng the original data set.
still accurately
rac tio n is use ful wh en you need to reduce the number of
e The process of feature ext ation.
proc es si ng wi th ou t losing important or relevant inform
resources n eeded for for a given
the amount of redundant data
can al so reduce
Feature extraction e's efforts in building
of the data and the machin
e
reduction
analysis. Also, the the sp ee d of lea rni ng and generalization
(features ) facilitate
variable combinations
learning process.
steps in the machine

0 eee
Machine Leaming 4-38 Introducti
ction to Maching
>Le
Stn
q Ma

EREVY Feature Selection

Feature selection is a process that chooses a subset of features from the 2 0 .. .

features so that the feature space is optimally reduced according to q “Bing


Certs:
criterion.
Feature selection is a critical step in the feature construction process, |,
, tery
categorization problems, some words simply do not appear very often, Perh.,
the word "groovy" appears in exactly one training document, which is Positive,
it really worth keeping this word around as a feature ? It's a dangerous endesy,.
because it's hard to tell with just one training example if it is really correlateg ad
learning algorithy, ,
the positive class or is it just noise. You could hope that your
smart enough to figure it out. Or you could just remove it.
general classes of feature selection algorithms : Filter method,
There are three
7
wrapper methods and embedded methods.
The role of feature selection in machine learning is,
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
algorithm.
o. To improve the predictive accuracy of a classification
4. To improve the comprehensibility of the learning results.
Features Selection Algorithms are as follows :
1. Instancebased approaches :. There is no explicit procedure for feature subse
Features ax
generation. Many small data samples are sampled from the data.
classe
weighted according to their roles in differentiating instances of different
for a data sample. Features with higher weights can be selected.
Nondeterministic approaches : Genetic algorithms and simulated annealing at
also used in feature selection.
Exhaustive complete approaches -: Branch and Bound evaluates estimated
accuracy and ABB checks an inconsistency measure that is monotonic. Bot!

start with a full feature set until the preset bound cannot be maintained.

- EEE Subsect Selection


tion. Ih
e Finding the best subset of the set of features is main aim of subset selec
best subset contains the least number of dimensions that most contribule "
accuracy.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Machine Learning 1-39 Introduction to Machine Learning

e Using a suitable error function, this can be used in both regression and
classification problems. There are 2d possible
subsets of d variables, but we cannot
test for all of them unless d is small and we employ heuris
tics to get a reasonable
(but not optimal) solution in reasonable (polynomial)
time.
e Subset selection are of two types : Forward and
backward selection.
J. Forward selection : It start without variables and add them one by one, at
each step adding the one that decreases the error the most, until any further
addition does not decrease the error.
.
2. Backward selection : It start with all variables and remove them one by one,
at each step removing the one that decreases the error the most, until any
further removal increases the error signific
antly.
e Sequential Forward Selection (SFS) : SFS is the simplest greedy search algorithm.
It start from the empty set, sequentially add the feature x*. SFS
performs best
when the optimal subset is small.
e The main disadvantage of SFS is that it is unable to remov
e features that become
obsolete after the addition of other features.
e Sequential Backward Selection (SBS) : It works in the opposite directi
on of SFS.
Starting from the full set, sequentially remove the feature x~ that least
reduces the
value of the objective function.
e SBS works best when the optimal feature subset is large, since SBS spends most of
its time visiting large subsets. The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has been discarded.
e SFS is performed from the empty set. SBS is performed from the full set.
e There are two floating methods :
1. Sequential Floating Forward Selection (SFFS) starts from the empty set. After
each forward step, SFFS performs backward steps as long as the objective
function increases.

2. Sequential Floating Backward Selection (SFBS) starts from the full set. After
each backward step, SFBS performs forward steps as long as. the objective
function increases.
e Subset selection is supervised in that outputs are used by the regressor or classifier
to calculate the error, but it can be used with any regression or classification
‘method.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


1-40 Introduction to Machine Learnin
g Machi
Machine Learning

Empty feature set


Empty feature set
P

Full feature set


Full feature set selection
(b) Sequential backward
forward selection
(a) Sequential
Fig. 1.13.1 method for
not a good
feature selection is
like face recognition, do not carry
e In an application individual pixels by themselves
dimensionality reduction because
of v alues of several pixels
the combination
di sc ri mi na ti ve in formation, it is don e by feature
much the face identity. This is
ther th at ca rr y ‘nformation about
toge
s.
extraction method

of Dimensionality
4.13.6 | Curse to the number of features
simply refers
onality
machine le arning, "dimensi
In t.
s) in your datase
@

(i.e. input variable to the number of observations


js very large re
lative
er of fe at ur e s models. This is
called
e When the nu mb struggle to tr ai n e ffective
data se t, ce rt ain algorithms le va nt for clustering
algorithms
in j rour especial ly re
4 nd it's
Dime nsionality,”
the .“Curse of
ions.
0 n di stance calcul at numbe r of random
that rely
of re ducing the
process It can be
re du ction 1s the l variables.
° Dimensio na li ty
by obtaining a set of principa
under consideration,
variables feature extracti
on.
ature selection and X%3, Xn}
divided into fe an input data {X1, X,
_—
e have
example : W
Classification Pr ob le m
ponding output labels.
e
and a set of corr es
= (Rh Mes aaaem /xo) to classify. x.
| such that Xj rg e and we wa nt
d of th e da ta point x is very la of parameters to
the dimension t vectors ar e la rge nu mb er
s.
wi th hi gh di mensional inpu an d la rg e va ri ance of estimate
» Problem res ult in overfi
t
all, this can
t 1 s sm
learn, if a datase
oblem is as f ol
lows : in puts; train
Solu ti on to th is pr
inputs from 4 large set of
r su bset of
»
of a sm al le
1. Selection

fo o ~ fp tnnwiledge
Machine Learning 1-41
Introduction to Machine Learning

2. Combination of high dimensional inputs to a smaller set of features: $4,(x);


train classifier on new features.

Selection

Si
Combination

Fig. 1.13.2 Dimensionality reduction


There are two components of dimensionality reduction :
1. Feature selection : User try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It
usually involves three ways : Filter, wrapper and embedded.
Feature extraction : This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser number of dimensions.
There are many methods to perform dimension reduction.
Missing: Values : While exploring data, if we encounter missing values, what
we do ? Our first step should be to identify the reason then impute missing
values / drop variables using appropriate methods. But, what if we have too.
many missing values ? Should we impute missing values or drop the
variables ? .
Low Variance : Let's think of a scenario where we have a constant variable in
our data set.
Decision TreesIt can be used as a ultimate solution to tackle multiple
:
challenges like missing values, outliers and identifying significant variables.
Random Forest : Similar to decision tree is random forest.
High Correlation : Dimensions exhibiting higher correlation can lower down
gu

the performance of model. Moreover, it is not good to have multiple variables


of similar information or variation also known as "multicollinearity”.

TECHNICAL PUBLICA TIONS® - an up-thrust for knowledge


1-42 Introduction to Machine Learnin g
Machine Learning — Ma

In this method, we start with all y


6. Backward Feature Elimination
eliminating each
dimensions. Compute the sum of square of error (SSR) after produced the
se removal has
(n times). Then, identifying variables who input
variable
SSR and remo vi ng it finally, leaving us with n-1
smallest increase in the
features.

Di sa dv an ta ge s of Di me ns ionality Reduction
fekwa Advantages and
ty Reduction
Advantages of Dimensionali
stor age space.
in dat a co mp res sio n and hence reduced
It helps
It reduces computation time.
undant features, if any. ”
s It also helps remove red
nsionality Reduction
e Disadvantages of Dime
amount of data loss.
» It may lead to some which is sometimes
between variables,
to find linear correlations
PCA tends
undesirable. enough to define datasets.
mean and cov ariance are not
fails in cases where me
PCA
co mponents to keep - in practice, so
how many principal
un We may not know
thumb rules are applied.

era Pca on a condition that while


by Karl Pearson. It works
introduced
a in a lower dimension
This method was
l spa ce is ma pp ed to dat
iona
the data in a higher dimens er di mensional space should be
dat a in the low
space, the variance of the
maximum.
(PCA) — fo A
~ Principal Component Analysis
the dimensionality of a
is to reduce 5 e
by finding a new set of
data set 5 ©
variables, smaller than the original
O
retains most of the
set of variables,
sample's information and useful for
the compression and classification of fy
| ———_7
data.
PCA, it is assumed that the
In Fig. 1.14.1 PCA
information is carried in the variance ion
a fe ature, the more informat
of the features, that is, the higher the variation in
that feature carries.
eete
TOCRUMIFAL DIIRI IPA TIONS® - an uo-thrust for knowledge
Machine Learning 1-43 Introduction to Machine Learning

e Hence, PCA employs a linear transformation that” is based on preserving, the most
variance in the data using the least number of dimensions.
e It involves the following steps :
1. Construct the covariance matrix of the data.
2. Compute the eigenvectors of this matrix.
3. Eigenvectors corresponding to the largest eigen values are used to reconstruct
a large fraction of variance of the original data.
e The data instances are projected onto a lower dimensional space where the new
features best represent the entire data in the least squares sense.
e It can be shown that the optimal approximation, in the least square error sense, of
a d-dimensional random vector x7 < d by a linear combination of independent
vectors is obtained by projecting the vector x onto the eigenvectors e;
_ corresponding to the largest eigen values /; of the covariance matrix (or the scatter
matrix) of the data from which x is drawn.
e The eigenvectors of the covariance matrix of the data are referred to as principal
axes of the data, and the projection of the data instances on to these principal axes
are called the principal components. Dimensionality reduction is then obtained by
only retaining those axes (dimensions) that account for most of the variance, and
discarding all others.
e In the Fig. 1.14.1, Principal axes are along the eigenvectors of the covariance
matrix of the data. There are two principal axes shown in the figure, first one is
- closed to origin, the other is far from origin.
e If X = X41, Xa, ,Xy is the set of n patterns of dimension d, the sample mean

e The sample covariance matrix is


(X-m) (X-m)! |
¢ C is a symmetric matrix. The orthogonal basis can be calculated by finding the
eigenvalues and eigenvectors.
¢ The eigenvectors g; and the corresponding eigenvalues 4, are solutions of the
equation ._
-Ctg, = Ai*9; i151, ...,d
¢ The eigenvector corresponding to the largest. eigenvalue gives the direction of the
largest variance of the data. By ordering the eigenvectors according to the
eigenvalues, the directions along which there is maximum variance can be found.

TECHNICAL PUBLICATIONS” - an up-thrust for knowledge


>
Machine Learning 1-44 Introduction to Machine Lear n

e If E is the matrix consisting of eigenvectors as row vectors, we can transform th


data X to get Y.
Y = E(X - m)
The original data X can be got from Y as follows:
X = E! Y+m
e Instead of using all d eigenvectors, the data can be represented by using the first}
eigenvectors where k < d.
e If only the first k eigenvectors are used represented by Ex, then
Y = Ex (X-m) and X’=E,Y+m

EER Non Negative Matrix Factorization (NMF)


Nonnegative Matrix Factorization is a “matrix factorization method where we
constrain the matrices to be nonnegative. In order to understand NMF, we should

clarify the underlying intuition between matrix factorization.


Suppose we factorize a matrix X into two matrices W H
and so that X = W H.

There is no guarantee that we can recover the original’ matrix, so we will


approximate it as best as we can.
e Now, suppose that X is composed of m rows, X4, X2, - » Xm, W is composed of k
LOWS Wy, W2, - Wy, His composed of m rows hy, hj,-... hy.
e Each row in X can be considered a data point. For instarice, in the case of

decomposing images, each row in X isa single image, and each column represents
some feature,
X41 | wy hy
x _ x9 , We
WwW 2 , H =
ha

Xk Wk]. hy

* Take the i row in X, xj. If you think about the equation, you will find that x;
can be written aS ~ .
k .

= Wi xh;

j=l
e :
Basically, we can, interpret x; to be aa Ww weighted sum of some components, where
each row in H is a component, and each row in W cont
ains the weights of each
component.
Machine Learning Introduction to Machine Learning

How Does it Work ?


e NMF decomposes multivariate data by cre
: . . Co ating a user-defined number of features,
Each feature
is a linear combination of the origin al attri
bute set; the coefficients of

these linear combinations are non-negative,


e NMF decomposes a data matrix V into the Product of two low er rank matrices W
and H so that V is approximately
equal to W times H.
e NMF uses an iterative procedure to modify the initial values of W and H so that
the product approaches V. The procedure terminates when the approximation
error converges or the specified number of iterations is reached.
e During model apply, an NMF model maps the original data into the new set of
attributes (features) discovered by the model.

ERLE] Difference between PCA and NMF

| Sr. No. PCA NMF


_— —— a =<
ti It uses unsupervised dimensionality It also uses unsupervised dimensionality
reduction. reduction.
Ca 7 opSe 0 eee —) {31
| 2. Orthogonal vectors with positive and Non-negative coefficients.
} negative coefficients.

; 3. Difficult to interpret. Easier to interpret.

|
4. PCA is non-iterative. NMF is iterative.

|
5. Designed for producing optimal basis — Designed for producing coefficients with a
images. specific property.

14.143 | Sparse PCA

e In sparse PCA one wants to get a small number of features which still capture
most of the variance. Thus one needs to enforce sparsity of the PCA component,
which yields a trade-off between explained variance and sparsity.
To address the non-sparsity issue of traditional PCA, sparse PCA imposes
additional constraint on the number of non-zero element in the vector v.

¢ This is achieved through the /, norm, which gives the number of non-zero
element in the vector v. A sparse PCA with at most k non-zero loadings can then
be formulated as the following optimization problem.
* Optimization problems with /, norm constraint is in general NP-hard. Therefore,
most methods for sparse PCA relaxes the /g norm constraint with l, norm
appended to the objective function.

_ TECHNICAL PUBLICA TIONS” - an up-thrust for knowledge


oT
Machine Leaming 1-46 Introduction to Machine Learning

ER@@Y Kernel PCA


¢ Kernel PCA is the nonlinear form of PCA, which better exploits the complicated
spatial structure of high-dimensional features.
e It can extract up to n (number of samples) nonlinear principal components without
expensive computations.
e The standard steps of kernel PCA dimensionality reduction can be summarized
as:
1. Construct the kernel matrix K from the training data set.
2. Compute the gram matrix.
3. Solve N-dimensional column vector.

4. Compute the kernel principal components.


e Kernel PCA supports both transform and inverse_transform.
e Fig 1.14.2 (a), (b) shows PCA and KPCA.

0.87 xy Xe x KKK x XK x 0.87

O.6F EL Ba 0.67
oat SM. PEE Ses
0.2+ 7 “no 5 xx x x x x
ay
0.2+

oe a Xo
*9 x. x Sexxy O+
O+ x x xx”

—0.2+. x on x * xy x “ * _ 0.27
—0.4 T x
—0.4+ x x weet * "x ere x ete x”

0.6 gp 0.6b, 4, Fs
08-06 -0.4-02 0 02 0.4 0.6 0.6 -0.8-0.6-0.4 -0.2 0 0.2 04 0.6 0.8

“4 . X4
Fig. 1.14.2 (a) PCA Fig. 1.14.2 (b) KPCA

Preliminaries :
# Load libraries
from sklearn.decomposition import PCA, KernelIPCA
from sklearn.datasets import make_circles

Create Linearly Inseparable Data :


# Create linearly inseparable data

X, _ = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1)


Conduct Kernel PCA :
# Apply kernal PCA with radius basis function
(RBF) kernel
kpca = KernelPCA(kemel="rbf',
gamma=15, n_components=1)
X_kpca = kpca.fit_transform(X)
oe

| Machine Learning
Introduction to Machine
Learning

| 1. What is Principal Component Analysis (PCA), when it is used.

;
Emra LDA
|

Fisher Linear Discriminant Analysis is also called Linear Discriminant Analysis


(LDA). LDA is closely related to PCA, for both of them are base
d on linear, i.e.
matrix multiplication, transformations.
In LDA, the transformation is based on maximizing a ratio of "between-class
variance to within-class variance" with the goal of reducing data variation in the
same class and increasing the separation between classes.
First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher
linear discriminant analysis :
. finds linear combinations of the gene expression profiles X=Xj1,...,Xp with large
ratios of between-groups to within-groups sums of squares - discriminant
variables;
predicts the class of an observation X by the class whose mean vector is closest to
X in terms of the discriminant variables
Suppose we have two classes and d-dimensional samples x, peeoyX,, Where
a. nz samples come from the first class.
nz samples come from the second class.
Consider projection on a line. Let the line direction be given by unit vector v.

Fig. 1.15.1
Scalar v ‘x; is the distance of projection of x; from the origin. Thus it v'x; is the
Projection of x; into a one dimensional subspace.
Thus the projection of sample x; onto a line in direction v is given by v 'x;.
How to measure separation between projection of different classes ?
Let p; and p, be the means of projection of classes 1 and
2.
Let uw, and U> be the means of classes 1 and 2.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


Machine Learning 1-48 Introduction to Machine Learn; Ing

e |p, —Pp2| seems like a good measure ‘

1 n4{ 1 nN t
t _ t —
pp
_
=—ny Dv XV — Yxjavli
n4
xj €Cy xj EC]

Similarly, p2 = V th

however /fi,-fil >1f1-f2!


er the variance of the classes.
e The problem with |fi; —f12| is that it does not consid
Ty ~~ 905919 Oo

8 ——-
10°
+e
_
Ho50
Oo
a Me
©- 1 |
|
= >
© — 4
& ! A A
“n by Mp

Large variance .
Fig. 1.15.2
to variance.
e We need to normalize |i —fi2| by a factor which is proportional
n

=
Have samples Z1,.--,Zn: Samples mean is Wz
NI rR Si:
e
i=l

e Define their scatter as


n-

s = S(2-Hz)?
i=1
scatter
e Fisher solution : Normalize |}, —{ly | by

Let y; =v 'x;,ie. yi's are the projected samples

e Scatter for projected samples of class 1 is


~ ~ \2
se = Yvi-By)
y; €Class1

© Scatter for projected samples of class 2 is |


~ ~ \2
se = Sirf)
yj €Class2

e We need to normalize by both scatter of class 1 and scatter of class 2. Thus Fisher
linear discriminant is to project on line in the direction v which: maximizes
¥4-1.)2
y(v) = Gao Ba)"

¢ Define the separate class scatter matrices S; and S., for classes 1 and 2.

TECHNICAL PUBLICA TIONS® - an up-thrust for knowledge


Machine Learning Introduction fo Machine L earning

Discriminant function :

¢ Discriminant function is a function of the pattern x that leads to a classification


rule. The form of the discriminant function is specified and is not imposed by the
underlying distribution.
e When g(x) is linear, the eseen surface is a hyperplane
g(x) = w Te? pig. SY xjtw,
i=]

e For a 2-class case, we seek a weight vector w and threshold w , such that
T > 0 WwW ]
Ww xXt+Wo => xE
<0 W>5

e If x; and x9 are both on the decision surface, then


wix,+Wo = Wlx.+w oO

w! (x; -X2) = O

i.e. Weight vector is normal to vectors in the hyperplane

We can write x = Xp try


Tw II
T
g(x) = W'XxXt+Wo

= Ww "fs tr——|+w
OP tiw
_= c=
twill?
Iw I

2(x
OF, r= 80)
I] w II
e The value of the discriminant function for a pattern x is a measure of its distance

from the hyperplane. A pattern classifier using linear discriminant functions is


called a linear machine.
¢ The decision boundaries are assumed to be linear. The discriminant function
divides the feature space by a hyperplane whose orientation is determined by the
weight vector w and the distance from the origin by the threshold w ,
* Different optimization schemes lead to different method such as the perceptron,
Fisher's Linear discriminant function and support vector machines. Linear
combinations of nonlinear functions serves as a stepping stone to nonlinear
models.
ee

TECHNICAL PUBLICATIONS” - an up-thrust for knowledge


Pei yO
Introc Ju € 4
g
Machine Leaming
stion lo Machine
l 94 rr
9 A

e Limitations of LDA :
a) LDA is a parametric method since it assumes unimodal Gaussian likelihoods |
b) LDA will fail when the discriminatory information is not in the mean but rathe-
the variance of the data. |

c) LDA produces at most C-1 feature projections

4.45.1 Difference between PCA and LDA

Sr. PCA . LDA


No.

1, Perform dimensionality reduction Perform dimensionality reduction while preserving


while preserving as much of the as much of the class discriminatory information as
variance in the high dimensional possible
space as possible.

Zz The transformation is based on The transformation is based on maximizing a ratio


minimizing mean square error of “between-class variance" to “within-class
between original data vectors and variance” with the goal of reducing data variation
data vectors that can be estimated in the same class and increasing the separation
from the reduced dimensionality between classes.
data vectors.

3. Higher » variance Smaller variance -

Good discriminability |
| 4. Bad for discriminability

5 inds the
PCA as a technique that finds LDA attempts to find a feature subspace that |

directions of maximal variance / maximizes class separability |


era . |
6 “PCA is unsupervised algorithm LDA is ‘supervised algorithm. — ||
j

A
7.
X44 ,

Bad projection
x
y
NS

lection on : : separates classes WE")I


Good projecti
» Loam ; | 1-51
Machine Leaming In troduction to Machine Lear
—_—_— ning

ERGe Application of Machine Learning


Examples of successful applications of machine learnin
¢ %:
Here are several examples :

1 Optical character recognition : Categorize images of h andwriltten


al “4 ’ aha rarto >
isitets Fepresentedl characters by the

Face
<
detection
\fco 7
:- Find
Wy
faces
°
in. images
s
(or indicate if a face is present)
tw

Spam filtering : Identify email messages as spam or non-spam topic spotting:


ateg
categorize news articles (say) as to whether they are about politics, sports,
entertainment, etc.

Spoken language understanding : Within the context of a limited domain,


determine me meaning of something uttered by a speaker to the extent that it can
be classified into one of a fixed set of categories.

4.16.1, Face Recognition and Medical Diagnosis


Face recognition _
Face recognition task is effortlessly and every day we recognize our friends,

relative and family members. We also recognition by looking at the photographs.


In photographs, they are in different pose, hair styles, background light, oe
- and without makeup.
it. Because we can't
We do it subconsciously and cannot explain how we do
explain how we do it, we can't write an algorithm.
of pixel. It is symmetric’
Face has some structure. It is not a random collection’
mouth, eye, ears. Every
structure. It contains predefined components like nose,
nation of the features. By
person face is a pattern composed of a particular combi
program captures the pattern
analyzing sample face images of a person, a learning
a new real face or new image
specific to that person and users it to recognize if
belongs to this specific person or not.
model of the concept being
Machine learning algorithm creates an optimized -
learned based.on data or past experience.
, the classes are people to be
In the case of face recognition, the input is an image
associate the face images to
recognized and the learning program should learn to
cter recognition because
identities. This problem is more. difficult than optical chara
is 3D and differences in
there are more classes, input image is larger and a face
pose and lighting cause significant changes in the image.
Machine Leamina 1-52 Introduction to Machine ly
Vi
Wri
4 MM
Medical diagnosis .
e In medical diagnosis, the input are the relevant information about the patiens eal a
the classes are the illness. The inputs contain the age of patient's, gender, es
medical history and current symptoms,
e Some tests may not have been applied to the patient and thus these inputs Would
be missing. Tests take time, may be costly and may inconvience the patient so
do not want to apply them unless we believe that they will give us valuab|
information.
e In the case of a medical diagnosis, a wrong decision may lead to a wrong or no
treatment and in cases of doubt it is preferable that the classifier reject and defe;
decision to a human expert.

Rete Google Home and Amazon Alexa

Amazon Alexa / Siri


e Every time Alexa or Siri make a mistake when responding to our request, it uses
the data it receives based on how it responded to the original query to improve
the next time. If an error was made, it takes that data and learns from it. If the
response was favourable, the system notes that as well.

e Data and machine learning are responsible for the explosive growth of digital
voice assistants. They continue o get better with the more experiences they have
| [
and the data they accumulate.
e When user make a request of Alexa, the microphone on the device records
command. This’ recording is sent to over the internet to the cloud. If user are
talking to Alexa, the recording is sent to Alexa Voice Services (AVS). This
cloud-based service will review the recording and interpret user request. Then, the
system will send a relevant response back to the device.
e Amazon breaks down user "orders" into individual sounds. It then consults a

database containing various words' pronunciations to find which words most

closely correspond to the combination of individual sounds.


e It then identifies important words to make sense of the tasks and carry out
corresponding functions. For instance, if Alexa notices words like "sport" of
"basketball", it would open the sports app.
2 ' . . ; :
e Amazon's servers send the information back to our device and Alexa may speak.
If Alexa needs to say anything back, it would | g0 through the same process
described above, but in reverse order.
Machine Leaming Introduction tr MA
Machine Learning

Google Home : .
| i"
e Google services such as its image search and franslation
‘ tools ; 1 Ss hist
SOphistte ate
machine learning which allow computers to see, listen ind
e, wen and speak l in much the
same way as human do.

e To perform its functions, Google Assis ‘clies ‘tial


as natural I ssistant relies on Artificial Intelligence (Af)
technologi es
Biles ssuch = as c c anguage orocessing : Ss and » achj Jee vr

understand| whaté the user ISj saying
j and | to make mSsuggestion
fee s or act on thates
language input.
e Th e Google
ne 7 Home can play music, but it’
it's primarily designed as a vehicle for
Goog sistant -- Google's voice
e

- activated
S ]

virtual
ews

helper that's connected


' °

to the
. . '

internet.

° The Google Home is always listening to its environment, but it won't record what
we are saying or respond to our commands until we speak one of its
pre-programmed wake words -- either "OK, Google" or "Hey, Google."
« TF-IDF, is a numerical statistic that is intended ‘to reflect how important a word is
to a document from collection corpus. It is often used as a weighting factor for
searches of information retrieval, text mining and user’modeling.
e The TD-IDF value increases proportionally to the number of times a word appears
in the document but it is often offset by the frequency of the word in the corpus,
which helps to adjust for the fact that some words appear more frequently.

ERGes Unmanned Vehicles

e An Unmanned Aerial Vehicle (UAV), sometimes known as a drone, is an aircraft


or airborne system that is controlled remotely by an onboard computer or a
human operator. The ground control station, aircraft components and various
types of sensors make up the UAV system.
e UAVs are categorized depending on their endurance, weight and altitude range.
They can be used for multiple commercial and military applications.
e Machine learning is the process of using, storing and finding patterns within
-massive amounts of data, which can eventually be fed into algorithms. It's
basically a process of using the data accumulated by the machine or device that
allows computers to develop their own algorithm so that humans won't have to
create challenging algorithms manually.
¢ Unmanned ground vehicles are classified into two broad types, remotely operated
and autonomous.
* Autonomous unmanned ground vehicles comprise several technologies that allow
the machine to be self - acting and self - regulating, sans human intervention. The

TECHNICAL PUBLICATIONS” - an up-thrust for knowledge


“S$
Machine Leaming 1 - 54 Introduction to Machine LeeaMin

technology
} ¢ was initially- developed to aid ground forces in the transfer of heaavy
equipment,
e However, the technology has witnessed significant evolution over the years, givin
; ow . wap i 8
rise to more tactical vehicles designed to assist in’ surveillance or [pp
search-and-destroy missions.

e For example, unmanned ships in the course of the voyage, the default route is to
ensure the obstruction of the premise of a straight line navigation.

(Environment 1, Crash) Environment 1 Destination


Crash !

Fig. 1.16.1 First time lane

(Environment 1, Crash) Award Environment 1 Destination

Fig. 1.16.2 The second time

e During the course of the voyage, the hull is changed by the intensity and direction
of the waves and is unpredictable. It is clear for the unmanned boat itself.
Therefore, unmanned ships in the process of navigation, continue to train the
perception of the surrounding environment and make the appropriate strategy, if
the results of the implementation of the strategy in line with the default route to

be rewarded.

Review Question

1. Describe the role of machine learning in the following applications :


a) Google home or Alexa — b) Unmanned vehicles.

good
Regression

Syllabus
Introductiojon - ;Regressi
ere oe Need of.Reg) “essi
ession, Differ
, ence between Regression and Correlation, Types
of Regt ession - Univariate vs. Multivariate, Linear vs, Nonlinear, Simple Linear vs. Multipl
e Linear,
Bias-Variance tradeoff, Overfitting and Underfitting.
Regression Techniques - Polynomial Regression, Stepwise Regression, Decision Tree Regression,
Random Forest Regression, Support Vector Regression, Ridge Regression, Lasso Regres
sion,
ElasticNet Regression, Bayesian Linear Regression.
Evaluation Metrics : Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared
Error (RMSE), R-squared , Adjusted R-squared.

Contents
2.1. Introduction
2.2 Regression
2.3 Types of Regression
2.4 Overfitting and Underfitting
2.5 Regression Techniques : Polynomial Regression
. 2.6 Support Vector Regression
2.7 Ridge Regression
2.8 Lasso Regression
2.9 ElasticNet Regression
2.10 Bayesian Linear Regression
2.11 Evaluation Metrics
E
Machine Learning 2-2 Regre Oe
~Y!0n ~
M
—_—

FXG introduction
e Regression analysis is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independen;
variables. It can be utilized to assess the strength of the relationship between F
variables and for modelling the future relationship between them.
e The two basic types of regression are linear regression and multiple linea;
regression.

e Regression analysis includes several variations, such as linear, multiple linear anq
nonlinear. The most common models are simple linear and multiple linear,
Nonlinear regression analysis is commonly used for more complicated data sets in
which the dependent and independent variables show a nonlinear relationship.

Review Question
———_

. oo ; |
1. How the performance of regression is‘assessed ? Write various performance metrics used for |
it. |

Pez Regression

e For an input x, if the output is continuous, this is called a regression problem. For
example, based on historical information of demand for tooth paste in your
supermarket, you are asked to predict the demand for the next month.
Regression is concerned with the prediction of continuous quantities. Linear
regression is the oldest and most widely used predictive model in the field of
machine learning. The goal is to minimize the sum of the squared errors to fit a
straight line to a set of data points.

‘t
Population ,
oh \ Population slope Random
x J - intercept error
b | Change ¥ P > 1 ao
5 b=Slood in Yi = Bo
+ By X; + 2;
SPChange
aS iniX /
Dependent Independent
| variable variable
a= Y - intercept

> X

Fig. 2.2.1 Regression


Machine Leaming
_—_————
Regression

e for
For regression
regression tasks
tasks, the » typical
typic ‘curacy
decuracy metrics are Roo!
7 >> \ ,
M Can Sx lle
(RMSE) and Mean Absolute Percentage Error (MAP E). Th jeate error
ese metrics measure the
distance between the predicted numeric target and the éactualal n numerici answer.

Regression Line
east squares : The least <> ae Acne? ,
° I ee The least squares regression line is the line that makes the sum of
squared residuals as small as possible. Linear means "straight line" ;
S str e”.
e Regression line is the line which gives the best estimate of one variable from the
value of any other given variable. .
e The regression line gives the average relationship between the two variables in
mathematical form.
For two variables X and Y, there are always two lines of regression.
Regression line of X on Y gives the best estimate for the value of X for any
specific given values of Y :
X = at+byY

where
a. = X - intercept

b = Slope of the line


X = Dependent variable
Y = Independent variable
e Regression line of Y on X: gives the best estimate for the value of Y for any
specific given values of X :
“Y = a+ bx

where
le a = Y - intercept
b = Slope of the-line
Y = Dependent variable
x = Independent variable
the vertical
e By using the least squares method (a procedure that minimizes
deviations of plotted points surrounding a straight line) we are able to construct a
best fitting straight line to the scatter diagram points and then formulate a
regression equation in the form. of :
y =-at+bX -
§ = F+b(x-xX)

TECHNICAL PUBLICA Tions® - an up-thrust for knowledge


Machine Learning 2-4 Pe yre

Regression analysis is the art and Bias term ——" { Wr) |


science of filling straight lines to x, 1 a I(x.)
patterns of data. In a_ linear Input vector ky J ~
regression model, the variable of x : $
interest ( "dependent" variable) is “d
predicted from k other variables Fig. 2.2.2
(“independent" variables) using a
linear equation. If Y denotes the dependent variable, and Xy,.,Xy, are th,
independent variables, then the assumption is that the value of Y at time
t in the
data sample is determined: by the linear equation :
Y, = Bo+B, Xp t+Bo Xy +... +B, Xp Et

Where the betas are constants and the epsilons are independent and _ identically
distributed normal random variables with mean zero.
e In a regression tree the idea is this : since the target variable does not have classes,
we fit a regression model to the target variable using each of the independent
variables. Then for each independent variable, the data is split at several split
points.
e At each split point, the "error" between the predicted value and the actual values
is squared to get a "Sum of Squared Errors (SSE)". The split point errors across the
variables are compared and the variable/point yielding the lowest SSE is chosen
as the root node/split point. This process is recursively continued.
e Error function measures how much our predictions deviate from the desired
answers. .
1 2 (y; -£ (x;))?
Mean-squared error J, = n
1= M1

e Multiple linear regression is an extension of linear regression, which allows a


response variable, y, to be modeled as a linear function of two or more predictor
variables

Evaluating a Regression Model


e Assume we want to predict a car's price using some features such as dimensions,
horsepower, engine specification, mileage etc. This is a typical regression problem,
where the target variable (price) is a continuous numeric value.
¢ We can fit a simple linear regression model that, given the feature values
of a
certain car, can predict the price of that car. This regression model can
be used to
score the same dataset we trained on. Once we have the predicted prices for all
of
the cars, we can evaluate the performance of the model by looking
at how much
the predictions deviate from the actual prices
on average. |

TECHNICAL PUBLICA TIons® - an up-thrust for knowledg


e
Machine Leaming

Advantages :

a. Training a linear regression model is usually


methods such as
m ~ >
than
t

neural networks. uch faster

By exa
omminne
ingid the
fe me g
magni and sign
i of the Tegression coefficients you
can infer
es alfect the target outcome.

P2RE Need of Regression

Regr oe, " a technique for investigating the relationship between independent
varia . or eatures and a dependent variable or outcome. It's used as a method
for pre ive modelling in machine learning, in which an algorithm is used to
predict continuous outcomes.
Regression models are used to predict a continuous value. Predicting prices of a
house given the features of house like size, price is one of the common examples
of Regression. It is a supervised technique.
Regression analysis is a way to find trends in data. For example, we might guess
that there's a connection between how much we eat and how much is the weigh;
regression. analysis can help us quantify that equation. Regression analysis will
provide with an equation for a graph so that we can make predictions about our
data.
Regression is essentially the "best guess" at. utilizing a collection of data to
generate some form of forecast. It is the process of fitting a set of points to a
graph.

Briefly, the goal of regression model is to build a mathematical equation that


defines y as a function of the x variables. Next, this equation can be used to
predict the outcome (y) on the basis of new values of the predictor variables (x).
Linear regression ‘is the most simple and popular technique for predicting a
continuous variable. It assumes a linear relationship between the outcome and the
predictor variables.
In some cases, the relationship between the outcome and the predictor variables is
not linear. In these situations, we need to build a non-linear regression, such as
polynomial and spline regression.
When we have multiple predictors in the regression model, we might want to
select the best combination of predictor variables to build an optimal predictive
model. This process called model selection, consists of comparing multiple models
containing different sets of predictors in order to select the best performing model

———______
2-6 Regres:
Machine Learning =a Son
<< M

that minimize the prediction error. Linear model selection approaches include bes
: ; DCs
subsets regression and stepwise regression.

2.2.2 Difference between Regression and Correlation

| Regression Correlation

| Regression tells us how to draw the straight Correlation describes the strength of a linear
line described by the correlation. relationship between two - variables. rs

variable Y For correlation, both variables should be |


For regression only the dependent
_random variables. = |
must be random.
Main goal is simply to find a number that |
Main goal is use the measure of relation to |
| predict values of the random variable based on expresses the relation between the variables.
. plot tats ot 25 LS
| values of the fixed variable.

2.3 | Types of Regression

e The most common regression algorithms are,


a) Simple linear regression
b) Multiple linear regression
c) Polynomial regression
d) Multivariate adaptive regression splines
e) Logistic regression
f) Maximum likelihood estimation (Least squares)

PER Univariate Vs. Multivariate

Multivariate |
ty No. Univariate

refers to the analysis of Multivariate analysis refers to the analysis |


1. Univariate analysis
one variable. of more than one variable.

2. It does not deal with causes and It deal with causes and relationships |
relationships

3, It does not contain any dependent variable. It contains more than one dependent |
variable |

4. Equation : Y = A + BX Equation : Y = A + BX+ CX; |

(BEEF Linear Vs. Nonlinear


e Linear regression is an approach for modelling dependent variable (y) and one or
more explanatory variables (x).

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


ing
Machine Leam
Mee
on
Regressi

e Equation: y = By +Byxte

e Nonlinear regression arises when Predictors and reresponse (


- : p . ponse followsfs particular
j
function form.
e Equation: y=fB,x+e

e The ee model is more flexible and accurate. Although both models can
Zacco mm odate curvature, the nonlinear model is significantly more versatile in
terms of the forms of the curves it can accept

PRED Simple Linear Vs. Multiple Linear.


yo Simple regression Multiple regression |


One dependent variable Y predicted from one One dependent variable Y predicted from a set |
| independent variable X. of independent variables (X, Xp --. X;):
One regression coefficient. | One regression coefficient for each independent |
pe variable.
1° : Proportion of variation in d.ependent R* : Proportion of variation in dependent
| variable Y predictable from X. variable Y predictable by set of independent . |
i _ variables (X's).

2H Overfitting and Underfitting


e In addition to using models for prediction, the ability to interpret what a model
has learned is receiving an increasing amount of attention.
e Interpretability has to do with how accurate a machine learning model can
associate a cause to an effect. |

e If a model can take the inputs, and routinely get the same. outputs, the model is
interpretable : .
1. If you overeat your magi at dinnertime and you always have troubles sleeping,
the situation is interpretable.
2. ° If all 2019 polls showed " ABC party" win and the "XYZ party" candidate took
office, all those models showed low interpretability:
e Interpretability poses no issue in low-risk scenarios. If a’ model is recommending
movies to watch, that can be a low-risk task
e Fitness of a target function approximated by a learning algorithm determines how
correctly it is:able to classify a set of data it has never seen.
© Training error can be reduced by making the hypothesis more sensitive to training
data, but this may lead to overfitting and poor generalization.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


3 i earning
Mach

Overfitting occurs when a statistical model describes random error or ne


of the underlying relationship. Overfitting is when a classifier fits the trains...
too tightly. Such a= classifier works well on the training data bys «.
independent test data. It is a general problem that plagues all machine Jes.,
methods

Underfitting : If we put too few variables in the model, leaving out variables «
could help explain the response, we are underfitting. Consequences :
1. Fitted model is not good for prediction of new data - prediction is biased
2.
dos Regression coefficients are biased

3. Estimate of error variance is too large

Because of overfitting, low error on training data and high error on test data
Overfitting occurs when a model begins to memorize training data rather thar
learning to generalize from trend.

The more difficult a criterion is to predict, the more noise exists in pas
information that need to be ignored. The problem is determining which part to
ignore,

Overfitting generally occurs when a model is excessively complex, such as having


too many parameters relative to the number of observations. We can determine
whether a predictive model is underfitting or overfitting the training data by
looking at the prediction error on the training data and the evaluation data.
Fig. 2.4.1 shows underfiting and overfiting.

wi Y Y

x
>
>
x<Y

Underfitting x Balanced Overfitting

Fig. 2.4.1

Reasons for overfitting


1, Noisy data
2. Training set is too small
3. Large number of features

TECHNICAL PUBLICATIONS” - an up-thrust for knowledge


ee 2-9
Machine Learning Regression
—_—_—_——_— ——

e In the machine learning, the more complex model is gall to. hour wiene of
overfitting, while the simpler model underfilting. Often several eeueistia are
developed in order to avoid overfitting, for example, when designing, neural
networks one may :
1. Limit the number of hidden nodes.
2. Stop training early to avoid a perfect explanation of the training set, and
3. Apply weight decay to limit the size of the weights, and thus of the function
class implemented by the network.

pean Bias Vs Variance

e In the experimental practice we observe an important phenomenon called the bias


variance dilemma.
e In supervised learning, the class value assigned by the learning model built based
on the training data may differ from the actual class value. This error in learning
can be of two types, errors due to ‘bias’ and error due to ‘variance’.
e Fig. 2.4.2 shows
Low variance High variance
Low bias
High bias

Fig. 2.4.2 Bias-variance trade off

¢ Give two classes of hypothesis (e.g. linear models and k-NNs). to fit to some
training data set, we observe that the more flexible hypothesis class has a low bias
term but a higher variance term. If we have paramettic family of hypothesis, then
we can increases the flexibility of the hypothesis but we still observe the increase
of variance.
——
Machine Leaming _ 2-10 . - $a Sin,

e The bias-varianee-dilemma is) the problem of simultaneously MINIMIZIng two


:sources : of error that prevent) supervised learning algorithm from Beneralizin,
B
beyond their training set :
|. The bias is error from erroneous assumptions in the learning algorithm, High
bias can cause an algorithm to miss the relevant relations between features and

target outputs.

The variance is error from sensitivity to small fluctuations in the training set,
Nu

High variance can cause overfitting modeling the random noise in the
training data, rather than the intended outputs.
¢ In order to reduce the model error, the designer can aim at reducing either the
bias or the variance, as the noise components is irreducible.
ee As the model increases in complexity, its bias is likely to diminish. However, as
the number of training examples is kept fixed, the parametric identification of the
model may strongly vary from one DN to another. This will increase the variance
term.
e At one stage, the decrease in bias will be inferior to the increase in variance,
warning that the model should not be too complex. Conversely, to decrease the
variance term, the designer has to simplify its model so that it is less sensitive to a
specific training set. This simplification will lead to a higher bias.
Example 2.4.1

8 8
o o

Size | Size Size


(a) (b) i (c)
Fig. 2.4.3

Explain the aboe Fig. 2.4.3 (a), (b) and (c).

Solution
¢ Given Fig. 2.4.3 is related to overfitting and underfitting.
Underfitting (High bias and low variance) :
° A statistical model, or a machine learning algorithm is said
to have underfitting
when it cannot capture the underlying
trend of the data.
e It
when
en we try to buildwhen
happens we have less data to build an accurate model and als?
a linear model with a non-linear data.

TECHNICAL PUBLICATIONS
® ~ an up-thrust for knowledge
Pegression

Size Size
2 2 2
Oy + 04x 05 + 0,x + 05x* 09 + 0,x + 05x + Nox + Ho%

(overfit)
High bais (underfit) High bais (underfit) High variance

Fig. 2.4.4
flexible to
In such cases the rules of the machine learning model are too easy and
ably make a
be applied on such minimal data and therefore the model will prob
lot of wrong predictions.
reducing the features by
» Underfitting can be avoided by using more data and also
feature selection.

(High variance and low bias) :


overfitting
when we train it with a lot of d
ata.
e A statistical model is sa id to be overfitted,
m the noise
trai ned with so mu ch of data, it starts learning fro
When a model gets
set.
and inaccurate data entries in our data
cat ego riz e the data corr ectl y, because of too many details
Then the model doe s not
and noise.
hods because
of ove rfi tti ng are the no n-parametric and non-linear met
The causes e freedom in building the
lea rni ng alg ori thm s hav e mor
these types of machine c models.
the dat ase t and the ref ore the y can really build unrealisti
‘model based on
data or
id ove rfitti ng is usi ng a lin ear algorithm if we have linear
e A solution to avo using decision trees.
imal depth if we are
using the parameters like the max
and Underfitting |
PX] Difference between Overfitting
| Overfitting Underfitting
Sr. No : _ Parameter
Model is too simple -
Complexity It is too complex — eee
‘i mn
High bias, low variance |
Reason Low bias, High variance —
oa 7 _____
A larger quantity of feature. |
eee ee Quantity of features Smaller quantity of features.
Less regularization |
4. Regularization : More regularization
Machine Learning 2-12 Regress;.

Review Questions

1. Explain the term bias-variance dilemma.


2. What is overfitting and underfitting ? What are the catalysts of overfitting ?
Elaborate bias variance dilemmia.
ov)

4. Explain with example K-fold cross validation.


On

Answer apply Fast leamer :


Level questions + Ve class
0

Fast learner : 25 Answer Recall


Slow learner : 10 Level questions
1 \o
Fast learner : 20 Fast learner: 5
Slow learner: 10 Slow learner: 30

Fig. 2.4.5
i) Find contingency table
it) Find recall iii) Precision
iv) Negative recall vw) False positive rate
6. Difference between overfitting and underfitting.

PAA Regression Techniques : Polynomial. Regression

e Polynomial Regression is a regression algorithm that models the relationship


between a dependent(y) and independent variable(x) as n® degree polynomial.
e The Polynomial Regression equation is given below :
2
y = do +b 1x1 +bax7 +b3xp +...+byxt
e Polynomial regression is needed
when there is no linear correlation
fitting all the variables. So instead
of looking like a line, it looks like
a nonlinear function. Fig. 2.5.1
shows polynomial regression. |
e If the datasets are arranged in a
y = Do+b,x+box, 2
non-linear fashion, then we
should use the Polynomial Fig. 2.5.1 Polynomial regression

TECHNICAL PUBLICA TIONS® - an up-thrust for knowledge


;
peatmng £-13
acl
ve
! Roqgrossion

regression model instead of Simpl


Oprroypes
ogre
‘ t’
Linea
very powelrorful lo handle honlinearity NON, Polynomial models are
/ yen adtise .
Funelions withi ANY piv CN polynomials can approximate
cont ‘auous within precision,
ahoare are WO Ss ‘ Ls
Phere are landard: procedures for build)ding
forward selection : Successively fit | )
4 polynomial model :
1. mode} until; the t test
est+ orde r term js Aeeunlenia
S (
rf increasing, order
of ‘rand,

for
7 .
the ’
high
i 5

SIAMficant,
2. Backward elimination : Appropriately fit th highest order model and then
delete terms one at a time starting13 with
wi =
vy the: highest order, until the highest
sjenifi
order remaining© term hashas aa sign ificant { slalistic

advantages :
e We can model non-linear relationships between
swe variab! es.

e There is a large range of different functions that you can use for fitting
of curvature and its
« Good for exploration purposes : You can test for the presence
inflections.

PERE Stepwise Regression


Stepwise regression is a step-by-step process of constructing a model by
T-tests
introducing or eliminating predictor variables. First, the variables undergo
to fit a linear
and F-tests. Then, predictor variables are individually tested
regression model.
by one. This is particularly
Stepwise regression drops. insignificant variables one
ables.
useful if we have many potential explanatory vari
regression model to introduce only
« Stepwise regression is used to design a
s. Other variables are discarded.
relevant and_ statistically significant variable
anted variables. These
However, every regression calculation contains unw
e the proces s unnecessarily.
variables are predictive and complicat
step wise regr essi on cons ists of iter atively adding and removing. predictors, in
* The
fi nd the subset of variables in the data set
the predictive model, in order to
that is a model that lowers prediction model,
resulting in the best performing
error.

are three strategies of stepwi se regression :


© There predictors in the model, iterative
ly
1.° Forward sele ctio n, whi ch starts wit h no
tri but ive pred icto rs and stops when the improvement is no
adds the mos t con
longer statistically significant.
+
Machine Learning 2-14 R 16g
. . “YTAs 3)
Sir
a”

in the model (full pp,


2. Backward selection which starts with all predictors
iteratively removes the least contributive predictors and stops when you na
model where all predictors are statistically significant. nea
3. Stepwise selection | which is a combination of forward and_backy,
selections. We start with no predictors, then sequentially add the on
contributive predictors. After adding each new variable, remove any variable.
that no longer provide an improvement in the model fit.

rar Decision Tree Regression

e Decision Tree is a decision-making tool that uses a flowchart-like tree structure 0;


is a model of decisions and all of their possible results, including outcomes, inpy;
costs and utility. .

e Decision-tree algorithm falls under the category of supervised learning algorithms.


It works for both continuous as well as categorical output variables.
e Decision tree regression observes features of an object and trains a model in the

structure of a tree to predict data in the future to produce meaningful continuous


output. Continuous output means that the output/result is not discrete, i.e., it is
not represented just by a discrete, known set of numbers or values.

Discrete output example : A weather prediction model that predicts whether or


not there'll be rain in a particular day.
states the probable
e Continuous output example : A profit prediction model that
profit that can be generated from the sale of a product.
a decision tree regression
Here, continuous values are predicted with the help of
model.
through the entire tree,
A decision tree is able to make a prediction by running
node. The final prediction is
asking true/false questions, until it reaches a leaf
of the value of the dependent variable in that leaf node.
given by the average

2.5.3 | Random Forest Regression

forest is a famous system learning set of rules that belong s to the


Random
to know method. It may be used for both classification and
supervised getting
in ML. It is based totally on the concept of ensemble studying,
regression issues
problem and
that's a process of combining multiple classifiers to solve a complex
to enhance the overall performance of the model.
. peainring 2-15
3 Regression
0
ach!”
wv -

indicates, "Random forest. ; of


« the call ‘ Is a cl ‘
assifier th
al Incorporates some

cho ice timber on diverse subsets of (1


d alasel and takes the averaZ® to
c
accuracy
tivvee acc
; rov
imp eavee the I prediccli ura of that dataset Instead of relying, on one decision
"

“edic t d on
.. the random forest tal kes the pred iction from each tree and primarily base
tre 7 nT ehe people's vot
last outp ut ‘
mos ;
P otes:: of predictions, and it predicts ‘ the 3 very
-Y as )

ao more Wider vari ety of trees with ‘a3in the forest results in better accuracy and
The
ing.
hassle of: overfitt
the
prevents

pl How does Random Forest Algorithm Work ?


random woodland by
Random forest work in two-section first is to create the
is to make predictions for each tree
combining N selection trees and second

created inside the first segment.


diagram :
lained within the below steps and
The working technique may be exp
g set.
Select random K statistics points from the schoolin
step - 1:
the selected information points
trees associated with
Build the selection
Step - 2 :
(Subsets).

the wid e variety N for select ion trees which we want to build.
Step - 3:
Choose

Step - 4: Repeat step 1 and 2.


the prediction s of eac h choice tree and assign the new
ctors, locate
Step- 5: For new fa es.
record 5 factors to the catego
ry that wins most people's vot
th
of rul es may be hig her un derstood by the undernea
The working of the set
example : fruit one
that includes more than
there ma y be a dataset
Example : Suppose
area classifi er. The dataset
So, this dat ase t is given to t he random wooded
photo.
to ev er y decisi on tre e. During the training
and give n
is divided into subsets d res ult and w hile a brand new
produces a pr ed ic ti on en
section, each decision tree the majority of c onsequences,
the
the n pri mar ily bas ed on
statistics point occurs, sider ‘the underneath Con
predicts the final decision.
random forest classifier

picture :
Machine
pts
Learmng
? “a

Fig. 2.5.2 Example of random forest

Applications of Random Forest


There are specifically + sectors where random forest normally used :
1. Banking : Banking zone in general uses this algorithm for the identification &
loan danger.
2. Medicine : With the assistance of this set of miles, disorder taits and risks of te
disorder may be recognized.

3. Land use : We can perceive the areas of comparable land use with the aid of ths
algorithm.
Marketing : Marketing tendences can be. recognized by the usage of ths
algorithm.

Advantages of Random Forest


e Random forest is able
to appearing both classification and regression|
responsibilities.
e It is capable of managing large datasets with hich dimensionality.
* It enhances the accuracy of the ve rsion and forestalls the overfitir < trouble.

TECHNICAL PuBucATIONS® ~ an up-thnest for imowtedge


p=
nO Learning
— _
maclil! Regression

po Disadvantages of Random Forest


ugh
Althoug random forest caan be used
30 for both class and rv gressi¢ ‘bilities
a : toni Si gression responsibilities,
it i isn't extra appropriate
PP!Oo} for regressi aanT
cgression obligations,

2.6 | Support Vector Regression

Support Vector Machines (SVMs) are a set of supervised learning methods which
JIearn from the dataset and
used for classification.
SVM is a classifier derived
oO
from statistical learning Class 2
Vapnik
oO
theory by
%
and
i

Chervonenkis.
An SVM kind
is’ a of a
; a a bs Class 1
large-margin classifier: it is >
a vector space based
Fig. 2.6.1 Two class problem
machine learning method
where the goal is to find a
decision boundary between
two classes that is
o
Class 2 Class 2
oO
maximally far from any
point in the training data
Given a set of training
Class 1
examples, each marked as
belonging to one of two
classes, an SVM algorithm Fig. 2.6.2 Bad decision boundary of SVM
builds a model _ that
Simply speaking,
predicts whether a new example falls into one class or the other.
in space,
we can think of an SVM model as representing the examples as points
divided by a gap
‘mapped so that each of the examples of the separate classes are
that is as wide as possible.
are then mapped into the same space and classified to belong to
New examples
the class based on which side of the gap they fall on.

Problems
Mansy decision boundaries can separate these two classes. Which one should we
i_ Two * Clas

choose ?

TECHNICAL PUBLICATIONS© - an up-thrust for knowledge


Bw
Machine Learning 2-18 Regr ESSion

¢ Perceptron learning rule can be used to find any decision boundary between
class 1 and class 2.
¢ The line that maximizes the minimum margin is a good bet. The model class of
“hyper-planes with a margin of m" has a low VC dimension if m is big.
e This maximum-margin separator is determined by a subset of the data points,
Data points in this subset are called "support vectors". It will be useful |
computationally if only a small fraction of the data points are support vectors, |
because we use the support vectors to decide which side of the separator a test
case is on.
Example of Bad Decision Boundaries
e SVM. are primarily two-class classifiers with the distinct characteristic that they
aim to find the optimal hyperplane such that the expected generalization error is
minimized. Instead of directly minimizing the empirical risk calculated from the
training data, SVMs perform structural risk minimization to achieve good |
generalization.
e The empirical risk is the average
loss of an estimator for a finite set High 4 \\ |
of data drawn from P. The: idea of \ Expected risk 2 ee
risk minimization is not only \\ ee |
measure the performance of an ‘ Nee : a |
estimator by its risk, but to actually Short \ oe Confidence |
search for the estimator that Noo
minimizes risk over distribution P. vo. '
Because we don't know distribution eo SN Empirical risk
P we instead minimize empirical ow i TO
risk over a training dataset drawn >
from P. This general learning Small Large
technique is called empirical risk Comglexity of function set
minimization.
e Fig. 2.6.3 shows empirical risk, Fig. 2.6.3 Empirical risk

Good Decision Boundary : Margin Should Be Large


e The decision boundary should be as far away from the data of both classes as |
possible. If data points lie very close to the boundary, the classifier may be
consistent but-is more "likely" to make errors on new instances from the
distribution. Hence, we prefer classifiers that maximize the minimal distance of
data points to the separator.
(fine
acl!
4
Feogression

1 Margin (m) : the gap between dal


" oe .
a points & t|
2
the minimum distance of classifiesoF boundary, The Margin
Ww © oo) ‘
Any sample (og the
hyperplane is in the canonical) form,
the Margin ¢ an be decision boundary. If this
measured by the Jength of
the weight vector. The m argin is given by the
p rojection of the distance between
these two points on the direction Perpen
dicular to the hyperplane. Margin of the
separator is the distance between Su
pport vectors
_ 2
Margin (m) = Tw]

Ja
ee
® denotes +1
© denotes —1

Fig. 2.6.4 Good decision boundary

2, Maximal margin classifier : a classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies
that only support vectors matter; other training examples are ignorable.
Gare For the following figure find a linear hyperplane (decision boundary) that will
separate the data.

oO
O07 oO

O
O
| a oO

O
P| El
| ra
1 |

P = =

- Fig. 2.6.5

®
TECHNICAL PUBLICATIONS - an up-thrust for knowledge
$$ eee - Regre..
“Nir

1i. as Fen, hat


Define what anar ote
optim al hy , Serplane is : maximize margin

ort
A ~

Extend the above definition


G

for non-linearly separable


m
~
problems : have a Penal

Second possible solution

Ny
i

°
° °
Oo
Bs< + - mw. ~~~.< be
Oo
°
} i)
a a “4
i i
| a

a = = :

3. Other
2 a Ss
possible solution
cermla ca sARK |
<. Which one is better? B1 or B2?
4 7s :

| B,
5 ° °
f ~ Sc ° )
ho
-_-=>~
° ng
o
| Bs RS SS. ° ~ BoRy 2.
; 7) | wes.
WeSSS
=~ ~~ °
= 2 | wf °
i= = SS a
~ a SS ~J
= ~~J a
| = a
= = = a = a

5. How do you define better? 6. Find a hyperplane that maximizes


the margin => B1 is better than B2

B, oO
° oO

bo" ; g, «2
a og | Wextb=+1
wextb=-17 | mg VA .
m= ~ J)

bio

Fig. 2.6.6
3. Map data to high dimensional space where it is easier to classify with linea!
decision surfaces : reformulate problem so that data is mapped implicitly to thi:
space

Ex Key Properties of Support Vector Machines

1. Use a single hyperplane which subdivides the space into two half-spaces, on
which is occupied by Class 1 and the other by Class 2.

TECHNICAL PUBLICA Tions® - an up-thrust for knowledge


2-21
pine pearing — Regres sion
ae
-orhey maximize} 1 ’
the margin ‘ ‘
of the decision bound
¢ Vy
hypery ane, TY Using quadratic optimization
© gechniques which find the optimal
large feature spaces
\bility to handle
7 Over ating
fitting ¢
can ‘
beveccontrolle d by softnf margin approacl
‘ ach,
When used in practice, SVM approaches frequently map the examples to a higher
dimensional
“SN

space and find margin maxim al hyperplanes


in the mapped space,
obtaining decision boundaries which are not hyperplanes in the original space.

0. The most popular versions of SVMs use non-linear kernel functions and map the
attribute space into a higher dimensional Space to facilitate finding ood" linear
decision boundaries in the modified space, an

pa SVM Applications

. SVM has been used successfully in many real-world problems,


1. Text (and hypertext) categorization
Image classification
oFYS N

Bioinformatics (Protein classification, Cancer classification)


Hand-written character recognition

Determination of SPAM email.

pre Limitations of SVM


1. It is sensitive to noise.
lies in the choice of the kernel.
2. The biggest limitation of SVM
3. Another limitation is speed and size.
4. The optimal design for multiclass SVM classifiers is also a research area.

EXZ" Soft Margin SVM


¢ For the very high dimensional problems common in text classification, sometimes
the data are linearly separable. But in the general case they are not, and even if
they are, we might prefer a solution that better separates the bulk of the data
while ignoring a few weird noise documents. -
* What if the training set is not linearly separable 2 Slack variables can be added to
allow misclassification of difficult or noisy examples, resulting margin called soft.

* A. sof L-margin allows a few variables to ‘choke into the margin or over the
hyperplane, allowing misclassification.
. penalize the crossover by looking at the number and distance of the
'Sclassifications. This is a trade off between the hyperplane violations and the

TECHNICAL PUBLICA TIONS” - an up-thrust for knowledge


Regressio
2-22 910n
Machine Learninges — a . . a eR

variables are bounded by some set cost. The farther they


margin size. The slack
ce they have on the prediction.
are from the soft margin, the less influen
variable,
All observations have an associated slack
1. Slack variable = 0 then all points on the margin.
in or on the wrong side of the
2. Slack variable > 0 then a point in the marg
hyperplane .
the slack variable penalty and the margin.
3. Cis the tradeoff between

Bo Comparisen of SVM ang Neural Networks

Network |
||

Neural
| ;
eg

Support Vector Machine


nersrae

§ Hidden Layers map to lower dimensional


sea cmecr scammer

Kernel maps to a very-high dimensional space

Search space. has multiple | local minima


Search space has a unique minimum
Classification extremely efficient
re
Ee

Very good acaccuracy - in typical domains


}
'

Very good accuracy in typical domains


SG

s |
‘ Requires number of hidden units and layer
ERT
i
{

Kernel and cost the two paras to select


_ Training 1:is 3 expensive
PE

Je hee
;

Training - is extremely efficient _


TR

pmnvocgamencmntnn

5) are
identify which data points (1, 2, 3, 4,
nae

Sees From the following diagram,


ort vect ors (if any) , slack vari able s o n correct side of classifier (if any) and slack
supp m
s on wro ng side of classifier (if any). Mention which point will have maximu
variable
penalty and why ?

Fig. 2.6.7

TECHNICAL PUBLICA TIONS® - an up-thrust for knowledge


_ es. eens
- _
Regression
acne
=
jon -
goltttt . .

pata points 1 and 5 will have maximum penalty


c r

:
between data points & the classifi ier boundary. The margin
e Margin (m) 1s the gap any compli, ta -
distance of
is the minimum
- can re b decisi on boundary. _ ! If 4 thi5s of
wes e Iste In the can
" erplan :
. onical form, the margin nba
hyp
yer measured by the lengt
the weight vector.
—_
ma rg in cla ssi fie r : A cla ssifier i tha t ma ximi zes the margin.
Maximal i 1 rin the family F 7

the margin s
’ s _nizing intuition and PAC theory. Implie
. —s

ing to
als
ord
¢
acc
TOOC
.
1s
,
arg)
i es sor t v 4k Ss . B

h
f axin
.
ay traini ble.
at only support
the
vectors matterer;
¥
,
other training examples are ignora
SUPE
to
lin ear ly separable ? Sla ck variables can be added
» What if a framing set is not sy examples, resulting margin called soft.
.allow misclassification of difficult or noi
into the margin or over the
allows a few variables to cross
» A soft-margin
ation.
hyperplane, allowing misclassific
number and distance of the
crossover by looking at the
the
penalize the
« We
ons . Thi s is a trade off be tw ee n the hyperplane violations and
misclassificati ther they-
gin size. The sla ck var iables are bounded by some set cost. The far
mar diction.
s influence they have on the pre
are from the soft margin, the les
associated slack variable
e All observations have an
.
n all pots on the margin
1. Slack variable = 0 the of the
or on the wrong side
9. -Slack variable > 0 then a point in the margin
|
_ hyperplane.
margin.
is the tradeo ff be tw een the slack variable penalty and the
3. C

Review Questions

t are sup por t vect ors and mar gin s ?-Also explain soft margin SVM.
Wha
Bm

margin errors.
What is slack variable ? Discuss
N

Explain support vector machine.


©

SVM.
gins ? Also explain soft margin
What are support vectors and mar
nm

5. What is slack variable ?: Discuss margin erro


rs.

6. Explain support vector machine. —


ins.? Explain soft S VM and hard SVM.
7. What are the support vectors and marg
i ett—~——

27 | Ridge Regression
vers ions of linear regression
* Ri dge regressio—
n is one of -the maximum sturdy
of bia s is del ivered so that we will get higher long term
~~

Wherein a small quanti ty


.
Predictions .

up-thrust for knowledge .


TECHNICAL PUBLICATIONS® - an
J
Machine L eaming
fi . 2-24 Regre Ssion

e The quantity of bias added to the version is. called ridge regression penalty. We |
can compute this penalty term with the aid of multiplying with the lambda to the |
squared weight of each individual features.
e The equation for ridge regression might be : |
L(x, y) = Min( Son (yi wir)? +2 Dp g(a)? )

A standard linear or polynomial regression will fail if there may be high

collinearity among the unbiased variables, to be able to solve such problems, ridge
regression may be used.
the
Ridge regression is a regularization method, which is used to reduce
complexity of the version. It is likewise referred to as L2 regularization.
samples.
It helps to clear up the troubles if we have greater parameters than

EXAM Sciki
Scikit Learn Code for Ridge Regression

1
“to
_ from sklearn.linear_model import Ridge
import numpy as np
|
' n_samples, n_features = 24, 19
| mg = np.random.RandomState(0) |
y = mg.randn(n_samples) |
| X = rmg.randn(n_samples, n_features) |
rdg = Ridge(alpha = 0.5) |
| rdg.fit(X, y)
rdg.score(X,y) | a
|
PPE Lasso Regression

e Lasso regression is any other regularization technique to lessen the complexity of


the version.

e It is similar to the ridge regression except that penalty time period includes only
the absolute weights instead of a rectangular of weights.
e Since it takes absolute values, consequently, it is able to shrink the slope to 0,

while ridge regression can simplest cut back it close to zero.

TECHNICAL PUBLICA TIONs® - an up-thrust for knowledge


o se ? \ 6
» js likewise
lt
Known as as LI regularization hie Myiation for Lasso regressten
mav be :
Py>
” «

pA V8 e i
Min} WK) : WN)
L(x, Y)
\ i ‘
*

po ElasticNet Regression

regularized variants of linear


hj .
popular
aK
of the two most
> 1:
Elastic net is a combination
7 1
»
Ridge utilizes
ali an Lo pronalty and lane tses an | |
ressit
+ =sjOn : Ridge ge ana
a 1 Lasso.
SS oF Y

:
featu re selection
P
and feature preservation, and to
Thiscc allows7S] if fc Yrbalance between
“OO ,
»

toa] with situations where lasso and ridge regression may


ave
fail

or example, when there are more features than observations, lasuo repression may
“4
e

while riclpe repression


of correlated features,
feature trom a sroup
elect only one

features
th

ated
ay keep them all. Elastic net regression can select a subset of correl
3

and avoid the instability of lasso regression.

Advantages
complexity by eliminating irrelevant features, which is more
1. It can reduce model
effective than ridge regression.
ween bias and variance than
2. Elastic net regression can achieve a better trade-off bet
ation parameters.
lasso and ridge regression by tuning the regulariz
to various types of data, such as linear,
3. This type of regression can be applied
logistic or Cox regression models.

Disadvantages
more computational resources and time due to two regularization
1. It requires
parameters and a cross-validation process.
select a large number of features
2. It may not be easily interpretable, as it could
ents.
with small coefficients or a small number of features with large coeffici

ET Bayesian Linear Regression


insufficient
* Bayesian linear regression allows a useful mechanism to deal with
coefficients and
data, or poor distributed data. It allows user to put a prior on the
A prior is a
on the noise so that in the absence of data, the priors can take over.
distribution on a parameter.
e
If we could flip the coin an infinite number of times, inferring its bias would be

eae the law of large numbers. However, what if we could only flip the coin a
Would we guess that a coin is biased if we saw three heads in
ful of times?
ene

TECHNICAL PUBLICATIONS” - an up-thrust for knowledge


|
.
CHINeG ;
Learn
Q
2-26
fn,
¥
three flips, an event that happens one out of eight times with unbiased ¢,,.
MLE would overfit these data, inferring a coin bias of p=1.
A Bayesian approach avoids Overfitting by quantifying our prior knowled.
most
.
coins , are unbiased,
7 ° .
that the prior on
é .
the bias parameter is peaked
: .
ir
s ot u

one-half. The data must overwhelm this prior belief about


ATO,
coins.
Bayesian methods allow us to estimate model parameters, to construct m,,Od,
forecasts and to conduct model comparisons. Bayesian learning algorithmes
calculate explicit
. .

probabilities
* . .

for hypotheses. -
S

Bayesian classifiers use a simple idea that the training


data are utilized to calcula;
an observed probability of each class based on feature values.
When Bayesian classifier is used for unclassified data, it uses the observe,
probabilities to predict the most likely class for the new features.
Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct.
Prior knowledge can be combined with observed data to determine the fing
probability of a hypothesis. In Bayesian learning, prior knowledge is provided b,
asserting a prior probability for each candidate hypotheses and a probabilit
distribution over observed data for each possible hypothesis. |
Bayesian methods can accommodate hypotheses that make probabilistic
predictions. New instances can be classified by combining the predictions oj
multiple hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical
methods can be measured.
Uses of Bayesian classifiers are as follows :
1. Used in text-based classification for finding spam or junk mail filtering.
2. Medical diagnosis. .
3. Network security such as detecting illegal intrusion.
The basic procedure for implementing Bayesian Linear Regression is :
i) Specify priors for the model parameter.
ii) Create a model mapping the training inputs to the training outputs.
iii) Have a Markov Chain Monte Carlo (MCMC) algorithm
draw samples fro" |
the posterior distributions for the parameters,
: pearing -27

N
yacline Pago BION
. ; =—
Ne

pu Evaluation Metrics
to evaluate
a

Error (MAR) are used


7 .

Absolute
S
Mean
“ «

(MSE), and
ir
Error
e r WC
squa ed
and
Mean
ay

—— on
the regression problem's accuracy.

Mean Squared Error

Mean Squared Error (MSE) is calculated by taking the average of the square of the
also be
difference between the original and predicted values of the data. It can
called the quadratic cost function or sum of squared errors.
is always positive or greater than zero. A value close to zero
The value of MSE
of the estimator/predictor. An MSE of zero (0)
will represent better quality
represents the fact that the predictor is a perfect predictor.
Pe ~
MSE = N 5) (Actual values — Predicted values)*
i=]
total number of "9 oo. ,
Here N is the
Sa. 1 et Regression /
observations/rows in the dataset.
. 6 YN" °
The sigma symbol denotes that the :
oo.
difference between actual and (Y = Y’)
Sum —
Y| , 0”0-6 MSE = —
i d values taken on every i
predicte
ue
value ranging from 1 to n. . are) ae
most -
Mean squared error is the
for . X a
commonly used loss function
regression. MSE is sensitive Fig. 2.11.1 Representation of MSE
|
towards outliers and given several
the optimal prediction will be their
examples with the same input feature values,
Mean Absolute Error, where the
mean target value. This should be compared with
good to use if you believe that your _
optimal prediction is the median. MSE is thus
distributed around a mean
target data, conditioned on the input, is normally
extra much.
value, and when it's important to penalize outliers
es both the variance and the bias of
incorporat
the predictor. MSE also gives
MSE
more it is penalized.
more weight to larger differences. The bigger the error, the
uous
Example : You want to predict future house prices. The price is a contin
be used as the loss
value, and therefore we want to do regression. MSE can here
function.

Er Mean Absolute Error

MAE is the sum of absolute differences between our target and predicted
: . = ue
Variab les. So it: measures the average magnitude of errors in a set of predictions,
wit hout consid ideri
ering theirir direct
di ions.

TECHNICAL PUBLICA TIONs® - an up-thrust for knowledge


py
Machine Leaming 2-28 / Regrass),.

e The loss is the mean overseen data of the absolute differences between true and
precicted values
| nh

MAL. : = »ly, yi~yj


—vil
jrl

e¢ Use mean absolute error when you are doing regression and don't want outliers ,
play a big role. It can also be useful if you know that your distribution j,_
multimodal. MAL loss is useful if the training data is corrupted with outliers.

Pare R-square

e R-squared is also known as: the coefficient of determination. This metric gives an
indication of how good a model fits a given dataset. It indicates how close the
regression line is to the actual data values.
e The R squared value lies between 0 and 1 where 0 indicates that this model
doesn't fit the given data and 1 indicates that the model fits perfectly to the
dataset provided. |
First sum of errors
R-squared = 1-
Second sum of errors
e R-squared can also be expressed as a function of mean squared error. R-squared
represents the fraction of variance of response variable captured by the regression
model rather than the MSE which captures the residual error.
e Specifically, this linear regression is used to determine how well a line fits’ to a
data set of observations, especially when comparing models. Also, it is the fraction
of the total variation in y that is captured by a model. Or, how well does a line
follow the variations within a set of data.
e The R, value varies between 0 and 1 where 0 represents no correlation between :
the predicted and actual value and 1 represents complete correlation.
e R-squared is a good measure to evaluate the model fitness. It is also known as the
coefficient of determination. R-squared is the fraction by which the variance of the
errors is less than the variance of the dependent variable. |
_¢ It is called R-squared because in a simple regression model it is just the square of |
the correlation between the dependent and independent variables, which is
commonly denoted by "r" |
e In a multiple regression model R-squared is determined by pairwise correlations |
among all the variables, including correlations of the independent variables with
each other as well as with the dependent variable.
peers
;

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


; in 2-29
nn Regression
| Lee ,
yore
: oe Suppose you. have — been ve
iy au DYDi Dp very (Xi, Ky nr Yd) Find } SN
. a set of training examples:
[<1 yah (X2/¥2 dn ind the equation of the line thal best fils the data
minimizes fhe squared error.
tha p
tio . Fit the regression line y = Bg +B1x to the data.
gol
(X1/Y 1)
se (Xn- Yn)

py sanding the "best" match between the line and the data. The "best" choice of Byg,Bi
osen to minimize.
will be ch
3'(yi —Bo +B1x;)? = Se?
i=l i=1

fit. Let's solve ...


This1 s called the least square
02 = 29 (yi - Bo +81 x;)) = 0
Bo

SQ 9) xi(yi — (Bo +B1 xi) = 0


= -2
By
Sy = nBo +B1>)xi
0
Yxiyi = -23)xi(yi -@o +B1 xi) =
After a little algebra, get
- n Y xii -(Y XI Yi)
ne nea On)
Bo = y-B1x where y = “yy and X= — Sx;

Tix _x)* = yx? —nx?


Wh

ll

5p ED 2
x
x

Xi oe ~y)* = )ixiyi —nxy


Say LEK
tr:

|
x

(Sx yi

Review Question
MSE in the contex t
What do you mean| by coefficient of: regression? Explain SST, SSE, SSR,
of regression.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge


2-30 RegresB56)
Machine Learning 29/07)

PXERY Root Mean Squared Error (RMSE)


e It measures the average difference between values predicted by a model and the
actual values. It provides an estimation of how well the model is able to Predic,
the target value (accuracy).
The lower the value of the Root Mean Squared Error, the better the model is, 4
perfect model would have a Root Mean Squared Error value of 0.
Root Mean Square Error (RMSE) is the standard deviation of the residual;
(prediction errors). Residuals are a measure of how far from the regression line
data points are; RMSE is a measure of how spread out these residuals are. |

In other words, it tells us how concentrated the data is around the line of best it. | |
Root mean square error is commonly used in climatology, forecasting and |
regression analysis to verify experimental results. |

The RMSE represents the square root of the second sample moment of the |
differences between predicted values and observed values or the quadratic mean ||
of these differences. These deviations are called residuals when the calculations are|
performed over the data sample that was used for estimation and are called errors |
(or prediction errors) when computed out-of-sample.
The RMSE serves to aggregate the magnitudes of the errors in predictions for |
various data points into a single measure of predictive power. |
|
° RMSE is calculated y using following formula :

=
SSEw
7
[1wei 5
RMSE
|

“where :
SSEw = Weighted sum of squares
W = Total weight of the population
N = Number of observations

w. = Weight of the i” observation

u; = Error associated with the i! " observation


e RMSE is a measure of accuracy, to compare forecasting errors of different models
as

for a particular dataset and not between datasets, as it is scale-dependent.


e RMSE is always non-negative and a value of 0 would indicate a perfect fit to the
data. In general,.a lower RMSE is better than a higher one. However, comparison
across different types of data would be invalid because the measure is depender
on the scale of the numbers used.
_ }

TECHNICAL PUBLICATIONS” --an up-thrust for knowledge ‘


— yp _—Leaming Se
2-31 20grossion
waclill
age of squared errors. ‘TThe effect of each error
RMSE is the square root of the aver
IT Je re “oy

_ quared errors.
MSD is proportional error; thus larger errors have a
ol
s proportional to the size of the squared is sensitive to
4 HS
disproportionately large effect on RMSD. Consequently, RMSE

outliers.

moe Adjusted R-squared


ses if new
Adjusted R-squared is a modified form of R-squared whose value increa
to improve models performance and decreases if new pre dictors
predictors tend
does not improve performance as expected.
(model accuracy) measure for linear
Adjusted R? is a corrected goodness-of-fit
t field th at is explained
models. It identifies the percentage of variance in the targe
by the input or inputs.
ng the residual mean square error by
Adjusted R squared is calculated by dividi
then subtracted from 1.
the total mean square error. The result is
model
ed R? is alwa ys less than or equal to R2. A value of 1 indicates a
Adjust
is less than or equal
that perfectly predicts values in the target field. A value that
ve valu e, In the real world, adjusted R?
to 0 indicates a model that has no predicti
lies between these values.
n R- squared is close to zero. Adjust
ed
R-s qua red can be neg ati ve whe
Adjusted
equal to R-squared value.
R-squared value always be less than or
OOO
SOLVED MODEL QUESTION PAPER (In Sem) 7 m
Machine Learning
B.E. (AI & DS) Semester - VII (As Per, 202
0 Pat
attter
ernn)

qime : 1 Hour] [Maximum Marks : 30


N. B.?
ij) Attempt Q.1 or Q.2, Q.3 or Q.4,
ii) Neat diagrams must be drawn wherever necessary
jii) Figures to the right side indicate full marks.

iv) Assume suitable data, if necessary.


Q.1 a) Compare machine learning vs artificial intelligence. (Refer-section 1.3.1) [5]

b) Describe parametric and non-parametric machine learning models.


(Refer sections 1.11 and 1.12) [5]

c) Explain supervised learning with its advantages and disadvantages.


(Refer section 1.5) [5]
— . OR

Q.2 a) Explain PCA and LDA. What is difference between LDA and PCA.
| [7]
(Refer sections 1.14 and 1.15)

b) What is reinforcement learning ? Explain elements of reinforcement learning.


(Refer section 1.8) [8]

Explain following evaluation matrix : | |


Q.3 a)
MSE, MAE, RMSE, R-Square (Refer section 2.11)
[8]

? Explain key properties of SVM. Compare SVM with neural


b) What is SVM
[7]
network. (Refer section 2.6)
OR

Q.4 a) What is regression ? Explain need of regression. Discuss ‘types of regressio Nn.
- [Ss]
(Refer sections 2.2 and 2.3)
(Refer sections 2.8 and 2.9) - [7]
b) Explain Lasso and ElasticNet regression.

You might also like