100%(1)100% found this document useful (1 vote) 2K views86 pages(Technical) Machine Learning U1-2 (2019 Pattern)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Introduction to Machine Learning
Syllabus
Introduction to Machine Learning
Com ine
fem Mar eee, Compara of Machine learning with traditional
pes of learning : Supervised, Unsupervised, and semi-su
pet i 1 i-supervised, reinforcement learnins
techniques, Models of Machine learning : Geometric model, Probabilistic Models, Logical
fodels, Grouping and grading models, Parametric and non-parametric models.
Important Elements of Machine Learning - Data formats, Learnability, Statistical learning
approaches.
Contents
1.1 Introduction to Machine Leaming
1.2 Comparison of Machine Leaming with Traditional Programming
. . Oct-19, Dec.-19, ---» +++ Marks 5
1.3 Types of Leaming
1.4 SupervisedLeaming ....------.eseeee March-20, June-22, - +++: ~ Marks 6
1.5 Unsupervised Leaming
1.6 Sembsupervised Leaming
1.7 Reinforcement Leamings ........-++++++-
4.8 Models of Machine Leaming
1.9 Distance-based Models
1.10 Tree Based Model
4.11 Grouping and Grading Models
1.12 Parametric Models
4.13 Nonparametric Methods
1.14 Important Elements of Machine Leaming
March-20, -<:0120sr0 Marks 5
Decn19, -oerereereeee Marks 9
fe Marks 8
Marks 10
1.15 Application of Machine Leaming ...+++++++° March-20, -
$$m
ne Introduction to Machine
rllresr—<=*~tsrsrssiC SS
Machine Learning
EGE introduction to Machine Learning EI,
sea eubtcld of Artificial Intelligence (Al) which concn,
¢ Learning (ML) i
Machine Leaming (MP) Oe snal theories of learning and building team,
with developing comp’
machines.
© Learning is a phenomenon and process which has manifestations of vary,
aspects. Learning process includes gaining of new symbolic knowledge ang
development of cognitive skills through instruction and practice. It is. aleg
discovery of new facts and theories through observation and experiment,
« Machine Learning Definition : A computer program is said to leam trom
experience E with respect to some class of tasks T and performance measure P, i
its performance at tasks in T, as measured by P, improves with experience E,
© Machine leaming is programming computers to optimize a performance criterion
using example data or past experience. Application of machine learning methods
to large databases is called data mining.
« It is very hard to write programs that solve problems like recognizing a human
face. We do not know what program to write because we don't know how our
brain does it. Instead of writing a program by hand, it is possible to collect lots of
examples that specify the correct output for a given input.
«A machine learning algorithm then takes these examples and produces a program
that does the job. The program produced by the learning algorithm may look very
different from a typical hand-written program. It may contain millions of numbers.
If we do it right, the program works for new cases as well as the ones we trained
it on.
© Main goal of machine learning is to devise learning algorithms that do the
learning automatically without human intervention or assistance. The machine
learning paradigm can be viewed as "programming by example.” Another goal is
to develop computational models of human learning process and_ perform
computer simulations.
« The goal of machine learning is to build computer systems that can adapt and
learn from their experience.
* Algorithm is used to solve a problem on computer. An algorithm is a sequence of
instruction. It should carry out to transform the input to output. For example, for
addition of four numbers is carried out by giving four number as input to Ne
algorithm and output is sum of all four numbers. For the same task, there may be
various algorithms. It is interested to find the most efficient one, requiring the
least number of instructions or memory or both.
For some tasks, however, we do not have an algorithm.
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge
eeMachine Leaming
1-3 Introduction to Machine Learning
Why Is Machine Learning Important 2
© Machine learning algori
algorithms can fi
favoring bos cea igure out how to perform important tasks by
Machii it
ee Learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
Machine learning algorithms discover the relationships between the variables ofa
system (input, output and hidden) from direct samples of the system.
« Following are some of the reasons :
1. Some tasks cannot be defined well, except by examples. For example:
recognizing people.
2. Relationships and correlations can be hidden within large amounts of data. To
solve these problems, machine learning and data mining may be able to find
these relationships.
3. Human designers often produce machines that do not work as well as desired
in the environments in which they are used.
4. The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans.
5. Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans.
© Machine learning also helps us find solutions of many problems in computer
vision, speech recognition and robotics. Machine leaming uses the theory of
statistics in building mathematical models, because the core task is making
inference from a sample.
How Machines Learn ?
® Machine learning typically follows three phases :
1. Training : A training set of examples of correct behavior is analyzed and some
representation of the newly learnt knowledge is stored. This is some form of
rules.
2. Validation : The rules are checked and, if necessary, additional training is
given. Sometimes additional test data are used, but instead, a human expert
may validate the rules, or some other automatic knowledge - based component
may be used. The role of the tester is often called the opponent.
3, Application : The rules are used in responding fo some new situation.
«Fig. 1.1.1 shows phases of ML.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
eeepach
jine Leering
Training
Fig. 1.1.1 Phases of ML
EEED Why Machine Learning is Important ?
Machine learning algorithms can figure out how to perform important tasks by
generalizing from examples.
Machine learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
Machine learning algorithms discover the relationships between the variables of a
system (input, output and hidden) from direct samples of the system.
Following are some of the reasons :
1. Some tasks cannot be defined well, except by examples. For example =
Recognizing people.
2. Relationships and correlations can be hidden within large amounts of data. To
solve these problems, machine learning and data mining may be able to find
these relationships.
3. Human designers often produce machines that do not work as well as
in the environments in which they are used.
4. The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans.
desired
5, Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
MyMachine Learning
1-6 Introduction to Machine Leaming
* Machine learning also helps us find solutions of many problems in computer
vision, speech recognition and robotics. Machine learning uses the theory of
statistics in building mathematical models, because the core task is making
inference from a sample.
* Learning is used when :
1. Human expertise does not exist (navigating on Mars),
2. Humans are unable to explain their expertise (speech recognition)
3. Solution changes in time (routing on a computer network)
4. Solution needs to be adapted to particular cases (user biometrics)
EERE ingredients of Machine Learning
The ingredients of machine learning are as follows : :
1. Tasks : The problems that can be solved with machine learning. A task is an |
abstract representation of a problem. The standard methodology in machine
learning is to leam one task at a time. Large problems are broken into small,
reasonably independent sub-problems that are learned separately and then |
recombined.
= Predictive tasks perform inference on the current data in order to make
predictions. Descriptive tasks characterize the general properties of the data in
the database.
2. Models : The output of machine learning. Different models are geometric models, |
probabilistic models, logical models, grouping and grading.
= The model-based approach seeks to create a modified solution tailored to each
new application. Instead of having to transform your problem to fit some
standard algorithm, in model-based machine learning you design the algorithm
precisely to fit your problem.
= Model is just made up of set of assumptions, expressed in a precise
mathematical form. These assumptions include the number and types of
variables in the problem domain, which variables affect each other, and what
the effect of changing one variable is on another variable.
s Machine learning models are classified as : Geometric model, Probabilistic
model and Logical model.
3, Features : The workhorses of machine learning. A good feature representation is
central to achieving high performance in any machine learning task.
Feature extraction starts from an initial set of measured data and builds derived
values intended to be informative, non redundant, facilitating the subsequent
learning and generalization steps.
TECHNIC! PUBLICATIONS® - an up-thrust for knowledgea Introduction to Machin
sun that crooner a must of FeaKUEEN FONT the oxy
nal
fw oature selection ta pr
pt edluce vording, to
ia optinally xl according to 4 ¢
features ao that the feature apace
critorion.
Review Questions
1, Justity the followin
sion task.
ofa person 8 TEA TES :
tev of a person by analyzing is woven syle Is ta classification task ?
Ip ita example of unvanpercsed learniiig. BUEN
What is machine learning ? E EAI
1.2 | Comparison of Machine Learning with Traditional Programming
1) Drolet the he
{i) Binul the gen
iid) Filter out spon en
slainn types of mactune tearnins
«Machine learning seeks to construct a model or logic for the problem by analyzing
traditional programming is tha,
its input data) and answers. In contrast,
programming aims to answer & problem using predefined set of rules or logic
© Machine learning is the ability of machines to automate a learning, process. The
input of this learning process js data and the output is a model. Through machine
fearing, a system can perform a learning function with the data it ingests and
thus it becomes progressively better at said function.
* Traditional programming is a manual process, It requires & programmer to create
the rules or logic of the program. We have to manually come up with the rules
and feed it to the computer alongside input data. The machine then processes the
given data according to the coded rules and comes up with answers as output.
# Fig 1.2.1 shows machine learning and traditional programming.
| Bata Data —+|
Computer Program | ‘Computer Output
Output Program
Fig. 1.2.1 (a) Machine |
| Fig. 1.2.1 (b) Traditional programming
¢ For projects that involve predicting output or identifying objects in images;
machine learning has proven to be much more efficient. In traditional
programming, the rule-based approach is preferred in situations where the
problem is of an algorithmic manner and there are not so many parameters to
consider when writing the logical rules.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 1-7 Introduction to Machine Learning
* Machine Learning is a proven technique for helping to solve complex problems
such as facial and voice recognition, recommendation systems, self-driving cars
and email spam detection.
ML vs Al vs Data Science
‘© Artificial Intelligence (Al) is the broad concept of developing machines that can
simulate human thinking, reasoning and behavior.
= Machine Learning (ML) is a subset of Al wherein computer systems learn from the
environment and in turn, use these learnings to improve experiences and
processes. All machine learning is Al, but not all Al is machine learning.
* Data Science is the processing, analysis and extraction of relevant assumptions
from data. It's about finding hidden patterns in the data. A Data Scientist makes
use of machine learning in order to predict future events.
* Fig. 1.22 shows relation between Al, ML and Data science.
Astificial
intelligence
} 4
Machine
learning Data
0 science
Deep
leaming
Fig, 1.2.2 Relation between Al, ML and Data science
© Machine Learning uses statistical models. Artificial Intelligence uses logic and
decision trees. Data Science deals with structured data.
» Machine Learning : A form of analytics in which software programs learn about
data and find patterns.
« Al: Development of computerized applications that simulate human intelligence
and interaction.
© Data Science : The process of using advanced analytics to extract relevant
information from data.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMac tine Leann
In Table F
Machine Learning
| Pocuses on providing means
© tor algorithiny and nystens te
| loan fram egperieiter with
data and use that experience
to tinpruve over the,
Machine Learning ines
atatiatical models
A tom of analytics in which
software progenns Tear abont
data and find patterns,
ctive is to maximize
racy
Obj
a
ML ean be done through
supervised, umsupervised or
reinforcement learning,
approaches,
ML is concerned with
knowledge accumulation,
EEX types of Learning
* Learning is essential for unknown environments, i.e.
Antifictal Lntelligence
on giving, machined
gnitive and Intellectual
ceapabilition allan to theme of
human
Artiticlal Intelligence uses
logle and declaton trees,
Development of computerized
applications that simulate
human intelligence and
Interaction
Objective Is to maximize the
chance of success,
Al encompasses a collection of
intelligence concepts, including
clements of perception,
planning and prediction,
AL is concerned with
knowledge dissemination and
conscious machine actions.
Introduotion to Machine Learning
Pocusen on extracting
information needles from data
Iayntack to aid in
|
Data Sclence |
|
|
|
dechiton-imaking and planning, |
'
|
Data Selence deals with |
atructured data, |
The process of using, advanced
analyticn to extract relevant
Information from data
|
{
Objective Is to extract |
actionable Insights from the |
data, |
Uses statistics, mathematica,
data wrangling, big data ||
analytics, machine learning
and various other methods to |
answer analytics questions, |
Data science is
all about data |
engineering. |
|
when designer lacks the
omniscience. Learning simply means incorporating information from the training
examples into the system,
* Learning is any change
time on repetition of thi
Population. One part of learning
e same task or on another
the other part is problem-solving.
* Supervised and Unsupervised Leaming
methods. A computational learning
aspects :
1. Learner :
algorithm.
Who or what is doing the learning, For example :
2. Domain : What is being learned ?
3._Goal : Why the learning is done ?
TECHNICAL PUBLICATIONS®
in a system that allows it to perform better the second
task drawn from the same
is acquiring knowledge and new information; and
are the different types of machine learning
model should be clear about the following
Program or
‘an up-thnust for knowledgen y
eet een Introduction to Machine Learning
4, Representation : The way the objects to be learned are represented.
5. Algorithmic technology : The algorithmic framework to be used.
6. Information source : The information (training data) the program uses for
learning,
7. Training scenario : The description of the learning process.
* Learning is constructing or modifying representation of what is being experienced
Learn means to get knowledge of by study, experience or being taught
¢ Machine learning is a scientific discipline concerned with the design and
development of the algorithm that allows computers to evolve behaviors based on
empirical data, such as form sensors data or database.
Machine learning is usually divided into two main types : Supervised: Learning
and Unsupervised Learning.
Why do Machine Learning ?
1. To understand and improve efficiency of human learning.
2. Discover new things or structure that is unknown to humans (Example : Data
mining).
3. Fill in skeletal or incomplete specifications about a domain.
« Fig, 1.3.1 shows types of machine learning.
Machine learning
Supervised learning | Unsupervised learning | | Reinforcement learning
Classification Clustering
Association analysis
Regression
Fig, 1.3.1 Types of machine learning
Supervised Learning BEE
« Supervised learning is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples.
The task of the supervised learner is to predict the output behavior of a system for
any set of input values, after an initial training phase.
« Supervised learning in which the network is trained by providing it with input
and matching output patterns. These input-output pairs are usually provided by
an external teacher.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeos
Introduction to Machine Le, ng
paced on the past experiences. A computer does not hay,
¢ Human learning is bas
. . tem learns from data, which Te esent some "past experiences
ter system Ik ta, repre 4
A computer 5;
- ! be used to predict the values of a discret
webu eg ape ov noppoved and ight ot low sk The wa
Seely cies Sapna Jearning, Casta 01 Peale
rae includes both the input and the desired results. the Seales
Training data inclu a cts) are known and are given in input to the model durin,
tre enna as Bae constuction of a proper taining, validation and tay g¢
is ‘cuca Tee methods are usually fast and accurate.
Have to be able to generalize : give the correct results when new data are given in
i it i iori the target.
input without knowing a priori ae /
Supervised learning is the machine learning = aes a cena fm
i ining data consist of a s ainin;
supervised training data. The training data c of a
ripecised learning, each example is a pair consisting of an input object and 4
desired output value.
* A supervised leaning algorithm analyzes the training data and produces an
inferred function, which is called a classifier or a regression function. Fig. 14)
shows supervised learning process.
Leaming
algorithm
Fig. 1.4.1 Supervised learning process
The Ieamed model helps the system to perform task better as compared to no
learning.
* Each input vector requires a corresponding target vector.
Training Pair = (Input Vector, Target Vector)
Fig. 1.42 shows input vector. (See Fig. 1.4.2 on next page)
'
z
i
-
g
e
7
8
z
.
&
&
3
7
2
<
4
3
z
zMaohion Loaninny
nau Introctuction ta Machina Laarning
Actual
output
Desired
output
Fig, 1.4.2 Input vector
* Supervised learning is further divided into methods which use reinforcement or
error correction. The perceptron learning algorithm is an example of supervised
learning with reinforcement.
Data formats in supervised learning :
* Supervised learning always uses a dataset to define finite set of real vectors with
m features each :
X = (Ry) Xp s+ /%n) where %; €R™
* Considering that user approach is always probabilistic, we need to consider each X
as drawn from a statistical multivariate distribution D. It is also useful to add an
important condition upon the whole dataset X. Here we consider that all samples
to be independent and identically distributed. This means all variables belong to
the same distribution D and considering an arbitrary subset of m values, it
happens that :
m
PR, Xp----/%m) = TPR)
i=l
+ The corresponding output values can be both numerical - continuous and
categorical. In the first case, the process is called regression, while in the second, it
is called classification.
« Example : Dataset contains city populations by year for the past 100 years and
user want to know what the population of a specific city will be four years from
now. The outcome uses labels that already exist in the data set : population, city
and year.
« In order to solve a given problem of supervised learning, following steps are
performed :
1. Find out the type of training examples.
TECHNICAL PUBLICATIONS® - an up-thrust for knowiodgeMachine Leaning Introduction to Machine Leaming
Collvet a training set.
ne
3. Determine the input teature representation of the learned function.
4. Determine the structure of the learned function and corresponding learning
algorithm,
5 Complete the design and then run the learning algorithm on the collected
training set.
6. Evaluate the accuracy of the learned function. After parameter adjustment and
learning, the performance of the resulting function should be measured on a
test set that is separate from the training set.
© Superv lassification and Regression.
i learning is divided into two types
1. Classification :
© Classification. predi
categorical labels (classes), prediction models continuous -
valued functions, Classification is considered to be supervised learning.
© Classifies data based on the training set and the values in a classifying attribute
and uses it in classifying new data, Prediction means models continuous-valued
functions, i.e, predicts unknown or missing values.
© Preprocessing of the data in preparation for classification and prediction can
involve data cleaning to reduce noise or handle missing values, relevance analysis
to remove irrelevant or redundant attributes and data transformation, such as
generalizing the data to higher level concepts or normalizing data.
* Numeric prediction is the task of predicting continuos values for given input. For
example, we may wish to predict the salary of college employee with 15 years of
work experience or the potential sales of a new product given its price.
* Some of the classification methods like back - propagation, support vector
machines and k-nearest-neighbor classifiers can be used for prediction.
2. Regression :
« For an input x, if the output is continuous, this is called a regression problem. For
example, based on historical information of demand for tooth paste in
supermarket, user are asked to predict the demand for the next month.
© Regression is concerned with the prediction of continuous quantities. Linear
regression is the oldest and most widely used predictive model in the field of
machine learning. The goal is to minimize the sum of the squared errors to fit a
straight line to a set of data points.
+ Regression algorithm used in supervised learning is linear regression, Bayesian
linear regression, polynomial regression, regression tree etc.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 1-13 Introduction to Machine Learning
EEX Advantages and Disadvantages of Supervised Learning
1, Advantages of supervised learning
© It performs classification and regression tasks.
«It allows estimating or mapping the result to a new sample.
* We have complete control over choosing the number of classes we want in the
training data.
2. Disadvantages of supervised learning
«Supervised learning cannot handle all complex tasks in Machine Learning.
* Computation time is vast for supervised learning.
© It requires a labelled data set.
» It requires a training process.
Review Questions}
1. Explain supervised learning with example. SPPU : March-20, In Sem, Marks 5
2. Explain data formats for supervised learning problem with example.
'SPPU: June-22, End Sem, Marks 6
EEA Unsupervised Learning
© Unsupervised leaning is a type of machine learning in which models are trained
using unlabeled dataset and are allowed to act on that data without any
supervision.
In unsupervised leaming, a dataset is provided without labels and a model learns
useful properties of the structure of the dataset. The main goal of unsupervised
learning is to discover hidden and interesting patterns in unlabeled data.
« They are called unsupervised because they do not need a teacher or super-visor to
label a set of training examples. Only the original data is required to start the
analysis.
+ Unsupervised learning tasks typically involve grouping similar examples together,
dimensionality reduction and density estimation.
* Common algorithms used in unsupervised learning include clustering, anomaly
detection, neural networks.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning __ Introduction to Machine Learing
# Fig. 15.1 shows unsupervised learning:
Output - 1
I+ Algorithm Processing
Output - 2
Fig. 1.5.1 Unsupervised learning
Interpretation
The most common unsupervised learning method is cluster analysis, which applies
clustering methods to explore data and find hidden patterns or groupings in data
Unsupervised leaming is typically applied before supervised learning, to identify
features in exploratory data analysis and establish classes based on groupings.
Unsupervised machine learning is mainly used to !
1. Cluster datasets on similarities between features or segment data.
2. Understand relationship between different data point such as automated music
recommendations.
3. Perform initial data analysis.
Unsupervised learning algorithms have the capability of analyzing large amounts
of data and identifying unusual points among the dataset. Once those anomalies
have been detected, they can be brought to the awareness of the user, who can
then decide to act or not on this warning.
‘Anomaly detection can be very useful in the financial and banking sectors. Indeed,
financial fraud has become a daily problem, due to the ease with which credit
card details can be stolen. Using unsupervised learning models, unauthorized or
fraudulent transactions on a bank account can be identified as it will most often
constitute a change in the user's normal pattern of spending.
* Example : Using customer data and user want to create segments of customers
who like similar products. The data that user are providing is not labeled and the
labels in the outcome are generated based on the similarities that were discovered
between data points.
© Types of unsupervised learning is clustering and association analysis.
© There is a wide range of algorithms that can be deployed under unsupervised
learning. A few of them includes : K-means clustering, Principal component
analysis, Hierarchical clustering and Dendrogram.
Advantages and Disadvantages of Unsupervised Learning
4. Advantages of unsupervised learning
© It does not require a training data to be labelled.
®
TECHNICAL PUBLICATIONS” - an up-thrust for knowledge‘Machine Leaming 1-16 Introduction to Machine Leeming
© Dimensionality reduction can be easily accomplished using unsupervised learning.
* Capable of finding previously unknown patterns in data.
2. Disadvantages of unsuporvised learning
«Difficult to measure accuracy or effectiveness due to lack of predefined answers
during training. :
© The results often have lesser accuracy.
«The user needs to spend time interpreting and label the classes which follow that
classification.
[GEE Difference between Supervised and Unsupervised Learning
] : A :
| sr. Supervised learning Unsupervised learning
No.
{
1, Desired output is given. Desired output is not given. \
| 2. Itis not possible to leam larger and more It is possible to lear larger and more |
complex models than with supervised complex models with unsupervised |
learning. learning. |
| 3, Use training data to infer model. No training data is used. i
4 Every input pattern that is used to train the The target output is not presented fo the |
| network is associated with an output network. |
pattem i i |
| 5. Trying to predict a function from labeled _—_Try to detect interesting relations in data. |
data.
2 |
6. Supervised learning requires that the target For unsupervised learning typically either |
SuPehle is well defined and that a sufficient — the target variable is unknown or has only
number of its values are given. been recorded for too small a number of |
_ cases. |
‘7, Example : Optical character recognition. _ Example : Find a face in an image. |
8 We can test our model. _ «We can not test our model. < i
| 9, Supervised leaning is also called ‘Unsupervised leaning is also called |
| ification. clustering. |
EEG Semi-supervised Learning
¢ Semi-supervised learning uses both labeled and unlabeled data to improve
supervised learning. The goal is to learn a predictor that predicts future test data
better than the predictor learned from the labeled training data alone.
+ Semi-supervised learning is motivated by its practical value in learning faster,
better and cheaper.
©
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMechina Learning
Sr.No. Supervised learning Unsupervised learning Semi-supervised learning |
1
1-16 Introduction to Machina Leaming
——— el
In many real world applications, {1s relatively easy to acquire a large amount of
unlabeled data x.
For example, documents can be crawled from the Web, images can be obtaineg
from surveillance cameras, and speech can be collected from broadcast. However,
for the prediction task, such as sentiment orientation,
often requires slow human annotation
their corresponding, label
intrusion detection and phonetic transcript,
and expensive laboratory experiments
In many practical learning domains, there is a large supply of unlabeled data but
limited labeled data, which can be expensive to generate. For example : text
processing, video-indexing, bioinformatics etc.
Semi-supervised Learning makes use of both labeled and unlabeled data for
training, typically a small amount of labeled data with a large amount of
unlabeled data. When unlabeled data is used in conjunction with a small amount
of labeled data, it can produce considerable improvement in learning accuracy.
Semi-supervised learning sometimes enables predictive model testing at reduced
cost.
Semi-supervised classification : Training on labeled data exploits additional
unlabeled data, frequently resulting in a more accurate classifier.
Semi-supervised clustering : Uses small amount of labeled data to aid and bias
the clustering of unlabeled data.
Comparision between Supervised, Unsupervised,
Semi-supervised Learning
Input data is labeled. Input data is unlabeled. A large amount of input
data is unlabeled while a |
small amount is labeled.
Trying to predict a specific Trying to understand the Using unsupervised
quantity. data. methods to improve |
supervised algorithm. |
|
Used in Fraud detection. Used in Identity Used in spam detection. |
management. {
Subtype : Classification Subtype : Clustering and Subtype : Classification, |
and regression. association, regression, clustering and |
association. |
Higher accuracy, Lesser accuracy. Lesser accuracy. \
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming
Reinforcement Learnings
1-17 Introduction to Machine Leaming
Reinforcement learning uses algorithms that learn from outcomes and decide
which action to take next. After each action, the algorithm receives feedback that
helps it determine whether the choice it made was correct, neutral or incorrect. It
is a good technique to use for automated systems that have to make a lot of small
decisions without human guidance.
Reinforcement learning is an autonomous, self - teaching system that essentially
learns by trial and error. It performs actions with the aim of maximizing rewards,
or in other words, it is learning by doing in order to achieve the best outcomes.
A good example of using reinforcement learning is a robot learning how to walk.
The robot first tries a large step forward and falls. The outcome of a fall with that
big step is a data point the reinforcement learning system responds to. Since the
feedback was negative, a fall, the system adjusts the action to try a smaller step.
The robot is able to move forward. This is an example of reinforcement learning in
action.
Reinforcement learning is
learning what to do and AGENT
how to map situations to
actions. The learner is not
told which actions to take. Situation] | Reward Action
Fig. 1.7.1 shows concept of * th a
reinforced learning. Sle
Reinforced learning is deals z Sy. ENVIRONMENT
with agents that must sense
and act upon their Fig. 1.7.1 Reinforced learning
environment. It combines
classical Artificial Intelligence and machine learning techniques.
It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance. Simple
reward feedback is required for the agent to learn its behavior; this is known as
the reinforcement signal.
Two most important distinguishing features of reinforcement learning is
trial-and-error and delayed reward.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaning 1-18 Introduction to Machine Learning
* With reinforcement learning algorithms an agent can improve its performance by
using the feedback it gets from the environment. This environmental feedback ig
called the reward signal.
* Based on accumulated experience, the agent needs to learn which action to take in
a given situation in order to obtain a desired long term goal. Essentially actions
that lead to long term rewards need to reinforced. Reinforcement learning has
connections with control theory, Markov decision processes and game theory.
Elements of Reinforcement Learning ie el
ee
* Reinforcement learning elements are as follows : Environment
1. Policy 2. Reward function
3. Value function 4. Model of the environment 0. We denote P(B| A) the probability
of B given that A has occurred. Since A is known to have occurred, it becomes the
new sample space replacing the original S. From this, the definition is ,
P(AN B)
PIA) = Fa
OR
P(A B) = P(A) P(B/A)
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming
1-26 Introduation to Machine Learning
nT i "
. aes PB | A) is read "the probability of event B given event A”. It is the
probability of an event B given the occurrence of the event A.
We say that, the probability that both A and B occur is equal to the probability
that A occurs times the probability that B occurs given that A has occurred. We
call P(B|A) the conditional probability of B given A, i.c., the probability that B
will occur given that A has occurred.
* Similarly, the conditional probability of an event A, given B by,
P(a/B) = PCAOB)
P(B)
* The probability P(A|B) simply reflects the fact that the probability of an event A
may depend on a second event B. If A and B are mutually exclusive ANB =6
and P(A|B) = 0.
Another way to look at the conditional probability formula is :
P(Second//First) = P (Fist choice and second choice)
P (First choice)
* Conditional probability is a defined quantity and cannot be proven.
* The key to solving conditional probability problems is to :
| 1. Define the events.
2. Express the given information and question in probability notation.
3. Apply the formula.”
Joint Probability
« A joint probability is a probability that measures the likelihood that two or more
events will happen concurrently.
«If there are two independent events A and B, the probability that A and B will
occur is found by multiplying the two probabilities. Thus for two events A and B,
the special rule of multiplication shown symbolically is :
P(A and B) = P(A) P(B).
« The general rule of multiplication is used to find the joint probability that two
events will occur. Symbolically, the general rule of multiplication is,
P(A and B) = P(A) P(B|A).
« The probability P(AM B) is called the joint probability for two events A and B
which intersect in the sample space. Venn diagram will readily shows that
P(AMB) = P(A) + PB) - P (AUB)
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeIntroduction to Machine g,
a)
Machine Leeming
Equivalently isi
> rAd B)s PIA)* PC
PAB) = PAY PUD = PAC ;
on of two events never exceeds the Sum of the ey,
* The probability of the uni
probabilities conditional and joint probab,
8 ver 1 for portrayiny
am js very useful P tually exclusive
+ A tree diagr:
hat are mul
tree diagram portrays outcomes tt
Leahy robability of an event given addition
* Bayes’ theorem is revise the Pp!
yes’ theorem is a method to revis zs
a conditional probability called 7
theorem calculates itional pI ‘lity @ posteriog
information. Baye
or revised probability.
* Bayes’ theorem is a result probability theory that relates condition
and B denote two events, P(A|B) denotes the conditiona
rs. The two conditional probabilities
in
probabilities. IA
probability of A occurring, given that B occu!
P(A|B) and P(B] A) are in general different.
n between P(A|B) and P(B|A)- An important
* Bayes theorem gives a relatio
application of Bayes’ theorem is that it gives a rule how to update or revise the
idence a posteriori.
strengths of evidence-based beliefs in light of new evi
«A prior probability is an initial probability value originally obtained before any
additional information is obtained. :
+ A posterior probability is a probability value that has been revised by using
additional information that is later obtained.
* Suppose that B,,B>,B3 ...B, partition the outcomes of an experiment and that A
is another event. For any number, k, with 1 < k < n, we have the formula :
P(A/By)-PBx)
a i
Dd P(A/B;)-P(B,)
i=l
EEED Logical Models
* Logical models are defined in terms of easily interpretable logical expansions.
Logical models use a logical expression to divide the instance space into segments
and hence construct grouping models.
P(B,/A) =
* A logical expression is an expression that returns a Boolean value, ie., a True of
False outcome. Models involving ogi
gical statement: i
human - understandable rules. Ses
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 1-27 Introduction to Machine Learning
Once the data is grouped using a logical expression, the data is divided into
homogeneous groupings for the problem we are trying to solve. For example, for a
classification problem, all the instances in the group belong to one class.
There are mainly two kinds of logical models : Tree models and Rule models.
Rule models consist of a collection of implications or IF-THEN rules. For
tree-based models, the ‘if - part’ defines a segment and the ‘then - part’ defines
the behavior of the model for this segment. Rule models follow the same
reasoning.
Example of Logical models : Decision tree, random forest.
Distance-based Models
Learning a good distance metric in feature space is crucial in real-world
application. Good distance metrics are important to many computer vision tasks,
such as image classification and content-based image retrieval.
The arithmetic and geometric means, usually used to average a finite set of
positive numbers, generalize naturally to a finite set of Symmetric Positive Definite
Matrices. This generalization is based on the key observation that a mean has a
various characterization.
The arithmetic mean minimizes the sum of the squared Euclidean distances to
given positive numbers. The geometric mean minimizes the sum of the squared
hyperbolic distances to the given positive numbers.
Examples of distance-based algorithms are Hierarchical Agglomerative Clustering
(HAC) and K-nearest neighbor algorithm (KNN) for prediction. These algorithm
works on arbitrary types of structured data. They require a distance function on
the underlying data type.
The distance calculation on complex/structured types requires three types of
functions :
1. A function to generate pairs of objects of the simpler constitutive types ie.
pairing function.
2. Distance functions on the simpler types.
3. An aggregation function that is applied to the distance values obtained from
above steps.
Depending on the availability of the training examples, algorithms for distance
metric learning can be divided into two categories : Supervised distance metric
learning and unsupervised distance metric learning.
Unlike most supervised learning algorithms where training examples are given
class labels, the training examples of supervised distance metric learning is cast
into pair wise constraints : The equivalence constraints where pairs of data points
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leeming
Introduction ta Machina Laaming
that belong to the same classes and in-equivalence constraints where pairs of data
points belong to different classes.
The supervised, distance metric learning can be further divided into two
categories : the global distance metric learning and the local distance metric
leaming. The first one learns the distance metric in a global sense, i.e., to satisfy
all the pair-wise constraints simultaneously. The second approach is to lean a
distance metric in a local setting, i.c., only to satisfy local pair-wise constraints,
The main ingredients of distance-based models are distance metrics, which can be
Euclidean, Manhattan, Minkowski or Mahalanobis, among many others.
EERE Euclidean Distance
© The Euclidean distance is the most common distance metric used in low
dimensional data sets. It is also knows the L norm. The Euclidean distance is the
ususal manner in which distance is measured in real world.
deuctidean (% ¥) = VDI - yi)?
where x and y are m-dimensional vectors and denoted by x = (x1/%2/%3/---7%m)
and y = (y1,¥2/¥3--/Ym) fepresent the m attribute values of two records.
While Euclidean metric is useful in low dimensions, it doesn't work well in high
dimensions and for categorical variables. The drawback of Euclidean distance is
that it ignores the similarity between attributes. Each attributes is treated as totally
different from all of the attributes.
EEE] Manhattan
Mahalanobis distance is also called quadratic distance.
Mahalanobis distance is a distance measure between two points in the space
defined by two or more correlated variables. Mahalanobis distance takes the
correlations within a data set between the variableinto considerations.
If there are two non-correlated variables, the Mahalanobis distance between the
points of the variables in a 2D scatter plot is same as Euclidean distance.
The Mahalanobis distance is the distance between an observation and the center
for each group in m-dimensional space defined by m variables and their
covariance. Thus, a small value of Mahalanobis distance increases the chance of an
observation to be closer to the group's center and the mote likely it is to be
assigned to that group.
Mahalanobis distance between two samples (x, y) of a random variable is defined
as,
Mahalanobis (X,Y) = y(x-y)" Y\“"(x-y)
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeer eee 1-20 Introduction to Machine Leeming
¢ The Mahalanobis metric is defined in independence of the
* No pre = processing of labeled
data matrix.
data samples is needed before using KNN
el in K-nearest neighbors of a data point is
point. A tie occurs when neighborhood has same
assigned as class label to that the
amount of labels from multiple classes.
© To break the tie, the distances of neighbors can be summed up in each class that
is tied and vector f is assigned to the class with minimal distance. Or, the class
can be chosen with the nearest neighbor. Clearly, tie is still possible here, in which
case an arbitrary assignment is taken.
* Mahalanobis distance that takes into account the correlation $ of the dataset :
Lisy) = y(x-y)S"l&-y)
Hamming Distance
+ Hamming bits are inserted into the message at the random locations. Hamming
code is a single error correcting code. It is most complex from the stand point of
creating and interpreting the error bits. Let us consider a frame which consists of
m data bits and r check bits. The total length of message is then n = m +r. An
mbit unit containing data and checkbits is often referred to as an n-bit codeword.
If 10001001 and 10110001 are two codewords, then the corresponding bits differ in
these two codewrods is 3 bits. The number of bit positions in which two
codewords differ is called the hamming distance.
If two codewords are a hamming distance d apart, it will require d single bit
errors to convert one into the other.
.
* Determining the placement and binary value of the hamming bits can be
implemented using hardware, but it is often more practical to implement them
using software.
¢ The number of bits in the message are counted and used to determine the number
of hamming bits to be used. The equation is used to count the number of
hamming bits.
2H > M+H+l1
where M = Number of bits in a message
H = Hamming bits ,
¢ The parity bits are inserted into the message. Position of the parity bit is
calculated as follows. Create a 4 bit binary number b4b3b) and by where
bi =0 if the parity check for P; succeeds
by =1 otherwise
for i= 1, 2,3 or 4.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge= l
Meche Lewnihng 1-30 Introduction to Machine Leaming
1) The parity bit Py is inserted at bit position 1 for even parity for bit positions 1, 3,
5, 7, 9, 10. In these bit positions it contains even number of 0s or 1s.
2) The parity bit P2 is inserted at bit position 2, for even parity for bit positions 2, 3,
6, 7, 10, 11
3) The parity bit P3 is inserted at bit position 4, for even parity for bit positions 4, 5,
6, 7, 12.
4) The parity bit Py is inserted at bit position 8, for even parity for bit positions 8, 9,
10, 11, 12. .
« For inserting the parity bit even or odd parity can be used. Each parity bit is
determined by the data bits it checks. When a receiver gets 2 transmitted frame, it
performs each of the parity checks,
« The combination of failures and successes then determines whether there was no
error or in which position an error occurred. Once the receiver knows where the
error occurred, it changes the bit value in that position and the error is corrected.
Minimum hamming distance (djpin) ?
© The minimum hamming distance is the smallest hamming distance between all
possible pairs in a set of words.
« To find the value of din’ We find the hamming distances between all words and
select the smallest one.
Minkowski Distance Metric
¢ Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
The Minkowski distance between two variabes X and Y is defined as
D = (3e-ar)”
© The case where p = 1 is equivalent to the Manhattan distance and the case where
p = 2 is equivalent to the Euclidean distance.
Although p can be any real value, it is typically set to a value between 1 and 2.
For values of p less than 1, the formula above does not define a valid distance
metric since the triangle inequality is not satisfied.
Review Question
1. What do you mean by distance metric and exemplar ? Explain different types of distances,
measures. ia Oe er os
TECHNICAL PUBLICATIONS® - an up-thrust for. knowledgeine Leann
Mechelen ot Introduction to Machine Learning
ERD Tree Based Model PSPPU: Dee.8]
ERED Decision Troos
+ A decision tree is a simple representation for classifying examples. A decision tree
or a classification tree is a tree in which cach internal node is labeled with an
input feature. The ares coming from a node labeled with a feature are labeled with
each of the possible values of the feature. Each leaf of the tree is Jabeled with a
class or a probability distribution over the classes.
«In this method a set of training examples is broken down into smaller and smaller
subsets while at the same time an associated decision tree get incrementally
developed. At the end of the learning process, a decision tree covering the training
set is returned.
«The key idea is to use a decision tree to partition the data space into cluster (or
dense) regions and empty (or sparse) regions.
Decision tree consists of
Nodes : Test for the value of a certain attribute.
Edges : Correspond to the outcome of a test and connect to the next node or leaf.
. Leaves : Terminal nodes that predict the outcome
sep re
In Decision Tree Leaning, a new example is classified by submitting it to a series
of tests that determine the class label of the example. These tests are organized in
a hierarchical structure called a decision tree.
© Learn trees in a
Top-Down fashion :
1. Divide the problem in
subproblems. Training
2. Solve each problem.
Basic Divide-and-Conquer
Algorithm :
L. Select a test for root node.
Create branch for each
possible outcome of the
test.
2. Split instances into subsets.
One for each branch
extending from the node.
Fig. 1.10.4
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
SC Oe caelMachine Leeming 1-32 Introduction to Machine Leaming Mt
i i branch.
3. Repeat recursively for each branch, using only instances that reach the brai
e class.
4. Stop recursion for a branch if all its instances have the sam
ERIE Ranking and Probability Estimation Trees
and widely used classification
‘lity estimates and Probability
as classification trees. But
© Decision trees are one of the most effective
methods. Many applications require class probabi
Estimation Trees (PET) have the same attractive feature
decision trees have been found to provide poor probability estimates.
«A tree is defined as a set of logical conditions on attributes ; & leaf represents the
subset of instances corresponding to the conjunction of conditions along its branch
or path back to the root. A simple approach to ranking is to estimate the
probability of an instance’s membership in a class and assign that probability as
the instance's rank. A decision tree can easily be used to estimate these
probabilities.
* Rule leaning is known for
classification models which also yield good cla:
areas, we also need good class probability estimates.
« For different classification models, such as decision trees, a variety of techniques
for obtaining good probability estimates have been proposed and evaluated.
In classification rule mining one search for a set of rules that describes the data as
accurately as possible. As there are many different generation approaches and
types of generated classification rules.
+ A probabilistic rule is an extension of a classification rule, which does not only
predict a single class value, but a set of class probabilities, which form a
probability distribution over the classes. This probability distribution estimates all.
probabilities that a covered instance belongs to any of the class in the data set, so
we get one class probability per class.
« Error rate does not consider the probability of the prediction, so it is consider in
PET. Instead of predicting a class, the leaves give a probability. Tt is very useful
when we do not want just the class, but examples most likely to belong to a class.
No additional effort in learning PET compared to decision tree.
© Building decision trees with accurate probability estimates, called probability
estimation trees. A small tree has a small number of leaves, thus more examples
will have the same class probability. That prevents the learning algorithm from
building an accurate PET.
© On the other hand, if the tree is large, not only may the tree overfit the training
data, but the number of examples in each leaf is also small and thus the
probability estimates would not be accurate and reliable. Such a contradiction does
exist in traditional decision trees.
its descriptive and therefore comprehensible
ss predictions. In some application
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Sait eeuMachine Leaning 1-93 Introduction to Machine Leaming
Decision trees acting as probability estimators, however, are often observed to
produce bad Probability estimates. There are two types of probability estimation
trees - a single tree estimator and an ensemble of multiple trees.
Applying a learned PET involves minimal computational effort, which makes the
tree-based approach Particularly suited for a fast reranking of large candidate sets.
For simplicity, all attributes are assumed to be numeric. For n attributes, each
input datum is then given by an n-tuple X = (x1). Xn) € R®
Let X= {x"), ...,. x} R" be the set of training items.
A probability estimation tree is introduced as a binary tree T with s20 inner
nodes DT = {d,,d2,...,dg} and leaf nodes ET =(e9,e;/---,€s) with E7 NDT=
Each inner node dj, i€ {1,2,...,s} is labeled by an attribute a} ¢ (1,...,n} , while
each leaf node e;, je {1,2,...,s} is labeled by a probability pj € [0,1].
* The arcs in A™ correspond to conditions on the inputs. Since it is a binary tree
and every inner node has exactly two children. By splitting inputs at each decision
node until a leaf is reached, the PET partitions the input space into n-dimensional
cartesian blocks :
a
HT = ro
T= hha ula)
Regression Tree
Regression tree models are known for their simplicity and efficiency when
dealing with domains with large number of variables and cases.
Regression trees are obtained using a fast divide and conquer greedy
algorithm that recursively partitions the given training data into smaller
subsets.
© When the complexity of the model is dependant on the learning sample size, both
bias and variance decrease with the learning sample size. E.g. regression trees.
Small bias, a tree can approximate any non linear function.
* Regression trees are among the machine learning method that present the highest
variance. Even a small change of the learning sample can result in a very different
tree, Even small trees have a high variance.
* Possible sources of variance :
1. Discretization of numerical attributes : The selected threshold -has a high
variance.
2. Structure choice : Sometimes, attribute scores are very close.
3. Estimation at leaf nodes : Because of the recursive partitioning, prediction at
leaf nodes are based on very small samples of objects.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeIntroduction to Machine Learning
Machine Leaming 1-94
Recursive Partitioning (RP) algorithm,
ly splitting the training sample into
input a set of n data points and if
tes a test node t, whose branches
th two subsets of the input data
© Regression trees are constructed using a
This algorithm builds a tree by recursivel
smaller subsets. The RP algorithm receives as
certain termination criteria are not met it general
are obtained by applying the same algorithm wi
points. ‘
* At each node the best split te
means that this is a greedy hill-climbing algorithm
st is chosen according to some local criterion which
Algorithm ; Recursive Partitioning Algorithm
Input :A set of n datapoints
Output :A regression tree
IF termination criterion THEN
Create Leaf Node and assign it a Constant Value
Return Leaf Node
ELSE
Find Best Splitting Test s*
Create Node t with s*
Left_branch (t) = RecursivePartitioningAlgorithm (|i > $*))
Right_branch (t) = RecursivePartitioningAlgorithm (|**i 5")
Return Node t
ENDIF
The algorithm has three main components +
1. A way to select a split test
2. A tule to determine when a tree node is terminal.
3. A rule for assigning a value to each terminal node.
Ew Impurity Measures - Gini Index and Entropy
© One of the decision tree algorithms is CART (Classification and Regression Tree).
© Classification Tree : When decision or target variable is categorical, the decision
tree is classification decision tree.
Regression Tree : When the decision or target variable is continuous variable, the
decision tree is called regression decision tree.
* CART algorithm can be used for building both Classification and Regression
Decision Trees. The impurity measure used in building decision tree in CART is
Gini Index. The decision tree built by CART algorithm is always a binary decision
tree.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 1-36
Introduction to Machine Learning
Cae te, is a measure of how often a randomly chosen element from the set
would be incorrectly labeled if it. was randomly labeled te
distribution of labels in the subset. So
» Gini index, entropy and twoing rule are some of the frequently used impurity
measures.
© Gini Index for a given node t :
GIN) = Ypgla-p§ld) = Yl?
i J
Maximum of 1-1/n, (number of classes) when records are equally: distributed
among all classes = Maximal impurity.
¢ Minimum of 0 when all records belong to one class = Complete purity.
« Entropy at a given node by :
Entropy (t) = SpG| 0 log Gl)
i
«Maximum (log n,) when records are equally distributed among all classes(maximal
impurity).
® Minimum (0.0) when all records belongs to one class (maximal purity).
+ Entropy is the only function that satisfies all of the following three properties
1. When node is pure, measure should be zero
2, When impurity is maximal (1e. all classes equally likely), measure should be
maximal
3. Measure should obey multistage property
© When a node p is split into.k partitions (children), the quality of the split is
computed as a weighted sum :
P
GINIgpin = >) SL GINIG) = Dolo?
i=l J
where nj = number of records at child-i, and n = number of records at node p.
Giniin)=ng
Fig. 1.10.2
impurity. measures is that they depend only on the number of
side of the hyperplane. Thus, if we
effective areas of class regions on
sure of the hyperplane will not
© A problem with all
(training) patterns of different classes on either
change the class regions without changing the
either side of a hyperplane, the impurity meat
change.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge