Unit Iii
Unit Iii
UNIT III
SUPERVISED LEARNING
SYLLABUS
Introduction to machine learning - Linear Regression Models: Least squares, single &
multiple variables, Bayesian linear regression, gradient descent, Linear Classification Models:
Discriminant function - Probabilistic discriminative model - Logistic regression, Probabilistic
generative model - Naive Bayes, Maximum margin classifier - Support vector machine,
Decision Tree, Random forests
• Machine Learning Definition: A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.
• It is very hard to write programs that solve problems like recognizing a human face. We do
not know what program to write because we don't know how our brain does it. Instead of
writing a program by hand, it is possible to collect lots of examples that specify the correct
output for a given input.
Page 1
Paavai Institutions Department of CSE
• A machine learning algorithm then takes these examples and produces a program that does
the job. The program produced by the learning algorithm may look very different from a
typical hand-written program. It may contain millions of numbers. If we do it right, the
program works for new cases as well as the ones we trained it on.
• Main goal of machine learning is to devise learning algorithms that do the learning
automatically without human intervention or assistance. The machine learning paradigm can
be viewed as "programming by example." Another goal is to develop computational models
of human learning process and perform computer simulations.
• The goal of machine learning is to build computer systems that can adapt and learn from
their experience.
2. Validation: The rules are checked and, if necessary, additional training is given. Sometimes
additional test data are used, but instead, a human expert may validate the rules, or some other
automatic knowledge - based component may be used. The role of the tester is often called the
opponent.
Page 2
Paavai Institutions Department of CSE
• Machine learning algorithms can figure out how to perform important tasks by generalizing
from examples.
• Machine learning provides business insight and intelligence. Decision makers are provided
with greater insights into their organizations. This adaptive technology is being used by global
enterprises to gain a competitive edge.
• Machine learning algorithms discover the relationships between the variables of a system
(input, output and hidden) from direct samples of the system.
1) Some tasks cannot be defined well, except by examples. For example: Recognizing people.
2) Relationships and correlations can be hidden within large amounts of data. To solve these
problems, machine learning and data these relationships.
3) Human designers often produce machines that do not work as well as desired in the
environments in which they are used.
4) The amount of knowledge available about certain tasks might be too large for explicit
encoding by humans.
Page 3
Paavai Institutions Department of CSE
• Machine learning also helps us find solutions of many problems in computer vision, speech
recognition and robotics. Machine learning uses the theory of statistics in building
mathematical models, because the core task is making inference from a sample.
1. Tasks: The problems that can be solved with machine learning. A task is an abstract
representation of a problem. The standard methodology in machine learning is to learn one
task at a time. Large problems are broken into small, reasonably independent sub-problems
that are learned separately and then recombined.
• Predictive tasks perform inference on the current data in order to make predictions.
Descriptive tasks characterize the general properties of the data in the database.
2. Models: The output of machine learning. Different models are geometric models,
probabilistic models, logical models, grouping and grading.
• The model-based approach seeks to create a modified solution tailored to each new
application. Instead of having to transform your problem to fit some standard algorithm, in
model-based machine learning you design the algorithm precisely to fit your problem.
• Machine learning models are classified as: Geometric model, Probabilistic model and
Logical model.
Page 4
Paavai Institutions Department of CSE
• Feature extraction starts from an initial set of measured data and builds derived values
intended to be informative, non redundant, facilitating the subsequent learning and
generalization steps.
• Feature selection is a process that chooses a subset of features from the original features so
that the feature space is optimally reduced according to a certain criterion.
• Learning is essential for unknown environments, i.e. when designer lacks the 10 me
omniscience. Learning simply means incorporating information from the training examples
into the system.
• Learning is any change in a system that allows it to perform better the second time on
repetition of the same task or on another task drawn from the same population. One part of
learning is acquiring knowledge and new information; and the other part is problem-solving.
• Supervised and Unsupervised Learning are the different types of machine learning methods.
A computational learning model should be clear about the following aspects:
1. Learner: Who or what is doing the learning. For example: Program or algorithm.
6. Information source: The information (training data) the program uses for learning.
• Machine learning is a scientific discipline concerned with the design and development of the
algorithm that allows computers to evolve behaviors based on empirical data, such as form
sensors data or database.
Page 5
Paavai Institutions Department of CSE
• Machine learning is usually divided into two main types: Supervised Learning and
Unsupervised Learning.
2. Discover new things or structure that is unknown to humans (Example: Data mining).
Supervised Learning
• Supervised learning is the machine learning task of inferring a function from supervised
training data. The training data consist of a set of training examples. The task of the
supervised learner is to predict the output behavior of a system for any set of input values,
after an initial training phase.
• Supervised learning in which the network is trained by providing it with input and matching
output patterns. These input-output pairs are usually provided by an external teacher.
• Human learning is based on the past experiences. A computer does not have experiences.
• A computer system learns from data, which represent some "past experiences" of an
application domain.
• To learn a target function that can be used to predict the values of a discrete class attribute,
e.g., approve or not-approved and high-risk or low risk. The task is commonly called:
Supervised learning, Classification or inductive learning.
• Training data includes both the input and the desired results. For some examples the correct
results (targets) are known and are given in input to the model during the learning process.
The construction of a proper training, validation and test set is crucial. These methods are
usually fast and accurate.
• Have to be able to generalize: give the correct results when new data are given in input
without knowing a priori the target.
• Supervised learning is the machine learning task of inferring a function from supervised
training data. The training data consist of a set of training examples. In supervised learning,
each example is a pair consisting of an input object and a desired output value.
Page 6
Paavai Institutions Department of CSE
• A supervised learning algorithm analyzes the training data and produces an inferred
function, which is called a classifier or a regression function. Fig. 8.2.1 shows supervised
learning process.
• The learned model helps the system to perform task better as compared to no learning.
• Supervised learning denotes a method in which some input vectors are collected and
presented to the network. The output computed by the net-work is observed and the deviation
from the expected answer is measured. The weights are corrected according to the magnitude
of the error in the way defined by the learning algorithm.
• Supervised learning is further divided into methods which use reinforcement or error
correction. The perceptron learning algorithm is an example of supervised learning with
reinforcement.
Page 7
Paavai Institutions Department of CSE
In order to solve a given problem of supervised learning, following steps are 1.8 performed:
4. Determine the structure of the learned function and corresponding learning algorithm.
5. Complete the design and then run the learning algorithm on the collected training set.
6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the
performance of the resulting function should be measured on a test set that is separate from
the training set.
Unsupervised Learning
• The model is not provided with the correct results during the training. It can be used to
cluster the input data in classes on the basis of their statistical properties only. Cluster
significance and labeling.
• The labeling can be carried out even if the labels are only available for a small number of
objects representative of the desired classes. All similar inputs patterns are grouped together
as clusters.
• If matching pattern is not found, a new cluster is formed. There is no error feedback.
• External teacher is not used and is based upon only local information. It is also referred to as
self-organization.
• They are called unsupervised because they do not need a teacher or super-visor to label a set
of training examples. Only the original data is required to start the analysis.
• Unsupervised learning algorithms aim to learn rapidly and can be used in real-time.
Unsupervised learning is frequently employed for data clustering, feature extraction etc.
Page 8
Paavai Institutions Department of CSE
• Another mode of learning called recording learning by Zurada is typically employed for
associative memory networks. An associative memory networks is designed by recording
several idea patterns into the networks stable states.
Semi-supervised Learning
• Semi-supervised learning uses both labeled and unlabeled data to improve supervised
learning. The goal is to learn a predictor that predicts future test data better than the predictor
learned from the labeled training data alone.
• Semi-supervised learning is motivated by its practical value in learning faster, better and
cheaper.
In many real world applications, it is relatively easy to acquire a large amount of unlabeled
data x.
Page 9
Paavai Institutions Department of CSE
• For example, documents can be crawled from the Web, images can be obtained from
surveillance cameras, and speech can be collected from broadcast. However, their
corresponding labels y for the prediction task, such as sentiment orientation, intrusion
detection and phonetic transcript, often requires slow human annotation and expensive
laboratory experiments.
• In many practical learning domains, there is a large supply of unlabeled data but limited
labeled data, which can be expensive to generate. For example: text processing, video-
indexing, bioinformatics etc.
• Semi-supervised Learning makes use of both labeled and unlabeled data for training,
typically a small amount of labeled data with a large amount of unlabeled data. When
unlabeled data is used in conjunction with a small amount of labeled data, it can produce
considerable improvement in learning accuracy.
• Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering
of unlabeled data.
Reinforced Learnings
• User will get immediate feedback in supervised learning and no feedback from unsupervised
learning. But in the reinforced learning, you will get delayed scalar feedback.
• Reinforcement learning is learning what to do and how to map situations to actions. The
learner is not told which actions to take. Fig. 8.2.3 shows concept of reinforced learning.
• Reinforced learning is deals with agents that must sense and act upon their environment. It
combines classical Artificial Intelligence and machine learning techniques.
Page 10
Paavai Institutions Department of CSE
• It allows machines and software agents to automatically determine the ideal behavior within
a specific context, in order to maximize its performance. Simple reward feedback is required
for the agent to learn its behavior; this is known as the reinforcement signal.
• With reinforcement learning algorithms an agent can improve its performance by using the
feedback it gets from the environment. This environmental feedback is and called the reward
signal.
• Based on accumulated experience, the agent needs to learn which action to take in D as a
given situation in order to obtain a desired long term goal. Essentially actions that lead to long
term rewards need to reinforced. Reinforcement learning has connections with control theory,
Markov decision processes and game theory.
• Example of reinforcement learning: A mobile robot decides whether it should bow enter a
new room in search of more trash to collect or start trying to find its way air back to its battery
recharging station. It makes its decision based on how quickly and easily it has been able to
find the recharger in the past.
Page 11
Paavai Institutions Department of CSE
3.2 REGRESSION
• Regression finds correlations between dependent and independent variables. If the desired
output consists of one or more continuous variable, then the task is called as regression.
• Therefore, regression algorithms help predict continuous variables such as house prices,
market trends, weather patterns, oil and gas prices etc.
Page 12
Paavai Institutions Department of CSE
• When the targets in a dataset are real numbers, the machine learning task is known as
regression and each sample in the dataset has a real-valued output or target.
• Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modelling the future
relationship between them.
• The two basic types of regression are linear regression and multiple linear regression.
• The objective of a linear regression model is to find a relationship between the input
variables and a target variable.
2. The other variable, denoted y, is regarded as the response, outcome or dependent variable.
• Regression models predict a continuous variable, such as the sales made on a day predict
temperature of a city. Let's imagine that we fit a line with the training point that we have. If
we want to add another data point, but to fit it, we need to change existing model.
• This will happen with each data point that we add to the model; hence, linear regression isn't
good for classification models.
• Regression estimates are used to explain the relationship between one dependent variable
and one or more independent variables. Classification predicts categorical labels (classes),
Page 13
Paavai Institutions Department of CSE
• Classifies data based on the training set and the values in a classifying attribute and uses it in
classifying new data. Prediction means models continuous - valued functions, i.e. predicts
unknown or missing values.
• The regression line gives the average relationship between the two variables in mathematical
form.
• For two variables X and Y, there are always two lines of regression.
• Regression line of X on Y Gives the best estimate for the value of X for any specific given
values of Y:
X=a+bY
Where a= X - intercept
X = Dependent variable
Y = Independent variable
• Regression line Y on X: Gives the best estimate for the value of Y for any specific given
values of X:
Y = a + bx
Where a = Y - intercept
Y = Dependent variable
X = Independent variable
• By using the least squares method (a procedure that minimizes the vertical deviations of
plotted points surrounding a straight line) we are able to construct a best fitting straight line to
the scatter diagram points and then formulate a regression equation in the form of :
y = a + bx
y = y +b(x-x)
Page 14
Paavai Institutions Department of CSE
• Regression analysis is the art and science of fitting straight lines to patterns of data. In a
linear regression model, the variable of interest ("dependent" variable) is predicted from k
other variables ("independent" variables) using a linear equation. If Y denotes the dependent
variable and X1,..., Xk, are the independent variables, then the assumption is that the value of
Y at time t in the data sample is determined by the linear equation :
where the betas are constants and the epsilons are independent and identically distributed
normal random variables with mean zero.
• At each split point, the "error" between the predicted value and the actual values is squared
to get a "Sum of Squared Errors (SSE)". The split point errors across the variables are
compared and the variable/point yielding the lowest SSE is chosen as the root node/split
point. This process is recursively continued.
• Error function measures how much our predictions deviate from the desired answers.
Advantages:
a. Training a linear regression model is usually much faster than methods such as neural
networks.
b. Linear regression models are simple and require minimum memory to implement.
c. By examining the magnitude and sign of the regression coefficients you can infer how
predictor variables affect the target outcome.
• The method of least squares is about estimating parameters by minimizing the squared
discrepancies between observed data, on the one hand, and their expected al values on the
other.
Page 15
Paavai Institutions Department of CSE
• Considering an arbitrary straight line, y = b0 +b1 x, is to be fitted through these data points.
The question is "Which line is the most representative"?
• What are the values of b0 and b1 such that the resulting line "best" fits the data points? But,
what goodness-of-fit criterion to use to determine among all possible combinations of nob
b0 and b1?
• The Least Squares (LS) criterion states that the sum of the squares of errors is minimum.
The least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs
to be in the range [0,1].
• How to draw such a line based on data points observed? Suppose a imaginary line of y = a +
bx.
• Imagine a vertical distance between the line and a data point E = Y - E(Y).
• This error is the deviation of the data point from the imaginary line, regression line. Then
what is the best values of a and b? A and b that minimizes the sum of such errors.
Page 16
Paavai Institutions Department of CSE
• Deviation does not have good properties for computation. Then why do we use squares of
deviation? Let us get a and b that can minimize the sum of squared deviations rather than the
sum of deviations. This method is called least squares.
• Least squares method minimizes the sum of squares of errors. Such a and b are called least
squares estimators i.e. estimators of parameters a and B.
• The process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares
method is the estimation method of Ordinary Least Squares (OLS).
Example 8.3.1 Fit a straight line to the points in the table. Compute m and b by least squares.
Page 17
Paavai Institutions Department of CSE
• Regression analysis is used to predict the value of one or more responses from a set of
predictors. It can also be used to estimate the linear association between the predictors and
responses. Predictors can be continuous or categorical or a mixture ben of both.
• If multiple independent variables affect the response variable, then the analysis calls for a
model different from that used for the single predictor variable. In a situation where more than
one independent factor (variable) affects the outcome of process, a multiple regression model
is used. This is referred to as multiple linear regression model or multivariate least squares
fitting.
Page 18
Paavai Institutions Department of CSE
• Let z1 ; z2;:::; zt, be a set of r predictors believed to be related to a response variable Y. The
linear regression model for the jth sample unit has the form
where ε is a random error and β1,i= 0, 1, ..., r are un-known regression coefficients.
• With n independent observations, we can write one model for each sample unit so that the
model is now
Y = Zβ + ε
• In order to estimate ẞ, we take a least squares approach that is analogous to what we did in
the simple linear regression case.
Page 19
Paavai Institutions Department of CSE
• If we could flip the coin an infinite number of times, inferring its bias would be easy by the
law of large numbers. However, what if we could only flip the coin a handful of times?
Would we guess that a coin is biased if we saw three heads in three flips, an event that
happens one out of eight times with unbiased coins? The MLE would overfit these data,
inferring a coin bias of p =1.
• A Bayesian approach avoids overfitting by quantifying our prior knowledge that most coins
are unbiased, that the prior on the bias parameter is peaked around one-half. The data must
overwhelm this prior belief about coins.
• Bayesian methods allow us to estimate model parameters, to construct model forecasts and
to conduct model comparisons. Bayesian learning algorithms can calculate explicit
probabilities for hypotheses.
• Bayesian classifiers use a simple idea that the training data are utilized to calculate an
observed probability of each class based on feature values.
• When Bayesian classifier is used for unclassified data, it uses the observed probabilities to
predict the most likely class for the new features.
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct.
• Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided by asserting a prior probability
for each candidate hypotheses and a probability distribution over observed data for each
possible hypothesis.
• Even in cases where Bayesian methods prove computationally intractable, they can provide
a standard of optimal decision making against which other practical bus () methods can be
measured.
Page 20
Paavai Institutions Department of CSE
2. Medical diagnosis.
ii) Create a model mapping the training inputs to the training outputs.
iii) Have a Markov Chain Monte Carlo (MCMC) algorithm draw samples from the posterior
distributions for the parameters
• First and second derivatives of the objective function or the constraints play an important
role in optimization. The first order derivatives are called the gradient and the second order
derivatives are called the Hessian matrix.
1. Steepest descent
2. Newton-Raphson method
Page 21
Paavai Institutions Department of CSE
Gradient Descent:
• Gradient descent is popular for very large-scale optimization problems because it is easy to
implement, can handle black box functions, and each iteration is cheap.
• Given a differentiable scalar field f (x) and an initial guess x1, gradient descent iteratively
moves the guess toward lower values of "f" by taking steps in the direction of the negative
gradient - f (x).
• Locally, the negated gradient is the steepest descent direction, i.e., the direction that x would
need to move in order to decrease "f" the fastest. The algorithm typically converges to a local
minimum, but may rarely reach a saddle point, or not move at all if x1 lies at a local
maximum.
• The gradient will give the slope of the curve at that x and its direction will point to an
increase in the function. So we change x in the opposite direction to lower the function value:
Xk+ 1 = xk − λ f (xk)
The λ > 0 is a small number that forces the algorithm to make small jumps
• Gradient descent is relatively slow close to the minimum technically, its asymptotic rate of
convergence is inferior to many other methods.
• For poorly conditioned convex problems, gradient descent increasingly 'zigzags' as the
gradients point nearly orthogonally to the shortest direction to a minimum point
Steepest Descent:
• This method is based on first order Taylor series approximation of objective function. This
method is also called saddle point method. Fig. 8.3.5 shows steepest descent method.
Page 22
Paavai Institutions Department of CSE
• The Steepest Descent is the simplest of the gradient methods. The choice of direction is
where f decreases most quickly, which is in the direction opposite to f (xi). The search
starts at an arbitrary point x0 and then go down the gradient, until reach close to the solution.
• The method of steepest descent is the discrete analogue of gradient descent, but the best
move is computed using a local minimization rather than computing a gradient. It is typically
able to converge in few steps but it is unable to escape local minima or plateaus in the
objective function.
• The gradient is everywhere perpendicular to the contour lines. After each line minimization
the new gradient is always orthogonal to the previous step direction. Consequently, the
iterates tend to zig-zag down the valley in a very manner.
• The method of Steepest Descent is simple, easy to apply, and each iteration is fast. It also
very stable; if the minimum points exist, the method is guaranteed to locate them after at least
an infinite number of iterations.
Page 23
Paavai Institutions Department of CSE
• A linear classifier does classification decision based on the value of a linear combination of
the characteristics. Imagine that the linear classifier will merge into it's weights all the
characteristics that define a particular class.
• Linear classifiers can represent a lot of things, but they can't represent everything. The
classic example of what they can't represent is the XOR function.
• Linear Discriminant Analysis (LDA) is the most commonly used dimensionality reduction
technique in supervised learning. Basically, it is a preprocessing step for pattern classification
and machine learning applications. LDA is a powerful algorithm that can be used to determine
the best separation between two or more classes.
• LDA is a supervised learning algorithm, which means that it requires a labelled training set
of data points in order to learn the linear discriminant function.
• The main purpose of LDA is to find the line or plane that best separates data points
belonging to different classes. The key idea behind LDA is that the decision boundary should
be chosen such that it maximizes the distance between the means of the two classes while
simultaneously minimizing the variance within of each class's data or within-class scatter.
This criterion is known as the Fisher nib criterion.
• LDA is one of the most widely used machine learning algorithms due to its accuracy and
flexibility. LDA can be used for a variety of tasks such as classification, dimensionality
reduction, and feature selection.
• Suppose we have two classes and we need to classify them efficiently, then using LDA,
classes are divided as follows:
Page 24
Paavai Institutions Department of CSE
a) The first step is to calculate the means and standard deviation of each feature.
b) Within class scatter matrix and between class scatter matrix is calculated
c) These matrices are then used to calculate the eigenvectors and eigenvalues.
e) LDA uses this transformation matrix to transform the data into a new space with k
dimensions.
f) Once the transformation matrix transforms the data into new space with k
dimensions, LDA can then be used for classification or dimensionality -reduction
c) LDA is not susceptible to the "curse of dimensionality" like many other machine
learning algorithms.
• Logistic regression is a form of regression analysis in which the outcome variable is binary
or dichotomous. A statistical method used to model dichotomous or binary outcomes using
predictor variables.
• Logistic component: Instead of modeling the outcome, Y, directly, the method models the
log odds (Y) using the logistic function.
Logistic Regression:
In[P(Y)/1-P(Y)] = β0 + β1 X1 + β2 X2 +...+ βk Xk
Y = β0 + β1 X1 + β2 X2 +...+ βk Xk + ε
Page 25
Paavai Institutions Department of CSE
• With logistic regression, the response variable is an indicator of some characteristic, that is,
a 0/1 variable. Logistic regression is used to determine whether other measurements are
related to the presence of some characteristic, for example, whether certain blood measures
are predictive of having a disease.
• İf analysis of covariance can be said to be a t test adjusted for other variables, then logistic
regression can be thought of as a chi-square test for homogeneity of proportions adjusted for
other variables. While the response variable in a logistic regression is a 0/1 variable, the
logistic regression equation, which is a linear equation, does not predict the 0/1 variable itself.
Linear Regression :
p = a0 + a1 X1 +a2 X2 +...+ak Xk
Logistic Regression:
• The linear model assumes that the probability p is a linear function of the regressors, while
the logistic model assumes that the natural log of the odds p/(1 - p) is a linear function of the
regressors.
• The major advantage of the linear model is its interpretability. In the linear model, if a 1 is
0.05, that means that a one-unit increase in X1 is associated with a 5% point increase in the
probability that Y is 1.
• The logistic model is less interpretable. In the logistic model, if b1 is 0.05, that means that a
one- unit increase in X1 is associated with a 0.05 increase in the log odds that Y is 1. And
what does that mean? I've never met anyone with any intuition for log odds.
Page 26
Paavai Institutions Department of CSE
• Generative models are a class of statistical models that generate new data instances. These
models are used in unsupervised machine learning to perform tasks such as probability and
likelihood estimation, modelling data points, and distinguishing between classes using these
probabilities.
• Generative models rely on the Bayes theorem to find the joint probability. Generative
models describe how data is generated using probabilistic models. They predict P(y | x), the
probability of y given x, calculating the P(x,y), the probability of x and y.
• A Naive Bayes Classifier is a program which predicts a class value given a set of attributes.
2. Use the product rule to obtain a joint conditional probability for the attributes.
3. Use Bayes rule to derive conditional probabilities for the class variable.
• Once this has been done for all class values, output the class with the highest probability.
• Naive bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is a
strong assumption but results in a fast and effective method.
• The probability of a class value given a value of an attribute is called the conditional
probability. By multiplying the conditional probabilities together for each attribute for a given
class value, we have a probability of a data instance belonging to that class.
Conditional Probability
• Let A and B be two events such that P(A) > 0. We denote P(BIA) the probability of B given
that A has occurred. Since A is known to have occurred, it becomes the new sample space
replacing the original S. From this, the definition is,
Page 27
Paavai Institutions Department of CSE
P(B/A) = P(A∩B)/P(A)
OR
• The notation P(B | A) is read "the probability of event B given event A". It is the probability
of an event B given the occurrence of the event A.
• We say that, the probability that both A and B occur is equal to the probability that A occurs
times the probability that B occurs given that A has occurred. We call P(B | A) the conditional
probability of B given A, i.e., the probability that B will occur given that A has occurred.
P(A/B) = P(A∩B)/P(B)
• The probability P(A | B) simply reflects the fact that the probability of an event A may
depend on a second event B. If A and B are mutually exclusive A ∩ B = and P(A | B) = 0.
Joint Probability
• A joint probability is a probability that measures the likelihood that two or more events will
happen concurrently.
• If there are two independent events A and B, the probability that A and B will occur is found
by multiplying the two probabilities. Thus for two events A and B, the special rule of
multiplication shown symbolically is :
Page 28
Paavai Institutions Department of CSE
• The general rule of multiplication is used to find the joint probability that two events will
occur. Symbolically, the general rule of multiplication is,
• The probability P(A ∩ B) is called the joint probability for two events A and B which
intersect in the sample space. Venn diagram will readily shows that
Equivalently:
• The probability of the union of two events never exceeds the sum of the event probabilities.
• A tree diagram is very useful for portraying conditional and joint probabilities. A tree
diagram portrays outcomes that are mutually exclusive.
Bayes Theorem
• Bayes' theorem is a result in probability theory that relates conditional probabilities. If A and
B denote two events, P(A | B) denotes the conditional probability of A occurring, given that B
occurs. The two conditional probabilities P(A | B) and P(B | A) are in general different.
• Bayes theorem gives a relation between P(A | B) and P(B | A). An important application of
Bayes' theorem is that it gives a rule how to update or revise the strengths of evidence-based
beliefs in light of new evidence a posterior.
• A prior probability is an initial probability value originally obtained before any additional
information is obtained.
• A posterior probability is a probability value that has been revised by using additional
information that is later obtained.
• Suppose that B1, B2, B3 ... Bn partition the outcomes of an experiment and that A is another
event. For any number, k, with 1 ≤ k ≤ n, we have the formula:
Page 29
Paavai Institutions Department of CSE
Page 30
Paavai Institutions Department of CSE
• Given a set of training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one class or the other.
Simply speaking, we can think of an SVM model as representing the examples as points in
space, mapped so that each of the examples of the separate classes are divided by a gap that is
as wide as possible.
• New examples are then mapped into the same space and classified to belong to the class
based on which side of the gap they fall on.
• Many decision boundaries can separate these two classes. Which one should we choose?
• Perceptron learning rule can be used to find any decision boundary between class 1 and class
2.
• The line that maximizes the minimum margin is a good bet. The model class of "hyper-
planes with a margin of m" has a low VC dimension if m is big.
Page 31
Paavai Institutions Department of CSE
• This maximum-margin separator is determined by a subset of the data points. Data points in
this subset are called "support vectors". It will be useful computationally if only a small
fraction of the data points are support vectors, because we use the support vectors to decide
which side of the separator a test case is on.
• SVM are primarily two-class classifiers with the distinct characteristic that they aim to find
the optimal hyperplane such that the expected generalization error is minimized. Instead of
directly minimizing the empirical risk calculated from the training data, SVMs perform
structural risk minimization to achieve good generalization.
• The empirical risk is the average loss of an estimator for a finite set of data drawn from P.
The idea of risk minimization is not only measure the performance of an estimator by its risk,
but to actually search for the estimator that minimizes risk over distribution P. Because we
don't know distribution P we instead minimize empirical risk over a training dataset drawn
from P. This general learning technique is called empirical risk minimization.
• The decision boundary should be as far away from the data of both classes as possible. If
data points lie very close to the boundary, the classifier may be consistent but is more "likely"
to make errors on new instances from the distribution. Hence, we prefer classifiers that
maximize the minimal distance of data points to the separator.
Page 32
Paavai Institutions Department of CSE
1. Margin (m): the gap between data points & the classifier boundary. The Margin is the
minimum distance of any sample to the decision boundary. If this hyperplane is in the
canonical form, the margin can be measured by the length of the weight vector.The margin is
given by the projection of the distance between these two points on the direction
perpendicular to the hyperplane.
2. Maximal margin classifier: a classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies that only
support vectors matter; other training examples are ignorable.
Page 33
Paavai Institutions Department of CSE
Solution:
2. Extend the above definition for non-linearly separable problems have a penalty term for
misclassifications
3. Map data to high dimensional space where it is easier to classify with linear of by decision
surfaces: reformulate problem so that data is mapped implicitly to this space
Page 34
Paavai Institutions Department of CSE
1. Use a single hyperplane which subdivides the space into two half-spaces, one which
is occupied by Class 1 and the other by Class 2
2. They maximize the margin of the decision boundary using quadratic optimization
techniques which find the optimal hyperplane.
5. When used in practice, SVM approaches frequently map the examples to a higher
dimensional space and find margin maximal hyperplanes in the mapped space,
obtaining decision boundaries which are not hyperplanes in the original space.
6. The most popular versions of SVMs use non-linear kernel functions and map the
attribute space into a higher dimensional space to facilitate finding "good" linear
decision boundaries in the modified space.
SVM Applications
2. Image classification
Limitations of SVM
1. It is sensitive to noise.
4. The optimal design for multiclass SVM classifiers is also a research area.
Page 35
Paavai Institutions Department of CSE
• For the very high dimensional problems common in text classification, sometimes the data
are linearly separable. But in the general case they are not, and even if they are, we might
prefer a solution that better separates the bulk of the data 1st while ignoring a few weird noise
documents.
• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
• A soft-margin allows a few variables to cross into the margin or over the hyperplane,
allowing misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications.
This is a trade off between the hyperplane violations and the margin size. The slack variables
are bounded by some set cost. The farther they are from the soft margin, the less influence
they have on the prediction.
2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane
3. C is the trade off between the slack variable penalty and the margin.
Example 8.6.2 : From the following diagram, identify which data points (1, 2, 3, 4, 5) are
support vectors (if any), slack variables on correct side of classifier (if any) and slack
Page 36
Paavai Institutions Department of CSE
variables on wrong side of classifier (if any). Mention which point will have maximum
penalty and why?
Solution:
• Margin (m) is the gap between data points & the classifier boundary. The margin is the
minimum distance of any sample to the decision boundary. If this hyperplane is in the
canonical form, the margin can be measured by the length of the weight vector.
• Maximal margin classifier: A classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies that only
support vectors matter; other training examples are ignorable.
• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
• A soft-margin allows a few variables to cross into the margin or over the hyperplane,
allowing misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications.
This is a trade off between the hyperplane violations and the margin size. The slack variables
are bounded by some set cost. The farther they are from the soft margin, the less influence
they have on the prediction.
Page 37
Paavai Institutions Department of CSE
2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane.
3. C is the tradeoff between the slack variable penalty and the margin.
• A decision tree is a simple representation for classifying examples. Decision tree learning is
one of the most successful techniques for supervised classification learning.
• In decision analysis, a decision tree can be used to visually and explicitly represent decisions
and decision making. As the name goes, it uses a tree-like model of decisions.
• Learned trees can also be represented as sets of if-then rules to improve human readability.
1. Each leaf node has a class label, determined by majority vote of training examples reaching
that leaf.
2. Each internal node is a question on features. It branches out according to the answers.
• Decision tree learning is a method for approximating discrete-valued target functions. The
learned function is represented by a decision tree.
• A learned decision tree can also be re-represented as a set of if-then rules. Decision tree
learning is one of the most widely used and practical methods for inductive inference.
• Goal: Build a decision tree for classifying examples as positive or negative instances of a
concept
Page 38
Paavai Institutions Department of CSE
C. Each arc has associated with it one of the possible values of the attribute at the node from
which the arc is directed.
• Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf
nodes represent class labels or class distribution.
• A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions. Decision trees can easily be converted to classification rules.
Input:
2. Attribute list
Algorithm:
4. If attribute list is empty then return N as a leaf node labeled with the majority class in D
5. Apply attribute selection method (D, attribute list) to find the "best" splitting criterion;
11. If Dj is empty then attach a leaf labeled with the majority class in D to node N;
12. Else attach the node returned by Generate decision tree (Dj, attribute list) to node N;
Page 39
Paavai Institutions Department of CSE
14. Return N;
• Decision tree generation consists of two phases: Tree construction and pruning
• In tree construction phase, all the training examples are at the root. Partition examples
recursively based on selected attributes.
• In tree pruning phase, the identification and removal of branches that reflect noise or
outliers.
• There are various paradigms that are used for learning binary classifiers which include:
1. Decision Trees
2. Neural Networks
3. Bayesian Classification
Page 40
Paavai Institutions Department of CSE
Solution: Left Side: A feature tree combining two Boolean features. Each internal node or
split is labelled with a feature, and each edge emanating from a split is labelled with a feature
value. Each leaf therefore corresponds to a unique combination of feature values. Also
indicated in each leaf is the class distribution derived from the training set
• Right Side: A feature tree partitions the instance space into rectangular regions, one for each
leaf.
• The leaves of the tree in the above figure could be labelled, from left to right, as ham - spam
- spam, employing a simple decision rule called majority class.
Page 41
Paavai Institutions Department of CSE
• Left side: A feature tree with training set class distribution in the leaves.
• Right side: A decision tree obtained using the majority class decision rule.
• Decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs. Fixed set of attributes, and the attributes
take a small number of disjoint possible values.
2. The target function has discrete output values. Decision tree learning is appropriate for a
boolean classification, but it easily extends to learning functions with more than two possible
output values.
4. The training data may contain errors. Decision tree learning methods are robust to errors,
both errors in classifications of the training examples and errors in the attribute values that
describe these examples.
5. The training data may contain missing attribute values. Decision tree methods can be used
even when some training examples have unknown values.
6. Decision tree learning has been applied to problems such as learning to classify.
Page 42
Paavai Institutions Department of CSE
Advantages:
3. Decision trees are capable of handling datasets that may have errors.
4. Decision trees are capable of handling datasets that may have missing values.
Disadvantages:
1. Most of the algorithms require that the target attribute will have only discrete
values.
3. Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
4. Decision trees are prone to errors in classification problems with many class and se
relatively small number of training examples.
• As the call indicates, "Random forest is a classifier that incorporates some of choice timber
on diverse subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and primarily based on most of the people's votes of predictions,
and it predicts the very last output.
• The more wider variety of trees within the forest results in better accuracy and prevents the
hassle of overfitting.
Page 43
Paavai Institutions Department of CSE
• Random forest works in two-section first is to create the random woodland by combining N
selection trees and second is to make predictions for each tree created inside the first segment.
The working technique may be explained within the below steps and diagram:
Step 2: Build the selection trees associated with the selected information points
(Subsets).
Step 3: Choose the wide variety N for selection trees which we want to build.
Step 5: For new factors, locate the predictions of each choice tree and assign the new
records factors to the category that wins most people's votes.
• The working of the set of rules may be higher understood by the underneath example:
Example: Suppose there may be a dataset that includes more than one fruit photo. So, this
dataset is given to the random wooded area classifier. The dataset is divided into subsets and
given to every decision tree. During the training section, each decision tree produces a
prediction end result and while a brand new statistics point occurs, then primarily based on
the majority of consequences, the random forest classifier predicts the final decision. Consider
the underneath picture:
Page 44
Paavai Institutions Department of CSE
SAT
1. Banking: Banking zone in general uses this algorithm for the identification of loan
danger.
2. Medicine: With the assistance of this set of rules, disorder traits and risks of the
disorder may be recognized.
3. Land use: We can perceive the areas of comparable land use with the aid of this
algorithm.
Page 45
Paavai Institutions Department of CSE
• It enhances the accuracy of the version and forestalls the overfitting trouble.
Although random forest can be used for both class and regression responsibilities, it isn't extra
appropriate for regression obligations.
Page 46
Paavai Institutions Department of CSE
• What is the best way to reduce the learning task to one or more function approximation
problems?
• How can the learner automatically alter its representation to improve its learning ability?
Q.7 What is decision tree?
Ans.:
• Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
• A decision tree is a tree where each node represents a feature (attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome (categorical or continues
value).
• A decision tree or a classification tree is a tree in which each internal node is labeled with an
input feature. The arcs coming from a node labeled with a feature are labeled with each of the
possible values of the feature.
Q.8 What are the nodes of decision tree?
Ans.: A decision tree has two kinds of nodes
1. Each leaf node has a class label, determined by majority vote of training examples reaching
that leaf.
2. Each internal node is a question on features. It branches out according to the answers.
• Decision tree learning is a method for approximating discrete-valued target functions. The
learned function is represented by a decision tree
Q.9 Why tree pruning useful in decision tree induction?
Ans.: When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting the
data. Such methods typically use statistical measures to remove the least reliable branches.
Q.10 What is tree pruning?
Ans.: Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data
Q.11 What is RULE POST-PRUNING?
Ans.:
• It is method for finding high accuracy hypotheses.
• Rule post-pruning involves the following steps:
1. Infer decision tree from training set
2. Convert tree to rules - one rule per branch
3. Prune each rule by removing preconditions that result in improved estimatedaccuracy
4. Sort the pruned rules by their estimated accuracy and consider them in this sequence when
classifying unseen instances
Q.12 Why convert the decision tree to rules before pruning?
Page 47
Paavai Institutions Department of CSE
Ans.:
• Converting to rules allows distinguishing among the different contexts in which a decision
node is used.
• Converting to rules removes the distinction between attribute tests that occur near the root of
the tree and those that occur near the leaves.
• Converting to rules improves readability. Rules are often easier for to if understand
Q.13 What do you mean by least square method?
Ans.: Least squares is a statistical method used to determine a line of best fit by minimizing
the sum of squares created by a mathematical function. A "square" is determined by squaring
the distance between a data point and the regression line or mean value of the data set
Q.14 What is linear discriminant function?
Ans.: LDA is a supervised learning algorithm, which means that it requires a labelled training
set of data points in order to learn the Linear Discriminant function.
Q.15 What is a support vector in SVM?
Ans.: Support vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, we maximize the
margin of the classifier.
Q.16 What is support vector machines?
Ans.: A Support Vector Machine (SVM) is a supervised machine learning model that uses
classification algorithms for two-group classification problems. After giving an SVM model
sets of labeled training data for each category, they're able to categorize new text.
Q.17 Define logistic regression.
Ans.: Logistic regression is supervised learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
Q.18 List out types of machine learning.
Ans.: Types of machine learnings are supervised, semi-supervised, unsupervised and
Reinforcement Learning.
Q.19 What is random forest?
Ans.: Random forest is an ensemble learning technique that combines multiple decision trees,
implementing the bagging method and results in a robust model with low variance.
Q.20 What are the five popular algorithms of machine learning?
Ans.: Popular algorithms are Decision Trees, Neural Networks (back propagation),
Probabilistic networks, Nearest Neighbor and Support vector machines.
Q.21 What is the function of 'Supervised Learning'?
Ans.: Function of 'Supervised Learning' are Classifications, Speech recognition, Regression,
Predict time series and Annotate strings.
Page 48
Paavai Institutions Department of CSE
Page 49
Paavai Institutions Department of CSE
PART B
Page 50