0% found this document useful (0 votes)
53 views50 pages

Unit Iii

Uploaded by

punitha.ece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views50 pages

Unit Iii

Uploaded by

punitha.ece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Paavai Institutions Department of CSE

UNIT III
SUPERVISED LEARNING
SYLLABUS

Introduction to machine learning - Linear Regression Models: Least squares, single &
multiple variables, Bayesian linear regression, gradient descent, Linear Classification Models:
Discriminant function - Probabilistic discriminative model - Logistic regression, Probabilistic
generative model - Naive Bayes, Maximum margin classifier - Support vector machine,
Decision Tree, Random forests

3.1 Introduction to Machine Learning


• Machine Learning (ML) is a sub-field of Artificial Intelligence (AI) which concerns with
developing computational theories of learning and building learning machines.

• Learning is a phenomenon and process which has manifestations of various aspects.


Learning process includes gaining of new symbolic knowledge and development of cognitive
skills through instruction and practice. It is also discovery of new facts and theories through
observation and experiment.

• Machine Learning Definition: A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.

• Machine learning is programming computers to optimize a performance criterion using


example data or past experience. Application of machine learning methods to large databases
is called data mining.

• It is very hard to write programs that solve problems like recognizing a human face. We do
not know what program to write because we don't know how our brain does it. Instead of
writing a program by hand, it is possible to collect lots of examples that specify the correct
output for a given input.

Page 1
Paavai Institutions Department of CSE

• A machine learning algorithm then takes these examples and produces a program that does
the job. The program produced by the learning algorithm may look very different from a
typical hand-written program. It may contain millions of numbers. If we do it right, the
program works for new cases as well as the ones we trained it on.

• Main goal of machine learning is to devise learning algorithms that do the learning
automatically without human intervention or assistance. The machine learning paradigm can
be viewed as "programming by example." Another goal is to develop computational models
of human learning process and perform computer simulations.

• The goal of machine learning is to build computer systems that can adapt and learn from
their experience.

• Algorithm is used to solve a problem on computer. An algorithm is a sequence of


instruction. It should carry out to transform the input to output. For example, for addition of
four numbers is carried out by giving four number as input to the algorithm and output is sum
of all four numbers. For the same task, there may be various algorithms. It is interested to find
the most efficient one, requiring the least number of instructions or memory or both.

• For some tasks, however, we do not have an algorithm.

How Machines Learn?

Machine learning typically follows three phases:

1. Training: A training set of examples of correct behavior is analyzed and some


representation of the newly learnt knowledge is stored. This is some form of rules.

2. Validation: The rules are checked and, if necessary, additional training is given. Sometimes
additional test data are used, but instead, a human expert may validate the rules, or some other
automatic knowledge - based component may be used. The role of the tester is often called the
opponent.

3. Application: The rules are used in responding to some new situation.

Page 2
Paavai Institutions Department of CSE

Why Machine Learning is Important?

• Machine learning algorithms can figure out how to perform important tasks by generalizing
from examples.

• Machine learning provides business insight and intelligence. Decision makers are provided
with greater insights into their organizations. This adaptive technology is being used by global
enterprises to gain a competitive edge.

• Machine learning algorithms discover the relationships between the variables of a system
(input, output and hidden) from direct samples of the system.

• Following are some of the reasons:

1) Some tasks cannot be defined well, except by examples. For example: Recognizing people.

2) Relationships and correlations can be hidden within large amounts of data. To solve these
problems, machine learning and data these relationships.

3) Human designers often produce machines that do not work as well as desired in the
environments in which they are used.

4) The amount of knowledge available about certain tasks might be too large for explicit
encoding by humans.

Page 3
Paavai Institutions Department of CSE

5) Environments change time to time.

6) New knowledge about tasks is constantly being discovered by humans.

• Machine learning also helps us find solutions of many problems in computer vision, speech
recognition and robotics. Machine learning uses the theory of statistics in building
mathematical models, because the core task is making inference from a sample.

• Learning is used when:

1. Human expertise does not exist (navigating on Mars),

2. Humans are unable to explain their expertise (speech recognition)

3. Solution changes in time (routing on a computer network)

4. Solution needs to be adapted to particular cases (user biometrics)

Ingredients of Machine Learning

The ingredients of machine learning are as follows:

1. Tasks: The problems that can be solved with machine learning. A task is an abstract
representation of a problem. The standard methodology in machine learning is to learn one
task at a time. Large problems are broken into small, reasonably independent sub-problems
that are learned separately and then recombined.

• Predictive tasks perform inference on the current data in order to make predictions.
Descriptive tasks characterize the general properties of the data in the database.

2. Models: The output of machine learning. Different models are geometric models,
probabilistic models, logical models, grouping and grading.

• The model-based approach seeks to create a modified solution tailored to each new
application. Instead of having to transform your problem to fit some standard algorithm, in
model-based machine learning you design the algorithm precisely to fit your problem.

• Model is just made up of set of assumptions, expressed in a precise mathematical form.


These assumptions include the number and types of variables in the problem domain, which
variables affect each other, and what the effect of changing one variable is on another
variable.

• Machine learning models are classified as: Geometric model, Probabilistic model and
Logical model.

Page 4
Paavai Institutions Department of CSE

3. Features: The workhorses of machine learning. A good feature representation is central to


achieving high performance in any machine learning task.

• Feature extraction starts from an initial set of measured data and builds derived values
intended to be informative, non redundant, facilitating the subsequent learning and
generalization steps.

• Feature selection is a process that chooses a subset of features from the original features so
that the feature space is optimally reduced according to a certain criterion.

3.1.1 TYPES OF LEARNING

• Learning is essential for unknown environments, i.e. when designer lacks the 10 me
omniscience. Learning simply means incorporating information from the training examples
into the system.

• Learning is any change in a system that allows it to perform better the second time on
repetition of the same task or on another task drawn from the same population. One part of
learning is acquiring knowledge and new information; and the other part is problem-solving.

• Supervised and Unsupervised Learning are the different types of machine learning methods.
A computational learning model should be clear about the following aspects:

1. Learner: Who or what is doing the learning. For example: Program or algorithm.

2. Domain: What is being learned?

3. Goal: Why the learning is done?

4. Representation: The way the objects to be learned are represented.

5. Algorithmic technology: The algorithmic framework to be used.

6. Information source: The information (training data) the program uses for learning.

7. Training scenario: The description of the learning process.

Learning is constructing or modifying representation of what is being experienced. Learn


means to get knowledge of by study, experience or being taught.

• Machine learning is a scientific discipline concerned with the design and development of the
algorithm that allows computers to evolve behaviors based on empirical data, such as form
sensors data or database.

Page 5
Paavai Institutions Department of CSE

• Machine learning is usually divided into two main types: Supervised Learning and
Unsupervised Learning.

Why do Machine Learning?

1. To understand and improve efficiency of human learning.

2. Discover new things or structure that is unknown to humans (Example: Data mining).

3. Fill in skeletal or incomplete specifications about a domain.

Supervised Learning

• Supervised learning is the machine learning task of inferring a function from supervised
training data. The training data consist of a set of training examples. The task of the
supervised learner is to predict the output behavior of a system for any set of input values,
after an initial training phase.

• Supervised learning in which the network is trained by providing it with input and matching
output patterns. These input-output pairs are usually provided by an external teacher.

• Human learning is based on the past experiences. A computer does not have experiences.

• A computer system learns from data, which represent some "past experiences" of an
application domain.

• To learn a target function that can be used to predict the values of a discrete class attribute,
e.g., approve or not-approved and high-risk or low risk. The task is commonly called:
Supervised learning, Classification or inductive learning.

• Training data includes both the input and the desired results. For some examples the correct
results (targets) are known and are given in input to the model during the learning process.
The construction of a proper training, validation and test set is crucial. These methods are
usually fast and accurate.

• Have to be able to generalize: give the correct results when new data are given in input
without knowing a priori the target.

• Supervised learning is the machine learning task of inferring a function from supervised
training data. The training data consist of a set of training examples. In supervised learning,
each example is a pair consisting of an input object and a desired output value.

Page 6
Paavai Institutions Department of CSE

• A supervised learning algorithm analyzes the training data and produces an inferred
function, which is called a classifier or a regression function. Fig. 8.2.1 shows supervised
learning process.

• The learned model helps the system to perform task better as compared to no learning.

• Each input vector requires a corresponding target vector.

Training Pair = (Input Vector, Target Vector)

• Fig. 8.2.2 shows input vector.

• Supervised learning denotes a method in which some input vectors are collected and
presented to the network. The output computed by the net-work is observed and the deviation
from the expected answer is measured. The weights are corrected according to the magnitude
of the error in the way defined by the learning algorithm.

• Supervised learning is further divided into methods which use reinforcement or error
correction. The perceptron learning algorithm is an example of supervised learning with
reinforcement.
Page 7
Paavai Institutions Department of CSE

In order to solve a given problem of supervised learning, following steps are 1.8 performed:

1. Find out the type of training examples.

2. Collect a training set.

3. Determine the input feature representation of the learned function.

4. Determine the structure of the learned function and corresponding learning algorithm.

5. Complete the design and then run the learning algorithm on the collected training set.

6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the
performance of the resulting function should be measured on a test set that is separate from
the training set.

Unsupervised Learning

• The model is not provided with the correct results during the training. It can be used to
cluster the input data in classes on the basis of their statistical properties only. Cluster
significance and labeling.

• The labeling can be carried out even if the labels are only available for a small number of
objects representative of the desired classes. All similar inputs patterns are grouped together
as clusters.

• If matching pattern is not found, a new cluster is formed. There is no error feedback.

• External teacher is not used and is based upon only local information. It is also referred to as
self-organization.

• They are called unsupervised because they do not need a teacher or super-visor to label a set
of training examples. Only the original data is required to start the analysis.

• In contrast to supervised learning, unsupervised or self-organized learning does not require


an external teacher. During the training session, the neural network boy receives a number of
different input patterns, discovers significant features in these patterns and learns how to
classify input data into appropriate categories.

• Unsupervised learning algorithms aim to learn rapidly and can be used in real-time.
Unsupervised learning is frequently employed for data clustering, feature extraction etc.

Page 8
Paavai Institutions Department of CSE

• Another mode of learning called recording learning by Zurada is typically employed for
associative memory networks. An associative memory networks is designed by recording
several idea patterns into the networks stable states.

Difference between Supervised and Unsupervised Learning

Semi-supervised Learning

• Semi-supervised learning uses both labeled and unlabeled data to improve supervised
learning. The goal is to learn a predictor that predicts future test data better than the predictor
learned from the labeled training data alone.

• Semi-supervised learning is motivated by its practical value in learning faster, better and
cheaper.

In many real world applications, it is relatively easy to acquire a large amount of unlabeled
data x.

Page 9
Paavai Institutions Department of CSE

• For example, documents can be crawled from the Web, images can be obtained from
surveillance cameras, and speech can be collected from broadcast. However, their
corresponding labels y for the prediction task, such as sentiment orientation, intrusion
detection and phonetic transcript, often requires slow human annotation and expensive
laboratory experiments.

• In many practical learning domains, there is a large supply of unlabeled data but limited
labeled data, which can be expensive to generate. For example: text processing, video-
indexing, bioinformatics etc.

• Semi-supervised Learning makes use of both labeled and unlabeled data for training,
typically a small amount of labeled data with a large amount of unlabeled data. When
unlabeled data is used in conjunction with a small amount of labeled data, it can produce
considerable improvement in learning accuracy.

• Semi-supervised learning sometimes enables predictive model testing at reduced cost.

• Semi-supervised classification: Training on labeled data exploits additional unlabeled data,


frequently resulting in a more accurate classifier.

• Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering
of unlabeled data.

Reinforced Learnings

• User will get immediate feedback in supervised learning and no feedback from unsupervised
learning. But in the reinforced learning, you will get delayed scalar feedback.

• Reinforcement learning is learning what to do and how to map situations to actions. The
learner is not told which actions to take. Fig. 8.2.3 shows concept of reinforced learning.

• Reinforced learning is deals with agents that must sense and act upon their environment. It
combines classical Artificial Intelligence and machine learning techniques.

Page 10
Paavai Institutions Department of CSE

• It allows machines and software agents to automatically determine the ideal behavior within
a specific context, in order to maximize its performance. Simple reward feedback is required
for the agent to learn its behavior; this is known as the reinforcement signal.

• Two most important distinguishing features of reinforcement learning is trial-and-error and


delayed reward.

• With reinforcement learning algorithms an agent can improve its performance by using the
feedback it gets from the environment. This environmental feedback is and called the reward
signal.

• Based on accumulated experience, the agent needs to learn which action to take in D as a
given situation in order to obtain a desired long term goal. Essentially actions that lead to long
term rewards need to reinforced. Reinforcement learning has connections with control theory,
Markov decision processes and game theory.

• Example of reinforcement learning: A mobile robot decides whether it should bow enter a
new room in search of more trash to collect or start trying to find its way air back to its battery
recharging station. It makes its decision based on how quickly and easily it has been able to
find the recharger in the past.

Difference between Supervised, Unsupervised and Reinforcement Learning

Page 11
Paavai Institutions Department of CSE

3.2 REGRESSION
• Regression finds correlations between dependent and independent variables. If the desired
output consists of one or more continuous variable, then the task is called as regression.

• Therefore, regression algorithms help predict continuous variables such as house prices,
market trends, weather patterns, oil and gas prices etc.

Page 12
Paavai Institutions Department of CSE

• When the targets in a dataset are real numbers, the machine learning task is known as
regression and each sample in the dataset has a real-valued output or target.

• Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modelling the future
relationship between them.

• The two basic types of regression are linear regression and multiple linear regression.

3.2.1 Linear Regression Models


• Linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables.

• The objective of a linear regression model is to find a relationship between the input
variables and a target variable.

1. One variable, denoted x, is regarded as the predictor, explanatory or buy independent


variable.

2. The other variable, denoted y, is regarded as the response, outcome or dependent variable.

• Regression models predict a continuous variable, such as the sales made on a day predict
temperature of a city. Let's imagine that we fit a line with the training point that we have. If
we want to add another data point, but to fit it, we need to change existing model.

• This will happen with each data point that we add to the model; hence, linear regression isn't
good for classification models.

• Regression estimates are used to explain the relationship between one dependent variable
and one or more independent variables. Classification predicts categorical labels (classes),
Page 13
Paavai Institutions Department of CSE

prediction models continuous - valued functions. Classification is considered to be supervised


learning.

• Classifies data based on the training set and the values in a classifying attribute and uses it in
classifying new data. Prediction means models continuous - valued functions, i.e. predicts
unknown or missing values.

• The regression line gives the average relationship between the two variables in mathematical
form.

• For two variables X and Y, there are always two lines of regression.

• Regression line of X on Y Gives the best estimate for the value of X for any specific given
values of Y:

X=a+bY

Where a= X - intercept

b = Slope of the line

X = Dependent variable

Y = Independent variable

• Regression line Y on X: Gives the best estimate for the value of Y for any specific given
values of X:

Y = a + bx

Where a = Y - intercept

b = Slope of the line

Y = Dependent variable

X = Independent variable

• By using the least squares method (a procedure that minimizes the vertical deviations of
plotted points surrounding a straight line) we are able to construct a best fitting straight line to
the scatter diagram points and then formulate a regression equation in the form of :

y = a + bx

y = y +b(x-x)

Page 14
Paavai Institutions Department of CSE

• Regression analysis is the art and science of fitting straight lines to patterns of data. In a
linear regression model, the variable of interest ("dependent" variable) is predicted from k
other variables ("independent" variables) using a linear equation. If Y denotes the dependent
variable and X1,..., Xk, are the independent variables, then the assumption is that the value of
Y at time t in the data sample is determined by the linear equation :

Y1 = β0 + β1 X1t + β2 X2t +...+ βk Xkt +εt

where the betas are constants and the epsilons are independent and identically distributed
normal random variables with mean zero.

• At each split point, the "error" between the predicted value and the actual values is squared
to get a "Sum of Squared Errors (SSE)". The split point errors across the variables are
compared and the variable/point yielding the lowest SSE is chosen as the root node/split
point. This process is recursively continued.

• Error function measures how much our predictions deviate from the desired answers.

Mean-squared error Jn = 1/n Σi = 1…n (yi – f(xi))2

Advantages:

a. Training a linear regression model is usually much faster than methods such as neural
networks.

b. Linear regression models are simple and require minimum memory to implement.

c. By examining the magnitude and sign of the regression coefficients you can infer how
predictor variables affect the target outcome.

3.2.2 LEAST SQUARES

• The method of least squares is about estimating parameters by minimizing the squared
discrepancies between observed data, on the one hand, and their expected al values on the
other.

Page 15
Paavai Institutions Department of CSE

• Considering an arbitrary straight line, y = b0 +b1 x, is to be fitted through these data points.
The question is "Which line is the most representative"?

• What are the values of b0 and b1 such that the resulting line "best" fits the data points? But,
what goodness-of-fit criterion to use to determine among all possible combinations of nob
b0 and b1?

• The Least Squares (LS) criterion states that the sum of the squares of errors is minimum.
The least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs
to be in the range [0,1].

• How to draw such a line based on data points observed? Suppose a imaginary line of y = a +
bx.

• Imagine a vertical distance between the line and a data point E = Y - E(Y).

• This error is the deviation of the data point from the imaginary line, regression line. Then
what is the best values of a and b? A and b that minimizes the sum of such errors.

Page 16
Paavai Institutions Department of CSE

• Deviation does not have good properties for computation. Then why do we use squares of
deviation? Let us get a and b that can minimize the sum of squared deviations rather than the
sum of deviations. This method is called least squares.

• Least squares method minimizes the sum of squares of errors. Such a and b are called least
squares estimators i.e. estimators of parameters a and B.

• The process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares
method is the estimation method of Ordinary Least Squares (OLS).

Disadvantages of least square

1. Lack robustness to outliers

2. Certain datasets unsuitable for least squares classification

3. Decision boundary corresponds to ML solution

Example 8.3.1 Fit a straight line to the points in the table. Compute m and b by least squares.

Page 17
Paavai Institutions Department of CSE

Solution: Represent in matrix form:

3.2.3 MULTIPLE REGRESSION

• Regression analysis is used to predict the value of one or more responses from a set of
predictors. It can also be used to estimate the linear association between the predictors and
responses. Predictors can be continuous or categorical or a mixture ben of both.

• If multiple independent variables affect the response variable, then the analysis calls for a
model different from that used for the single predictor variable. In a situation where more than
one independent factor (variable) affects the outcome of process, a multiple regression model
is used. This is referred to as multiple linear regression model or multivariate least squares
fitting.

Page 18
Paavai Institutions Department of CSE

• Let z1 ; z2;:::; zt, be a set of r predictors believed to be related to a response variable Y. The
linear regression model for the jth sample unit has the form

Yj = β0 + β1 zj1 + β2 zj2 + ... + βr Zjr + εj

where ε is a random error and β1,i= 0, 1, ..., r are un-known regression coefficients.

• With n independent observations, we can write one model for each sample unit so that the
model is now

Y = Zβ + ε

where Y is n × 1, Z is n × (r+1),β is (r+1)× 1 and ε is n × 1

• In order to estimate ẞ, we take a least squares approach that is analogous to what we did in
the simple linear regression case.

• In matrix form, we can arrange the data in the following form:

where βj are the estimates of the regression coefficients.

Difference between Simple Regression and Multiple Regression

Page 19
Paavai Institutions Department of CSE

3.2.4 Bayesian Linear Regression


• Bayesian linear regression allows a useful mechanism to deal with insufficient data, or poor
distributed data. It allows user to put a prior on the coefficients and on the noise so that in the
absence of data, the priors can take over. A prior is a distribution on a parameter.

• If we could flip the coin an infinite number of times, inferring its bias would be easy by the
law of large numbers. However, what if we could only flip the coin a handful of times?
Would we guess that a coin is biased if we saw three heads in three flips, an event that
happens one out of eight times with unbiased coins? The MLE would overfit these data,
inferring a coin bias of p =1.

• A Bayesian approach avoids overfitting by quantifying our prior knowledge that most coins
are unbiased, that the prior on the bias parameter is peaked around one-half. The data must
overwhelm this prior belief about coins.

• Bayesian methods allow us to estimate model parameters, to construct model forecasts and
to conduct model comparisons. Bayesian learning algorithms can calculate explicit
probabilities for hypotheses.

• Bayesian classifiers use a simple idea that the training data are utilized to calculate an
observed probability of each class based on feature values.

• When Bayesian classifier is used for unclassified data, it uses the observed probabilities to
predict the most likely class for the new features.

• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct.

• Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided by asserting a prior probability
for each candidate hypotheses and a probability distribution over observed data for each
possible hypothesis.

• Bayesian methods canaccommodate hypotheses that make probabilistic predictions. New


instances can be classified by combining the predictions of multiple hypotheses, weighted by
their probabilities.

• Even in cases where Bayesian methods prove computationally intractable, they can provide
a standard of optimal decision making against which other practical bus () methods can be
measured.
Page 20
Paavai Institutions Department of CSE

Uses of Bayesian classifiers are as follows:

1. Used in text-based classification for finding spam or junk mail filtering.

2. Medical diagnosis.

3. Network security such as detecting illegal intrusion.

• The basic procedure for implementing Bayesian Linear Regression is :

i) Specify priors for the model parameter.

ii) Create a model mapping the training inputs to the training outputs.

iii) Have a Markov Chain Monte Carlo (MCMC) algorithm draw samples from the posterior
distributions for the parameters

3.2.5 Gradient Descent


• Goal: Solving minimization nonlinear problems through derivative information

• First and second derivatives of the objective function or the constraints play an important
role in optimization. The first order derivatives are called the gradient and the second order
derivatives are called the Hessian matrix.

• Derivative based optimization is also called nonlinear. Capable of determining search


directions" according to an objective function's derivative information.

Derivative based optimization methods are used for:

1. Optimization of nonlinear neuro-fuzzy models

2. Neural network learning

3. Regression analysis in nonlinear models

Basic descent methods are as follows:

1. Steepest descent

2. Newton-Raphson method

Page 21
Paavai Institutions Department of CSE

Gradient Descent:

• Gradient descent is a first-order optimization algorithm. To find a local minimum of a


function using gradient descent, one takes steps proportional to the negative of the gradient of
the function at the current point.

• Gradient descent is popular for very large-scale optimization problems because it is easy to
implement, can handle black box functions, and each iteration is cheap.

• Given a differentiable scalar field f (x) and an initial guess x1, gradient descent iteratively
moves the guess toward lower values of "f" by taking steps in the direction of the negative
gradient - f (x).

• Locally, the negated gradient is the steepest descent direction, i.e., the direction that x would
need to move in order to decrease "f" the fastest. The algorithm typically converges to a local
minimum, but may rarely reach a saddle point, or not move at all if x1 lies at a local
maximum.

• The gradient will give the slope of the curve at that x and its direction will point to an
increase in the function. So we change x in the opposite direction to lower the function value:

Xk+ 1 = xk − λ f (xk)

The λ > 0 is a small number that forces the algorithm to make small jumps

Limitations of Gradient Descent:

• Gradient descent is relatively slow close to the minimum technically, its asymptotic rate of
convergence is inferior to many other methods.

• For poorly conditioned convex problems, gradient descent increasingly 'zigzags' as the
gradients point nearly orthogonally to the shortest direction to a minimum point

Steepest Descent:

• Steepest descent is also known as gradient method.

• This method is based on first order Taylor series approximation of objective function. This
method is also called saddle point method. Fig. 8.3.5 shows steepest descent method.

Page 22
Paavai Institutions Department of CSE

• The Steepest Descent is the simplest of the gradient methods. The choice of direction is
where f decreases most quickly, which is in the direction opposite to f (xi). The search
starts at an arbitrary point x0 and then go down the gradient, until reach close to the solution.

• The method of steepest descent is the discrete analogue of gradient descent, but the best
move is computed using a local minimization rather than computing a gradient. It is typically
able to converge in few steps but it is unable to escape local minima or plateaus in the
objective function.

• The gradient is everywhere perpendicular to the contour lines. After each line minimization
the new gradient is always orthogonal to the previous step direction. Consequently, the
iterates tend to zig-zag down the valley in a very manner.

• The method of Steepest Descent is simple, easy to apply, and each iteration is fast. It also
very stable; if the minimum points exist, the method is guaranteed to locate them after at least
an infinite number of iterations.

3.3 LINEAR CLASSIFICATION MODELS


• A classification algorithm (Classifier) that makes its classification based on a linear
predictor function combining a set of weights with the feature vector.

Page 23
Paavai Institutions Department of CSE

• A linear classifier does classification decision based on the value of a linear combination of
the characteristics. Imagine that the linear classifier will merge into it's weights all the
characteristics that define a particular class.

• Linear classifiers can represent a lot of things, but they can't represent everything. The
classic example of what they can't represent is the XOR function.

3.3.1 Discriminant Function

• Linear Discriminant Analysis (LDA) is the most commonly used dimensionality reduction
technique in supervised learning. Basically, it is a preprocessing step for pattern classification
and machine learning applications. LDA is a powerful algorithm that can be used to determine
the best separation between two or more classes.

• LDA is a supervised learning algorithm, which means that it requires a labelled training set
of data points in order to learn the linear discriminant function.

• The main purpose of LDA is to find the line or plane that best separates data points
belonging to different classes. The key idea behind LDA is that the decision boundary should
be chosen such that it maximizes the distance between the means of the two classes while
simultaneously minimizing the variance within of each class's data or within-class scatter.
This criterion is known as the Fisher nib criterion.

• LDA is one of the most widely used machine learning algorithms due to its accuracy and
flexibility. LDA can be used for a variety of tasks such as classification, dimensionality
reduction, and feature selection.

• Suppose we have two classes and we need to classify them efficiently, then using LDA,
classes are divided as follows:

Page 24
Paavai Institutions Department of CSE

LDA algorithm works based on the following steps:

a) The first step is to calculate the means and standard deviation of each feature.

b) Within class scatter matrix and between class scatter matrix is calculated

c) These matrices are then used to calculate the eigenvectors and eigenvalues.

d) LDA chooses the k eigenvectors with the largest eigenvalues to form a


transformation matrix.

e) LDA uses this transformation matrix to transform the data into a new space with k
dimensions.

f) Once the transformation matrix transforms the data into new space with k
dimensions, LDA can then be used for classification or dimensionality -reduction

Benefits of using LDA:

a) LDA is used for classification problems.

b) LDA is a powerful tool for dimensionality reduction.

c) LDA is not susceptible to the "curse of dimensionality" like many other machine
learning algorithms.

3.3.2 Logistic Regression

• Logistic regression is a form of regression analysis in which the outcome variable is binary
or dichotomous. A statistical method used to model dichotomous or binary outcomes using
predictor variables.

• Logistic component: Instead of modeling the outcome, Y, directly, the method models the
log odds (Y) using the logistic function.

• Regression component: Methods used to quantify association between an outcome and


predictor variables. It could be used to build predictive models as a function of predictors.

• In simple logistic regression, logistic regression with 1 predictor variable.

Logistic Regression:

In[P(Y)/1-P(Y)] = β0 + β1 X1 + β2 X2 +...+ βk Xk

Y = β0 + β1 X1 + β2 X2 +...+ βk Xk + ε

Page 25
Paavai Institutions Department of CSE

• With logistic regression, the response variable is an indicator of some characteristic, that is,
a 0/1 variable. Logistic regression is used to determine whether other measurements are
related to the presence of some characteristic, for example, whether certain blood measures
are predictive of having a disease.

• İf analysis of covariance can be said to be a t test adjusted for other variables, then logistic
regression can be thought of as a chi-square test for homogeneity of proportions adjusted for
other variables. While the response variable in a logistic regression is a 0/1 variable, the
logistic regression equation, which is a linear equation, does not predict the 0/1 variable itself.

The linear and logistic probability models are :

Linear Regression :

p = a0 + a1 X1 +a2 X2 +...+ak Xk

Logistic Regression:

In [p/(1-p)] = b0 +b1 X1 +b2 X2 +...+bk Xk

• The linear model assumes that the probability p is a linear function of the regressors, while
the logistic model assumes that the natural log of the odds p/(1 - p) is a linear function of the
regressors.

• The major advantage of the linear model is its interpretability. In the linear model, if a 1 is
0.05, that means that a one-unit increase in X1 is associated with a 5% point increase in the
probability that Y is 1.

• The logistic model is less interpretable. In the logistic model, if b1 is 0.05, that means that a
one- unit increase in X1 is associated with a 0.05 increase in the log odds that Y is 1. And
what does that mean? I've never met anyone with any intuition for log odds.

Page 26
Paavai Institutions Department of CSE

3.3.3 Probabilistic Generative Model

• Generative models are a class of statistical models that generate new data instances. These
models are used in unsupervised machine learning to perform tasks such as probability and
likelihood estimation, modelling data points, and distinguishing between classes using these
probabilities.

• Generative models rely on the Bayes theorem to find the joint probability. Generative
models describe how data is generated using probabilistic models. They predict P(y | x), the
probability of y given x, calculating the P(x,y), the probability of x and y.

3.4 NAIVE BAYES


• Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying
Bayes' theorem with strong independence assumptions between the features. It is highly
scalable, requiring a number of parameters linear in the number of variables
(features/predictors) in a learning problem.

• A Naive Bayes Classifier is a program which predicts a class value given a set of attributes.

• For each known class value,

1. Calculate probabilities for each attribute, conditional on the class value.

2. Use the product rule to obtain a joint conditional probability for the attributes.

3. Use Bayes rule to derive conditional probabilities for the class variable.

• Once this has been done for all class values, output the class with the highest probability.

• Naive bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is a
strong assumption but results in a fast and effective method.

• The probability of a class value given a value of an attribute is called the conditional
probability. By multiplying the conditional probabilities together for each attribute for a given
class value, we have a probability of a data instance belonging to that class.

Conditional Probability

• Let A and B be two events such that P(A) > 0. We denote P(BIA) the probability of B given
that A has occurred. Since A is known to have occurred, it becomes the new sample space
replacing the original S. From this, the definition is,

Page 27
Paavai Institutions Department of CSE

P(B/A) = P(A∩B)/P(A)

OR

P(A ∩ B) = P(A) P(B/A)

• The notation P(B | A) is read "the probability of event B given event A". It is the probability
of an event B given the occurrence of the event A.

• We say that, the probability that both A and B occur is equal to the probability that A occurs
times the probability that B occurs given that A has occurred. We call P(B | A) the conditional
probability of B given A, i.e., the probability that B will occur given that A has occurred.

• Similarly, the conditional probability of an event A, given B by,

P(A/B) = P(A∩B)/P(B)

• The probability P(A | B) simply reflects the fact that the probability of an event A may
depend on a second event B. If A and B are mutually exclusive A ∩ B = and P(A | B) = 0.

• Another way to look at the conditional probability formula is :

P(Second/First) = P(First choice and second choice)/P(First choice)

• Conditional probability is a defined quantity and cannot be proven.

• The key to solving conditional probability problems is to:

1. Define the events.

2. Express the given information and question in probability notation.

3. Apply the formula.

Joint Probability

• A joint probability is a probability that measures the likelihood that two or more events will
happen concurrently.

• If there are two independent events A and B, the probability that A and B will occur is found
by multiplying the two probabilities. Thus for two events A and B, the special rule of
multiplication shown symbolically is :

P(A and B) = P(A) P(B).

Page 28
Paavai Institutions Department of CSE

• The general rule of multiplication is used to find the joint probability that two events will
occur. Symbolically, the general rule of multiplication is,

P(A and B) = P(A) P(B | A).

• The probability P(A ∩ B) is called the joint probability for two events A and B which
intersect in the sample space. Venn diagram will readily shows that

P(A ∩ B) = P(A) + P(B) - P (AUB)

Equivalently:

P(A ∩ B) = P(A) + P(B) - P(A ∩ B) ≤ P(A) + P(B)

• The probability of the union of two events never exceeds the sum of the event probabilities.

• A tree diagram is very useful for portraying conditional and joint probabilities. A tree
diagram portrays outcomes that are mutually exclusive.

Bayes Theorem

• Bayes' theorem is a method to revise the probability of an event given additional


information. Bayes's theorem calculates a conditional probability called a posterior or revised
probability.

• Bayes' theorem is a result in probability theory that relates conditional probabilities. If A and
B denote two events, P(A | B) denotes the conditional probability of A occurring, given that B
occurs. The two conditional probabilities P(A | B) and P(B | A) are in general different.

• Bayes theorem gives a relation between P(A | B) and P(B | A). An important application of
Bayes' theorem is that it gives a rule how to update or revise the strengths of evidence-based
beliefs in light of new evidence a posterior.

• A prior probability is an initial probability value originally obtained before any additional
information is obtained.

• A posterior probability is a probability value that has been revised by using additional
information that is later obtained.

• Suppose that B1, B2, B3 ... Bn partition the outcomes of an experiment and that A is another
event. For any number, k, with 1 ≤ k ≤ n, we have the formula:

Page 29
Paavai Institutions Department of CSE

Difference between Generative and Discriminative Models

3.5 MAXIMUM MARGIN CLASSIFIER: SUPPORT VECTOR


MACHINE
Support Vector Machines (SVMs)are a set of supervised learning methods which learn from
the and used for dataset classification. SVM is a classifier derived from statistical learning
theory by Chervonenkis.

Page 30
Paavai Institutions Department of CSE

• An SVM is a kind of large-margin classifier: It is a vector space based machine learning


method where the goal is to find a decision boundary between two classes that is maximally
far from any point in the training data

• Given a set of training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one class or the other.
Simply speaking, we can think of an SVM model as representing the examples as points in
space, mapped so that each of the examples of the separate classes are divided by a gap that is
as wide as possible.

• New examples are then mapped into the same space and classified to belong to the class
based on which side of the gap they fall on.

Two Class Problems

• Many decision boundaries can separate these two classes. Which one should we choose?

• Perceptron learning rule can be used to find any decision boundary between class 1 and class
2.

• The line that maximizes the minimum margin is a good bet. The model class of "hyper-
planes with a margin of m" has a low VC dimension if m is big.
Page 31
Paavai Institutions Department of CSE

• This maximum-margin separator is determined by a subset of the data points. Data points in
this subset are called "support vectors". It will be useful computationally if only a small
fraction of the data points are support vectors, because we use the support vectors to decide
which side of the separator a test case is on.

Example of Bad Decision Boundaries

• SVM are primarily two-class classifiers with the distinct characteristic that they aim to find
the optimal hyperplane such that the expected generalization error is minimized. Instead of
directly minimizing the empirical risk calculated from the training data, SVMs perform
structural risk minimization to achieve good generalization.

• The empirical risk is the average loss of an estimator for a finite set of data drawn from P.
The idea of risk minimization is not only measure the performance of an estimator by its risk,
but to actually search for the estimator that minimizes risk over distribution P. Because we
don't know distribution P we instead minimize empirical risk over a training dataset drawn
from P. This general learning technique is called empirical risk minimization.

Good Decision Boundary: Margin Should Be Large

• The decision boundary should be as far away from the data of both classes as possible. If
data points lie very close to the boundary, the classifier may be consistent but is more "likely"
to make errors on new instances from the distribution. Hence, we prefer classifiers that
maximize the minimal distance of data points to the separator.

Page 32
Paavai Institutions Department of CSE

1. Margin (m): the gap between data points & the classifier boundary. The Margin is the
minimum distance of any sample to the decision boundary. If this hyperplane is in the
canonical form, the margin can be measured by the length of the weight vector.The margin is
given by the projection of the distance between these two points on the direction
perpendicular to the hyperplane.

Margin of the separator is the distance between support vectors.

Margin (m)= 2/|w|||

2. Maximal margin classifier: a classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies that only
support vectors matter; other training examples are ignorable.

Page 33
Paavai Institutions Department of CSE

Solution:

1. Define what an optimal hyperplane is : maximize margin

2. Extend the above definition for non-linearly separable problems have a penalty term for
misclassifications

3. Map data to high dimensional space where it is easier to classify with linear of by decision
surfaces: reformulate problem so that data is mapped implicitly to this space

Page 34
Paavai Institutions Department of CSE

Key Properties of Support Vector Machines

1. Use a single hyperplane which subdivides the space into two half-spaces, one which
is occupied by Class 1 and the other by Class 2

2. They maximize the margin of the decision boundary using quadratic optimization
techniques which find the optimal hyperplane.

3. Ability to handle large feature spaces.

4. Overfitting can be controlled by soft margin approach

5. When used in practice, SVM approaches frequently map the examples to a higher
dimensional space and find margin maximal hyperplanes in the mapped space,
obtaining decision boundaries which are not hyperplanes in the original space.

6. The most popular versions of SVMs use non-linear kernel functions and map the
attribute space into a higher dimensional space to facilitate finding "good" linear
decision boundaries in the modified space.

SVM Applications

• SVM has been used successfully in many real-world problems,

1. Text (and hypertext) categorization

2. Image classification

3. Bioinformatics (Protein classification, Cancer classification)

4.Hand-written character recognition

5. Determination of SPAM email.

Limitations of SVM

1. It is sensitive to noise.

2. The biggest limitation of SVM lies in the choice of the kernel.

3. Another limitation is speed and size.

4. The optimal design for multiclass SVM classifiers is also a research area.

Page 35
Paavai Institutions Department of CSE

Soft Margin SVM

• For the very high dimensional problems common in text classification, sometimes the data
are linearly separable. But in the general case they are not, and even if they are, we might
prefer a solution that better separates the bulk of the data 1st while ignoring a few weird noise
documents.

• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.

• A soft-margin allows a few variables to cross into the margin or over the hyperplane,
allowing misclassification.

• We penalize the crossover by looking at the number and distance of the misclassifications.
This is a trade off between the hyperplane violations and the margin size. The slack variables
are bounded by some set cost. The farther they are from the soft margin, the less influence
they have on the prediction.

• All observations have an associated slack variable,

1. Slack variable = 0 then all points on the margin.

2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane

3. C is the trade off between the slack variable penalty and the margin.

Comparison of SVM and Neural Networks

Example 8.6.2 : From the following diagram, identify which data points (1, 2, 3, 4, 5) are
support vectors (if any), slack variables on correct side of classifier (if any) and slack

Page 36
Paavai Institutions Department of CSE

variables on wrong side of classifier (if any). Mention which point will have maximum
penalty and why?

Solution:

• Data points 1 and 5 will have maximum penalty.

• Margin (m) is the gap between data points & the classifier boundary. The margin is the
minimum distance of any sample to the decision boundary. If this hyperplane is in the
canonical form, the margin can be measured by the length of the weight vector.

• Maximal margin classifier: A classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies that only
support vectors matter; other training examples are ignorable.

• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.

• A soft-margin allows a few variables to cross into the margin or over the hyperplane,
allowing misclassification.

• We penalize the crossover by looking at the number and distance of the misclassifications.
This is a trade off between the hyperplane violations and the margin size. The slack variables
are bounded by some set cost. The farther they are from the soft margin, the less influence
they have on the prediction.

• All observations have an associated slack variable

Page 37
Paavai Institutions Department of CSE

1. Slack variable = 0 then all points on the margin.

2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane.

3. C is the tradeoff between the slack variable penalty and the margin.

3.6 DECISION TREE

• A decision tree is a simple representation for classifying examples. Decision tree learning is
one of the most successful techniques for supervised classification learning.

• In decision analysis, a decision tree can be used to visually and explicitly represent decisions
and decision making. As the name goes, it uses a tree-like model of decisions.

• Learned trees can also be represented as sets of if-then rules to improve human readability.

• A decision tree has two kinds of nodes

1. Each leaf node has a class label, determined by majority vote of training examples reaching
that leaf.

2. Each internal node is a question on features. It branches out according to the answers.

• Decision tree learning is a method for approximating discrete-valued target functions. The
learned function is represented by a decision tree.

• A learned decision tree can also be re-represented as a set of if-then rules. Decision tree
learning is one of the most widely used and practical methods for inductive inference.

• It is robust to noisy data and capable of learning disjunctive expressions.

• Decision tree learning method searches a completely expressive hypothesis

Decision Tree Representation

• Goal: Build a decision tree for classifying examples as positive or negative instances of a
concept

• Supervised learning, batch processing of training examples, using a preference bias.

• A decision tree is a tree where

a. Each non-leaf node has associated with it an attribute (feature).

b. Each leaf node has associated with it a classification (+ or -).

Page 38
Paavai Institutions Department of CSE

C. Each arc has associated with it one of the possible values of the attribute at the node from
which the arc is directed.

• Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf
nodes represent class labels or class distribution.

• A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions. Decision trees can easily be converted to classification rules.

Decision Tree Algorithm

• To generate decision tree from the training tuples of data partition D.

Input:

1. Data partition (D)

2. Attribute list

3. Attribute selection method

Algorithm:

1. Create a node (N)

2. If tuples in D are all of the same class then

3. Return node (N) as a leaf node labeled with the class C.

4. If attribute list is empty then return N as a leaf node labeled with the majority class in D

5. Apply attribute selection method (D, attribute list) to find the "best" splitting criterion;

6. Label node N with splitting criterion;

7. If splitting attribute is discrete-valued and multiway splits allowed

8. Then attribute list -> attribute list -> splitting attribute

9. For (each outcome j of splitting criterion)

10. Let Dj be the set of data tuples in D satisfying outcome j;

11. If Dj is empty then attach a leaf labeled with the majority class in D to node N;

12. Else attach the node returned by Generate decision tree (Dj, attribute list) to node N;
Page 39
Paavai Institutions Department of CSE

13. End of for loop

14. Return N;

• Decision tree generation consists of two phases: Tree construction and pruning

• In tree construction phase, all the training examples are at the root. Partition examples
recursively based on selected attributes.

• In tree pruning phase, the identification and removal of branches that reflect noise or
outliers.

• There are various paradigms that are used for learning binary classifiers which include:

1. Decision Trees

2. Neural Networks

3. Bayesian Classification

4. Support Vector Machines

Page 40
Paavai Institutions Department of CSE

Solution: Left Side: A feature tree combining two Boolean features. Each internal node or
split is labelled with a feature, and each edge emanating from a split is labelled with a feature
value. Each leaf therefore corresponds to a unique combination of feature values. Also
indicated in each leaf is the class distribution derived from the training set

• Right Side: A feature tree partitions the instance space into rectangular regions, one for each
leaf.

• The leaves of the tree in the above figure could be labelled, from left to right, as ham - spam
- spam, employing a simple decision rule called majority class.

Page 41
Paavai Institutions Department of CSE

• Left side: A feature tree with training set class distribution in the leaves.

• Right side: A decision tree obtained using the majority class decision rule.

Appropriate Problem for Decision Tree Learning

• Decision tree learning is generally best suited to problems with the following characteristics:

1. Instances are represented by attribute-value pairs. Fixed set of attributes, and the attributes
take a small number of disjoint possible values.

2. The target function has discrete output values. Decision tree learning is appropriate for a
boolean classification, but it easily extends to learning functions with more than two possible
output values.

3. Disjunctive descriptions may be required. Decision trees naturally represent disjunctive


expressions.

4. The training data may contain errors. Decision tree learning methods are robust to errors,
both errors in classifications of the training examples and errors in the attribute values that
describe these examples.

5. The training data may contain missing attribute values. Decision tree methods can be used
even when some training examples have unknown values.

6. Decision tree learning has been applied to problems such as learning to classify.

Page 42
Paavai Institutions Department of CSE

Advantages and Disadvantages of Decision Tree

Advantages:

1. Rules are simple and easy to understand.

2. Decision trees can handle both nominal and numerical attributes.

3. Decision trees are capable of handling datasets that may have errors.

4. Decision trees are capable of handling datasets that may have missing values.

5. Decision trees are considered to be a non parametric method.

6. Decision trees are self-explantory.

Disadvantages:

1. Most of the algorithms require that the target attribute will have only discrete
values.

2. Some problem are difficult to solve like XOR.

3. Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.

4. Decision trees are prone to errors in classification problems with many class and se
relatively small number of training examples.

3.7 RANDOM FORESTS


• Random forest is a famous system learning set of rules that belongs to the supervised getting
to know method. It may be used for both classification and regression issues in ML. It is
based totally on the concept of ensemble studying, that's a process of combining multiple
classifiers to solve a complex problem and to enhance the overall performance of the model.

• As the call indicates, "Random forest is a classifier that incorporates some of choice timber
on diverse subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and primarily based on most of the people's votes of predictions,
and it predicts the very last output.

• The more wider variety of trees within the forest results in better accuracy and prevents the
hassle of overfitting.

Page 43
Paavai Institutions Department of CSE

How Does Random Forest Algorithm Work?

• Random forest works in two-section first is to create the random woodland by combining N
selection trees and second is to make predictions for each tree created inside the first segment.

The working technique may be explained within the below steps and diagram:

Step 1: Select random K statistics points from the schooling set.

Step 2: Build the selection trees associated with the selected information points
(Subsets).

Step 3: Choose the wide variety N for selection trees which we want to build.

Step 4: Repeat step 1 and 2.

Step 5: For new factors, locate the predictions of each choice tree and assign the new
records factors to the category that wins most people's votes.

• The working of the set of rules may be higher understood by the underneath example:

Example: Suppose there may be a dataset that includes more than one fruit photo. So, this
dataset is given to the random wooded area classifier. The dataset is divided into subsets and
given to every decision tree. During the training section, each decision tree produces a
prediction end result and while a brand new statistics point occurs, then primarily based on
the majority of consequences, the random forest classifier predicts the final decision. Consider
the underneath picture:

Page 44
Paavai Institutions Department of CSE

Applications of Random Forest

There are specifically 4 sectors where random forest normally used:

SAT

1. Banking: Banking zone in general uses this algorithm for the identification of loan
danger.

2. Medicine: With the assistance of this set of rules, disorder traits and risks of the
disorder may be recognized.

3. Land use: We can perceive the areas of comparable land use with the aid of this
algorithm.

4. Marketing: Marketing tendencies can be recognized by the usage of this algorithm.

Advantages of Random Forest

Random forest is able to appearing both classification and regression responsibilities.

Page 45
Paavai Institutions Department of CSE

• It is capable of managing large datasets with high dimensionality.

• It enhances the accuracy of the version and forestalls the overfitting trouble.

Disadvantages of Random Forest

Although random forest can be used for both class and regression responsibilities, it isn't extra
appropriate for regression obligations.

Two Marks Questions with Answers


Q.1 Define learning.
Ans.: Learning is a phenomenon and process which has manifestations of various aspects.
Learning process includes gaining of new symbolic knowledge and development of cognitive
skills through instruction and practice. It is also discovery of new facts and theories through
observation and experiment.
Q.2 Define machine learning.
Ans.: A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
Q.3 What is an influence of information theory on machine learning?
Ans.: Information theory is measures of entropy and information content. Minimum
description length approaches to learning. Optimal codes and their relationship to optimal
training sequences for encoding a hypothesis.
Q.4 What is meant by target function of a learning program?
Ans.: Target function is a method for solving a problem that an AI algorithm parses its
training data to find. Once an algorithm finds its target function, that function can be used to
predict results. The function can then be used to find output data related to inputs for real
problems where, unlike training sets, outputs are not included.
Q.5 Define useful perspective on machine learning.
Ans.: One useful perspective on machine learning is that it involves searching a very large
space of possible hypotheses to determine one that best fits the observed data and any prior
knowledge held by the learner.
Q.6 Describe the issues in machine learning?
Ans.: Issues of machine learning are as follows:
• What learning algorithms to be used?
• How much training data is sufficient?
• When and how prior knowledge can guide the learning process?
• What is the best strategy for choosing a next training experience?

Page 46
Paavai Institutions Department of CSE

• What is the best way to reduce the learning task to one or more function approximation
problems?
• How can the learner automatically alter its representation to improve its learning ability?
Q.7 What is decision tree?
Ans.:
• Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
• A decision tree is a tree where each node represents a feature (attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome (categorical or continues
value).
• A decision tree or a classification tree is a tree in which each internal node is labeled with an
input feature. The arcs coming from a node labeled with a feature are labeled with each of the
possible values of the feature.
Q.8 What are the nodes of decision tree?
Ans.: A decision tree has two kinds of nodes
1. Each leaf node has a class label, determined by majority vote of training examples reaching
that leaf.
2. Each internal node is a question on features. It branches out according to the answers.
• Decision tree learning is a method for approximating discrete-valued target functions. The
learned function is represented by a decision tree
Q.9 Why tree pruning useful in decision tree induction?
Ans.: When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting the
data. Such methods typically use statistical measures to remove the least reliable branches.
Q.10 What is tree pruning?
Ans.: Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data
Q.11 What is RULE POST-PRUNING?
Ans.:
• It is method for finding high accuracy hypotheses.
• Rule post-pruning involves the following steps:
1. Infer decision tree from training set
2. Convert tree to rules - one rule per branch
3. Prune each rule by removing preconditions that result in improved estimatedaccuracy
4. Sort the pruned rules by their estimated accuracy and consider them in this sequence when
classifying unseen instances
Q.12 Why convert the decision tree to rules before pruning?
Page 47
Paavai Institutions Department of CSE

Ans.:
• Converting to rules allows distinguishing among the different contexts in which a decision
node is used.
• Converting to rules removes the distinction between attribute tests that occur near the root of
the tree and those that occur near the leaves.
• Converting to rules improves readability. Rules are often easier for to if understand
Q.13 What do you mean by least square method?
Ans.: Least squares is a statistical method used to determine a line of best fit by minimizing
the sum of squares created by a mathematical function. A "square" is determined by squaring
the distance between a data point and the regression line or mean value of the data set
Q.14 What is linear discriminant function?
Ans.: LDA is a supervised learning algorithm, which means that it requires a labelled training
set of data points in order to learn the Linear Discriminant function.
Q.15 What is a support vector in SVM?
Ans.: Support vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, we maximize the
margin of the classifier.
Q.16 What is support vector machines?
Ans.: A Support Vector Machine (SVM) is a supervised machine learning model that uses
classification algorithms for two-group classification problems. After giving an SVM model
sets of labeled training data for each category, they're able to categorize new text.
Q.17 Define logistic regression.
Ans.: Logistic regression is supervised learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
Q.18 List out types of machine learning.
Ans.: Types of machine learnings are supervised, semi-supervised, unsupervised and
Reinforcement Learning.
Q.19 What is random forest?
Ans.: Random forest is an ensemble learning technique that combines multiple decision trees,
implementing the bagging method and results in a robust model with low variance.
Q.20 What are the five popular algorithms of machine learning?
Ans.: Popular algorithms are Decision Trees, Neural Networks (back propagation),
Probabilistic networks, Nearest Neighbor and Support vector machines.
Q.21 What is the function of 'Supervised Learning'?
Ans.: Function of 'Supervised Learning' are Classifications, Speech recognition, Regression,
Predict time series and Annotate strings.

Page 48
Paavai Institutions Department of CSE

Q.22 What are the advantages of Naive Bayes?


Ans.: In Naïve Bayes classifier will converge quicker than discriminative models like logistic
regression, so you need less training data. The main advantage is that it can't learn interactions
between features.
Q.23 What is regression?
Ans.: Regression is a method to determine the statistical relationship between a dependent
variable and one or more independent variables.
Q.24 Explain linear and non-linear regression model.
Ans.: In linear regression models, the dependence of the response on the regressors is defined
by a linear function, which makes their statistical analysis mathematically tractable. On the
other hand, in nonlinear regression models, this dependence is defined by a nonlinear
function, hence the mathematical difficulty in their analysis.
Q.25 What is regression analysis used for?
Ans.: Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding the causal effect
relationship between the variables.
Q.26 List two properties of logistic regression.
Ans.:
1. The dependent variable in logistic regression follows Bernoulli Distribution. 2. Estimation
is done through maximum likelihood.
Q.27 What is the goal of logistic regression?
Ans.: The goal of logistic regression is to correctly predict the category of outcome for
individual cases using the most parsimonious model. To accomplish this goal, a model is
created that includes all predictor variables that are useful in predicting the response variable.
Q.28 Define supervised learning.
Ans.: Supervised learning in which the network is trained by providing it with input and
matching output patterns. These input-output pairs are usually provided by an external
teacher.

Page 49
Paavai Institutions Department of CSE

PART B

1. Explain the different Types of Machine Learning?


2. What is Overfitting, and How Can You Avoid It?
3. What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How Much Data
Will You Allocate for Your Training, Validation, and Test Sets?
4. How Do You Handle Missing or Corrupted Data in a Dataset
5. How do you interpret a linear regression model?
6. What are the basic assumptions of the Linear Regression Algorithm?
7. Explain the Gradient Descent algorithm with respect to linear regression.
8. Justify the cases where the linear regression algorithm is suitable for a given dataset.
9. In linear regression, what is the value of the sum of the residuals for a given dataset?
Explain with proper justification.
10. Explain the CART Algorithm for Decision Trees.
11. List down the attribute selection measures used by the ID3 algorithm to construct a
Decision Tree
12. Which should be preferred among Gini impurity and Entropy?
13. What is the role of decision trees in artificial intelligence and machine learning?

Page 50

You might also like