0% found this document useful (0 votes)
9 views

W8-Supervised Learning Methods

The document discusses artificial intelligence and predictive modeling techniques. It covers Bayesian inference including naive Bayes classification. It explains that naive Bayes classification is a simple probabilistic classifier based on Bayes' theorem. The document provides an example of how to classify a new data point using a naive Bayes classifier trained on a sample dataset. It also covers predictive regression techniques, specifically linear regression for predicting continuous variable values. Linear regression finds the line of best fit to model the relationship between variables.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

W8-Supervised Learning Methods

The document discusses artificial intelligence and predictive modeling techniques. It covers Bayesian inference including naive Bayes classification. It explains that naive Bayes classification is a simple probabilistic classifier based on Bayes' theorem. The document provides an example of how to classify a new data point using a naive Bayes classifier trained on a sample dataset. It also covers predictive regression techniques, specifically linear regression for predicting continuous variable values. Linear regression finds the line of best fit to model the relationship between variables.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Artificial Intelligence

RSCI
Dr. Ayesha Kashif
• Bayesian Inference
– Naïve Bays Classifier
• Predictive Regression
– Linear Regression
– Logistic Regression
Bayesian Classification: Why?
– A statistical classifier:
• performs probabilistic prediction, i.e., predicts class membership
probabilities
– Foundation:
• Based on Bayes’ Theorem.
– Performance:
• A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
– Incremental:
• Each training example can incrementally increase/decrease the
probability that a hypothesis is correct
• prior knowledge can be combined with observed data
Bayes’ Theorem: Basics
– Bayes’ Theorem:
P( H | X) = P(X | H ) P( H ) = P(X | H ) P( H ) / P(X)
P(X)
• Let X be a data sample (“evidence”): class
label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e.,
posteriori probability): the probability that
the hypothesis holds given the observed data
sample X
• P(H) (prior probability): the initial probability
– E.g., X will buy computer, regardless of age,
income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of
observing the sample X, given that the
hypothesis holds
– E.g., Given that X will buy computer, the prob.
that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P( H | X) = P(X | H ) P(H ) = P(X | H ) P( H ) / P(X)


P(X)

P(X | C )P(C )
P(C | X) = i i
i P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Classification Is to Derive the
Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …,
xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)

• Since P(X) is constant for all classes, only


P(C | X) = P(X | C )P(C )
needs to be maximized i i i
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
n
P( X | C i) =  P( x | C i) = P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k =1

• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
and P(xk|Ci) is g ( x,  ,  ) = e 2 2

2 

P ( X | C i ) = g ( xk ,  Ci ,  Ci )
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ P(C | X) = P(X | C )P(C )
i i i
C2:buys_computer = ‘no’
age income student credit_rating buys_computer
<=30 high no fair no
Data to be classified: <=30 high no excellent no
X = (age <=30, 31…40 high no fair yes
>40 medium no fair yes
Income = medium, >40 low yes fair yes
Student = yes >40 low yes excellent no
31…40 low yes excellent yes
Credit_rating = Fair) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier: An Example
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Exercise

• Given the table above, predict classification of the new


sample X = {1, 2, 2, class = ?}.
Predictive Regression
Linear Regression
• The prediction of continuous values can be modeled by a statistical
technique called regression .
• regression analysis is the process of determining how a variable Y is
related to one or more other variables x1 , x 2 , . . . , x n .
• The relationship that fits a set of data is characterized by a prediction
model called a regression equation .

• Common reasons for performing regression analysis


include
1. the output is expensive to measure but the inputs are not, and so a
cheap prediction of the output is sought;
2. the values of the inputs are known before the output is known, and a
working prediction of the output is required;
3. controlling the input values, we can predict the behavior of
corresponding outputs; and
4. there might be a causal link between some of the inputs and the
output, and we want to identify the links.
Regression And Model Building
• The engineer visits 25 randomly chosen retail outlets having
vending machines, and the in-outlet delivery time (in minutes)
and the volume of product delivered (in cases) are observed for
each.
• This graph is called a scatter diagram. This display clearly
suggests a relationship between delivery time and delivery
volume
Regression And Model Building
• Correlation coefficients measure the strength and sign of a
relationship, but not the slope.
• There are several ways to estimate the slope; the most
common is a linear least squares fit.
• A “linear fit” is a line intended to model the relationship
between variables.
• A “least squares” fit is one that minimizes the mean
squared error (MSE) between the line and the data.
Linear Regression
• Equation of straight line:
• Y= mX + b
• Y = b + mX
• Where Y represents the dependent variable
• X represents the independent variable
• ‘b’ represents the Y-intercept (i.e. the value of Y when X is equal to zero)
• ‘m’ represents the slope of the line
(i.e. the value of the tan Θ, where Θ represents the angle between the line and the
horizontal axis)
Linear Regression
• Linear regression with one input variable is the
simplest form of regression. It models a random
variable Y (called a response variable) as a linear
function of another random variable X (called a
predictor variable).
• Given n samples or data points of the form (x1, y1),
(x2, y2),…,(xn, yn), where xi∈X and yi∈Y, linear
regression can be expressed as

• where intercept α and slope β are unknown


constants or regression coefficients, and ε is a
random error component.
Linear Regression
– Find the Least Square Error
• minimizes the error between the actual
data points and the estimated line
• LS Minimizes the Sum of the Squared
Differences (errors) (SSE)

• where yi is the real output value given


in the data set, and yi’ is a response
value obtained from the model.
• Squaring has the obvious feature of
treating positive and negative residuals
the same.
Linear Regression: Regression coefficients
Differentiating SSE with respect to α and β Setting the partial derivatives equal to
zero (minimization of the total error) and
rearranging the terms,

which may be solved simultaneously to yield computing formulas for α and β. Using
standard relations for the mean values, regression coefficients for this simple case of
optimization are
Slope =

Intercept =
Beta equals the covariance between x and y divided by the variance of x.
Linear Regression: Example
– Training Data
• where α and β coefficients can be calculated based on previous
formulas (using meanA = 5, and meanB = 6), and they have the
values

• The optimal regression line is


Linear Regression: Goodness of fit
– Mean Square Error
• suppose that you are trying to guess someone’s weight. If
you didn’t know anything about them, your best strategy
would be to guess ȳ; in that case the MSE of your guesses
would be Var(Y):
Linear Regression: Goodness of fit
• A number given by MSE is still hard to immediately intuit. Is this a good
prediction?

• To measure the predictive power of a model, we can compute the


coefficient of determination, more commonly known as “R-squared”:
Linear Regression: Goodness of fit
• To measure the predictive power of a model, we can
compute the coefficient of determination, more commonly
known as “R-squared”:

• So the term Var(ε)/Var(Y) is the ratio of mean squared


error with and without the explanatory variable, which is
the fraction of variability left unexplained by the model.
Linear Regression
– Quality of the linear regression model
• One parameter, which shows this strength of linear association
between two variables by means of a single number, is called a
correlation coefficient r .
Covariance (x,y)/Standard Dev of x . Standard Dev y

• Where

• A correlation coefficient r = 0.85 indicates a good linear


relationship between two variables.
Logistic Regression
Logistic Regression
– Probability of dependent variable
• Rather than predicting the value of the dependent variable
• the logistic regression method tries to estimate the
probability that the dependent variable will have a given
value.
– Customer Credit Rating example
• If the estimated probability is greater than 0.50 then the
prediction is closer to YES (a good credit rating),
• otherwise the output is closer to NO (a bad credit rating is
more probable).
Logistic Regression
– Odds Ratio
• Logistic regression uses the concept of odds ratios to
calculate the probability.
• For example, the probability of a sports team to win a
certain match might be 0.75.
• The odds for that team to lose would be 1 – 0.75 = 0.25.
• The odds ratio for that team winning would be 0.75/0.25
= 3.
• This can be said as the odds of the team winning are 3 to
1 on.
Logistic Regression
– Linear Logistic Model
• Suppose that output Y has two possible categorical values
coded as 0 and 1. (output is a vector)

• This equation is known as the linear logistic model . The


function log (pj /[1 − pj ]) is often written as logit(p).
• The main reason for using the logit form of output is to
prevent the predicting probabilities from becoming values
out of the required range [0, 1].
Logistic Regression
– Example
• suppose that the new sample for classification has input values {x1 , x2
, x3 } = {1, 0, 1}.

• Using the linear logistic model, it is possible to estimate the probability


of the output value 1, (p[Y = 1]) for this sample.
• First, calculate the corresponding logit(p)
• and then the probability of the output value 1 for the given inputs:

Based on the final value for probability p, we


may conclude that output value Y = 1 is more
probable than the other categorical value Y = 0. Curves of the form are called sigmoidal
because they are S-shaped, and nonlinear.
References
• Allen B. Downey - Think Stats-O'Reilly Media, Inc. (2018)
• https://en.wikipedia.org/wiki/Numerical_methods_for_li
near_least_squares
• https://towardsdatascience.com/linear-regression-
derivation-d362ea3884c2

You might also like