0% found this document useful (0 votes)
20 views

Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis

1) Bayesian decision theory provides a framework for designing classifiers to minimize an expected risk or cost of misclassification. 2) It uses prior probabilities, likelihood functions, and posterior probabilities derived through Bayes' rule to determine decision rules. 3) The optimal Bayesian decision rule minimizes the average probability of error or expected loss/risk.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis

1) Bayesian decision theory provides a framework for designing classifiers to minimize an expected risk or cost of misclassification. 2) It uses prior probabilities, likelihood functions, and posterior probabilities derived through Bayes' rule to determine decision rules. 3) The optimal Bayesian decision rule minimizes the average probability of error or expected loss/risk.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 64

Bayesian Decision Theory

Chapter 2 (Duda et al.) – Sections 2.1-2.10

CS479/679 Pattern Recognition


Dr. George Bebis
Bayesian Decision Theory
• Design classifiers to make decisions subject to
minimizing an expected ”risk”.
– The simplest risk is the classification error.
– When misclassification errors are not equally
important, the risk can include the cost associated
with different misclassification errors.
Terminology
• State of nature ω (class label):
– e.g., ω1 for sea bass, ω2 for salmon

• Probabilities P(ω1) and P(ω2) (priors):


– e.g., prior knowledge of how likely is to get a sea bass
or a salmon

• Probability density function p(x) (evidence):


– e.g., how frequently we will measure a pattern with
feature value x (e.g., x corresponds to lightness)
Terminology (cont’d)
• Conditional probability density p(x/ωj) (likelihood) :
– e.g., how frequently we will measure a pattern with
feature value x given that the pattern belongs to class ωj

e.g., lightness distributions


between salmon/sea-bass
populations
Terminology (cont’d)

• Conditional probability P(ωj /x) (posterior) :


– e.g., the probability that the fish belongs to class
ωj given feature x.
Decision Rule Using Prior
Probabilities Only
Decide ω1 if P(ω1) > P(ω2); otherwise decide ω2

 P(1 ) if we decide 2
P(error )  
 P(2 ) if we decide 1

or P(error) = min[P(ω1), P(ω2)]

• Favours the most likely class.


• This rule will be making the same decision all times.
– i.e., optimum if no other information is available
Decision Rule Using
Conditional Probabilities
• Using Bayes’ rule:
p ( x /  j ) P ( j ) likelihood  prior
P( j / x)  
p( x) evidence
2
where p ( x)   p ( x /  j ) P ( j ) (i.e., scale factor – sum of probs = 1)
j 1

Decide ω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2


or
Decide ω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2); otherwise decide ω2
or
Decide ω1 if p(x/ω1)/p(x/ω2) >P(ω2)/P(ω1) ; otherwise decide ω2
likelihood ratio threshold
Decision Rule Using Conditional
Probabilities (cont’d)

p(x/ωj) 2 1
P(1 )  P ( 2 )  P(ωj /x)
3 3
Probability of Error
• The probability of error is defined as:

 P(1 / x) if we decide 2
P (error / x)  
 P(2 / x) if we decide 1

P(error/x) = min[P(ω1/x), P(ω2/x)]


or

• What is the average probability error?


 
P(error )   P(error , x)dx   P(error / x) p( x)dx
 

• The Bayes rule is optimum, that is, it minimizes the average


probability error!
Where do Probabilities come from?
• There are two competitive answers:

(1) Relative frequency (objective) approach.


– Probabilities can only come from experiments.

(2) Bayesian (subjective) approach.


– Probabilities may reflect degree of belief and can be
based on opinion.
Example (objective approach)
• Classify cars whether they are more or less than $50K:
– Classes: C1 if price > $50K, C2 if price <= $50K
– Features: x, the height of a car

• Use the Bayes’ rule to compute the posterior probabilities:

p ( x / Ci )P (C i )
P(Ci / x ) 
p( x)
• We need to estimate p(x/C1), p(x/C2), P(C1), P(C2)
Example (cont’d)
• Collect data
– Ask drivers how much their car was and measure height.
• Determine prior probabilities P(C1), P(C2)
– e.g., 1209 samples: #C1=221 #C2=988

221
P(C1 )   0.183
1209
988
P(C2 )   0.817
1209
Example (cont’d)
• Determine class conditional probabilities (likelihood)
– Discretize car height into bins and use normalized histogram

p ( x / Ci )
Example (cont’d)
• Calculate the posterior probability for each bin, e.g.:
p( x  1.0 / C1 ) P( C1)
P(C1 / x  1.0)  
p( x  1.0 / C1) P( C1)  p( x 1.0 / C2) P( C2)
0.2081*0.183
  0.438
0.2081*0.183  0.0597 *0.817

P(Ci / x)
Example (subjective approach)

• Use the Bayes’ rule to compute the posterior probabilities:


p ( x / Ci )P (C i )
P(Ci / x ) 
p( x)

N(μ,Σ)

• p(x/C1) ~ N(μ1,Σ1)
• p(x/C2) ~ N(μ2,Σ2)
• P(C1) = P(C2) = 0.5
A More General Theory
• Use more than one features.
• Allow more than two categories.
• Allow actions other than classifying the input to
one of the possible categories (e.g., rejection).
• Employ a more general error function (i.e.,
expected “risk”) by associating a “cost” (based
on a “loss” function) with different errors.
Terminology
• Features form a vector x  R d
• A set of c categories ω1, ω2, …, ωc
• A finite set of l actions α1, α2, …, αl
• A loss function λ(αi / ωj)
– the cost associated with taking action αi when the correct
classification category is ωj

Bayes rule (using vector notation):


p (x /  j ) P( j )
P( j / x) 
p( x)
c
where p(x)   p(x /  j ) P( j )
j 1
Conditional Risk (or Expected Loss)
• Suppose we observe x and take action αi

• The conditional risk (or expected loss) with taking


action αi is defined as:
c
R (ai / x)    (ai /  j ) P( j / x)
j 1
Overall Risk
• The overall risk is defined as:

R   R (a(x) / x) p(x) dx

where α(x) is a general decision rule that determines


which action α1, α2, …, αl to take
for every x.

• The optimum decision rule is the Bayes rule


Overall Risk (cont’d)
• The Bayes rule minimizes R by:
(i) Computing R(αi /x) for every αi given an x
(ii) Choosing the action αi with the minimum R(αi /x)

• The resulting minimum R* is called Bayes risk and


is the best performance that can be achieved:

R  min R
*
Example: Two-category
classification
• Define
– α1: decide ω1
– α2: decide ω2
– λij = λ(αi /ωj)

• The conditional risks are:


c
R(ai / x)    (ai /  j ) P ( j / x)
j 1
Example: Two-category
classification (cont’d)
• Minimum risk decision rule:

or

or

>

likelihood ratio threshold


Special Case:
Zero-One Loss Function
• Assign the same loss to all errors:

• The conditional risk corresponding to this loss function:


Special Case:
Zero-One Loss Function (cont’d)
• The decision rule becomes:

or

or

• In this case, the overall risk becomes the average


probability error!
Example
Assuming general loss:
>

Assuming zero-one loss:


Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2

 a  P(2 ) / P(1 )

P(2 )(12  22 )


b 
P(1 )(21  11 )

assume: 12  21


(decision regions)
Discriminant Functions
• Represent a classifier is through discriminant functions
gi(x), i = 1, . . . , c
• A feature vector x is assigned to class ωi if:
gi(x) > gj(x) for all ji

max
Discriminants for Bayes Classifier
• Assuming a general loss function:

gi(x)=-R(αi / x)

• Assuming the zero-one loss function:

gi(x)=P(ωi / x)
Discriminants for Bayes Classifier
(cont’d)
• Is the choice of gi unique?
– Replacing gi(x) with f(gi(x)), where f() is monotonically
increasing, does not change the classification results.

p (x / i ) P(i )
g i ( x) 
p ( x)
gi(x)=P(ωi/x)
gi (x)  p(x / i ) P(i )
gi (x)  ln p (x / i )  ln P(i )

we’ll use this


discriminant extensively!
Case of two categories
• More common to use a single discriminant function
(dichotomizer) instead of two:

• Examples:
g (x)  P(1 / x)  P(2 / x)
p(x / 1 ) P(1 )
g (x)  ln  ln
p ( x / 2 ) P(2 )
Decision Regions and Boundaries
• Discriminants divide the feature space in decision regions
R1, R2, …, Rc, separated by decision boundaries.

Decision boundary
is defined by:
g1(x)=g2(x)
Discriminant Function for
Multivariate Gaussian Density

N(μ,Σ)

• Consider the following discriminant function:


gi (x)  ln p(x / i )  ln P(i )

p(x/ωi)
Multivariate Gaussian Density: Case I

• Σi=σ2 I (diagonal matrix)


– Features are statistically independent
– Each feature has the same variance
Multivariate Gaussian Density:
Case I (cont’d)

wi=

)
)
Multivariate Gaussian Density:
Case I (cont’d)
• Properties of decision boundary:
– It passes through x0
– It is orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi)= P(ωj), then x0 shifts away from the most likely category.
– If σ is very small, the position of the boundary is insensitive to P(ωi)
and P(ωj)

)
)
Multivariate Gaussian Density:
Case I (cont’d)

If P(ωi)= P(ωj), then x0 shifts away


from the most likely category.
Multivariate Gaussian Density:
Case I (cont’d)

If P(ωi)= P(ωj), then x0 shifts away


from the most likely category.
Multivariate Gaussian Density:
Case I (cont’d)

If P(ωi)= P(ωj), then x0 shifts away


from the most likely category.
Multivariate Gaussian Density:
Case I (cont’d)
• Minimum distance classifier
– When P(ωi) are equal, then the discriminant becomes:

g i ( x)   || x  i ||2

– This is the Euclidean distance!

– Assumptions: statistically independent features, same variance!


Multivariate Gaussian Density: Case II

• Σi= Σ
Multivariate Gaussian Density:
Case II (cont’d)
Multivariate Gaussian Density:
Case II (cont’d)
• Properties of hyperplane (decision boundary):
– It passes through x0
– It is not orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi)= P(ωj), then x0 shifts away from the most likely category.
Multivariate Gaussian Density:
Case II (cont’d)

If P(ωi)= P(ωj), then x0 shifts away


from the most likely category.
Multivariate Gaussian Density:
Case II (cont’d)

If P(ωi)= P(ωj), then x0 shifts away


from the most likely category.
Multivariate Gaussian Density:
Case II (cont’d)
• Mahalanobis distance classifier
– When P(ωi) are equal, then the discriminant becomes:
Multivariate Gaussian Density: Case III

• Σi= arbitrary

hyperquadrics;

e.g., hyperplanes, pairs of hyperplanes, hyperspheres,


hyperellipsoids, hyperparaboloids etc.
Multivariate Gaussian Density:
Case III (cont’d)

non-linear
decision
boundaries
Multivariate Gaussian Density:
Case III (cont’d)
Example - Case III

decision boundary:

P(ω1)=P(ω2)

boundary does
not pass through
midpoint of μ1,μ2
Error Bounds
• Exact error calculations could be difficult – easier to
estimate error bounds!

or
min[P(ω1/x), P(ω2/x)]

P(error)
Error Bounds (cont’d)
• If the class conditional distributions are Gaussian, then

where:
Error Bounds (cont’d)
• The Chernoff bound is obtained by minimizing e-κ(β)
– This is a 1-D optimization problem, regardless to the dimensionality
of the class conditional densities.
Error Bounds (cont’d)
• The Bhattacharyya bound is obtained by setting β=0.5
– Easier to compute than Chernoff error but looser.

• Note: the Chernoff and Bhattacharyya bounds will not be


good bounds if the densities are not Gaussian.
Example (cont’d)

Bhattacharyya error:
k(0.5)=4.06
P(error )  0.0087
Receiver Operating
Characteristic (ROC) Curve
• Every classifier typically employs some kind of a threshold.

 a  P(2 ) / P (1 )

P(2 )(12  22 )


b 
P (1 )(21  11 )
• Changing the threshold can affect the performance of the
classifier.
• ROC curves allow us to evaluate/compare the
performance of a classifier using different thresholds.
Example: Person Authentication
• Authenticate a person using biometrics (e.g., fingerprints).
• There are two possible distributions (i.e., classes):
– Authentic (A) and Impostor (I)

A
I
Example: Person Authentication
(cont’d)
• Possible decisions:
– (1) correct acceptance (true positive):
• X belongs to A, and we decide A correct rejection
correct acceptance
– (2) incorrect acceptance (false positive):
• X belongs to I, and we decide A
– (3) correct rejection (true negative): A
• X belongs to I, and we decide I I
– (4) incorrect rejection (false negative):
• X belongs to A, and we decide I false negative false positive
Error vs Threshold
ROC Curve

x* (threshold)

FAR: False Accept Rate (False Positive)


FRR: False Reject Rate (False Negative)
False Negatives vs False Positives
ROC Curve

FAR: False Accept Rate (False Positive)


FRR: False Reject Rate (False Negative)
Bayes Decision Theory:
Case of Discrete Features

• Replace  p ( x /  ) dx
j with  P(x /  )
x
j

• See section 2.9


Missing Features
• Suppose x=(x1,x2) is a test vector where x1 is missing and x2
= x̂-2 how would we classify it?
– If we set x1 equal to the average value, we will classify x as ω3
– But p ( xˆ2 / 2 ) is larger; should we classify x as ω2 ?

Example:
Missing Features (cont’d)
• Suppose x=[xg, xb] (xg: good features, xb: bad features)
• Derive the Bayes rule using the good features:

p p

marginalize
posterior
probability
over bad
features.
Compound Bayesian
Decision Theory
• Sequential decision
(1) Decide as each pattern (e.g., fish) emerges.

• Compound decision
(1) Wait for n patterns (e.g., fish) to emerge.
(2) Make all n decisions jointly.
–Could improve performance when consecutive states
of nature are not statistically independent.
Compound Bayesian
Decision Theory (cont’d)
• Suppose X=(x1, x2, …, xn) are n observed
vectors.
• Suppose Ω=(ω(1), ω(2), …, ω(n)) denotes the n
states of nature.
– ω(i) can take one of c values ω1, ω2, …, ωc (i.e., c
categories)
• Suppose P(Ω) is the prior probability of the n
states of nature.
Compound Bayesian
Decision Theory (cont’d)

P P
acceptable! i.e., consecutive states of nature may
not be statistically independent!

You might also like