Bayesian Classification: Why?
◼ A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
◼ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
◼ Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
1
Bayes’ Theorem: Basics
M
◼ Total probability Theorem: P(B) = P(B | A )P( A )
i i
i =1
◼ Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
◼ Let X be a data sample (“evidence”): class label is unknown
◼ Let H be a hypothesis that X belongs to class C
◼ Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
◼ P(H) (prior probability): the initial probability
◼ E.g., X will buy computer, regardless of age, income, …
◼ P(X): probability that sample data is observed
◼ P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
◼ E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
2
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
◼ Informally, this can be viewed as
posteriori = likelihood x prior/evidence
◼ Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
◼ Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
3
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
4
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts the
class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
g ( x, , ) = e 2 2
and P(xk|Ci) is 2
P ( X | C i ) = g ( xk , C i , Ci )
5
Naïve Bayes Classifier: Training Dataset
Example:
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Class: >40 medium no excellent no
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
6
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
◼ Compute P(Ci) for each class:
◼ P(C1) = P(buys_computer = “yes”) = 9/14 = 0.643
◼ P(C2) = P(buys_computer = “no”) = 5/14= 0.357
7
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
◼ Compute P(X|Ci) for each class
P(Xk|C1) = P(X1|C1) * P(X2|C1) * P(X3|C1)* ….*P(Xk|C1)
P(Xk|C2) = P(X1|C2) * P(X2|C2) * P(X3|C2)* ….*P(Xk|C2)
8
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Age Buys Computer Count Total Conditional Probability Conditional Probability
<= 30 Yes 2 9 (2/9) 0.222222222
<= 30 No 3 5 (3/5) 0.6
31-40 Yes 4 9 (4/9) 0.444444444
31-40 No 0 5 (0/5) 0
> 40 Yes 3 9 (3/9) 0.333333333
> 40 No 2 5 (2/5) 0.4
P(Age <= 30| Buys Computer = Yes) 0.222222222
P(Age <= 30| Buys Computer = No) 0.6
P(Age Between 31 and 40| Buys Computer = Yes) 0.444444444
P(Age Between 31 and 40| Buys Computer = No) 0
P(Age > 40| Buys Computer = Yes) 0.333333333
P(Age > 40| Buys Computer = No) 0.4
9
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Income Buys Computer Count Total Conditional Probability Conditional Probability
High Yes 2 9 (2/9) 0.222222222
High No 2 5 (2/5) 0.4
Medium Yes 4 9 (4/9) 0.444444444
Medium No 2 5 (2/5) 0.4
Low Yes 3 9 (3/9) 0.333333333
Low No 1 5 (1/5) 0.2
P(Income = High| Buys Computer = Yes) 0.222222222
P(Income = High| Buys Computer = No) 0.4
P(Income = Medium| Buys Computer = Yes) 0.444444444
P(Income = Medium| Buys Computer = No) 0.4
P(Income = Low| Buys Computer = Yes) 0.333333333
P(Income = Low| Buys Computer = No) 0.2
10
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Student Buys Computer Count Total Conditional Probability Conditional Probability
Yes Yes 6 9 (6/9) 0.666666667
Yes No 1 5 (1/5) 0.2
No Yes 3 9 (3/9) 0.333333333
No No 4 5 (4/5) 0.8
P(Student = Yes| Buys Computer = Yes) 0.666666667
P(Student = Yes| Buys Computer = No) 0.2
P(Student = No| Buys Computer = Yes) 0.333333333
P(Student = No| Buys Computer = No) 0.8
11
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Credit Rating Buys Computer Count Total Conditional Probability Conditional Probability
Fair Yes 6 9 (6/9) 0.666666667
Fair No 2 5 (2/5) 0.4
Excellent Yes 3 9 (3/9) 0.333333333
Excellent No 3 5 (3/5) 0.6
P(Credit Rating = Fair| Buys Computer = Yes) 0.666666667
P(Credit Rating = Fair| Buys Computer = No) 0.4
P(Credit Rating = Excellent| Buys Computer = Yes) 0.333333333
P(Credit Rating = Excellent| Buys Computer = No) 0.6
12
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
◼ Compute P(X|Ci) for each class
P(X|C1) = P(X|buys_computer = “yes”)
= 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C2) = P(X|buys_computer = “no”)
= 0.6 x 0.4 x 0.2 x 0.4 = 0.019
13
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
◼ Compute P(X|Ci) * P(Ci) for each class
P(X|C1) * P(C1) = 0.044 * 0.643 = 0.028
P(X|C2) * P(C2) = 0.019 * 0.357 = 0.007
◼ Decision
P(X|C1) * P(C1) > P(X|C2) * P(C2)
X belongs to (C1)
Therefore, X belongs to class (“buys_computer = yes”)
14
Solved Example on Bayes Theorem
◼ Researchers investigated the effectiveness of using the
Hologic Sahara Sonometer, a portable device that
measures bone mineral density (BMD) in the ankle, in
predicting a fracture. They used a Hologic estimated
bone mineral density value of .57 as a cutoff. The
results of the investigation yielded the following data:
15
Solved Example on Bayes Theorem
a) Calculate the sensitivity of using a BMD value of 0.57
as a cutoff value for predicting fracture.
b) Calculate the specificity of using a BMD value of 0.57
as a cutoff value for predicting fracture.
c) If it is estimated that 10 percent of the U.S.
population have a confirmed bone fracture, What is
predictive value positive of using a BMD value of
0.57 as a cutoff value for predicting fracture? That is,
we wish to estimate the probability that a subject
who has BMD value equals 0.57 has a confirmed
bone fracture.
16
Solved Example on Bayes Theorem
a) Sensitivity = P (+T \ +D) = 214/287 = 0.7456 = 74.56%
b) Specificity = P (-T \ -D) = 330/1000 = 0.33 = 33%
c) Predictive Value Positive
P +T\+D ∗P(+D)
P(+D\+T) =
𝑃(+𝑇)
17
Solved Example on Bayes Theorem
c) Predictive Value Positive
P(+T) = P(+T\+D)P(+D) + P(-T\+D)P(-D) =
(214/287)(0.1) + (670/1000)(0.9) = 0.6776
P +T\+D ∗P(+D) 0.7456∗0.1
◼ P(+D\+T) = = = 0.11
𝑃(+𝑇) 0.6776
18