Naïve Bayes Classifier
1
Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: X Y, or P(Y|X)
Discriminative classifiers (also called ‘informative’ by
Rubinstein&Hastie):
1. Assume some functional form for P(Y|X)
2. Estimate parameters of P(Y|X) directly from training data
Generative classifiers
1. Assume some functional form for P(X|Y), P(X)
2. Estimate parameters of P(X|Y), P(X) directly from training data
3. Use Bayes rule to calculate P(Y|X= xi)
Bayes Formula
Generative Model
• Color
• Size
• Texture
• Weight
• …
Discriminative Model
• Logistic Regression
• Color
• Size
• Texture
• Weight
• …
Comparison
• Generative models
– Assume some functional form for P(X|Y), P(Y)
– Estimate parameters of P(X|Y), P(Y) directly from
training data
– Use Bayes rule to calculate P(Y|X= x)
• Discriminative models
– Directly assume some functional form for P(Y|X)
– Estimate parameters of P(Y|X) directly from
training data
Probability Basics
• Prior, conditional and joint probability for random
variables
– Prior probability:P(X )
P(X1| X2), P(X2| X1)
– Conditional probability:
– Joint probability: X (X1,X2), P(X) P(X1 ,X2)
– Relationship: P(X1 ,X2) P(X2| X1)P(X1) P(X1| X2)P(X2)
P(X2| X1) P(X2), P(X1| X2) P(X1), P(X1 ,X2) P(X1)P(X2)
– Independence:
• Bayesian Rule
P(X| C)P(C) Likelihood
Prior
P(C| X) Posterior
P(X) Evidence
7
Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could
end up with the following occurance: (A) dice 1 lands on side “3”,
(B) dice 2 lands on side “1”, and (C) Two dice sum to eight.
Answer the following questions:
1) P( A) ?
2)P(B)?
3)P(C) ?
4)P( A| B) ?
5)P(C| A) ?
6)P( A , B) ?
7)P( A ,C) ?
8)Is P( A ,C) equalsP(A) P(C)?
8
Probabilistic Classification
• Establishing a probabilistic model for
classification
– Discriminative model
P(C| X) C c1,,cL , X (X1,,Xn )
P(c1| x) P(c2 | x) P(cL | x)
Discriminative
Probabilistic Classifier
x1 x2 xn
x (x1 , x2 ,, xn )
9
Probabilistic Classification
• Establishing a probabilistic model for
classification (cont.)
– Generative model
P(X| C) C c1,,cL , X (X1,,Xn )
P(x| c1) P(x| c2 ) P(x| cL )
Generative Generative Generative
Probabilistic Model Probabilistic Model Probabilistic Model
for Class 1 for Class 2 for Class L
x1 x2 xn x1 x2 xn x1 x2 xn
x (x1 , x2 ,, xn )
10
Probabilistic Classification
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
P(C c* | X x) P(C c| X x) c c* , c c1,,cL
• Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior
probabilities
P(X x| C ci )P(C ci )
P(C ci | X x)
P(X x)
P(X x| C ci )P(C ci )
for i 1,2,, L
– Then apply the MAP rule
11
Naïve Bayes
• Bayes classification
P(C| X) P(X| C)P(C) P(X1,,Xn | C)P(C)
P(X1,,Xn | C)
Difficulty: learning the joint probability
• Naïve Bayes classification
– Assumption that all input attributes are conditionally
independent!
P(X1,X2 ,,Xn | C) P(X1| X2 ,,Xn ;C)P(X2 ,,Xn | C)
P(X1| C)P(X2 ,,Xn | C)
P(X1| C)P(X2| C) P(Xn | C)
x (x1 , x2 ,, xn )
– MAP classification rule: for
[P(x1| c* ) P(xn | c* )]P(c* ) [P(x1| c) P(xn | c)]P(c), c c* , c c1,,cL
12
Naïve Bayes
• Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
Foreachtargetvalueof ci (ci c1 ,,cL )
Pˆ(C ci ) estimateP(C ci ) withexamplesin S;
Foreveryattributevaluexjk of eachattributeX j ( j 1,,n; k 1,, Nj )
Pˆ(X j xjk | C ci ) estimateP(X j xjk | C ci ) withexamplesin S;
Output: conditional probability tables; for elements
X j , Nj L
– Test Phase: Given an unknown instance ,
X (a1 ,,an )
Look up tables to assign the label c* to X’ if
[Pˆ(a1 | c* ) Pˆ(an | c* )]Pˆ(c* ) [Pˆ(a1 | c) Pˆ(an | c)]Pˆ(c), c c* , c c1,,cL
13
Example
• Example: Play Tennis
14
Example
• Learning Phase
Outlook Play=Y Play=N Temperatu Play=Yes Play=No
es o re
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Ye Play=N
Wind Play=Y Play=
s o
es No
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
15
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– Look up tables P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No)
P(Temperature=Cool|Play=Yes) = 3/9 = 1/5
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be 16
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– Look up tables P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No)
P(Temperature=Cool|Play=Yes) = 3/9 = 1/5
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be 17
Relevant Issues
• Violation of Independence Assumption
P(X1,,Xn | C) P(X1| C) P(Xn | C)
– For many real world tasks,
– Nevertheless, naïve Bayes works surprisingly well
anyway!
• Zero conditional probability Problem
X j ajk, Pˆ(X j ajk| C ci ) 0
– If no example contains
Pˆ(x1|the
ci ) attribute
Pˆ(ajk| ci ) Pˆ(xvalue
n | ci ) 0
– In this circumstance, during
test ˆ nc mp
P(X j ajk| C ci )
n m
– For a remedy, conditional probabilities estimated with
nc : numberof trainingexamplesfor whichX j ajk andC ci
n : numberof trainingexamplesfor whichC ci
p : priorestimate(usually,p 1/ t fort possiblevaluesof X j )
m: weightto prior(numberof "virtual"examples,m 1)
18
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal
distribution (X j ji )2
ˆ 1
P(X j | C ci ) exp
2 ji 2 2
ji
ji : mean(avearage)of attributevaluesX j of examplesfor whichC ci
ji : standarddeviationof attributevaluesX j of examplesfor whichC ci
for X (X1,,Xn ), C c1,,cL
– Learningn
Phase:
L P(C ci ) i 1,, L
Output: normal distributions and
for X (X1 ,,Xn )
– Test Phase:
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
19
Conclusions
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
20
Extra Slides
21
Naïve Bayes (1)
• Revisit
• Which is equal to
• Naïve Bayes assumes conditional independency
• Then the inference of posterior is
Naïve Bayes (2)
• Training: Observation is multinomial; Supervised, with label information
– Maximum Likelihood Estimation (MLE)
– Maximum a Posteriori (MAP): put Dirichlet prior
• Classification
Naïve Bayes (3)
• What if we have continuous Xi ?
• Generative training
• Prediction
Naïve Bayes (4)
• Problems
– Features may overlapped
– Features may not be independent
• Size and weight of tiger
– Use a joint distribution estimation (P(X|Y), P(Y)) to solve a
conditional problem (P(Y|X= x))
• Can we discriminatively train?
– Logistic regression
– Regularization
– Gradient ascent