Classification in
Machine Learning
Lecture 07
Discussion
2
Establishing the
prayer means
fixing it where it
needs repair &
maintaining it
once you do.
The prayer is
our reminder
that we’ll all
stand before
Allah on
Judgment Day.
Agenda:
•A Quick Recap (Important Concepts)
•Naïve Bayes Classifier
•Principle of Naïve Bayes
•Bayes theorem
•Why Bayes Classification
•Example
•Advantages and Disadvantages
•Conclusion
3
What is a Classifier?
A classifier is a machine learning model that is used to
discriminate different objects based on certain features
What is Naïve Bayes Classifier?
Naive Bayes is a supervised learning algorithm used
for classification tasks. Hence, it is also called Naive
Bayes Classifier
4
Principle of Naive Bayes Classifier:
A Naive Bayes classifier is a probabilistic machine
learning model that’s used for classification task. The
crux of the classifier is based on the Bayes theorem.
5
Why Naïve?
• Using Bayes theorem, we can find the
probability of A happening, given that B has
occurred.
• Here, B is the evidence and A is the
hypothesis.
• The assumption made here is that the
predictors/features are independent.
• That is presence of one particular feature does
not affect the other. Hence it is called naïve. 6
A Quick Recap:
• Probability simply means the likelihood of an event to occur and
always takes a value between 0 and 1 (0 and 1 inclusive).
• Conditional probability is the likelihood of an event A to occur
given that another event that has a relation with event A has
already occurred.
• The probability of event A given that event B has occurred is
denoted as p(A|B)
• Joint probability is the probability of two events occurring
together and denoted as p(A and B). For independent events,
joint probability can be written as:
• p(A and B) = p(A).p(B) ……… (1)
• p(A and B) = p(A).p(B|A) ……… (2) Dependent events
7
Bayes’ Theorem
We will start with the fact that joint probability is commutative for
any two events. That is:
p(A and B) = p(B and A) ……… (3)
From equation 2, we know that:
p(A and B) = p(A).p(B|A)
p(B and A) = p(B).p(A|B)
We can rewrite equation 3 as:
p(A).p(B|A) = p(B).p(A|B)
Dividing two sides by p(B) gives us the Bayes’ Theorem:
8
Example:
• Let us take an example to get some better
intuition.
• Consider the problem of playing golf.
9
Example:
• We classify whether the day is suitable for playing golf,
given the features of the day.
• If we take the first row of the dataset, we can observe
that is not suitable for playing golf if the outlook:rainy,
temperature : hot, humidity : high and windy: False
• Assumption-I: we consider that these predictors are
independent
- if the temperature is hot, it does not necessarily mean that the
humidity is high
• Assumption-II: all the predictors have an equal effect
on the outcome
- the day being windy does not have more importance in
deciding to play golf or not
10
Example:
• According to this example, Bayes theorem can be rewritten as:
• The variable y is the class variable(play golf), which represents if
it is suitable to play golf or not given the conditions.
Variable X represent the parameters/features.
• X is given as,
• Here x_1,x_2….x_n represent the features, i.e they can be
mapped to outlook, temperature, humidity and windy. By
substituting for X and expanding using the chain rule we get,
11
Example:
• Now, you can obtain the values for each by looking at the dataset
and substitute them into the equation.
• For all entries in the dataset, the denominator does not change, it
remain static. Therefore, the denominator can be removed and a
proportionality can be introduced
• In our case, the class variable(y) has only two outcomes, yes or
no. There could be cases where the classification could be
multivariate. Therefore, we need to find the class y with maximum
probability.
• Using the above function, we can obtain the class, given the
predictors 12
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and selected neural
network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
13
Naïve Bayes Classifier: Training Dataset
14
Naïve Bayes Classifier An Example : age
youth
youth
income student credit_rating
high
high
no
no
fair
excellent
buys_computer
no
no
middle-a high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle-a low yes excellent yes
•
youth medium no fair no
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 youth low
senior medium
yes fair
yes fair
yes
yes
P(buys_computer = “no”) = 5/14= 0.357 youth medium
middle-a medium
yes excellent
no excellent
yes
yes
• Compute P(X|Ci) for each class
missle-a high yes fair yes
senior medium no excellent no
P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “youth ” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
•X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
49
Avoiding the Zero-Probability Problem
Advantages and Disadvantages:
Conclusion:
• Naive Bayes algorithms are mostly used in sentiment
analysis, spam filtering, recommendation systems etc.
• They are fast and easy to implement but their biggest
disadvantage is that the requirement of predictors to be
independent
• In most of the real life cases, the predictors are
dependent, this hinders the performance of the
classifier.
18
Thank you
Any Question?
19