Theory of probability
1
Definition of Probability
Let A be an event of Ω. Then the frequency ratio of A is
If the random experiment Ω is repeated a large number of
times under identical or uniform conditions then the
frequency ratio of A will be approximately equal to its
probability, i.e.,
Thus, f(A) can be taken as to be an experimentally measured
value of the idealised number P(A).
Longer is the sequence of repetitions of Ω more accurate is the
measured value.
The definition of probability in terms of frequency
interpretation restricts the class of random experiments, i.e.,
the random experiment must be repeated a large number of
times in a uniform condition.
Example-1
What is the probability that a positive integer selected
at random from the set of positive integers not
exceeding 100 is divisible by (i) 5, (ii)5 or 3 (iii)5 and
3?
Solution: Ω = {1, 2, …, 100} so, n(Ω)= 100
Let A be an event that no. is divisible by 5, so A =
{5,10,…,100}; so n(A)= 20. So,
Let B be an event that no. is divisible by 3, so B ={3,6,
…,99}, so n(B)=33; so
Example-1
Let C be an event that no. is divisible by 5 or 3, C = A
∪ B, and n(A ∪ B)=47; so
Let D be an event that no. is divisible by 5 and 3, D =
A ∩ B, and n(A ∩ B)=6; so
Deduction of some important
rules
1. For any event A,
In the limit as , So,
2. For certain event S, , so in the
limit P(S)=1
3. For impossible event O, n(O)=0. Hence,
P(O)=0
4. For an event A, A+ A′=S (certain event)
So, P(A+ A′)=P(S) => P(A) + P(A′)=1
=> P(A)=1-P(A′)
Addition rule for pairwise mutually exclusive
events
For two mutually exclusive events, A and B, n(A ∪ B)
= n(A+B)= n(A) + n(B)
So, , therefore, in the limit P(A+B)=P(A)
+P(B)
If A, B, C be pairwise mutually exclusive
events, => , , and
Also => A and are mutually exclusive
so we have
In general, if be n pairwise
mutually exclusive events then we have the
following addition rule:
Conditional Probability
Let us consider two events A and B of Ω. Let us make the
hypothesis that the event A has occurred.
Let n(A) times event A has occurred out of n(Ω).
Let, among these n(A) occurrence of A, the event B also
occurs (along with A) times.
The ratio n(AB)/n(A) is called the conditional frequency ratio
of B on the hypothesis that A has occurred and denoted by
f(B/A), i.e., f(B/A)= n(AB)/n(A) =>
By Empirical or Statistical definition,
We assume that, this limit exist, this limit is called the
conditional probability of B on the hypothesis that A has
occurred.
So, as n(Ω) -> ∞, provided P(A) ≠ 0
Conditional Probability
Similarly, provided P(B) ≠ 0
Hence, if P(A), P(B) ≠ 0, we have
This is the Multiplication Rule
Addition rule:
Multiplication Rule:
General Addition Rule
General Addition Rule
Let us consider two events A and B of Ω. In general,
they are not mutually exclusive.
But the events, A-AB, AB, and B-AB are always
pairwise mutually exclusive.
So, A=(A-AB)+AB, B=(B-AB)+AB, and A+B=(A-AB)+AB+
(B-AB)
By Addition rule for mutually exclusive events,
P(A)= P(A-AB)+P(AB), P(B)=P(B-AB)+P(AB), and
P(A+B)=P(A-AB)+P(AB)+P(B-AB)
= P(A)-P(AB)+P(AB)+P(B)-P(AB) = P(A)+P(B)-P(AB)
i.e., P(A+B)= P(A)+P(B)-P(AB)
General Addition Rule
For three events A, B, and C
P(A+B+C) = P(A+(B+C))=P(A)+P(B+C)-P(A(B+C))
= P(A)+P(B)+P(C)-P(BC)-P(AB+AC)
= P(A)+ P(B)+P(C)-P(BC)-[P(AB)+P(AC)-
P(AB.AC)]
= P(A)+ P(B)+P(C)-P(BC)-P(AB)-P(AC)+P(AB.AC)
= P(A)+ P(B)+P(C)-P(AB)-P(BC)-P(CA)+P(ABC)
Generalising for n events, , we get
Examples
1. A coin is tossed 3 times in succession. Find
the probability of (a) 2 heads (b) 2
consecutive heads
Examples
1. A coin is tossed 3 times in succession. Find the
probability of (a) 2 heads (b) 2 consecutive
heads
Sol: (a) Here,
Let, A is the event that 2 heads occur, then
So, P(A) = = 3/8
(b) Let B be the event that 2 consecutive heads
occur, then n(B) = 3-1=2 [as head in first and third
positions are not consecutive]
So, P(B)= = 2/8=1/4
Examples
2. Two dice are thrown. Find the probability that the sum of the
faces equals or exceeds 10.
Examples
2. Two dice are thrown. Find the probability that the sum of the
faces equals or exceeds 10.
Sol: Here,
Let, A, B, and C denote the events ‘Sum 10’, ‘Sum 11’, and
‘Sum 12’ respectively. So, A+B+C is the required event,
where A, B, and C are pairwise mutually exclusive.
So, P(A+B+C) = P(A)+P(B)+P(C).
Now, P(A)= 3/36 as (4,6), (5,5), and (6,4) lie in A,
P(B)= 2/36 as (5,6) and (6,5) lie in B, and
P(C)= 1/36 as only (6,6) lies in C
So, P(A+B+C) = P(A)+P(B)+P(C)=3/36+2/36+1/36=6/36=1/6
Generalisation of Conditional Probability
Frequency ratio
For a long sequence of repetitions of the random experiment
under uniform conditions, the conditional frequency ratio
f(B/A) is taken to be an approximate value of the conditional
probability P(B/A).
So, the conditional probability of B on the hypothesis that A
has occurred is
Which gives the multiplication rule:
For three events A, B, C, we have
Generalisation of Conditional Probability
For three events A, B, C, we have
Proof: R.H.S=
In general, for n events the multiplication rule is:
Examples
1. A die is rolled. If the result is either an even face or a
multiple of 3, then you win. What is the probability that
multiple of 3 occurs on the hypothesis that even face occurs?
Examples
1. A die is rolled. If the result is either an even face or a multiple
of 3, then you win. What is the probability that multiple of 3
occurs on the hypothesis that even face occurs?
Sol: Here,
Let, A and B denote the events ‘even face’, ‘multiple of 3,
respectively. So, B/A is the required event.
So, P(B/A) = P(AB)/P(A).
Now, P(A)= 3/6 as (2), (4), and (6) lie in A,
P(B)= 2/6 as (3) and (6) lie in B
P(AB)= 1/6 as only (6) lies in AB
So, P(B/A) = P(AB)/P(A)=(1/6)/ (3/6)=1/3
Similarly, P(A/B)=P(AB)/P(B)=1/2
Examples
2. Two cards are drawn successively from a pack without
replacing the first. If the first card is a spade, find the
probability that the second card is also a spade.
Examples
2. Two cards are drawn successively from a pack without
replacing the first. If the first card is a spade, find the
probability that the second card is also a spade.
Sol1: Let A = first card is a spade, B=second card is a spade.
So, AB=both cards are spades.
= 52. 51, n(A)=13. 51, n(AB)=13.12
So, P(B/A)=P(AB)/P(A) = n(AB)/n(A)=(13.12) / (13.51) = 4/17
Sol2: When the first card is seen to be a spade, there are 51 cards
remain in the pack out of which 12 are spade.
Hence, the probability that the second card is also a spade is
12/51=4/17
Bayes’ Theorem
Theorem: If be a given set of n pairwise
mutually exclusive events, one of which certainly occurs,
i.e.,
and
then for any arbitrary event X,
(i)
(ii) Bayes’ theorem: if P(X) ≠ 0,
Proof: For any event X, we have X=SX= =
Since,
Therefore, are pairwise mutually exclusive events, and
hence
[Addition Rule]
Since, [Multiplication Rule]
Therefore, [(i) is proved]
Bayes’ Theorem
We have already proved that
Now we have to prove the Bayes’ theorem: i.e., if P(X) ≠ 0,
Proof: [Multiplication Rule]
Also, [Multiplication Rule]
Hence, if P(X) ≠ 0,
Thus the Bayes’ theorem is proved.
Example on Bayes’ Theorem
Example-1: There are three identical urns containing white and black
balls. The
first urn contains 2 white and 3 black balls, the second urn 3 white
and 5 black
balls, and the third urn 5 white and 2 black balls. An urn is chosen at
random,
and a ball is drawn from it. If the ball drawn is white, what is the
probability
that the second urn is chosen?
Solution:
Let A= the event that the ball is drawn from the first urn, B= the
event that the ball is drawn from the second urn, and C= the
event that the ball is drawn from the third urn.
The events A, B, and C are pairwise mutually exclusive, and one of
these necessarily occurs.
P(A) is the probability that 1st urn is chosen, and so on. So,
P(A)=P(B)=P(C)=1/3
Let X=the event that ball drawn is white. So, P(X/A)=2/5,
P(X/B)=3/8, P(X/C)=5/7.
We have to compute P(B/X)
Example-1 Cont…
Compute P(B/X)
P(X)=P(A)P(X/A) + P(B)P(X/B)+P(C)P(X/C)
=(1/3)(2/5)+ (1/3)(3/8)+ (1/3)(5/7)
=(1/3)[2/5+3/8+5/7]
=(1/3)(417/280) = 139/280
So, By Bayes’ theorem:
P(B/X) = [P(X/B) P(B)]/ P(X)
= [(3/8)(1/3)] / (139/280)
=35/139 (Ans.)
Bayesian Classifier
25
Bayesian Classification: Why?
A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier,
naïve Bayesian classifier, has comparable
performance with decision tree and selected
neural network classifiers
26
Bayesian Classification
Data Mining: Concepts and
March 29, 2025 Techniques 27
Bayesian Classification
Data Mining: Concepts and
March 29, 2025 Techniques 28
Bayes’ Theorem: Basics
P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)
Bayes’ Theorem:
P(X)
Let X be a data sample: class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori
probability): the probability that the hypothesis holds given
the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income,
student, credit_rating.
P(X): probability that sample data is observed
P(X|H) : the probability of observing the sample X, given
that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
29
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem
P(H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)
P(X)
Predicts X belongs to Ci iff the probability P(Ci|X) is
the highest among all the P(Ck|X) for all the k
classes
Practical difficulty: It requires initial knowledge of
many probabilities, involving significant
computational cost
30
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2,
…, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
This can be derived from Bayes’
P(X | C )Ptheorem
i
(C )
i
P(C | X)
i P(X)
Since P(X) is constant for all classes, so only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
31
Naïve Bayes Classifier
A simplified assumption: attributes are
conditionally independent (i.e., no dependence
relation between nattributes):
P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ...P( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only
counts the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples
of Ci in D)
gIf( xAk ,kis
Ci , Ci )
continous-valued, P(xk|Ci) is usually
computed 2
( x )
based on Gaussian distribution
g ( x, , )
1
e 2
2 with a
mean μ and standard deviation 2 σ , where
32
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
33
age income studentcredit_rating
buys_computer
<=30 high no fair no
An Example <=30
31…40
>40
>40
high
high
medium
low
no excellent
no fair
no fair
yes fair
no
yes
yes
yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium yes fair yes
<=30 medium yes excellent yes
P(buys_computer = “no”) = 5/14= 0.357 31…40
31…40
medium
high
no excellent
yes fair
yes
yes
Compute P(X|Ci) for each class >40 medium no excellent no
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating =
fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
0.044 x 0.643 = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.019 x 0.357 = 0.007
34
Avoiding the Zero-Probability Problem
Naïve Bayesian prediction requires each
conditional prob. be non-zero. Otherwise, the
predicted prob. will be zeron
P( X | C i) P( x k | C i)
k 1
Ex. Suppose a dataset with 1000 tuples,
income=low (0), income= medium (990), and
income = high (10)
Use Laplacian correction (or Laplacian
estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to
their “uncorrected” counterparts 35
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients=> Profile: age, family
history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.
Dependencies among these cannot be
modeled by Naïve Bayes Classifier
How to deal with these dependencies? Bayesian
Belief Networks 36
kNN Algorithm
37
kNN Classifier
k-nearest neighbours (kNN) algorithm as a type of
supervised algorithm which can be used for both
classification as well as regression predictive problems.
There are three categories of learning algorithms:
i) Lazy learning algorithm: kNN is a lazy learning
algorithm because it does not have a specialized
training phase or model and uses all the data for
training while classification
ii) Non-parametric learning algorithm: kNN is also
a non-parametric learning algorithm because it does
not assume about the distribution of the underlying
data (as opposed to other algorithms such as Gaussian
Mixture Model (GMM), which assume a Gaussian
distribution about the data
iii) Eager learning algorithm: Eager learners, when
given a set of training tuples, will construct a
generalization model before receiving new (e.g., test)
tuples to classify. 38
kNN Classifier
The kNN algorithm begins with a training
dataset made up of examples that are
classified into several categories.
Assume that we have a test dataset
containing unlabeled examples that
otherwise have the same features as the
training data.
For each example (i.e., record) in the test
dataset, kNN identifies k examples in the
training data that are the "nearest" in
similarity, where k is an integer specified in
advance.
The unlabeled test instance is assigned the
class of the majority of the k nearest
neighbors
39
KNN: Classification Approach
Locating the unlabeled instance’s nearest
neighbors requires a distance function, or a
formula that measures the similarity
between two instances.
There are many different ways to calculate
distance. Traditionally, the kNN algorithm
uses Euclidean distance.
The K-NN algorithm works by finding the K
nearest neighbors to a given data point
based on a distance metric, such as
Euclidean distance.
The class or value of the data point is then
determined by the majority vote (for
classification) or average (for regression) of
the K neighbors. 40
KNN: Classification Approach
Classified by “MAJORITY VOTES” for its neighbor
classes
🞑 Assignedto the most common class amongst its K-
nearest neighbors
KNN:
Pseudocode
Advantages of kNN
kNN algorithm is a versatile and widely used
machine learning algorithm that is primarily
used for its simplicity and ease of
implementation.
It does not require any assumptions about
the underlying data distribution.
It can also handle both numerical and
categorical data, making it a flexible choice
for various types of datasets in classification
and regression tasks.
It is a non-parametric method that makes
predictions based on the similarity of data
points in a given dataset.
43
Advantages of kNN
K-NN is less sensitive to outliers compared
to other algorithms.
Few Hyperparameters – The only
parameters which are required in the
training of a KNN algorithm are the value of
k and the choice of the distance metric
which we would like to choose from our
evaluation metric.
44
Disadvantages of the KNN Algorithm
Does not scale – It is a lazy Algorithm. The main
significance of this term is that this takes lots of computing
power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.
Curse of Dimensionality – The KNN algorithm is affected by
the curse of dimensionality which implies the algorithm faces
a hard time classifying the data points properly when the
dimensionality is too high.
The curse of dimensionality can be particularly problematic in
several ways:
(i) Distance Metrics Become Less Informative: In high-
dimensional spaces, the concept of "distance" becomes less
meaningful because the distances between all pairs of points
tend to become more similar. This reduces the effectiveness
of KNN, which relies on distance metrics to determine the
nearest neighbors.
(ii) Sparsity of Data: As the number of dimensions increases,
the volume of the space grows exponentially. This means that
data points become sparse, making it harder for KNN to find
enough neighbors that are truly representative of the
underlying distribution of the data.
45
Disadvantages of the KNN
Algorithm
(iii) Increased Computational Complexity: The
computational cost of calculating distances between points
increases with dimensionality. This can make KNN inefficient
in practice when dealing with very high-dimensional data.
(iv) Overfitting: In high-dimensional spaces, KNN can become
overly sensitive to noise in the data, leading to overfitting.
This happens because with many dimensions, there’s a
higher likelihood that some of the features will not be
relevant, but they can still influence the nearest neighbor
calculations.
To mitigate these issues, various techniques can be
used:
(i) Dimensionality Reduction
(ii) Feature Selection
By addressing the curse of dimensionality through
these methods, you can make KNN more effective
even
March 29, 2025
in high-dimensional settings.
Data Mining: Concepts
Techniques
and
46
Variation In kNN
How to Choose the value of
k?
The value of k is very crucial in the KNN
algorithm to define the number of
neighbors in the algorithm.
The value of k in the k-nearest neighbors (k-
NN) algorithm should be chosen based on
the input data. If the input data has more
outliers or noise, a higher value of k would
be better.
It is recommended to choose an odd value
for k to avoid ties in classification.
48
Value of k
The small k value isn’t suitable for
classification.
As a rule of thumb, setting k to the square
root of the number of training samples can
lead to better result. If k becomes 10 after
the square root, we can choose either k=9
or k=11 just to make sure that k is odd.
Use an error plot or accuracy plot to find the
most favorable k value.
kNN performs well with multi-label classes,
but you must be aware of the outliers.
49
How to Choose the value of
k?
• We got the accuracy of 0.41 at K=37. As we got the
minimum error at k=37, so we will get better
efficiency at that K value.
50
Is Naïve Bayes a lazy learner?
The Naive Bayes algorithm is not a lazy learner. It
is an eager learner. It is different from the nearest
neighbor algorithm.
A real learning takes place for Naive Bayes. The
parameters that are learned in Naive Bayes are
the prior probabilities of different classes, as well
as the likelihood of different features for each
class.
In the test phase, these learned parameters are
used to estimate the probability of each class for
the given sample.
In other words, in Naive Bayes, for each sample in
the test set, the parameters determined during
training are used to estimate the probability of
that sample belonging to different classes.
Is Naïve Bayes a lazy learner?
For example, P(c|x) ∝ P(c) P(x1|c) P(x2|c) ...
p(xn|c), where c is a class and x is a test
sample.
All quantities P(c) and P(xi|c) are
parameters which are determined during
training and are used during testing.
This is similar to NN, but the kind of
learning and the kind of applying the
learned model is different.
THANK YOU
March 29, 2025 53