0% found this document useful (0 votes)

35 views63 pages

CHP 5

Chapter 5 provides an overview of key probability concepts used in machine learning, including: 1) Probability theory provides a mathematical foundation for quantifying uncertainty, which is important for machine learning given uncertainties that can arise from prediction, models, and confidence levels. 2) Concepts covered include probability, random variables, discrete and continuous distributions, the central limit theorem, and hypothesis testing. 3) Common discrete distributions discussed are the Bernoulli, binomial, and multinomial distributions, while continuous distributions and related concepts like probability density functions are also introduced.

Uploaded by

its9918k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views63 pages

CHP 5

Uploaded by

its9918k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Chapter 5

Overview of Probability
Overview
 Statistical tools in Machine Learning
 Concepts of probability, Random variables, Discrete
distributions, Continuous distributions, Multiple random
variables
 Central limit theorem
 Sampling distributions
 Hypothesis testing
 Monte Carlo Approximation
Statistical Tools in Machine Learning
Machine learning: Consists of a set of methods that can automatically
detect patterns in data, and then can be used,

- To uncover patterns to predict future data.

- To perform other kinds of decision making under uncertainty.

The best way to perform Machine Learning activities is to use the tools of
probability theory because probability theory can be applied to any situation
involving uncertainty.
In Machine Learning, uncertainties can arise in various
ways.
Forms:
• Prediction Uncertainty
• Model Uncertainty
• Confidence Level Uncertainty
Importance of Statistical Tools in ML

Knowledge and uncertainty

• Probability theory provides a mathematical foundation for
quantifying this uncertainty of the knowledge.

• As the knowledge about the training data comes in the form

of interdependent feature sets, the conditional probability
theories form the basis for deriving required confidence level
of the training data.
• The mathematical function behind each of these distributions
so that we can understand how the data is spread out from its
average value – denoted by the mean and the variance.
Concept of Probability
• Frequentist interpretation of probability:
Represents the long run frequencies of events.
Example: Tossing a coin

• Bayesian interpretation of probability:

Focuses on information rather than repeated trials.
Joint Probabilities

p(A|B) - Conditional probability of event A happening if event B happens.

Marginal distribution: Distribution of a single variable from a joint

probability distribution.

Summing up the all probable states of B gives the total probability

formulae, which is also called sum rule or the rule of total probability.
Chain rule of probability:
Conditional Probability
• The conditional probability of event A, given that event B is true,

Similarly,
In a toy-making shop, the automated machine produces
few defective pieces. It is observed that in a lot of 1,000
toy parts, 25 are defective. Two random samples are
selected for testing without replacement from the lot.

Calculate the probability that both the samples are

defective.
Bayes Rule (Bayes Theorem)
• Can be derived by combining the definition of conditional
probability with the product and sum rules,
Let’s assume that we have knowledge about the reliability of
the assumption that emails with sender names ‘mass’ and
‘bulk’ are spam email is 80% meaning that if some email has
sender name with ‘mass’ or ‘bulk’ then the email will be spam
with probability 0.8. The probability of false alarm is 10%
meaning that the agent will show the positive result as spam
even if the email is not a spam with a probability 0.1. Also, we
have a prior knowledge that only 0.4% of the total emails
received are spam.

Find the probability of an email to be really spam based on the

name of the sender.
Random Variables
• In an example of tossing a coin, we can associate random
variables X and Y as,

X(H) = 1, X(T) = 0, which means this variable is associated

with the outcome of the coin facing head.

Y(H) = 0, Y(T) = 1, which means this variable is associated

with the outcome of the coin facing tails.
• Sample space S which is the outcome related to tossing the coin
once.

• In S, random variables represent the single valued real function

[X(ζ)] that assigns a real number, called its value to each sample
point of S.

• A random variable is not a variable but is a function.

• The sample space S is called the domain of random variable X and

the collection of all the numbers.
i.e. values of X(ζ), is termed the range of the random variable.
• We can define the event (X = x) where x is a fixed real number
as,
(X = x) = {ζ:X(ζ) = x}
(It's the set of all values of ζ that, when passed into the function X, yield the result x.)

• Accordingly, we can define the following events for fixed

numbers x, x1, and x2,

(X ≤ x) = { ζ:X(ζ) ≤ x}
(X > x) = { ζ:X(ζ) > x}
(x1 < X ≤ x2 ) = { ζ:x1 < X(ζ) ≤ x2}
• The probabilities of these events are denoted by,
P(X = x) = P{ ζ:X(ζ) = x}
P(X ≤ x) = P{ ζ:X(ζ) ≤ x}
P(X > x) = P{ ζ:X(ζ) > x}
P(x1 < X ≤ x2 ) = P{ ζ:x1 < X(ζ) ≤ x2}
Cumulative Distribution Function (CDF)
• Fundamental concept used to describe and analyse random
variables.

• The CDF of a random variable X, denoted as F(x), is a

function that provides information about the probability that
X takes on a value less than or equal to a specific value x.

F (x) = P(X ≤ x), - ∞ < x < ∞

• Important properties of F(x) are,
Discrete Random Variables
• Let X be a random variable with cdf F(x) and it changes
values only in jumps (a countable number of them) and
remains constant between the jumps then it is called a
discrete random variable.

• The definition of a discrete random variable is that its range

or the set X contains a finite or countably infinite number of
points.
Let us consider that the jumps of F(x) of the discrete
random variable X is occurring at points x1, x2, x3 where
this sequence represents a finite or countably infinite set
and xi < xj.
Continuous Random Variables
• Most of the real-life events are continuous in nature.

Example: If we have to measure the actual time taken to finish

an activity then there can be an infinite number of possible
ways to complete the activity (the measurement is continuous).
• Fx(x) describing the event is continuous and also has a
derivative dFx(x)/dx that exists and is piecewise continuous.

• The probability density function (pdf) of the continuous

random variable x is defined as,
How to extend probability to reason about uncertain continuous
quantities.

• The probability that x lies in any interval a ≤ x≤ b can be computed as

follows…

• We can define the events A = (x≤ a), B = (x≤ b), and W = (a < x≤ b).
• We have that B = A∪W, and since A and W are mutually exclusive,
according to the sum rules:
• The cumulative distribution function (cdf) of random
variable X can be obtained by,

which is a monotonically increasing function. Using this

notation, we have,

• The probability of the continuous variable,

• As the size of the interval gets smaller, it is possible to write,
Mean and Variance
• The mean in statistical terms represents the weighted average
of the all the possible values of random variable X and each
value is weighted by its probability.

• Denoted by µ or E(X) and defined as,

• Variance of a random variable X measures the spread or
dispersion of X.
• If E(X) is the mean of the random variable X, then the
variance is given by,
Some Common Discrete Distribution
• Bernoulli Distributions:
• When we have a situation where the outcome of a trial is
categorized as ‘success’ or ‘failure’, then the behaviour of the
random variable X can be represented by Bernoulli distribution.

• The Bernoulli distribution is a discrete probability distribution

that models a single experiment with two possible outcomes,
typically denoted as 1 for success and 0 for failure.
• A random variable X is called Bernoulli random variable with
parameter p when its PMF takes the form of,

• Where 0 ≤ p ≤ 1. So, using the cdf Fx(x) of Bernoulli random variable

is expressed as,
• The mean and variance of Bernoulli random variable X are,

Here, the probability of success is p and probability of failure is 1 – p.

Binomial Distribution
• If n independent Bernoulli trials are performed and X
represents the number of success in those n trials, then X is
called a binomial random variable.

• For this reason a Bernoulli random variable is a special case

of binomial random variable with parameters (1, p).
• The PMF of X with parameters (n, p) is given by,

Where, 0 ≤ p ≤ 1 and

is also called the binomial coefficient which is the number of ways to

choose k items from n.
• If we toss a coin n times. Let X ∈ {0, …, n} be the number of heads.
If the probability of heads is p, then we say X has a binomial
distribution, written as X ~Bin(n, p).

• The corresponding CDF of x is,

• Mean and Variance,
The Multinomial and Multinoulli Distributions
• The binomial distribution can be used to model the outcomes
of coin tosses, or for experiments where the outcome can be
either success or failure. But to model the outcomes of
tossing a K-sided die, or for experiments where the outcome
can be multiple, we can use the multinomial distribution.

• Defined as: let x = (x , …, x ) be a random vector, where x is

the number of times side j of the die occurs. Then x has the
following pmf:
and the summation is over the set of all non-negative integers x , x , … . x
whose sum is n.
• Consider a special case of n = 1, which is like rolling a K-
sided dice once, so x will be a vector of 0s and 1s (a bit
vector), in which only one bit can be turned on.

• Meaning that if the dice shows up face k, then the k’th bit
will be on.

• We can consider x as being a scalar categorical random

variable with K states or values, and x is its dummy encoding
with x = [Π(x = 1), …, Π(x = K)].
• For example, if K = 3, this states that 1, 2, and 3 can be
encoded as (1, 0, 0), (0, 1, 0), and (0, 0, 1). This is also called
a one-hot encoding, as we interpret that only one of the K
‘wires’ is ‘hot’ or on.

• This very common special case is known as a categorical or

discrete distribution.

• Because of the analogy with the Binomial/Bernoulli

distinction, Gustavo Lacerda suggested that this is called the
multinoulli distribution.
Poisson Distribution
• Poisson random variable has a wide range of application.
It may be used as an approximation for binomial with
parameter (n,p) when n is large and p is small and thus np is
of moderate size.

• Example:
If a fax machine has a faulty transmission line, then the
probability of receiving an erroneous digit within a certain
page transmitted can be calculated using a Poisson random
variable.
• A random variable X is called a Poisson random variable
with parameter λ (>0) when the pmf looks like,
SOME COMMON CONTINUOUS
DISTRIBUTIONS
• Uniform distribution:
The pdf of a uniform random variable is given by,

The cdf of X is,

Example: If X is a continuous random variable with
X~Uniform(a,b). Find E(X).
Gaussian (normal) distribution:
• Most widely used distribution in statistics and machine
learning.

• PDF and CDF:

The Gaussian or Normal distribution is the most widely used
distribution in the study of random phenomena in nature statistics
due to few reasons:
 It has two parameters that are easy to interpret, and which capture some
of the most basic properties of a distribution, namely its mean and
variance.
The central limit theorem provides the result that sums of independent
random variables have an approximately Gaussian distribution, which
makes it a good choice for modelling residual errors or ‘noise’.
The Gaussian distribution makes the least number of assumptions,
subject to the constraint of having a specified mean and variance and thus
is a good default choice in many cases.
 Its simple mathematical form is easy to implement, but often highly
effective.
Central Limit Theorem
• Most important theorems in probability theory.
• It states that if X1 ,…, Xn is a sequence of independent identically
distributed random variables and each having mean μ and variance σ2

• As n → ∞ (meaning for a very large but finite set) then Z tends to

the standard normal.
or

where, Φ(z) is the cdf of a standard normal random variable.

• According to the central limit theorem, irrespective of the distribution of

the individual X ’s the distribution of the sum S n – X1 + …Xn is
approximately normal for large n.
Significance:
• Even if our data doesn't perfectly follow normal distributions, the
CLT suggests that as your sample size increases, the distribution of
sample means (or other statistics) will become closer to a normal
distribution, allowing you to apply these methods with confidence.

• Hypothesis Testing: Hypothesis testing in machine learning often

relies on the normal distribution. The CLT ensures that for large
sample sizes, the distributions of sample statistics become normal,
enabling you to make valid statistical inferences.
MONTE CARLO APPROXIMATION
 In practical situations it is difficult to compute the distribution
of random variables using the change of variables formula.

 Solution: Monte Carlo Approximation

 Let’s first generate S samples from the distribution, as x1, …,

x . For these samples, we can approximate the distribution of
f(X) by using the empirical distribution of
• In principle, Monte Carlo methods can be used to solve any
problem which has a probabilistic interpretation.

• A widely used sampler is Markov chain Monte Carlo

(MCMC) sampler for parametrizing the probability
distribution of a random variable.

• The main idea is to design a judicious Markov chain model

with a prescribed stationary probability distribution.
• Using the Monte Carlo technique, we can approximate the
expected value of any function of a random variable by
simply drawing samples from the population of the random
variable, and then computing the arithmetic mean of the
function applied to the samples, as follows:

(Also, known as Monte Carlo integration)

• This method evaluates the function only in places where
there is a nonnegligible probability.

Probability 360
No ratings yet
Probability 360
74 pages
Best Ones
No ratings yet
Best Ones
32 pages
Random Variables & Distributions Guide
No ratings yet
Random Variables & Distributions Guide
5 pages
SI Chapter-1
No ratings yet
SI Chapter-1
30 pages
2 Random Variable
No ratings yet
2 Random Variable
69 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
Probability Review
No ratings yet
Probability Review
12 pages
0 Deep Learning Fundamentals of Probability Theory
No ratings yet
0 Deep Learning Fundamentals of Probability Theory
31 pages
Discrete Random Variables Class 4, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Discrete Random Variables Class 4, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
13 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
Basic Probability Review
No ratings yet
Basic Probability Review
77 pages
Probability Basics
No ratings yet
Probability Basics
19 pages
Fundamentals of Probability
No ratings yet
Fundamentals of Probability
58 pages
Introduction to Probability Basics
No ratings yet
Introduction to Probability Basics
4 pages
EEE 6542 - Lecture 3 Notes - Complete - F2024
No ratings yet
EEE 6542 - Lecture 3 Notes - Complete - F2024
53 pages
PRP - Unit 2
No ratings yet
PRP - Unit 2
41 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
Understanding Random Variables
No ratings yet
Understanding Random Variables
44 pages
Engineering Uncertainty Notes
No ratings yet
Engineering Uncertainty Notes
15 pages
Exam P Review Sheet
No ratings yet
Exam P Review Sheet
12 pages
Probability and Statistics
No ratings yet
Probability and Statistics
20 pages
MAS 102 - Topic 1
No ratings yet
MAS 102 - Topic 1
13 pages
MA1201 Probability Notes
No ratings yet
MA1201 Probability Notes
30 pages
Lec 12 Random Variable
No ratings yet
Lec 12 Random Variable
105 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
Binomial and Hypergeometric PDF
No ratings yet
Binomial and Hypergeometric PDF
12 pages
Econ-2042 - Unit 2-HO
No ratings yet
Econ-2042 - Unit 2-HO
12 pages
Chapitre 3 - Random Variables
No ratings yet
Chapitre 3 - Random Variables
44 pages
Data Science Probability Cheat Sheet
50% (2)
Data Science Probability Cheat Sheet
74 pages
Ch3 Random Variables
No ratings yet
Ch3 Random Variables
27 pages
Chap2 Discrete Distributions
No ratings yet
Chap2 Discrete Distributions
22 pages
Chapter 4
80% (5)
Chapter 4
21 pages
Chapter 3: Random Variables and Probability Distributions This Chapter Is All About
No ratings yet
Chapter 3: Random Variables and Probability Distributions This Chapter Is All About
8 pages
ML DL Cheatsheet 1752238211
No ratings yet
ML DL Cheatsheet 1752238211
3 pages
RVSP Notes
89% (9)
RVSP Notes
123 pages
Unit 1 - Digital Communication - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Digital Communication - WWW - Rgpvnotes.in
11 pages
Digital Communication Basics
No ratings yet
Digital Communication Basics
109 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
7 pages
Random Variables
No ratings yet
Random Variables
26 pages
Unit 5 Overview of Probability
No ratings yet
Unit 5 Overview of Probability
21 pages
Stat 350 Study Guide
No ratings yet
Stat 350 Study Guide
37 pages
PRP Module 2
No ratings yet
PRP Module 2
113 pages
Probability FoundationalMathofAI S24
No ratings yet
Probability FoundationalMathofAI S24
7 pages
RandomVariables ProbDist
No ratings yet
RandomVariables ProbDist
40 pages
Probability & AI Cheat Sheets
No ratings yet
Probability & AI Cheat Sheets
52 pages
Random Variables
No ratings yet
Random Variables
21 pages
Slide 2 - 20191
No ratings yet
Slide 2 - 20191
44 pages
Discrete Probability Distributions Guide
No ratings yet
Discrete Probability Distributions Guide
23 pages
Statistical Method
No ratings yet
Statistical Method
227 pages
Probability
No ratings yet
Probability
17 pages
Learning Quiz 3 - Discrete Random Variables - Jupyter Notebook
No ratings yet
Learning Quiz 3 - Discrete Random Variables - Jupyter Notebook
15 pages
Basic Probability & Statistics Guide
No ratings yet
Basic Probability & Statistics Guide
35 pages
Random Variable and Mathematical Expectation
No ratings yet
Random Variable and Mathematical Expectation
9 pages
Print
No ratings yet
Print
12 pages
4-Random Variables
No ratings yet
4-Random Variables
80 pages
SE-Unit-4-Requirement Analysis and Specification-Sequence Diagram
No ratings yet
SE-Unit-4-Requirement Analysis and Specification-Sequence Diagram
10 pages
Software Project Management Guide
No ratings yet
Software Project Management Guide
57 pages
SE-Unit-4-Requirement Analysis - Specification
No ratings yet
SE-Unit-4-Requirement Analysis - Specification
21 pages
SE-Unit-4-Requirement Analysis and Specification-Activity Diagram
No ratings yet
SE-Unit-4-Requirement Analysis and Specification-Activity Diagram
14 pages
Software Engineering Basics
No ratings yet
Software Engineering Basics
52 pages
SE-Unit-2-Agile Development
No ratings yet
SE-Unit-2-Agile Development
20 pages
CHP 2
No ratings yet
CHP 2
52 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
8 pages
2 T-Student
No ratings yet
2 T-Student
24 pages
The Assumptions of The CLRM
No ratings yet
The Assumptions of The CLRM
2 pages
Fundamentals of Condensed Matter Physics (PDFDrive) (209-252)
No ratings yet
Fundamentals of Condensed Matter Physics (PDFDrive) (209-252)
44 pages
Navier-Stokes Equations Overview
No ratings yet
Navier-Stokes Equations Overview
2 pages
T-Test Guide: Usage and Calculation
No ratings yet
T-Test Guide: Usage and Calculation
18 pages
The Copperbelt University: School of Business
No ratings yet
The Copperbelt University: School of Business
5 pages
Hypothesis Test - Difference in Means
No ratings yet
Hypothesis Test - Difference in Means
4 pages
Human Rationality in Politics
100% (1)
Human Rationality in Politics
8 pages
Martinus J. G. Veltman
No ratings yet
Martinus J. G. Veltman
4 pages
9.bivariate Analysis
No ratings yet
9.bivariate Analysis
64 pages
Assignment Stats
No ratings yet
Assignment Stats
3 pages
Quantum Dance: Subatomic Grooves
No ratings yet
Quantum Dance: Subatomic Grooves
1 page
Evolution Without Evolution: Dynamics Described Stationary
No ratings yet
Evolution Without Evolution: Dynamics Described Stationary
8 pages
Introduction To Class (Finale Drafte) - 1
No ratings yet
Introduction To Class (Finale Drafte) - 1
4 pages
Chapter 8.3. Maximum Likelihood Estimation: Prof. Tesler
No ratings yet
Chapter 8.3. Maximum Likelihood Estimation: Prof. Tesler
11 pages
qftch21 PDF
No ratings yet
qftch21 PDF
26 pages
Physics Question Bank for B.Sc. Students
No ratings yet
Physics Question Bank for B.Sc. Students
3 pages
Halzen & Martin - Quarks and Leptons.
100% (3)
Halzen & Martin - Quarks and Leptons.
412 pages
MATHMATICAL Physics Book Career Endaevour
100% (2)
MATHMATICAL Physics Book Career Endaevour
293 pages
Quantum Physics
No ratings yet
Quantum Physics
9 pages
Philosophical Problems of Quantum Physics PDF
No ratings yet
Philosophical Problems of Quantum Physics PDF
2 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
1 page
Recommended Reading - Kruskal - Wallis H Test
No ratings yet
Recommended Reading - Kruskal - Wallis H Test
1 page
Hypothesis Testing for Engineers
No ratings yet
Hypothesis Testing for Engineers
94 pages
Inertial Frame of Reference
No ratings yet
Inertial Frame of Reference
8 pages
Elite Theory
100% (3)
Elite Theory
5 pages
One-Sample T-Test Guide
No ratings yet
One-Sample T-Test Guide
40 pages
Density Matrix Formalism Solutions
100% (1)
Density Matrix Formalism Solutions
21 pages

CHP 5

Uploaded by

CHP 5

Uploaded by

Chapter 5

- To uncover patterns to predict future data.

Knowledge and uncertainty

• As the knowledge about the training data comes in the form

• Bayesian interpretation of probability:

p(A|B) - Conditional probability of event A happening if event B happens.

Marginal distribution: Distribution of a single variable from a joint

Summing up the all probable states of B gives the total probability

Calculate the probability that both the samples are

Find the probability of an email to be really spam based on the

X(H) = 1, X(T) = 0, which means this variable is associated

Y(H) = 0, Y(T) = 1, which means this variable is associated

• In S, random variables represent the single valued real function

• A random variable is not a variable but is a function.

• The sample space S is called the domain of random variable X and

• Accordingly, we can define the following events for fixed

• The CDF of a random variable X, denoted as F(x), is a

F (x) = P(X ≤ x), - ∞ < x < ∞

• The definition of a discrete random variable is that its range

Example: If we have to measure the actual time taken to finish

• The probability density function (pdf) of the continuous

• The probability that x lies in any interval a ≤ x≤ b can be computed as

which is a monotonically increasing function. Using this

• The probability of the continuous variable,

• Denoted by µ or E(X) and defined as,

• The Bernoulli distribution is a discrete probability distribution

• Where 0 ≤ p ≤ 1. So, using the cdf Fx(x) of Bernoulli random variable

Here, the probability of success is p and probability of failure is 1 – p.

• For this reason a Bernoulli random variable is a special case

is also called the binomial coefficient which is the number of ways to

• The corresponding CDF of x is,

• Defined as: let x = (x , …, x ) be a random vector, where x is

• We can consider x as being a scalar categorical random

• This very common special case is known as a categorical or

• Because of the analogy with the Binomial/Bernoulli

The cdf of X is,

• PDF and CDF:

• As n → ∞ (meaning for a very large but finite set) then Z tends to

where, Φ(z) is the cdf of a standard normal random variable.

• According to the central limit theorem, irrespective of the distribution of

• Hypothesis Testing: Hypothesis testing in machine learning often

 Solution: Monte Carlo Approximation

 Let’s first generate S samples from the distribution, as x1, …,

• A widely used sampler is Markov chain Monte Carlo

• The main idea is to design a judicious Markov chain model

(Also, known as Monte Carlo integration)

You might also like