0% found this document useful (0 votes)
35 views63 pages

CHP 5

Chapter 5 provides an overview of key probability concepts used in machine learning, including: 1) Probability theory provides a mathematical foundation for quantifying uncertainty, which is important for machine learning given uncertainties that can arise from prediction, models, and confidence levels. 2) Concepts covered include probability, random variables, discrete and continuous distributions, the central limit theorem, and hypothesis testing. 3) Common discrete distributions discussed are the Bernoulli, binomial, and multinomial distributions, while continuous distributions and related concepts like probability density functions are also introduced.

Uploaded by

its9918k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views63 pages

CHP 5

Chapter 5 provides an overview of key probability concepts used in machine learning, including: 1) Probability theory provides a mathematical foundation for quantifying uncertainty, which is important for machine learning given uncertainties that can arise from prediction, models, and confidence levels. 2) Concepts covered include probability, random variables, discrete and continuous distributions, the central limit theorem, and hypothesis testing. 3) Common discrete distributions discussed are the Bernoulli, binomial, and multinomial distributions, while continuous distributions and related concepts like probability density functions are also introduced.

Uploaded by

its9918k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Chapter 5

Overview of Probability
Overview
 Statistical tools in Machine Learning
 Concepts of probability, Random variables, Discrete
distributions, Continuous distributions, Multiple random
variables
 Central limit theorem
 Sampling distributions
 Hypothesis testing
 Monte Carlo Approximation
Statistical Tools in Machine Learning
Machine learning: Consists of a set of methods that can automatically
detect patterns in data, and then can be used,

- To uncover patterns to predict future data.


- To perform other kinds of decision making under uncertainty.

The best way to perform Machine Learning activities is to use the tools of
probability theory because probability theory can be applied to any situation
involving uncertainty.
In Machine Learning, uncertainties can arise in various
ways.
Forms:
• Prediction Uncertainty
• Model Uncertainty
• Confidence Level Uncertainty
Importance of Statistical Tools in ML

Knowledge and uncertainty


• Probability theory provides a mathematical foundation for
quantifying this uncertainty of the knowledge.

• As the knowledge about the training data comes in the form


of interdependent feature sets, the conditional probability
theories form the basis for deriving required confidence level
of the training data.
• The mathematical function behind each of these distributions
so that we can understand how the data is spread out from its
average value – denoted by the mean and the variance.
Concept of Probability
• Frequentist interpretation of probability:
Represents the long run frequencies of events.
Example: Tossing a coin

• Bayesian interpretation of probability:


Focuses on information rather than repeated trials.
Joint Probabilities

p(A|B) - Conditional probability of event A happening if event B happens.

Marginal distribution: Distribution of a single variable from a joint


probability distribution.

Summing up the all probable states of B gives the total probability


formulae, which is also called sum rule or the rule of total probability.
Chain rule of probability:
Conditional Probability
• The conditional probability of event A, given that event B is true,

Similarly,
In a toy-making shop, the automated machine produces
few defective pieces. It is observed that in a lot of 1,000
toy parts, 25 are defective. Two random samples are
selected for testing without replacement from the lot.

Calculate the probability that both the samples are


defective.
Bayes Rule (Bayes Theorem)
• Can be derived by combining the definition of conditional
probability with the product and sum rules,
Let’s assume that we have knowledge about the reliability of
the assumption that emails with sender names ‘mass’ and
‘bulk’ are spam email is 80% meaning that if some email has
sender name with ‘mass’ or ‘bulk’ then the email will be spam
with probability 0.8. The probability of false alarm is 10%
meaning that the agent will show the positive result as spam
even if the email is not a spam with a probability 0.1. Also, we
have a prior knowledge that only 0.4% of the total emails
received are spam.

Find the probability of an email to be really spam based on the


name of the sender.
Random Variables
• In an example of tossing a coin, we can associate random
variables X and Y as,

X(H) = 1, X(T) = 0, which means this variable is associated


with the outcome of the coin facing head.

Y(H) = 0, Y(T) = 1, which means this variable is associated


with the outcome of the coin facing tails.
• Sample space S which is the outcome related to tossing the coin
once.

• In S, random variables represent the single valued real function


[X(ζ)] that assigns a real number, called its value to each sample
point of S.

• A random variable is not a variable but is a function.

• The sample space S is called the domain of random variable X and


the collection of all the numbers.
i.e. values of X(ζ), is termed the range of the random variable.
• We can define the event (X = x) where x is a fixed real number
as,
(X = x) = {ζ:X(ζ) = x}
(It's the set of all values of ζ that, when passed into the function X, yield the result x.)

• Accordingly, we can define the following events for fixed


numbers x, x1, and x2,

(X ≤ x) = { ζ:X(ζ) ≤ x}
(X > x) = { ζ:X(ζ) > x}
(x1 < X ≤ x2 ) = { ζ:x1 < X(ζ) ≤ x2}
• The probabilities of these events are denoted by,
P(X = x) = P{ ζ:X(ζ) = x}
P(X ≤ x) = P{ ζ:X(ζ) ≤ x}
P(X > x) = P{ ζ:X(ζ) > x}
P(x1 < X ≤ x2 ) = P{ ζ:x1 < X(ζ) ≤ x2}
Cumulative Distribution Function (CDF)
• Fundamental concept used to describe and analyse random
variables.

• The CDF of a random variable X, denoted as F(x), is a


function that provides information about the probability that
X takes on a value less than or equal to a specific value x.

F (x) = P(X ≤ x), - ∞ < x < ∞


• Important properties of F(x) are,
Discrete Random Variables
• Let X be a random variable with cdf F(x) and it changes
values only in jumps (a countable number of them) and
remains constant between the jumps then it is called a
discrete random variable.

• The definition of a discrete random variable is that its range


or the set X contains a finite or countably infinite number of
points.
Let us consider that the jumps of F(x) of the discrete
random variable X is occurring at points x1, x2, x3 where
this sequence represents a finite or countably infinite set
and xi < xj.
Continuous Random Variables
• Most of the real-life events are continuous in nature.

Example: If we have to measure the actual time taken to finish


an activity then there can be an infinite number of possible
ways to complete the activity (the measurement is continuous).
• Fx(x) describing the event is continuous and also has a
derivative dFx(x)/dx that exists and is piecewise continuous.

• The probability density function (pdf) of the continuous


random variable x is defined as,
How to extend probability to reason about uncertain continuous
quantities.

• The probability that x lies in any interval a ≤ x≤ b can be computed as


follows…

• We can define the events A = (x≤ a), B = (x≤ b), and W = (a < x≤ b).
• We have that B = A∪W, and since A and W are mutually exclusive,
according to the sum rules:
• The cumulative distribution function (cdf) of random
variable X can be obtained by,

which is a monotonically increasing function. Using this


notation, we have,

• The probability of the continuous variable,


• As the size of the interval gets smaller, it is possible to write,
Mean and Variance
• The mean in statistical terms represents the weighted average
of the all the possible values of random variable X and each
value is weighted by its probability.

• Denoted by µ or E(X) and defined as,


• Variance of a random variable X measures the spread or
dispersion of X.
• If E(X) is the mean of the random variable X, then the
variance is given by,
Some Common Discrete Distribution
• Bernoulli Distributions:
• When we have a situation where the outcome of a trial is
categorized as ‘success’ or ‘failure’, then the behaviour of the
random variable X can be represented by Bernoulli distribution.

• The Bernoulli distribution is a discrete probability distribution


that models a single experiment with two possible outcomes,
typically denoted as 1 for success and 0 for failure.
• A random variable X is called Bernoulli random variable with
parameter p when its PMF takes the form of,

• Where 0 ≤ p ≤ 1. So, using the cdf Fx(x) of Bernoulli random variable


is expressed as,
• The mean and variance of Bernoulli random variable X are,

Here, the probability of success is p and probability of failure is 1 – p.


Binomial Distribution
• If n independent Bernoulli trials are performed and X
represents the number of success in those n trials, then X is
called a binomial random variable.

• For this reason a Bernoulli random variable is a special case


of binomial random variable with parameters (1, p).
• The PMF of X with parameters (n, p) is given by,

Where, 0 ≤ p ≤ 1 and

is also called the binomial coefficient which is the number of ways to


choose k items from n.
• If we toss a coin n times. Let X ∈ {0, …, n} be the number of heads.
If the probability of heads is p, then we say X has a binomial
distribution, written as X ~Bin(n, p).

• The corresponding CDF of x is,


• Mean and Variance,
The Multinomial and Multinoulli Distributions
• The binomial distribution can be used to model the outcomes
of coin tosses, or for experiments where the outcome can be
either success or failure. But to model the outcomes of
tossing a K-sided die, or for experiments where the outcome
can be multiple, we can use the multinomial distribution.

• Defined as: let x = (x , …, x ) be a random vector, where x is


the number of times side j of the die occurs. Then x has the
following pmf:
and the summation is over the set of all non-negative integers x , x , … . x
whose sum is n.
• Consider a special case of n = 1, which is like rolling a K-
sided dice once, so x will be a vector of 0s and 1s (a bit
vector), in which only one bit can be turned on.

• Meaning that if the dice shows up face k, then the k’th bit
will be on.

• We can consider x as being a scalar categorical random


variable with K states or values, and x is its dummy encoding
with x = [Π(x = 1), …, Π(x = K)].
• For example, if K = 3, this states that 1, 2, and 3 can be
encoded as (1, 0, 0), (0, 1, 0), and (0, 0, 1). This is also called
a one-hot encoding, as we interpret that only one of the K
‘wires’ is ‘hot’ or on.

• This very common special case is known as a categorical or


discrete distribution.

• Because of the analogy with the Binomial/Bernoulli


distinction, Gustavo Lacerda suggested that this is called the
multinoulli distribution.
Poisson Distribution
• Poisson random variable has a wide range of application.
It may be used as an approximation for binomial with
parameter (n,p) when n is large and p is small and thus np is
of moderate size.

• Example:
If a fax machine has a faulty transmission line, then the
probability of receiving an erroneous digit within a certain
page transmitted can be calculated using a Poisson random
variable.
• A random variable X is called a Poisson random variable
with parameter λ (>0) when the pmf looks like,
SOME COMMON CONTINUOUS
DISTRIBUTIONS
• Uniform distribution:
The pdf of a uniform random variable is given by,

The cdf of X is,


Example: If X is a continuous random variable with
X~Uniform(a,b). Find E(X).
Gaussian (normal) distribution:
• Most widely used distribution in statistics and machine
learning.

• PDF and CDF:


The Gaussian or Normal distribution is the most widely used
distribution in the study of random phenomena in nature statistics
due to few reasons:
 It has two parameters that are easy to interpret, and which capture some
of the most basic properties of a distribution, namely its mean and
variance.
The central limit theorem provides the result that sums of independent
random variables have an approximately Gaussian distribution, which
makes it a good choice for modelling residual errors or ‘noise’.
The Gaussian distribution makes the least number of assumptions,
subject to the constraint of having a specified mean and variance and thus
is a good default choice in many cases.
 Its simple mathematical form is easy to implement, but often highly
effective.
Central Limit Theorem
• Most important theorems in probability theory.
• It states that if X1 ,…, Xn is a sequence of independent identically
distributed random variables and each having mean μ and variance σ2

• As n → ∞ (meaning for a very large but finite set) then Z tends to


the standard normal.
or

where, Φ(z) is the cdf of a standard normal random variable.

• According to the central limit theorem, irrespective of the distribution of


the individual X ’s the distribution of the sum S n – X1 + …Xn is
approximately normal for large n.
Significance:
• Even if our data doesn't perfectly follow normal distributions, the
CLT suggests that as your sample size increases, the distribution of
sample means (or other statistics) will become closer to a normal
distribution, allowing you to apply these methods with confidence.

• Hypothesis Testing: Hypothesis testing in machine learning often


relies on the normal distribution. The CLT ensures that for large
sample sizes, the distributions of sample statistics become normal,
enabling you to make valid statistical inferences.
MONTE CARLO APPROXIMATION
 In practical situations it is difficult to compute the distribution
of random variables using the change of variables formula.

 Solution: Monte Carlo Approximation

 Let’s first generate S samples from the distribution, as x1, …,


x . For these samples, we can approximate the distribution of
f(X) by using the empirical distribution of
• In principle, Monte Carlo methods can be used to solve any
problem which has a probabilistic interpretation.

• A widely used sampler is Markov chain Monte Carlo


(MCMC) sampler for parametrizing the probability
distribution of a random variable.

• The main idea is to design a judicious Markov chain model


with a prescribed stationary probability distribution.
• Using the Monte Carlo technique, we can approximate the
expected value of any function of a random variable by
simply drawing samples from the population of the random
variable, and then computing the arithmetic mean of the
function applied to the samples, as follows:

(Also, known as Monte Carlo integration)


• This method evaluates the function only in places where
there is a nonnegligible probability.

You might also like