Chapter 4: Probability and Probabilistic Models: Statistics I
Chapter 4: Probability and Probabilistic Models: Statistics I
Summary
I Probability:
I Random experiments, sample space, elementary and composite
events.
I Axioms of Probability.
I Conditional probability and its properties.
I Random variables (RVs) and their properties.
I Stochastic models for discrete RVs: the Bernoulli and other related
models.
I Stochastic models for continuous RVs: the Normal (or Gaussian) and
related models.
Basic concepts
I Random Experiment: is the process of observing an outcome that
cannot be predicted with certainty.
I Sample Space: is the set of all possible outcomes of a random
experiment, denoted by
Ω = {e1 , e2 , . . . , en , . . .}
whose elements are called elementary events. These are disjoint (i.e.
they cannot occur at the same time).
I Event: is a collection of elementary events.
A = {e1 , e3 }
Examples:
I Result of a coin toss.
I The closing price of stock x at the end of next Monday.
Events: basic concepts
Union of events: Let A and B be two events of the sample space Ω, the
union, A ∪ B, is the set of all events of Ω that belong either to A OR B.
Events: basic concepts
Trivial events:
I Sure event Ω: the set equals to the sample space
I Impossible event ∅: the empty set
Complementary event
The complementary of an event A is the set of all events of Ω that do
not belong to A.
Example: throw of a dice
Consider the outcome of throwing a regular dice (i.e. a dice with 6 faces):
1
P(A) = × size of A
k
Postulates:
1. 0 ≤ P(A) ≤ 1.
Pn
2. If A = {e1 , e2 , . . . , en }, then P(A) = i=1 P(ei ).
3. P(Ω) = 1.
Consequently:
I Complementary: P(Ā) = 1 − P(A).
I P(∅) = 0
1 1 1 1
P(A) = P(”2”) + P(”4”) + P(”6”) = + + =
6 6 6 2
I Probability of a score greater than 3: B = {4, 5, 6}, then
1 1 1 1
P(B) = P(”4”) + P(”5”) + P(”6”) = + + =
6 6 6 2
I Probability of an odd score
1 1
P(Ā) = 1 − P(A) = 1 − =
2 2
Example: throw of a dice
1 1 1 4 2
P(A ∪ B) = + − = =
2 2 3 6 3
I Probability of even or face 1:
Events A = {2, 4, 6} and C = {1} are incompatible (A ∩ C = ∅)
therefore
1 1 4 2
P(A ∪ C ) = P(A) + P(C ) = + = =
2 6 6 3
Example: conditional probability
3
I Therefore, the probability to win is P (A) = 37 .
I Suppose that before begin it were told us that the roulette is unfair
as only odd numbers come out. What is the probability to win when
including this information? Is the same as before ?
Notion of conditional probability
Conditional Probability
Let A and B be two events such that P(B) > 0, the conditional
probability of A given B is defined:
P(A ∩ B)
P(A|B) =
P(B)
I Note that, once we know that the roulette is unfair, then the sample
space changes from the initial as no even numbers can be obtained.
It becomes Ω∗ = B = {1, 3, 5, . . . , 35}. The probability of A over Ω∗
is known 19 .
Bi ∩ Bj = ∅, ∀i 6= j.
I For the pack of spanish card the following sets are partition of the
sample space:
I Assuming that he fishes twice and knowing that the second is black,
what is the probability that the first were yellow?
Discrete r.v.
If X assumes values over a finite or infinite countable space S ⊆ R we
say that X is a discrete r.v.
Continuous r.v.
If X assumes values over a infinite uncountable space S ⊆ R we say that
X is a continuous r.v.
Examples
I X =“Score of trowing a dice” is a discrete r.v. S = {1, 2, 3, 4, 5, 6}.
I Y =“Number of cars crossing a bridge in a week” is a discrete r.v.
S = {0, 1, 2, . . .} = N ∪ 0 as it is infinite countable.
I Z = “the heigh of a student” is a continuous r.v. S = [0, +∞).
Discrete r.v.
Probability function
Let X be a discrete r.v. with values {x1 , x2 , . . .}. We call probability
function or probability mass function, the set of probabilities where X
takes values, that is, pi = P[X = xi ], para i = 1, 2, . . . .
Example
X = the score at throwing a dice. The probability mass function for a
fair dice
x 1 2 3 4 5 6
1 1 1 1 1 1
P[X = x] 6 6 6 6 6 6
X
I P[X ≤ x] = P[X = xi ].
i,xi ≤x
Ω = {(f , f , f ) , (a, f , f ) , (f , a, f ) , (f , f , a) ,
(a, a, f ) , (a, f , a) , (f , a, a) , (a, a, a)}
P (X ≥ 0) = P (X = 1) + P (X = 3) + P (X = 27) =
= 0.243 + 0.027 + 0.001 = 0.271
or:
Distribution Function
The cumulative distribution function (c.d.f.) of a r.v. X is an application
F : R → [0, 1], that to each x ∈ R assigns the probability:
X
F (x) = P[X ≤ x] = P (X = xi )
xi ∈S,xi ≤x
where:
2
E X 2 = (−3) × 0.729 + 12 × 0.243 + 32 × 0.027 + 272 × 0.001 = 7.776
√
therefore the standard deviation is S[X ] = 4.405 = 2.0988.
Example
Let X count the number of tails in tossing a coin twice. The probability
function of X is
x 0 1 2
1 1 1
P[X = x] 4 2 4
4.405
P (|X + 1.836| ≥ 3) ≤ = 0.4894
9
Considering the probability function then the exact probability is:
which shows that the Chebyshev’s bound can be roughly far from the
exact probability.
Summary Example
I The set S where the r.v. has support S = {−3, −1, 1, 3} as:
X (e1 ) = 3 − 0 = 3
X (e2 ) = X (e3 ) = X (e4 ) = 2 − 1 = 1
X (e5 ) = X (e6 ) = X (e7 ) = 1 − 2 = −1
X (e8 ) = 0 − 3 = −3
Description
This probability model describes the outcome of an experiments where
there are only two possible outcomes that we can denote as (for instance)
success and fail. The corresponding random variable:
1 if success
X =
0 if fail
X ∼ Ber (p).
Bernoulli model
Example
Throwing a fair coin
1 tail
X =
0 cross
It is a Bernoulli experiment and X follows a Bernoulli distribution with
p = 1/2.
Example
A certain airline assumes that a passenger has probability 0.05 to not
show up at the check-in.
Let
1 check-in
Y =
0 not check-in
Y follows a Bernoulli distribution with parameter p = 0.95.
Bernoulli model
Probability function:
P[X = 0] = 1 − p P[X = 1] = p
c.d.f.:
0 if x < 0
F (x) = 1−p if 0 ≤ x < 1
1 if x ≥ 1
Properties
I E [X ] = p × 1 + (1 − p) × 0 = p
I E [X 2 ] = p × 12 + (1 − p) × 02 = p
I V [X ] = E [X 2 ] − E [X ]2 = p − p 2 = p(1 − p)
p
I S[X ] = p(1 − p)
Binomial model
Description
This model describes the total number of successes of a n equal Bernoulli
experiments repeated independently. The r.v. represents the number of
successes, and follows a binomial distribution with parameters n ∈ N and
p = [0, 1].
Definition
A discrete r.v. X follows a a binomial distribution with parameters n and
p if
n
P[X = x] = p x (1 − p)n−x ,
x
for x = 0, 1, . . . , n where
n n!
= ,
x x!(n − x)!
Example
Suppose that the previous airline sold 80 tickets for a certain flight and
the probability of each passenger to not show up at the check-in is 0.05.
Let X = number of check-in passengers. Then (assuming independence
between passengers)
X ∼ B(80, 0.95)
Properties
I E [X ] = np
I Var [X ] = np(1 − p)
p
I S[X ] = np(1 − p)
Poisson model
Description
It models the probability of the number of rare events occurring in a
certain domain as, for instance, the time or space.
Examples: telephone calls in a hour, typos in a page, traffic accidents in
a week, particles in a m3 of air, ”Prussian soldiers hit by the kick of an
horse”, . . .
Definition
A r.v. X follows a Poisson distribution of parameter λ > 0 if
λx e −λ
P[X = x] = , para x = 0, 1, 2, . . . ,
x!
and we write X ∼ P(λ).
Poisson model
Properties (1)
I E [X ] = λ
I Var [X ] = λ
√
I S[X ] = λ
Property (2)
Let X ∼ P(λ) represents the number of events in a unit of time with
mean λ.
Let Y represent the number of events in a time length t then we have
Y ∼ P(tλ)
Poisson model
Example
The mean number of typos for slide is 0.2, let X represent such number,
then
X ∼ P(0.2)
What is the probability to have no typos ?
0.20 e −0.2
P[X = 0] = = e −0.2 = 0.8187.
0!
What is the probability to have one typo in 4 slides ?
Let Y be the number of typos in t = 4 slides, then
Y ∼
P(0.2 × 4) = P(0.8)
0.81 e −0.8
P[Y = 1] = = 0.8e −0.8 = 0.3595.
1!
Continuos r.v.
Distribution function
For a continuous r.v. X , the distribution function is
F (x) = P[X ≤ x], ∀x ∈ R
Properties
I 0 ≤ F (x) ≤ 1, for all x ∈ R
I F (−∞) = 0.
I F (∞) = 1.
I If x1 ≤ x2 , then F (x1 ) ≤ F (x2 ), that is, F (x) is no decreasing.
I For all x1 , x2 ∈ R, P(x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ).
I F (x) is continuos.
Density function
For a continuos r.v. X with distribution function F (x), the density
function of X is:
dF (x)
f (x) = = F 0 (x)
dx
Properties
I f (x) ≥ 0 ∀x ∈ R
Rb
I P(a ≤ X ≤ b) = a f (x)dx ∀a, b ∈ R
Rx
I F (x) = P(X ≤ x) = −∞ f (u)du
R∞
I
−∞
f (x)dx = 1
Continuos r.v.
Example
For a r.v. X with density function
12x 2 (1 − x)
si 0 < x < 1
f (x) =
0 si no
we have
Z 0.5 Z 0.5
P(X ≤ 0.5) = f (u)du = 12u 2 (1 − u)du = 0.3125
−∞ 0
Z 0.5 Z 0.5
P(0.2 ≤ X ≤ 0.5) = f (u)du = 12u 2 (1 − u)du = 0.2853
0.2 0.2
Z x
30 si x ≤ 0
x x4
F (x) = P(X ≤ x) = f (u)du = 12 3 − 4 si 0 < x ≤ 1
−∞
1 si x > 1
Expectation of a continuous r.v.
where:
Z Z 1
12 5 x=1 12 6 x=1
E X2 = 2
12x 4 (1 − x)dx =
x f (x)dx = x |x=0 − x |x=0 =
R 0 5 6
12 2
=−2=
5 5
q
1
Therefore the standard deviation is S[X ] = 25 = 15 .
Uniform distribution
Description
For the uniform distribution every sets of the same length has the same
probability, that is, the density is constant over a bounded set where the
r.v. takes values.
Definition
A continuous r.v. variable X follows a uniform distribution over the
interval (a, b) (where a and b are the parameters of the distribution) if
1
b−a si a < x ≤ b
f (x) =
0 otherwise
Density function:
Properties
a+b
I Expectation: E [X ] = 2
(b−a)2
I Variance: V [X ] = 12
I Standard deviation:
b−a
S[X ] = √ 12
Example of uniform distribution
Distribution function
Z x
F (x) = P(X ≤ x) = f (u)du = . . .
−∞
R51
I If 5 < x then F (x) = P(X ≤ x) = 3 2
du = u4 |53 = 5−3
2 = 1.
Summarizing we have:
0 si x ≤ 3
x−3
F (x) = 2 si 3 < x ≤ 5
1 si x > 5
Example of uniform distribution
Expectation
R5 5
x2 52 −32
x · 12 dx =
R
E [X ] = R
x · f (x)dx = 3 4 = 4 =4
3
Variance
x 2 · f (x)dx − E [X ]2
R
Var [X ] = R
R5 5
x2 x3
= 3 2 dx − 42 = 6 − 16 = 0.33
3
Exponential distribution
Description
The exponential distribution models the time between two independent
events, separately and uniformly distributed over time (i.e. the time
between two Poisson events).
Definition
We say that X follows an exponential distribution of parameter λ >,
X ∼ E(λ), if its density function is
Examples
I Time between the arrivals of two trucks at the discharge point.
I Time between two emergency calls.
I Life time of a lightbulb.
Exponential distribution
Density function
Properties
1
I Expectation: E [X ] = λ
1
I Variance: V [X ] = λ2
I Standard Deviation:
S[X ] = λ1
I Density function:
1 − e −λx
if x ≥ 0
F (x) =
0 otherwise
Description
The normal distribution, models the measurement errors of a certain
continuos quantity and it approximates very well most of the real
situations. Statistics makes large use of this model and those models that
derive from it.
Definition
The r.v. X follows a normal or Gaussian distribution with parameters
µ ∈ R and σ ∈ R+ , X ∼ N (µ, σ), if
1 1 2
f (x) = √ exp − 2 (x − µ)
σ 2π 2σ
Properties
E [X ] = µ V [X ] = σ 2
If X ∼ N (µ, σ), f (x) the distribution is symmetric around the median µ.
Normal or Gaussian distribution
Density function for 3 different values of µ and σ
Normal or Gaussian distribution
Property
If X ∼ N (µ, σ),
I P(µ − σ < X < µ + σ) ≈ 0.683
I P(µ − 2σ < X < µ + 2σ) ≈ 0.955
I P(µ − 3σ < X < µ + 3σ) ≈ 0.997
Chebyshev’s inequality
Chebyshev’s inequality applies also to continuous variables knowing only
its mean and standard deviation. In the case where X is Gaussian with
mean µ and standard deviation σ, we have that:
σ2
P (µ − k < X < µ + k) = P (|X − µ| < k) ≥ 1 −
k2
1
therefore, if k = cσ, we have that P (µ − cσ < X < µ + cσ) ≥ 1 − c2 .
Normal or Gaussian distribution
Linear transformation
If X ∼ N (µ, σ), then:
Y = aX + b ∼ N (aµ + b, |a|σ)
Standardization
If X ∼ N (µ, σ), it is possible to consider the standardized r.v.
X −µ
Z= ∼ N (0, 1)
σ
The special case of N(0, 1) is called the standard normal distribution. It
is symmetric around 0 its c.d.f. (whose analytical expression is not
available) is tabulated.
Table of N (0, 1)
Example of Normal Distribution
I Pr(Z < −1.5) = Pr(Z > 1.5) = 1 − Pr(Z < 1.5) = 1 − 0.9332 =
0.0668. Why nor ≤?
I Pr(−1.5 < Z < 1.5) = Pr(Z < 1.5) − Pr(Z < −1.5) =
0.9332 − 0.0668 = 0.8664.
Example of Normal Distribution
Let X ∼ N(µ = 2, σ = 3), we want to calculate Pr(X < 4) and
Pr(−1 < X < 3.5) using the table of the standard normal:
I First we make the statement over Z using standardization and then
we use the table:
X −2 4−2
Pr(X < 4) = P < = Pr Z < 0.666̇ ≈ 0.7454,
3 3
If the sample were made of 5 packs, what is the probability that at least
one has a loss between 3% and 5% ? In this case we have n = 5 and
p = 0.6827, then Y ∼ B(5, 0.6827) and
and state that in the limit, for n large, the distribution of X̄ is Gaussian
independently of the distribution of X . Given its generality it has called
“central”.
Theorem
Let X1 , X2 , . . . , Xn be independent r.v., identically distributed, with mean
µ and standard deviation σ (both finite). For n large enough, then
X̄ − µ
√ ∼ N (0, 1)
σ/ n
Approximations with the CLT
Binomial
Let X ∼ B(n, p) with n large enough (that is n ≥ 30 and 0.1 ≤ p ≤ 0.9
or np ≥ 5 and n (1 − p) ≥ 5), then:
X − np
p ∼ N (0, 1)
np(1 − p)
Poisson
If X ∼ P(λ) with λ large enough (λ > 5)
X −λ
√ ∼ N (0, 1)
λ
√
On the other hands, P(λ) ≈ N λ, λ .
Approximations with the CLT: Example
I Let X ∼ B(100, 1/3). Suppose we want to calculate Pr(X < 40), as
the exact calculus is heavy by hand we use the
I CLT where X ∼ B(100, 1/3) ≈ N (33.3, 4.714) , because
1
E [X ] = 100 × = 33.3̇
3
1 2
V [X ] = 100 × × = 22.2̇
p 3 3
S[X ] = 22.2̇ = 4.714
I Therefore,
X − 33.3̇ 40 − 33.3̇
Pr(X < 40) = P <
4.714 4.714
≈ P (Z < 1.414) donde Z ∼ N(0, 1)
≈ 0.921,
I the exact, value using a PC, is 0.934 and the approximation with the
CLT is not very far from the exact value.
Distributions related to the normal one
χ2 (Chi-squared)
Let X1 , X2 , . . . , Xn be i.i.d. N (0, 1). r.v.. The distribution of
n
X
S= Xi2
i=1
t Student
Let Y , X1 , X2 , . . . , Xn be i.i.d. N (0, 1) r.v.. The distribution of
Y
T = qP
n
i=1 Xi2 /n
Fn,m de Fisher
Let X1 , X2 , . . . , Xn and Y1 , Y2 , Y3 , . . . , Ym be N (0, 1) r.v.. The
distribution of Pn
m i=1 Xi2
F =
n m 2
P
i=1 Yi
is called Fisher distribution, Fn,m is regulated by two parameters,n and m,
called degrees of freedom (d.f.) with the following properties:
m
I E [F ] = (para m > 2)
m−2
2m2 (n + m − 2)
I Var [F ] = (para m > 4)
n(m − 2)2 (m − 4)
1
I ∼ Fm,n
F