Probability and Statistics
Probability and Statistics
1. Random variables
The language of events is sometimes cumbersome. The usual description of events uses
a numerical description and events are identified by numbers. For example, in dice rolling
the events are identified by the numerical output Ω = {1, 2, 3, 4, 5, 6}. This leads to the
definition of random variables.
X:Ω→R
with the property that, for every x ∈ R, the preimage X −1 ((−∞, x]) = {ω ∈ Ω : X(ω) ≤ x}
is an event in A.
A random variable simply translates elements in the sample spaces to numbers, with
the condition that intervals of the form (−∞, x] have preimages in the σ–algebra A of the
probability space, where probabilities are defined.
Example 1.2. Let A = {∅, A, Ā, Ω} be a Bernouilli algebra on a sample space Ω. The map
X:Ω→R
defined by
(
0 x 6∈ A
X(ω) =
1 x∈A
1
is a random variable: we have
∅ x<0
−1
X (−∞, x] = Ā 0≤x<1
Ω x≥1
Such random variables are called indicator random variables for the set A and are usually
written as 1A .
The purpose of the definition of random variables is to translate the probability func-
tion from the probability space to R. This is generally accomplished by the cumulative
distribution function.
Definition 1.3. Let X be a random variable on a probability space (Ω, A, Pr). The cumula-
tive distributio function of X is
FX : R → R
defined as
FX (x) = Pr(X −1 ((−∞, x]) = Pr(X ≤ x).
Example 1.4. Let 1A be the indicator function of the event A in a probability space (Ω, A, Pr)
with Pr(A) = p. The cumulative function of 1A is
0 x<0
F1A (x) = 1 − p 0 ≤ x < 1
1 x ≥ 1.
The above properties identify the class of real functions which are distribution functions
of some random variable.
Remark 1.6. Usually capital letters like X, Y, Z, . . . are used to denote random variables in
the probability setting. For a subset A ⊂ R we use the shorthand X ∈ A for X −1 (A). Thus,
we write Pr(X < 0), Pr(0 < X ≤ 1) or Pr(X = 2) instead of Pr(X −1 (−∞, 0)), Pr(X −1 ((0, 1]),
Pr(X −1 (2)) or Pr({ω ∈ Ω : X(ω) = 2}).
2
Probabilities of general events X ∈ A can be expressed in terms of the distribution function
(sometimes in a not simple way). Some examples are
Proposition 1.7. Let X be a random variable in a probability space with distribution func-
tion FX . Then
We now come to one important part of the course. A large part of probability theory is
developped upon simple models which are used once an again to build more complex ones.
3
We next describe some of the most important ones. They are identified both as a model and
give the name to their probability distributions and, by abuse of language, to the random
variables.
3.1. Bernouilli model. The Bernouilli model is the simplest one, associated to the Bernouilli
algebra and to the indicator functions. In this model we are only interetsed in a single event
A ⊂ Ω and want to determine if this event occurs. Formally,
Definition 3.1. A random variable X has Bernouilli distribution with parameter p ∈ [0, 1]
if X(Ω) = {0, 1} and
Pr(X = 0) = 1 − p, Pr(X = 1) = p.
Equivalently, X = 1A for some A ⊂ Ω. We write X ∼ B(p).
There is not much to say about the Bernouilli distribution except that it is the building
block of many more complicated models.
3.2. Binomial model. We again focus on a single event A in the sample space, but now we
make n independent repetitions of the experience associated to the probability space, and we
are interested in counting how many times the event occurs in these n repetitions. By the
independence, any possible output of the n experiments in which there are k occurrences of
k n−k n
A has probability p q , where p = Pr(A) and q = 1 − p. There k outputs with precisely
k occurrences of A. This gives the probability distribution of our random variable.
Definition 3.2. A random variable X has the Binomial distribution with parameters n and
p if
n k n−k
Pr(X = k) = p q , k = 0, 1, . . . , n.
k
We write X ∼ Bin(n, p).
The Binomial model is a quite fundamental one. Among other interesting properties, we
have.
and
Pr(X = k) ≥ Pr(X = k + 1), np − 1 < k ≤ n.
In other words, the sequence Pr(X = 0), Pr(X = 1), . . . , Pr(X = n) is unimodal with a
maximum at bn/pc.
4
3.3. Poisson Model. The values of the probability function of a Binomial variable are
somewhat cumbersome to compute. An useful simplification is the limiting model for n large
as long as np is kept constant. More precisely, if Xn ∼ Bin(n, pn ) with limn→∞ npn = λ,
then the probability function of Xn has a limit which can be written in a simpler way:
n k
Pr(Xn = k) = p qn n − k
k n
1
∼ (npn )k (1 − pn )−k (1 − pn )n
k!
1
∼ λk e−λ ,
k!
where in the last line we have used that
lim (1 − λ/n)n = e−λ .
n→∞
k Pr(X = k) Pr(Y = k)
0 0.042 0.049
1 0.141 0.149
2 0.227 0.224
3 0.236 0.224
4 0.177 0.168
The number of radioactive particles, the number of phone calls, the number of impacts of
drop rain, are examples where the Poisson distribution fits well, associated to large number
of trials with small probability of success which can be imagined to be independent.
The name of the distribution comes from the fact that the sum of probabilities, which
must add up to one, is the geometric series
X X 1
Pr(X = k) = q k−1 p = p = 1.
k≥1 k≥1
1 − q
We note that a slight variation of the geometric distribution Y = X − 1, which counts the
number of ‘failures’ before the first ‘success’ is usually also called geometric, with probability
distribution
Pr(Y = k) = q k p, k = 0, 1, 2, . . .
starting with k = 0 instead of k = 1.
The geometric distribution has a characteristic property, the ‘lack of memory’. If we know
that up to the k–th repetition we have no success, then the probability of having to wait at
least s additional repetitions is the same as from the start:
Pr(X ≥ r + s, X ≥ s) q r+s−1
Pr(X ≥ r + s|X ≥ r) = = r−1 = Pr(X ≥ s).
Pr(X ≥ r) q
where we have used Pr(X ≥ r) = k≥r q k−1 p = q r−1 . In other words, if we are waiting for
P
Heads in coin tossing, the fact that we have waited 103 tosses without seeing the event does
not mean that Heads are approaching faster in the future.
Proposition 3.6. Let X be a discrete random variable taking values in the positive integers
with Pr(X = 1) = p. If X has the ‘lack of memory’ property then X ∼ Geom(p).
3.5. Negative Binomial model. If instead of waiting for the first ‘success’ as in the geo-
metric distribution, one waits till the appearance of the r–th ‘success’, then the probability
of waiting up to k repetitions is that of having r − 1 successes in the first k − 1 trials (a
binomial distribution) and then having the r–th one in the k–th trial.
Definition 3.7. A random variable X has the negative binomial distribution with parameter
p and r if
k − 1 r k−r
Pr(X = k) = p q , k = r.r + 1, . . .
r−1
We write X ∼ N egBin(p, r).
3.7. Uniform model. The basic distribution we have repeatedly seen in a finite sample
space Ω = {1, 2, . . . , n} is the uniform one, where each event gets the same probability.
Definition 3.9. A random variable X has the uniform distribution with parameter n if
1
Pr(X = k) = , k = 0, 1, 2, . . . , n.
n
We write X ∼ U (n).
We simply note that a discrete random variable can not have the uniform distribution on
an infinite countable set, say Ω = N. This would lead to Pr(X = k) = 0 for all k and then,
P
by σ–additivity, Pr(N) = x∈N Pr(X = x) = 0, contradicting the first axiom of a probability
measure.
In other words, the expectation is the sum of values of the random variable weighted
by their probabilities. We will see later on that E(X) is the single number which best
represents a probability distribution, a clear intuitive fact. Of course the expectation can
be large because X takes very large values even with small probabilities, so the expectation
(as the mean) may be a misleading representative of a random variable. A large amount of
probability and statistics is devoted to clarify the above statement.
The caution in the definition about the convergence of the sum is not superfluous: there
are random variables which do not have expectation, although sometimes the value ∞ is
accepted.
Example 4.2. Let X be a random variable taking values on the positive integers with prob-
ability
6
Pr(X = k) = .
πk 2
P
One can check (is a famous problem in the history of mathematics) that k Pr(X = k) = 1.
However the series k k Pr(X = k) = (6/π 2 ) k k Pr(X = k) is not convergent.
P P
Distribution Expectation
X ∼ B(p) p
X ∼ Bin(n, p) np
X ∼ P ois(λ) λ
X ∼ Geom(p) 1/p
X ∼ N egBin(r, p) r/p
X ∼ HypGeom(n, n1 , r) rn1 /n.
Proposition 4.4 is particularly useful. For example, it provides a simple derivation of the
expectation of Binomial and Negative Binomial distributions.
A second important property of the expectation is related to functions of random variables.
Proposition 4.5 (Functions of random variables). Let X be a discrete random variable
on a probability space and let g : R → R be a function such that the preimage of each
interval (−∞, x], x ∈ R belongs to the Borel σ–algebra (the class of such functions is called
‘measurable’ and it includes continuous functions).
Then the composition Y = g(X) is a discrete random variable on the same probability
space.
One can sometimes obtain explicitly the distribution of Y = g(X). However the following
result, usually called the theorem of expectation or the formula of change of variables for
expectation, is often useful.
Theorem 4.6. Let X be a discrete random variable on a probability space taking values
x1 , x2 , . . . and let g : R → R be a measurable function. Then the expectation of Y = g(X)
satisfies X
E(Y ) = g(xi ) Pr(X = xi ).
i
Particularly important functions are the polynomials xk , which lead to the following defi-
nition.
Definition 4.7 (Moments). Let X be a discrete random variable taking values x1 , x2 , . . ..
The k–th moment of X is X
E(X k ) = xki Pr(X = xi ).
i
9
whenever the sum is absolutely convergent.
The k–th central moment of X is
X
E((X − E(X))k ) (xi − E(X))k Pr(X = xi ).
i
The variance of X measures the mean deviation of X with respect to the expected value.
The smaller the variance, the more concentrated are the values of X around its mean (less
probability that it takes values far from its mean). A quantitative measure of this deviation
is given by the Chebyshev inequality.
Theorem 4.9 (Markov and Chebyshev inequalities). Let X be a discrete random variable
taking only nonnegative values, with finite expectation. Then, for each a ∈ R+ ,
E(X)
Pr(X ≥ a) ≤ . (Markov inequality)
a
Let X be a discrete random variable with first and second moments. Then, for each a ∈ R+ ,
V ar(X)
Pr(|X − E(X)| ≥ a) ≤ (Chebyshev inequality).
a2
One particular consequence is that, if V ar(X) = 0 then X takes the only value E(X) with
probability one: such a variable is a constant.
Example 4.10. Let X be a random variable with uniform distribution U (n). Then
n
X i 1 n+1 n+1
E(X) = = = .
i=1
n n 2 2
Computing the Variance requires the formula ni=1 i2 = n(n + 1)(2n + 1)/6 which can be
P
proved by induction on n.
n
2
X i2 (n + 1)(2n + 1)
E(X ) = = .
i=1
n 6
Hence,
n2 − 1
V ar(X) = E(X 2 ) − (E(X))2 = .
12
10
Chebyshev inequality gives
n2 − 1
Pr(|X − (n + 1)/2| ≥ k) ≤
12k 2
while the actual value is
n − 2k + 1 n + 2k + 1 2k − 1 n − 2k + 1
Pr(|X −(n+1)/2| ≥ k) = 1−Pr( <X< ) = 1− = .
2 2 n n
For example, for n = 13 the two values are
k 1 2 3 4 5 6
Chebyshev 14 7/2 14/9 7/8 14/25 7/18
Actual value 12/13 10/13 8/13 6/13 4/13 2/13
which shows that the bounds can be rather poor. However, the Chebyshev estimation is valid
for any probability distribution and it can be tight.
5. Law of averages
We now come to one of the connections between the axiomatic approach to probability
theory and the frequency one. Let A be an event in a probability space with probability p =
Pr(A). If we make n independent repetitions of the experience associated to the probability
space, the number X of times the event A appears follows a binomial distribution X ∼
Bin(n, p). The proportion of successes in the n trials is X/n, the relative frequency of
appearance of A. By Chebyshev inequality, for every a > 0
pq 1
Pr(|X/n − p| ≥ a) = Pr(|X − E(X)| ≥ na) ≤ → 0 (n → ∞).
a n
In words, the relative frequency of appearance of A approaches its probability for n large.
This is precisely the intuitive meaning of probability and matches with its axiomatic presen-
tation. The above fact is the simplest expression of what is called the Law of Large Numbers,
already obtained by Bernouilli in 1692.
It is interesting to note that, for the particular case presented before, one can obtain a
much better estimation than the one given by Chebyshev inequality. The next bound was
11
first obtained by Bernstein and belongs to what is known as Chernoff bounds. The proof
illustrates an interesting technique worth to see.
Let m = dn(p + )e. For each λ > 0 and each k ≥ m we have eλk ≥ eλn(p+) . Therefore,
n
λ(k−n(p+) n
X
Pr(X/n ≥ p + ) ≤ e pk q n−k
k=m
k
n
−λn
X n
≤e (peλq )k (qe−λp )n−k
k=0
k
= e−λn (peλq + qe−λp )n .
2
By using ex ≤ x + ex , valid for all x, one can turn both exponents to positive:
2 q2 2 p2
Pr(X/n ≥ p + ) ≤ e−λn (peλ + qeλ )n
2 n−λn
= eλ ,
2 q2 2 p2 2
where in the last inequality we write eλ , eλ ≤ eλ . This inequality, valid for every λ > 0,
is optimized when λ = /2, giving
2 /4
Pr(X/n ≥ p + ) ≤ e−n .
Continuous random variables are roughly identified by the fact that the distribution func-
tion is continuous. As it happens, this requirement is not enough to identify the class of
continuous random variables. Instead we ask the stronger requirement that the distribution
function can be obtained by integration of a density function.
The function f is called the probability density function of X and usually denoted by fX .
12
By the fundamental theorem of Calculus, we have
fX (x) = FX0 (x),
at each point x where FX has a derivative. It is a result from Calculus that a continuous
random variable has a continuous distribution function. This in particular shows that
Pr(X = x) = FX (x) − lim FX (t) = 0
t↑x
for all x. In particular, Pr(X ∈ A) = 0 for every countable set A ⊂ R. By the Fundamental
theorem of Calculus,
Pr(a < X < a + h) = fX (a)h + o(h),
which can be written, for small h,
Pr(a < X < a + h)
≈ fX (a),
h
which explains the name ‘probability density’ for fX . So, large values of fX indicate large
probability of being locally around the argument. In this sense one can interpret continuous
random variables as limit versions of discrete ones.
Example 6.2. Let Xn be a uniform discrete distribution on n points. For n → ∞ the
distribution function FXn tends to the function F (x) = x1[ 0, 1](x), which is a distribution
function of a continuous random variable Y . The density of Y is fY (y) = 1[0,1] . The
density function is constant on the points where it is nonzero indicating that the distribution
is uniform. This is called the uniform distribution U (0, 1) and corresponds to the random
choice of a point in the interval.
FX (x)
The probability that X lies in a set A can be obtained from the density function as
Z
Pr(X ∈ A) = fX (t)dt.
A
In particular, Z b
Pr(a < X ≤ b) = Pr(a < X < b) = fX (t)dt.
a
The density function of a continuous random variable has the following properties:
13
Proposition 6.3. Let fX be the density function of continuous random variable X. The
following holds:
The above properties characterize the class of functions which are density functions of
some continuous random variable.
7. Probability models
7.1. Uniform distribution. We choose a random point in an interval (a, b). All intervals
with the same length have the same probability. This leads to:
Definition 7.1. A random variable X has the uniform distribution on an interval (a, b) if
its density function is
(
1 1/(b − a) x ∈ (a, b)
fX (x) = 1(a,b) (x) = .
b−a 0 x 6∈ (a, b)
The distribution function of X is, for x 6= a, b,
0 x<a
1
FX (x) = b−a x a < x < b
1
x > b.
7.2. Exponential distribution. The exponential distribution can be seen as the limiting
distribution of a geometric one. The model corresponds to the time a random event occurs
when the probability of occurring in a small interval is a Bernouilli variable independent of
the occurrence in other disjoint intervals.
Definition 7.2. A random variable X has the exponential distribution with parameter λ if
its density function is
fX (x) = λe−λx 1(0,∞) (x)
The distribution function of X is, for x > 0,
Z x
FX (x) = λe−λt dt = 1 − e−λx .
0
we write X ∼ Exp(λ).
14
If Xn is a discrete random variable with the geometric distribution, Xn ∼ Geom(1/n),
then
Pr(Xn ≤ kn) = 1 − (1 − 1/n)kn → 1 − e−k ,
This illustrates how the exponential distribution can be seen as the limit of a Geometric
distribution. In particular, the exponential distribution also has the memoryless property:
Proposition 7.3. Let X be a random variable with the exponential distribution. Then, for
s, t > 0
Pr(X > t|X > s) = Pr(X > t − s).
It can be shown that a continuous random variable taking values in (0, ∞) with the
memoryless property has an exponential distribution.
An interesting connection with the Poisson distribution is worth mentioning. Suppose
that the number of events in a time interval [0, t] is a random variable X with a Poisson law
X ∼ P oiss(λt). There are natural assumptions which lead to such distribution. Then the
waiting time for the first event to happen is a random variable T with distribution
Pr(T ≤ t) = Pr(X ≥ 1) = 1 − Pr(X = 0) = 1 − e−λt ,
so that T follows an Exponential distribution T ∼ Exp(λ).
7.3. Normal distribution. The Normal distribution is among the most important ones
in Probability and Statistics. One of the reasons is that it can be seen as the limiting
distribution of the Binomial distribution. A such, it models the sum of (infinitely many)
independent Bernouilli variables. It occurs in random phenomena which are the sum of many
independent inputs. It was observed by Gauss as the law of errors in measurements, and for
that reason it is also known as the Gaussian distribution.
Definition 7.4. A random variable X has the normal distribution with parameters m, σ if
its density function is
1 (x−m)2
fX (x) = √ e− 2σ2 .
2πσ
We write X ∼ N (m, σ 2 ).
As it happens, the density function of a normal distribution does not have a primitive
which can be expressed as a finite combination of elementary functions. For that reason,
the values of the normal distribution were historically recorded in tables. These values are
accessible through most standard mathematical softwares, particularly in R.
For m = 0 and σ 2 = 1 the corresponding normal distribution N (0, 1) is called the standard
one and has the more transparent density function
1 2
fX (x) = √ e−x /2 .
2π
15
This is a symmetric function with respect to the origin and has particularly small tails.
Proposition 7.5. Let X ∼ N (0, 1). Then, for x > 0,
Z ∞
1 2 1 2
Pr(X > x) = √ e−t /2 dt < √ e−x .
2π t x 2π
In his celebrated treatise by Laplace on probability one can already find what is known
as the De Moivre-Laplace theorem which states that the binomial probability tends to the
Normal one. More precisely,
Theorem 7.6 (De Moivre-Laplace). For n large, X ∼ Bin(n, p) and k close to np we have
n k n−k 1 (k−np)2
Pr(X = k) = p q ∼√ e− 2npq ,
k 2πnpq
which is the value of the density of the normal distribution N (np, npq) at k.
The Galton Bean Game is a physical experiment which illustrates the above Theorem.
The de Moivre-Laplace Theorem is the first form of the celebrated Central Limit Theorem,
one of the central results in Probability and Statistics. The general form of this basic result
will be discussed later on in this course.
Expectation, and general moments, can be defined for general random variables. They
may be expressed in analogous forms for discrete and for continuous random variables, with
the same meaning.
Definition 8.1 (Expectation and Moments). Let X be a continuous random variable with
density fX . The expectation of X is
Z ∞
E(X) = xfX (x)dx
−∞
16
whenever the integral is absolutely convergent. The k–th moment of the random variable X
is Z ∞
k
E(X ) = xk fX (x)dx,
−∞
and the central k–th moment
Z ∞
k
E((X − m)) = (x − m)k fX (x)dx,
−∞
where m = E(X).
The second central moment is the Variance of X,
2
σX = V ar(X) = E((X − m)2 ) = E(X 2 ) − (E(X))2 .
whenever the corresponding integrals are absolutely convergent.
The following table summarizes the expectation and variance of the most common distri-
butions.
Distribution Mean Value Variance
X ∼ U (a, b) (a + b)/2 (b − a)2 /12
X ∼ Exp(λ) 1/λ 1/λ2
X ∼ N (m, σ 2 ) m σ2
Functions of continuous random variables through continuous, and more general, functions
are continuous random variables. The following is a useful result for computing density
functions and expectations of such functions of random variables.
Theorem 9.1 (Change of Variable). Let X be a continuous random variable with density
fX . Let gR → R be a differentiable function with g 0 (x) > 0 for each x. Then,
1
fY (y) = fX (x),
g 0 (x)
where Y = g(X) and y = g(x).
Moreover, if g 0 (x) < 0 for each x then
1
fY (y) = fX (x),
|g 0 (x)|
Example 9.2 (Linear Transformations). One simple example of a transformation is a linear
one, g(x) = ax + b, and, for Y = aX + b, one gets
1 y−b
fY (y) = fX ( ).
|a| a
17
One important case is converting a random variable to a standard one. If m = E(X) and
V ar(X) = σ 2 , the linear transformation Y = X−m
σ
turns X into a variable with expectation
E(Y ) = 0 and Variance V ar(Y ) = 1. Its density function is
fY (y) = σfX (σy + m).
The above Theorem can be extended to differentiable functions g which are not necessarily
strictly monotone. The case Y = X 2 is an important example which illustrates the general
situation.
Proposition 9.4. Let X be a continuous random variable with density fX . The density of
Y = X 2 is, for y > 0,
1 √ √
fY (x) = √ (fX ( y) + fX (− y)).
2 y
As in the discrete case, the Markov and Chebyshev inequalities are valid for continuous
random variables.
18
Theorem 10.1 (Markov and Chebyshev inequalities). Let X be a continuous random vari-
able taking only nonnegative values, with finite expectation. Then, for each a ∈ R+ ,
E(X)
Pr(X ≥ a) ≤ . (Markov inequality)
a
Let X be a continuous random variable with first and second moments. Then, for each
a ∈ R+ ,
V ar(X)
Pr(|X − E(X)| ≥ a) ≤ (Chebyshev inequality).
a2
Most programming languages and software packages have a primitive to obtain a random
number with the uniform distribution in the interval [0, 1]. It is invoked as random() in
Python, or rand() in C++. Actually these numbers are produced by computational means
which are called Pseudo Random Number Generators. They produce sequences of numbers
which are not random but do have statistics close to what a truly random number would
have. Among the most common devices are the Linear Congruential Generators, which are
based on a congruence reccurrence of the form
xn+1 = (axn + b) (mod m),
for suitable chosen a, b and m. The sequence starts in an initial point, called a seed, which
is often taken from the computer clock.
The Mersenne Twister is based on an analogous linear recurrence on a finite field by using
a large Mersenne prime (219937 − 1 is used) and it is the random generator used by Python
or R among many other programming languages and mathematical software systems.
Once the uniform distribution is at disposal then one can produce random samples for
other distributions easily. The most common way is based in the following Proposition.
Thus, if FX is inversible (strictly monotone) then one can obtain the distribution of X
from a uniform distribution simply by writing X = FX−1 (U ).
Example 11.2. In order to sample the exponential distribution X ∼ Exp(λ) one can obtain
a sample U of the uniform distribution in [0, 1] and then apply the function
1
X = − ln(1 − U ).
λ
19
There are other specific functions, particularly to sample the Normal distribution, whose
distribution function is not expressible in simple analytic terms.
Discrete distributions can be also sampled with analogous methods. If X is a discrete
random variable which takes integer values with distribution function FX then we sample
X = k if FX (k − 1) < U ≤ FX (k).
Sampling with R according some distribution can be done directly by the sample call.
Simulation of random phenomena is a very common tool and it can become an art of many
subtleties and technical difficulties.
20