0% found this document useful (0 votes)
76 views20 pages

Probability and Statistics

This document discusses probability and statistics concepts including: 1) Random variables which assign numerical values to outcomes and have cumulative distribution functions. Discrete random variables take countable values and have probability mass functions. 2) Common probability models including the Bernoulli, binomial, and Poisson distributions which are used to build more complex models. The binomial models counts successes in independent trials while the Poisson is its limiting distribution for large trials. 3) Key concepts like expectation, moments, and simulations of random variables are introduced but not explained in detail. Common distributions like the normal are mentioned.

Uploaded by

amolaaudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views20 pages

Probability and Statistics

This document discusses probability and statistics concepts including: 1) Random variables which assign numerical values to outcomes and have cumulative distribution functions. Discrete random variables take countable values and have probability mass functions. 2) Common probability models including the Bernoulli, binomial, and Poisson distributions which are used to build more complex models. The binomial models counts successes in independent trials while the Poisson is its limiting distribution for large trials. 3) Key concepts like expectation, moments, and simulations of random variables are introduced but not explained in detail. Common distributions like the normal are mentioned.

Uploaded by

amolaaudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Probability and Statistics

Data Science Engineering


Chapter 2: Random variables

Random Variables. Cumulative distribution function. Discrete Random variables. Proba-


bility function. Probability models. Expectation and Moments. Law of averages. Continous
random variables. Probability density. Continous distributions. Normal distribution. Sim-
ulation of random variables.

1. Random variables

The language of events is sometimes cumbersome. The usual description of events uses
a numerical description and events are identified by numbers. For example, in dice rolling
the events are identified by the numerical output Ω = {1, 2, 3, 4, 5, 6}. This leads to the
definition of random variables.

Definition 1.1. A random variable on a probability space (Ω, A, Pr) is a function

X:Ω→R

with the property that, for every x ∈ R, the preimage X −1 ((−∞, x]) = {ω ∈ Ω : X(ω) ≤ x}
is an event in A.

A random variable simply translates elements in the sample spaces to numbers, with
the condition that intervals of the form (−∞, x] have preimages in the σ–algebra A of the
probability space, where probabilities are defined.

Example 1.2. Let A = {∅, A, Ā, Ω} be a Bernouilli algebra on a sample space Ω. The map

X:Ω→R

defined by
(
0 x 6∈ A
X(ω) =
1 x∈A
1
is a random variable: we have

∅ x<0


−1
X (−∞, x] = Ā 0≤x<1

Ω x≥1

Such random variables are called indicator random variables for the set A and are usually
written as 1A .

The purpose of the definition of random variables is to translate the probability func-
tion from the probability space to R. This is generally accomplished by the cumulative
distribution function.

Definition 1.3. Let X be a random variable on a probability space (Ω, A, Pr). The cumula-
tive distributio function of X is
FX : R → R
defined as
FX (x) = Pr(X −1 ((−∞, x]) = Pr(X ≤ x).

Example 1.4. Let 1A be the indicator function of the event A in a probability space (Ω, A, Pr)
with Pr(A) = p. The cumulative function of 1A is

0 x<0


F1A (x) = 1 − p 0 ≤ x < 1

1 x ≥ 1.

Distribution functions are identified by the following properties:

Proposition 1.5. Let FX be the distribution function of a random variable X. Then

(1) 0 ≤ FX (x) ≤ 1 for all x ∈ R.


(2) FX is non decreasing: x < y implies FX (x) ≤ FX (y).
(3) limx→∞ FX (x) = 1 and limx→−∞ FX (x) = 0.
(4) FX is right–continuous: limx↓a FX (x) = FX (a).

The above properties identify the class of real functions which are distribution functions
of some random variable.

Remark 1.6. Usually capital letters like X, Y, Z, . . . are used to denote random variables in
the probability setting. For a subset A ⊂ R we use the shorthand X ∈ A for X −1 (A). Thus,
we write Pr(X < 0), Pr(0 < X ≤ 1) or Pr(X = 2) instead of Pr(X −1 (−∞, 0)), Pr(X −1 ((0, 1]),
Pr(X −1 (2)) or Pr({ω ∈ Ω : X(ω) = 2}). 
2
Probabilities of general events X ∈ A can be expressed in terms of the distribution function
(sometimes in a not simple way). Some examples are
Proposition 1.7. Let X be a random variable in a probability space with distribution func-
tion FX . Then

(i) For a, b ∈ R, a < b, we have Pr(a < X ≤ b) = FX (b) − FX (a).


(ii) For a ∈ R, we have Pr(X = a) = FX (a) − limx↑a FX (x).

2. Discrete Random variables

A random variable is discrete if it takes a countable number of values.


Definition 2.1. A random variable X on the probability space (Ω, A, Pr) is discrete if X(Ω)
is countable.

An equivalent definition is that the distribution function FX of X is a step function: there


is a countable set {x1 , x2 , . . .} ⊂ R of real numbers such that FX is constant outside this set,
and FX has jump discontinuites at the points in this set.
For discrete random variables it is usually preferred to identify the probability distribution
by the so–called probability function instead of the cumulative distribution function.
Definition 2.2. The probability function of a discrete random variable is
PX : R → [0, 1]
defined as PX (x) = Pr(X = x).
Example 2.3. By throwing a dice we naturally identify the outcomes by a discrete random
variable X which takes values in {1, 2, 3, 4, 5, 6}. Its probability function is Pr(X = i) = 1/6
for 1 ≤ i ≤ 6. This shows how natural is the use of random variables in many situations. 
Example 2.4. An indicator function 1A is obviously a discrete random variable taking values
in {0, 1} with probability function Pr(1A = 0) = 1 − Pr(A) and Pr(1A = 1) = Pr(A).
Actually every discrete random variable can be written as a linear combination of indicator
P
variables: if Ai is the event that X = xi for each xi ∈ X(Ω) then X = i xi 1Ai . 
Example 2.5. We toss a coin till Heads shows up. The random variable X which counts
the number of tosses takes values in N, a countable infinite set. Its probability function is
Pr(X = i) = (1/2)i (and certainly, i Pr(X = i) = 1).
P


3. Discrete Probability models

We now come to one important part of the course. A large part of probability theory is
developped upon simple models which are used once an again to build more complex ones.
3
We next describe some of the most important ones. They are identified both as a model and
give the name to their probability distributions and, by abuse of language, to the random
variables.

3.1. Bernouilli model. The Bernouilli model is the simplest one, associated to the Bernouilli
algebra and to the indicator functions. In this model we are only interetsed in a single event
A ⊂ Ω and want to determine if this event occurs. Formally,

Definition 3.1. A random variable X has Bernouilli distribution with parameter p ∈ [0, 1]
if X(Ω) = {0, 1} and
Pr(X = 0) = 1 − p, Pr(X = 1) = p.
Equivalently, X = 1A for some A ⊂ Ω. We write X ∼ B(p).

There is not much to say about the Bernouilli distribution except that it is the building
block of many more complicated models.

3.2. Binomial model. We again focus on a single event A in the sample space, but now we
make n independent repetitions of the experience associated to the probability space, and we
are interested in counting how many times the event occurs in these n repetitions. By the
independence, any possible output of the n experiments in which there are k occurrences of
k n−k n
A has probability p q , where p = Pr(A) and q = 1 − p. There k outputs with precisely
k occurrences of A. This gives the probability distribution of our random variable.

Definition 3.2. A random variable X has the Binomial distribution with parameters n and
p if
 
n k n−k
Pr(X = k) = p q , k = 0, 1, . . . , n.
k
We write X ∼ Bin(n, p).

The Binomial model is a quite fundamental one. Among other interesting properties, we
have.

Proposition 3.3. Let X ∼ Bin(n, p). Then

Pr(X = k) ≤ Pr(X = k + 1), 0 ≤ k ≤ np − 1

and
Pr(X = k) ≥ Pr(X = k + 1), np − 1 < k ≤ n.
In other words, the sequence Pr(X = 0), Pr(X = 1), . . . , Pr(X = n) is unimodal with a
maximum at bn/pc.
4
3.3. Poisson Model. The values of the probability function of a Binomial variable are
somewhat cumbersome to compute. An useful simplification is the limiting model for n large
as long as np is kept constant. More precisely, if Xn ∼ Bin(n, pn ) with limn→∞ npn = λ,
then the probability function of Xn has a limit which can be written in a simpler way:
 
n k
Pr(Xn = k) = p qn n − k
k n
1
∼ (npn )k (1 − pn )−k (1 − pn )n
k!
1
∼ λk e−λ ,
k!
where in the last line we have used that
lim (1 − λ/n)n = e−λ .
n→∞

This leads to the following definition.


Definition 3.4. A random variable X has the Poisson distribution with parameter λ if
λk −λ
Pr(X = k) = e , k = 0, 1, 2, · · ·
k!
and we write X ∼ P ois(λ).

Even if a result of an asymptotic approximation, the expression of the Poisson prob-


ability function approximates quite well the Binomial one. Here are some examples for
X ∼ Bin(n, p) and Y ∼ P oiss(np) when n = 30 and p = 0.1:

k Pr(X = k) Pr(Y = k)
0 0.042 0.049
1 0.141 0.149
2 0.227 0.224
3 0.236 0.224
4 0.177 0.168

The number of radioactive particles, the number of phone calls, the number of impacts of
drop rain, are examples where the Poisson distribution fits well, associated to large number
of trials with small probability of success which can be imagined to be independent.

3.4. Geometric model. We again repeat independently an experience associated to a


Bernouilli algebra but now we count the number of repetitions till the first occurrence of A
(the waiting times till the first ‘success’). By independence, the probability of that we wait
up to k repetitions is q k−1 p.
Definition 3.5. A random variable X has the geometric distribution with parameter p if
Pr(X = k) = q k−1 p, k = 1, 2, 3, . . .
5
We write X ∼ Geom(p).

The name of the distribution comes from the fact that the sum of probabilities, which
must add up to one, is the geometric series
X X 1
Pr(X = k) = q k−1 p = p = 1.
k≥1 k≥1
1 − q

We note that a slight variation of the geometric distribution Y = X − 1, which counts the
number of ‘failures’ before the first ‘success’ is usually also called geometric, with probability
distribution
Pr(Y = k) = q k p, k = 0, 1, 2, . . .
starting with k = 0 instead of k = 1.
The geometric distribution has a characteristic property, the ‘lack of memory’. If we know
that up to the k–th repetition we have no success, then the probability of having to wait at
least s additional repetitions is the same as from the start:
Pr(X ≥ r + s, X ≥ s) q r+s−1
Pr(X ≥ r + s|X ≥ r) = = r−1 = Pr(X ≥ s).
Pr(X ≥ r) q
where we have used Pr(X ≥ r) = k≥r q k−1 p = q r−1 . In other words, if we are waiting for
P

Heads in coin tossing, the fact that we have waited 103 tosses without seeing the event does
not mean that Heads are approaching faster in the future.
Proposition 3.6. Let X be a discrete random variable taking values in the positive integers
with Pr(X = 1) = p. If X has the ‘lack of memory’ property then X ∼ Geom(p).

3.5. Negative Binomial model. If instead of waiting for the first ‘success’ as in the geo-
metric distribution, one waits till the appearance of the r–th ‘success’, then the probability
of waiting up to k repetitions is that of having r − 1 successes in the first k − 1 trials (a
binomial distribution) and then having the r–th one in the k–th trial.
Definition 3.7. A random variable X has the negative binomial distribution with parameter
p and r if  
k − 1 r k−r
Pr(X = k) = p q , k = r.r + 1, . . .
r−1
We write X ∼ N egBin(p, r).

Of course, when r = 1 we obtain the geometric distribution.

3.6. Hypergeometric model. In most of the above examples we consider independent


repetition of trials of a simple Bernouilli experience. When sampling without replacement
from a population, the iteration of the sampling is not bound to independence, because the
result in the k–trial affects the probability distribution in the (k + 1)–th trial. The typical
6
example is drawing balls from an urn without replacement (as opposite with drawing with
replacement, when the trials are independent).
In the hypergeometric model, we extract samples of size r out of a population of size n
which has n1 individuals of one type and n2 = n − n1 individuals of a second type. We are
then interested in counting the number of individuals of type 1 in the sample. This leads to
the following definition.
Definition 3.8. A random variable X has the hypergeometric distribution with parameters
n, n1 and r if
n1
 n2 
k r−k
Pr(X = k) = n
 , k = 0, 1, 2, . . .
r
where n2 = n − n1 .
We write X ∼ HypGeom(n, n1 , r).

The range of values of k for which the above definition is meaningful is


max{0, r − n2 , } ≤ k ≤ min{r, n1 }.
In order to not get bored about these boundary values we may adopt the (reasonable)
convention that a binomial coefficient ab equals zero whenever b > a or b < 0.
The name of the distribution comes from the fact that the sum of probabilities is an
hypergeometric (finite) series.

3.7. Uniform model. The basic distribution we have repeatedly seen in a finite sample
space Ω = {1, 2, . . . , n} is the uniform one, where each event gets the same probability.
Definition 3.9. A random variable X has the uniform distribution with parameter n if
1
Pr(X = k) = , k = 0, 1, 2, . . . , n.
n
We write X ∼ U (n).

We simply note that a discrete random variable can not have the uniform distribution on
an infinite countable set, say Ω = N. This would lead to Pr(X = k) = 0 for all k and then,
P
by σ–additivity, Pr(N) = x∈N Pr(X = x) = 0, contradicting the first axiom of a probability
measure.

4. Expectation and moments

Expectation is a central concept in probability and statistics. The mean of a sequence


x = (x1 , . . . , xn ) of real numbers is P
xi
x̄ = i .
n
7
If there are repetitions in the sequence then we can collect the values yj which are repeated
nj times and rewrite the mean as
X nj
x̄ = yj .
j
n
The expectation of a random variable mimics the definition and the spirit of the mean of a
sequence of numbers, by substitution of relative frequencies nj /n by probabilities:

Definition 4.1 (Expectation). The expectation of a discrete random variable X is


X
E(X) = xi Pr(X = xi ),
i

if the sum is absolutely convergent.

In other words, the expectation is the sum of values of the random variable weighted
by their probabilities. We will see later on that E(X) is the single number which best
represents a probability distribution, a clear intuitive fact. Of course the expectation can
be large because X takes very large values even with small probabilities, so the expectation
(as the mean) may be a misleading representative of a random variable. A large amount of
probability and statistics is devoted to clarify the above statement.
The caution in the definition about the convergence of the sum is not superfluous: there
are random variables which do not have expectation, although sometimes the value ∞ is
accepted.

Example 4.2. Let X be a random variable taking values on the positive integers with prob-
ability
6
Pr(X = k) = .
πk 2
P
One can check (is a famous problem in the history of mathematics) that k Pr(X = k) = 1.
However the series k k Pr(X = k) = (6/π 2 ) k k Pr(X = k) is not convergent.
P P


The expectation of the basic distributions we have seen so far is as follows:

Distribution Expectation
X ∼ B(p) p
X ∼ Bin(n, p) np
X ∼ P ois(λ) λ
X ∼ Geom(p) 1/p
X ∼ N egBin(r, p) r/p
X ∼ HypGeom(n, n1 , r) rn1 /n.

One important property of expectation is linearity. For this it is meaningful to have a


look on the distribution of the sum of two discrete random variables.
8
Definition 4.3 (Sum of random variables). Let X, Y be two discrete random variables on
the same probability space. The random variable Z = X +Y , defined as Z(ω) = X(ω)+Y (ω)
for each ω ∈ Ω has probability function
X
Pr(Z = k) = Pr(X = i, Z = k − i),
i

where (X = i, Z = k − i) is shorthand for {X = i} ∩ {Y = k − i}.


Proposition 4.4. Let X, Y be two random variables on the same probability space. Then
E(X + Y ) = E(X) + E(Y ).
provided that the involved expectations exist.
Moreover, for each λ ∈ R,
E(λX) = λE(X).

Proposition 4.4 is particularly useful. For example, it provides a simple derivation of the
expectation of Binomial and Negative Binomial distributions.
A second important property of the expectation is related to functions of random variables.
Proposition 4.5 (Functions of random variables). Let X be a discrete random variable
on a probability space and let g : R → R be a function such that the preimage of each
interval (−∞, x], x ∈ R belongs to the Borel σ–algebra (the class of such functions is called
‘measurable’ and it includes continuous functions).
Then the composition Y = g(X) is a discrete random variable on the same probability
space.

One can sometimes obtain explicitly the distribution of Y = g(X). However the following
result, usually called the theorem of expectation or the formula of change of variables for
expectation, is often useful.
Theorem 4.6. Let X be a discrete random variable on a probability space taking values
x1 , x2 , . . . and let g : R → R be a measurable function. Then the expectation of Y = g(X)
satisfies X
E(Y ) = g(xi ) Pr(X = xi ).
i

Particularly important functions are the polynomials xk , which lead to the following defi-
nition.
Definition 4.7 (Moments). Let X be a discrete random variable taking values x1 , x2 , . . ..
The k–th moment of X is X
E(X k ) = xki Pr(X = xi ).
i
9
whenever the sum is absolutely convergent.
The k–th central moment of X is
X
E((X − E(X))k ) (xi − E(X))k Pr(X = xi ).
i

Among moments the second one has a significant importance.

Definition 4.8. The Variance of the discrete random variable X is


V ar(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 .
The standard deviation of X is
p
σ(X) = + V ar(X).

The variance of X measures the mean deviation of X with respect to the expected value.
The smaller the variance, the more concentrated are the values of X around its mean (less
probability that it takes values far from its mean). A quantitative measure of this deviation
is given by the Chebyshev inequality.

Theorem 4.9 (Markov and Chebyshev inequalities). Let X be a discrete random variable
taking only nonnegative values, with finite expectation. Then, for each a ∈ R+ ,
E(X)
Pr(X ≥ a) ≤ . (Markov inequality)
a
Let X be a discrete random variable with first and second moments. Then, for each a ∈ R+ ,
V ar(X)
Pr(|X − E(X)| ≥ a) ≤ (Chebyshev inequality).
a2

One particular consequence is that, if V ar(X) = 0 then X takes the only value E(X) with
probability one: such a variable is a constant.

Example 4.10. Let X be a random variable with uniform distribution U (n). Then
n  
X i 1 n+1 n+1
E(X) = = = .
i=1
n n 2 2

Computing the Variance requires the formula ni=1 i2 = n(n + 1)(2n + 1)/6 which can be
P

proved by induction on n.
n
2
X i2 (n + 1)(2n + 1)
E(X ) = = .
i=1
n 6
Hence,
n2 − 1
V ar(X) = E(X 2 ) − (E(X))2 = .
12
10
Chebyshev inequality gives
n2 − 1
Pr(|X − (n + 1)/2| ≥ k) ≤
12k 2
while the actual value is
n − 2k + 1 n + 2k + 1 2k − 1 n − 2k + 1
Pr(|X −(n+1)/2| ≥ k) = 1−Pr( <X< ) = 1− = .
2 2 n n
For example, for n = 13 the two values are

k 1 2 3 4 5 6
Chebyshev 14 7/2 14/9 7/8 14/25 7/18
Actual value 12/13 10/13 8/13 6/13 4/13 2/13

which shows that the bounds can be rather poor. However, the Chebyshev estimation is valid
for any probability distribution and it can be tight. 

The variance of the basic distributions we have seen so far is as follows:


Distribution Variance
X ∼ B(p) pq
X ∼ Bin(n, p) npq
X ∼ P ois(λ) λ
X ∼ Geom(p) q/p2
X ∼ N egBin(r, p) rq/p2
X ∼ HypGeom(n, n1 , r) rn1 (n − n1 )(n − r)/n2 (n − 1).

5. Law of averages

We now come to one of the connections between the axiomatic approach to probability
theory and the frequency one. Let A be an event in a probability space with probability p =
Pr(A). If we make n independent repetitions of the experience associated to the probability
space, the number X of times the event A appears follows a binomial distribution X ∼
Bin(n, p). The proportion of successes in the n trials is X/n, the relative frequency of
appearance of A. By Chebyshev inequality, for every a > 0
pq 1
Pr(|X/n − p| ≥ a) = Pr(|X − E(X)| ≥ na) ≤ → 0 (n → ∞).
a n
In words, the relative frequency of appearance of A approaches its probability for n large.
This is precisely the intuitive meaning of probability and matches with its axiomatic presen-
tation. The above fact is the simplest expression of what is called the Law of Large Numbers,
already obtained by Bernouilli in 1692.
It is interesting to note that, for the particular case presented before, one can obtain a
much better estimation than the one given by Chebyshev inequality. The next bound was
11
first obtained by Bernstein and belongs to what is known as Chernoff bounds. The proof
illustrates an interesting technique worth to see.

Theorem 5.1. Let X ∼ Bin(n, p). For each  > 0 we have


2 /4
Pr(|X/n − p| ≥ ) ≤ 2e−n .

Proof. We simply estimate one tail,


 
X X n k n−k
Pr(X/n ≥ p + ) = Pr(X = k) = p q .
k
k≥n(p+) k≥n(p+)

Let m = dn(p + )e. For each λ > 0 and each k ≥ m we have eλk ≥ eλn(p+) . Therefore,
n  
λ(k−n(p+) n
X
Pr(X/n ≥ p + ) ≤ e pk q n−k
k=m
k
n  
−λn
X n
≤e (peλq )k (qe−λp )n−k
k=0
k
= e−λn (peλq + qe−λp )n .
2
By using ex ≤ x + ex , valid for all x, one can turn both exponents to positive:
2 q2 2 p2
Pr(X/n ≥ p + ) ≤ e−λn (peλ + qeλ )n
2 n−λn
= eλ ,
2 q2 2 p2 2
where in the last inequality we write eλ , eλ ≤ eλ . This inequality, valid for every λ > 0,
is optimized when λ = /2, giving
2 /4
Pr(X/n ≥ p + ) ≤ e−n .


6. Continuous random variables

Continuous random variables are roughly identified by the fact that the distribution func-
tion is continuous. As it happens, this requirement is not enough to identify the class of
continuous random variables. Instead we ask the stronger requirement that the distribution
function can be obtained by integration of a density function.

Definition 6.1 (Continuous random variable). A random variable is continuous if there is


a function f : R → R such that, for each x ∈ R,
Z x
FX (x) = f (t)dt.
−∞

The function f is called the probability density function of X and usually denoted by fX .
12
By the fundamental theorem of Calculus, we have
fX (x) = FX0 (x),
at each point x where FX has a derivative. It is a result from Calculus that a continuous
random variable has a continuous distribution function. This in particular shows that
Pr(X = x) = FX (x) − lim FX (t) = 0
t↑x

for all x. In particular, Pr(X ∈ A) = 0 for every countable set A ⊂ R. By the Fundamental
theorem of Calculus,
Pr(a < X < a + h) = fX (a)h + o(h),
which can be written, for small h,
Pr(a < X < a + h)
≈ fX (a),
h
which explains the name ‘probability density’ for fX . So, large values of fX indicate large
probability of being locally around the argument. In this sense one can interpret continuous
random variables as limit versions of discrete ones.
Example 6.2. Let Xn be a uniform discrete distribution on n points. For n → ∞ the
distribution function FXn tends to the function F (x) = x1[ 0, 1](x), which is a distribution
function of a continuous random variable Y . The density of Y is fY (y) = 1[0,1] . The
density function is constant on the points where it is nonzero indicating that the distribution
is uniform. This is called the uniform distribution U (0, 1) and corresponds to the random
choice of a point in the interval.

FX (x)

Figure 1. The continuous uniform distribution as a limit of the discrete one.

The probability that X lies in a set A can be obtained from the density function as
Z
Pr(X ∈ A) = fX (t)dt.
A
In particular, Z b
Pr(a < X ≤ b) = Pr(a < X < b) = fX (t)dt.
a
The density function of a continuous random variable has the following properties:
13
Proposition 6.3. Let fX be the density function of continuous random variable X. The
following holds:

(i) RfX (x) ≥ 0 for all x, and



(ii) −∞ fX (t)dt = 1.

The above properties characterize the class of functions which are density functions of
some continuous random variable.

7. Probability models

The following are some of the important continuous distributions.

7.1. Uniform distribution. We choose a random point in an interval (a, b). All intervals
with the same length have the same probability. This leads to:

Definition 7.1. A random variable X has the uniform distribution on an interval (a, b) if
its density function is
(
1 1/(b − a) x ∈ (a, b)
fX (x) = 1(a,b) (x) = .
b−a 0 x 6∈ (a, b)
The distribution function of X is, for x 6= a, b,

0 x<a


1
FX (x) = b−a x a < x < b

1

x > b.

we write X ∼ U (a, b).

7.2. Exponential distribution. The exponential distribution can be seen as the limiting
distribution of a geometric one. The model corresponds to the time a random event occurs
when the probability of occurring in a small interval is a Bernouilli variable independent of
the occurrence in other disjoint intervals.

Definition 7.2. A random variable X has the exponential distribution with parameter λ if
its density function is
fX (x) = λe−λx 1(0,∞) (x)
The distribution function of X is, for x > 0,
Z x
FX (x) = λe−λt dt = 1 − e−λx .
0

we write X ∼ Exp(λ).
14
If Xn is a discrete random variable with the geometric distribution, Xn ∼ Geom(1/n),
then
Pr(Xn ≤ kn) = 1 − (1 − 1/n)kn → 1 − e−k ,
This illustrates how the exponential distribution can be seen as the limit of a Geometric
distribution. In particular, the exponential distribution also has the memoryless property:
Proposition 7.3. Let X be a random variable with the exponential distribution. Then, for
s, t > 0
Pr(X > t|X > s) = Pr(X > t − s).

It can be shown that a continuous random variable taking values in (0, ∞) with the
memoryless property has an exponential distribution.
An interesting connection with the Poisson distribution is worth mentioning. Suppose
that the number of events in a time interval [0, t] is a random variable X with a Poisson law
X ∼ P oiss(λt). There are natural assumptions which lead to such distribution. Then the
waiting time for the first event to happen is a random variable T with distribution
Pr(T ≤ t) = Pr(X ≥ 1) = 1 − Pr(X = 0) = 1 − e−λt ,
so that T follows an Exponential distribution T ∼ Exp(λ).

7.3. Normal distribution. The Normal distribution is among the most important ones
in Probability and Statistics. One of the reasons is that it can be seen as the limiting
distribution of the Binomial distribution. A such, it models the sum of (infinitely many)
independent Bernouilli variables. It occurs in random phenomena which are the sum of many
independent inputs. It was observed by Gauss as the law of errors in measurements, and for
that reason it is also known as the Gaussian distribution.
Definition 7.4. A random variable X has the normal distribution with parameters m, σ if
its density function is
1 (x−m)2
fX (x) = √ e− 2σ2 .
2πσ
We write X ∼ N (m, σ 2 ).

As it happens, the density function of a normal distribution does not have a primitive
which can be expressed as a finite combination of elementary functions. For that reason,
the values of the normal distribution were historically recorded in tables. These values are
accessible through most standard mathematical softwares, particularly in R.
For m = 0 and σ 2 = 1 the corresponding normal distribution N (0, 1) is called the standard
one and has the more transparent density function
1 2
fX (x) = √ e−x /2 .

15
This is a symmetric function with respect to the origin and has particularly small tails.
Proposition 7.5. Let X ∼ N (0, 1). Then, for x > 0,
Z ∞
1 2 1 2
Pr(X > x) = √ e−t /2 dt < √ e−x .
2π t x 2π

In his celebrated treatise by Laplace on probability one can already find what is known
as the De Moivre-Laplace theorem which states that the binomial probability tends to the
Normal one. More precisely,
Theorem 7.6 (De Moivre-Laplace). For n large, X ∼ Bin(n, p) and k close to np we have
 
n k n−k 1 (k−np)2
Pr(X = k) = p q ∼√ e− 2npq ,
k 2πnpq
which is the value of the density of the normal distribution N (np, npq) at k.

The Galton Bean Game is a physical experiment which illustrates the above Theorem.

The de Moivre-Laplace Theorem is the first form of the celebrated Central Limit Theorem,
one of the central results in Probability and Statistics. The general form of this basic result
will be discussed later on in this course.

8. Expectation and Moments

Expectation, and general moments, can be defined for general random variables. They
may be expressed in analogous forms for discrete and for continuous random variables, with
the same meaning.
Definition 8.1 (Expectation and Moments). Let X be a continuous random variable with
density fX . The expectation of X is
Z ∞
E(X) = xfX (x)dx
−∞
16
whenever the integral is absolutely convergent. The k–th moment of the random variable X
is Z ∞
k
E(X ) = xk fX (x)dx,
−∞
and the central k–th moment
Z ∞
k
E((X − m)) = (x − m)k fX (x)dx,
−∞

where m = E(X).
The second central moment is the Variance of X,
2
σX = V ar(X) = E((X − m)2 ) = E(X 2 ) − (E(X))2 .
whenever the corresponding integrals are absolutely convergent.

The following table summarizes the expectation and variance of the most common distri-
butions.
Distribution Mean Value Variance
X ∼ U (a, b) (a + b)/2 (b − a)2 /12
X ∼ Exp(λ) 1/λ 1/λ2
X ∼ N (m, σ 2 ) m σ2

9. Functions of random variables

Functions of continuous random variables through continuous, and more general, functions
are continuous random variables. The following is a useful result for computing density
functions and expectations of such functions of random variables.

Theorem 9.1 (Change of Variable). Let X be a continuous random variable with density
fX . Let gR → R be a differentiable function with g 0 (x) > 0 for each x. Then,
1
fY (y) = fX (x),
g 0 (x)
where Y = g(X) and y = g(x).
Moreover, if g 0 (x) < 0 for each x then
1
fY (y) = fX (x),
|g 0 (x)|
Example 9.2 (Linear Transformations). One simple example of a transformation is a linear
one, g(x) = ax + b, and, for Y = aX + b, one gets
1 y−b
fY (y) = fX ( ).
|a| a
17
One important case is converting a random variable to a standard one. If m = E(X) and
V ar(X) = σ 2 , the linear transformation Y = X−m
σ
turns X into a variable with expectation
E(Y ) = 0 and Variance V ar(Y ) = 1. Its density function is
fY (y) = σfX (σy + m).

Example 9.3 (Lognormal distribution). Let X ∼ N (0, 1) be a standard normal distribution.


Consider the random variable Y = eX . In this case Y = g(X) with g(x) = ex , a differentiable
function with positive derivative elsewhere. If y = g(x) then x = ln(y). Therefore, g 0 (x) =
elny = y and Theorem 9.1 gives, for y ≥ 0,
1 1 1 2
fY (y) = fX (ln y) = √ e−(ln y) /2 .
y y 2π
The random variable Y is said to have the lognormal distribution (because log Y ∼ N (0, 1)).
It is an important distribution with interesting properties which arises naturally in several
random phenomena.

The above Theorem can be extended to differentiable functions g which are not necessarily
strictly monotone. The case Y = X 2 is an important example which illustrates the general
situation.

Proposition 9.4. Let X be a continuous random variable with density fX . The density of
Y = X 2 is, for y > 0,
1 √ √
fY (x) = √ (fX ( y) + fX (− y)).
2 y

Sometimes it is simpler to obtain the expectation of a function Y = g(X) of a random


variable X in terms of the density fX directly instead of computing previously the density
of Y . This can be achieved as follows.

Theorem 9.5 (Expectation of a function). Let X be a continuous random variable with


density fX . Let gR → R be a continuous function. Then
Z ∞
E(g(X)) = g(x)fX (x)dx,
−∞

if the integral is absolutely convergent.

The computation of the k-th moment of a random variable is an example of application


of the above Theorem, applied to the function g(x) = xk .

10. Markov inequality

As in the discrete case, the Markov and Chebyshev inequalities are valid for continuous
random variables.
18
Theorem 10.1 (Markov and Chebyshev inequalities). Let X be a continuous random vari-
able taking only nonnegative values, with finite expectation. Then, for each a ∈ R+ ,
E(X)
Pr(X ≥ a) ≤ . (Markov inequality)
a
Let X be a continuous random variable with first and second moments. Then, for each
a ∈ R+ ,
V ar(X)
Pr(|X − E(X)| ≥ a) ≤ (Chebyshev inequality).
a2

11. Simulation of random variables

Most programming languages and software packages have a primitive to obtain a random
number with the uniform distribution in the interval [0, 1]. It is invoked as random() in
Python, or rand() in C++. Actually these numbers are produced by computational means
which are called Pseudo Random Number Generators. They produce sequences of numbers
which are not random but do have statistics close to what a truly random number would
have. Among the most common devices are the Linear Congruential Generators, which are
based on a congruence reccurrence of the form
xn+1 = (axn + b) (mod m),
for suitable chosen a, b and m. The sequence starts in an initial point, called a seed, which
is often taken from the computer clock.
The Mersenne Twister is based on an analogous linear recurrence on a finite field by using
a large Mersenne prime (219937 − 1 is used) and it is the random generator used by Python
or R among many other programming languages and mathematical software systems.
Once the uniform distribution is at disposal then one can produce random samples for
other distributions easily. The most common way is based in the following Proposition.

Proposition 11.1. Let X be a continuous random variable with distribution function FX .


Then
U = FX (X) ∼ U ([0, 1]).

Thus, if FX is inversible (strictly monotone) then one can obtain the distribution of X
from a uniform distribution simply by writing X = FX−1 (U ).

Example 11.2. In order to sample the exponential distribution X ∼ Exp(λ) one can obtain
a sample U of the uniform distribution in [0, 1] and then apply the function
1
X = − ln(1 − U ).
λ

19
There are other specific functions, particularly to sample the Normal distribution, whose
distribution function is not expressible in simple analytic terms.
Discrete distributions can be also sampled with analogous methods. If X is a discrete
random variable which takes integer values with distribution function FX then we sample
X = k if FX (k − 1) < U ≤ FX (k).

Sampling with R according some distribution can be done directly by the sample call.
Simulation of random phenomena is a very common tool and it can become an art of many
subtleties and technical difficulties.

20

You might also like