100% found this document useful (1 vote)

100 views

Elements of Probability and Statis

Uploaded by

Aleks Guevara Palacios

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

100 views

Elements of Probability and Statis

Uploaded by

Aleks Guevara Palacios

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 246

Francesca Biagini

Massimo Campanino

Elements

NITEXT
of Probability
and Statistics
An Introduction to Probability
with de Finetti’s Approach
and to Bayesian Statistics
UNITEXT - La Matematica per il 3+2

Volume 98

Editor-in-chief
A. Quarteroni

Series editors
L. Ambrosio
P. Biscari
C. Ciliberto
M. Ledoux
W.J. Runggaldier
More information about this series at http://www.springer.com/series/5418
Francesca Biagini Massimo Campanino
•

Elements of Probability
and Statistics
An Introduction to Probability
with de Finetti’s Approach
and to Bayesian Statistics

123
Francesca Biagini Massimo Campanino
Department of Mathematics Department of Mathematics
Ludwig-Maximilians-Universität Università di Bologna
Munich Bologna
Germany Italy

ISSN 2038-5722 ISSN 2038-5757 (electronic)

UNITEXT - La Matematica per il 3+2
ISBN 978-3-319-07253-1 ISBN 978-3-319-07254-8 (eBook)
DOI 10.1007/978-3-319-07254-8

Library of Congress Control Number: 2015958841

Translation from the Italian language edition: Elementi di Probabilità e Statistica di Francesca Biagini e
Massimo Campanino, © Springer-Verlag Italia, Milano 2006. All rights reserved.
© Springer International Publishing Switzerland 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Cover design: Simona Colombo, Giochi di Graﬁca, Milano, Italy

Printed on acid-free paper

This Springer imprint is published by SpringerNature

The registered company is Springer International Publishing AG Switzerland
Nous ne possédons une ligne, un surface,
un volume que si notre amour l’occupe.
M. Proust
To Thilo and Oskar
Francesca Biagini

To my brother Vittorio
Massimo Campanino
Preface

This book is based on the lectures notes for the course, Probability and
Mathematical Statistics, taught for many years by one of the authors (M.C.) and
then, divided into two sections, by both authors at the University of Bologna (Italy).
We follow the approach of de Finetti, see de Finetti [1] for a complete detailed
exposition. Although de Finetti [1] was conceived as a textbook of probability for
mathematics students, it was also meant to illustrate the point of view of the author
on the foundations of probability and mathematical statistics and discuss it in
relation to prevalent approaches, resulting often of difficult access for beginners.
This was the main reason that prompted us to arrange the lectures notes of our
courses into a more organic way and to write a textbook for an initial class on
probability and mathematical statistics.
The first five chapters are devoted to elementary probability. After that in the
next three chapters we develop some elements of Markov chains in discrete and
continuous time also in connection with queueing processes, and introduce basic
concepts in mathematical statistics in the Bayesian approach. Then we propose six
chapters of exercises, which cover most of the topics treated in the theoretical
part. In the appendices we have inserted summary schemes and complementary
topics (two proofs of Stirling formula). We also informally recall some elements of
calculus, as this has often proved useful for the students.
This book offers a comprehensive but concise introduction to probability and
mathematical statistics without requiring notions of measure theory; hence it can be
used in basic classes on probability for mathematics students and is particularly
suitable for computer science, physics and engineering students.

ix
x Preface

We are grateful to Springer for allowing us to publish the English version of the
book. We wish to thank Elisa Canova, Alessandra Cretarola, Nicola Mezzetti and
Quirin Vogel for their fundamental help with latex, for both the Italian and the
English version.

Munich Francesca Biagini

Bologna Massimo Campanino
June 2015
Contents

Part I Probability
1 Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Probability of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Uniform Distribution on Partitions . . . . . . . . . . . . . . . . . . . . . 12
1.6 Conditional Probability and Expectation . . . . . . . . . . . . . . . . . 14
1.7 Formula of Composite Expectation and Probability. . . . . . . . . . 15
1.8 Formula of Total Expectation and Total Probability . . . . . . . . . 16
1.9 Bayes Formula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.10 Correlation Between Events. . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.11 Stochastic Independence and Constituents . . . . . . . . . . . . . . . . 20
1.12 Covariance and Variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.13 Correlation Coefﬁcient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.14 Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.15 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Random Numbers with Discrete Distribution . . . . . . . . . . . . . . 27
2.2 Bernoulli Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Independence of Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Generalized Bernoulli Scheme . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10 Stochastic Independence for Random Numbers
with Discrete distribution . . . . . . . . . . . . . . . ............ 35
2.11 Joint Distribution . . . . . . . . . . . . . . . . . . . . . ............ 35

xi
xii Contents

2.12 Variance of Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 36

2.13 Non-correlation and Stochastic Independence. . . . . . . . . . . . . . 38
2.14 Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 One-Dimensional Absolutely Continuous Distributions . . . . . . . . . . 43
3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Absolutely Continuous Distributions . . . . . . . . . . . . . . . . . . . . 44
3.4 Uniform Distribution in ½0; 1. . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Uniform Distribution on an Arbitrary Interval ½a; b . . . . . . . . . 47
3.6 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 A Characterization of Exponential Distribution. . . . . . . . . . . . . 49
3.8 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.9 Normal Tail Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.11 χ2 -Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.12 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 Mixed Cumulative Distribution Functions . . . . . . . . . . . . . . . . 55
4 Multi-dimensional Absolutely Continuous Distributions . . . . . . . . . 57
4.1 Bidimensional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Marginal Cumulative Distribution Functions . . . . . . . . . . . . . . 58
4.3 Absolutely Continuous Joint Distributions . . . . . . . . . . . . . . . . 58
4.4 The Density of Z ¼ X þ Y . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Beta Distribution Bðα; βÞ . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Student Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Multi-dimensional Distributions . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Absolutely Continuous Multi-dimensional Distributions . . . . . . 65
4.9 Multi-dimensional Gaussian Distribution . . . . . . . . . . . . . . . . . 66
5 Convergence of Distributions . . . . . . . . . . . . . . . . . . . . . . ...... 73
5.1 Convergence of Cumulative Distribution Functions. . . . ...... 73
5.2 Convergence of Geometric Distribution to Exponential
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... 75
5.3 Convergence of Binomial Distribution
to Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . ...... 76
5.4 De Moivre-Laplace Theorem . . . . . . . . . . . . . . . . . . . ...... 77
6 Discrete Time Markov Chains . . . . . . . . . . . . . . ............. 81
6.1 Homogeneous Discrete Time Markov Chains
with Finite State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Transition Probability in n Steps . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Equivalence Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Contents xiii

7 Continuous Time Markov Chains . . . . . . . . . . . . . . ........... 89

7.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . ........... 89
7.2 Homogeneous Continuous Time Markov Chains
with Countable State Space . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4 Queueing Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5 M=M=1 Queueing Systems . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6 M=M=1 Queueing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.7 M=M=n Queueing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.8 Queueing Systems in Stationary Regime
and Little’s Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.1 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.2 Conditional Density for Two Random Numbers . . . . . . . . . . . . 104
8.3 Statistical Induction on Bernoulli Distribution . . . . . . . . . . . . . 105
8.4 Statistical Induction on Expectation of Normal Distribution . . . . 107
8.5 Statistical Induction on Variance of Normal Distribution . . . . . . 108
8.6 Improper Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.7 Statistical Induction on Expectation and Variance
of Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.8 Bayesian Conﬁdence Intervals and Hypotheses' Testing . . . . . . 111
8.9 Comparison of Expectations for Normal Distribution . . . . . . . . 111

Part II Exercises
9 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Exercise 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Exercise 9.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Exercise 9.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Exercise 9.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Exercise 9.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Exercise 9.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Exercise 10.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Exercise 10.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Exercise 10.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Exercise 10.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Exercise 10.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercise 10.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Exercise 10.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
xiv Contents

11 One-Dimensional Absolutely Continuous Distributions . . . . . . . . . . 143

Exercise 11.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Exercise 11.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Exercise 11.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Exercise 11.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
12 Absolutely Continuous and Multivariate Distributions . . . . . . . . . . 151
Exercise 12.1 . . . . . . . ............. ........... . . . . . . . . . 151
Exercise 12.2 . . . . . . . ............. ........... . . . . . . . . . 153
Exercise 12.3 . . . . . . . ............. ........... . . . . . . . . . 156
Exercise 12.4 . . . . . . . ............. ........... . . . . . . . . . 162
Exercise 12.5 . . . . . . . ............. ........... . . . . . . . . . 167
Exercise 12.6 . . . . . . . ............. ........... . . . . . . . . . 170
Exercise 12.7 . . . . . . . ............. ........... . . . . . . . . . 174
Exercise 12.8 . . . . . . . ............. ........... . . . . . . . . . 176
13 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercise 13.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercise 13.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Exercise 13.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Exercise 13.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
14 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Exercise 14.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Exercise 14.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Exercise 14.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Exercise 14.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Exercise 14.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Exercise 14.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Exercise 14.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Exercise 14.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Appendix A: Elements of Combinatorics . . . . . . . . . . . . . . . . . . . . . . . 213

Appendix B: Relations Between Discrete and Absolutely Continuous

Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Appendix C: Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . 219

Appendix D: Some One-Dimensional Absolutely

Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . 221

Appendix E: The Normal Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 223

Appendix F: Stirling’s Formula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Contents xv

Appendix G: Elements of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

Appendix H: Bidimensional Integrals . . . . . . . . . . . . . . . . . . . . . . . . . 235

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Part I
Probability
Chapter 1
Random Numbers

1.1 Introduction

Probability Theory deals with the quantification of our degree of uncertainty. Its main
object of interest are random entities and, in particular, random numbers. What is
meant by random number?
A random number is a well defined number, whose value is not necessarily known.
For example we can use random numbers to describe the result of a determined exper-
iment, or the value of an option at a prefixed time, or the value of a meteorological
magnitude at a given time. All these quantities have a well defined value, but may
not be known either because they refer to the future and there are no means to predict
their values with certainty or, even if they refer to the past, there is no available
information at the moment.
We shall denote random numbers with capital letters. Even if the value of a random
number is in general not known, we can speak about the set of its possible values,
that will be denoted by I (X ). Certain numbers can be considered as particular cases
of random numbers, whose set of possible values consists of a single element.
Example 1.1.1 Let the random numbers X, Y represent respectively the results of
throwing a coin and a die. If we denote head and tail by 0 and 1 and the sides of the
die with the numbers from 1 to 6, we have:

I (X ) = {0, 1} ,
I (Y ) = {1, 2, 3, 4, 5, 6} .

The random number X is:

• upper bounded if I(X) is upper bounded (sup I (X ) < +∞);
• lower bounded if I(X) is lower bounded (inf I (X ) > −∞);
• bounded if I(X) is both upper and lower bounded (sup I (X ) < +∞, inf I (X ) >
−∞).

© Springer International Publishing Switzerland 2016 3

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_1
4 1 Random Numbers

Given two random numbers X and Y, we denote by I (X, Y ) the set of pairs of
values that (X, Y ) can attain. In general given n random numbers X 1 , . . . , X n , we
denote by I (X 1 , . . . , X n ) the set of possible values that (X 1 , . . . , X n ) can attain.
The random numbers X and Y are said to be logically independent if

I (X, Y ) = I (X ) × I (Y ) ,

where I (X ) × I (Y ) denotes the Cartesian product of I (X ) and I (Y ).

Similarly the random numbers (X 1 . . . , X n ) are said to be logically independent
if I (X 1 , . . . , X n ) = I (X 1 ) × · · · × I (X n ).

Example 1.1.2 In a lottery two balls are consecutively drawn without substitution
from an urn that contains 90 balls numerated from 1 to 90. Let X and Y represent
the random numbers corresponding respectively to the first and the second drawing.
The set of possible pairs is then

I (X, Y ) = {(i, j)|1 ≤ i ≤ 90, 1 ≤ j ≤ 90, i = j}.

Clearly I (X, Y ) = I (X ) × I (Y ) as I (X, Y ) does not contain pairs of the type (i, i),
with i ∈ {1, . . . , 90}. The random numbers X and Y therefore are not logically
independent.

By using random numbers we can perform usual arithmetic operations, obtaining

again random numbers. We introduce the following operations that we will apply to
random numbers. For real x and y
1. x ∨ y := max(x, y);
2. x ∧ y := min(x, y);
3. x̃ := 1 − x.
As it is easy to verify, these operations satisfy the following properties:
1. distributive property

x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z), (1.1)

x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z); (1.2)

2. associative property
x ∨ (y ∨ z) = (x ∨ y) ∨ z, (1.3)

x ∧ (y ∧ z) = (x ∧ y) ∧ z; (1.4)

3. commutative property
x ∨ y = y ∨ x, (1.5)

x ∧ y = y ∧ x; (1.6)
1.1 Introduction 5

4. furthermore

x̃˜ = x, (1.7)
˜
(x ∨ y) = x̃ ∧ ỹ, (1.8)
(x ∧ y)˜ = x̃ ∨ ỹ . (1.9)

These properties are easily extended to operations to n real numbers x1 , . . . , xn .

1.2 Events

Events are a particular case of random numbers. An event E is a random number

such that I (E) ⊆ {0, 1}. In the case of two events E and F, E ∨ F is called logical
sum and E ∧ F logical product. It is easy to verify that:
1. E ∨ F = E + F − E F;
2. E ∧ F = E F.
Given an event E, one defines the complementary event E by

Ẽ = 1 − E.

From (1.7) we have Ẽ˜ = E. From (1.8) we have

(E ∨ F)˜ = Ẽ ∧ F̃ = (1 − E)(1 − F) = 1 − E − F + E F,

so that
E ∨ F = E + F − E F.

Analogously

(E ∨ F ∨ G)˜ = Ẽ ∧ F̃ ∧ G̃ = (1 − E)(1 − F)(1 − G)

= 1 − E − F − G + E F + E G + F G − E F G,

so that
E ∨ F ∨ G = E + F + G − E F − E G − F G + E F G.

Other two operations on events are:

1. Difference of E and F: E \ F = E − E F.
2. Symmetric difference of E and F: E F = (E \ F) ∨ (F \ E) = E + F(mod 2).
From now on we shall use the symbol
to indicate that what follows is certainly
true. For example,
X ≤ Y indicates that I (X, Y ) ⊂ {(x, y)| x ≤ y}.
6 1 Random Numbers

We use the notation

E ⊂ F for
E ≤ F,

and

E = F for E ≡ F

that is equivalent to E ⊂ F and F ⊂ E. When an event E is equal to 1 we say that

E happens, when E is equal to 0 we say that it does not happen. The logical sum
E ∨ F happens if and only if at least one of the events E and F takes place, whereas
the logical product E ∧ F = E F happens if and only if both E and F take place. The
complementary event Ẽ happens if and only if E does not happen. Note that E ⊂ F
means that E implies F, i.e. when E takes place also F does.

Definition 1.2.1 We define the following relations for events:

1. incompatibility: E, F are said to be incompatible if
E F = 0;
2. exhaustivity: E 1 ,…, E n are said to be exhaustive if
E 1 + · · · + E n ≥ 1;
3. partition: E 1 ,…, E n are said to be a partition if
E 1 + · · · + E n = 1 (i.e. they
are exhaustive and two by two incompatible).

Example 1.2.2 An event E and its complementary Ẽ are a partition.

Given n events E 1 , . . . , E n , we can always build up a partition combining them

and their complementary sets. This partition is called partition of constituents. We
introduce the following notation. Given an event E, we put

Ei
E i∗ =
Ẽ i .

A constituent of E 1 , . . . , E n is a product

Q = E 1∗ · · · E n∗ .

It easy to check that the set of all constituents are a partition.

In general, not all constituents are possible. If I (E i ) = {0, 1} for i = 1, . . . , n,
all constituents are possible if and only if E 1 , . . . , E n are logically independent. The
possible constituents are a partition. Indeed

1 = (E 1 + Ẽ 1 ) . . . (E n + Ẽ n ) = Q.
Q constituent

Impossible constituents can be obviously skipped in the sum.

1.2 Events 7

If E 1 , . . . , E n are already a partition, then the possible constituents are:

E 1 Ẽ 2 . . . Ẽ n ,
Ẽ 1 E 2 Ẽ 3 . . . Ẽ n ,
··· ,
Ẽ 1 . . . Ẽ n−1 E n ,

in this case the constituents can be identified with the events themselves.
Let us now introduce the concept of logical dependence and independence of an
event E from n given events E 1 , . . . , E n . The constituents Q of E 1 , . . . , E n can be
classified in the following way with respect to a given event E:
(i) constituent of I type if Q ⊂ E;
(ii) constituent of II type if Q ⊂ Ẽ;
(iii) constituent of III type otherwise.
We say that the event E is:
• logically dependent from E 1 ,…,E n if all constituents of E 1 ,…,E n are of I or II
type;
• logically independent from E 1 ,…,E n if all constituents of E 1 ,…,E n are of the III
type;
• logically semidependent from E 1 ,…,E n otherwise.
If E is logically dependent from E 1 ,…,E n , then we can write

E= Q.
Q of I type
Q⊂E

Example 1.2.3 Let us consider two events E 1 , E 2 . The logical sum (E 1 ∨ E 2 ) can
be written as

E 1 ∨ E 2 = E 1 E 2 + E˜1 E 2 + E 1 E˜2 .

In general an event E is logically dependent from E 1 ,…,E n if and only if E can be

written as E = Φ(E 1 , . . . , E n ) for some function Φ.

Example 1.2.4 Let us throw five times a coin. Let E i be the event that we get head
at the ith trial, i.e. E i = 1. Set Y = E 1 + E 2 + E 3 + E 4 + E 5 (Y is the total number
of heads in the five throws) and consider the event

E = (Y ≥ 3).
8 1 Random Numbers

Then E is logically semidependent from E 1 E 2 E 3 . Indeed there are constituents of

the
I type: E 1 E 2 E 3 ⊂ E;
II type: Ẽ 1 Ẽ 2 Ẽ 3 ⊂ Ẽ;
III type: Ẽ 1 E 2 Ẽ 3 .

1.3 Expectation

Given a random number X , we look for a non-random number that expresses our
evaluation of X . We call this quantity expectation of X . In economic terms, if we
think of the expectation of X as a non-random gain that we judge equivalent to X .
Following de Finetti [1] the expectation P(X ) assigned to the random number X
can be defined in an operative way as follows.
Two equivalent operative definitions can be used to define the expectation:
1. Bet method: we think of X as a random gain (or loss, if it is negative). We have
to choose a value P(X ) (non-random) that we judge equivalent to X .
After this choice is made, we must accept any bet with gain (or loss) given by

λ(X − x̄),

where λ ∈ R is a constant. The corresponding coherence principle is that no

choice is allowed for which there is a bet giving a certain loss. The chosen value
x̄ is our evaluation for the expectation of X .
2. Penalty method: in this case we choose a value X − x̄¯ and we accept to pay a
penalty given by
−λ(X − x̄)¯ 2,

where λ ∈ R+ is a proportionality coefficient. In this case the coherence principle

is that x̄¯ is not allowed if there exits a different value X − x̄¯ such that λ(X − x̄¯ )2
is certainly less than λ(X − x̄) ¯ 2 . The value x̄¯ that we can choose is our evaluation
of the expectation P(X ).
It can be shown that these two operative definitions are equivalent (see [1]).

Proposition 1.3.1 (Properties of the expectation) Given a random number X , the

expectation P(X ) has the following properties:
1. monotonicity: inf I (X ) ≤ P(X ) ≤ sup I (X );
2. linearity: if X = α1 X 1 + · · · + αn X n , then P(X ) = α1 P(X 1 ) + · · · + αn P(X n ).

Proof 1. Monotonicity: Assume that x̄ < inf I (X ), then for λ < 0:

λ(X − x̄) < 0.
1.3 Expectation 9

If x̄ > sup I (X ), then for λ > 0 we again get:

λ(X − x̄) < 0 ,

i.e. a certain loss. It follows that these choices are not coherent according to the
first criterium. If
inf I (X ) ≤ x̄ ≤ sup I (X ),
2 2
then
X − x̄¯ < (X − inf I (x)) or
X − x̄¯ < (X − sup I (x)) respec-
tively. In this case these choices are not coherent according to the second cri-
terium.

2. Linearity: Let Z = X + Y . Assume that we choose z̄ = P(Z ), x̄ = P(X ), ȳ =

P(Y ), then according to the bet method we are ready to accept any combination
of bets on X , Y and Z that gives a total gain

G = c1 (X − x̄) + c2 (Y − ȳ) + c3 (Z − z̄)

= (c1 + c3 )X + (c2 + c3 )Y − c1 x̄ − c2 ȳ − c3 z̄

where c1 , c2 , c3 are arbitrary constants. If we choose

c1 = c2 = −c3 ,

(so that the random part of G cancels), then we have that the total gain is: G =
c3 (x̄ + ȳ − z̄). Then if x̄ + ȳ − z̄ = 0, one can choose c3 so that
G < 0. In this
case this choice is not coherent according to the first criterium. On the other side
if we follow the penalty method we will pay a penalty proportional to

−[(X − x̄)2 + (Y − ȳ)2 + (Z − z̄)2 ] = −[(X − x̄)2 + (Y − ȳ)2 + (X + Y − z̄)2 ].

The orthogonal projection P of P = (x̄, ȳ, z̄) on the plane z = x + y has a

distance less or equal to the distance of P from every possible (X, Y, Z ), that lies
on the plane, with a strict inequality if P does not lie on the plane. Therefore by
the second criterium we obtain z̄ = x̄ + ȳ. The proof that Z = αX , α ∈ R, by
the first or the second criterium is completely analogous.
In general, if X = α1 X 1 + · · · + αn X n , it follows that

P(X ) = α1 P(X 1 ) + · · · + αn P(X n ).

The monotonicity of expectation implies that:

X ≥ c =⇒ P(X ) ≥ c;
If c1 ≤ c2 ,
c1 ≤ X ≤ c2 =⇒ c1 ≤ P(X ) ≤ c2 ;

X = c =⇒ P(X ) = c.
10 1 Random Numbers

Remark 1.3.2 For unbounded random numbers X (for which inf I (X ) = −∞, or
sup I (X ) = ∞, or both) an evaluation of P(X ) is not necessarily finite or even may
not exist. We refer to [1] for a discussion on the definition of the expectation for
unbounded random numbers.

1.4 Probability of Events

If E is an event, i.e. a random number such that I (E) ⊂ {1, 0}, then its expectation
P(E) is also called probability of E. From monotonicity it follows that:
1. the probability of an event E is a number between 0 and 1, 0 ≤ P(E) ≤ 1;
2. E ≡ 0 =⇒ P(E) = 0;
3. E ≡ 1 =⇒ P(E) = 1.
When E ≡ 1, E is called certain event. If E ≡ 0, E is called impossible event.
Furthermore for any given events E 1 , E 2 we have that

P(E 1 ∨ E 2 ) = P(E 1 + E 2 − E 1 E 2 ) ≤ P(E 1 + E 2 )

and that
P(E 1 + E 2 ) = P(E 1 ) + P(E 2 ).

In general for a partition E 1 , . . . , E n , i.e. if

E 1 + · · · + E n = 1, we have

n
P(E i ) = 1.
i=1

The function that assigns to the events of a partition their probabilities is called
probability distribution of the partition. If E is logically dependent from the events
{E 1 , . . . , E n } of a partition, then we can express the probability of E in terms of the
probabilities of E 1 , . . . , E n . Indeed we have

E= Ei
E i ⊂E

so that

P(E) = P(E i ).
E i ⊂E

Let us now compute the expectation of a random number X with a finite number of
possible values I (X ) = {x1 , . . . , xn } in terms of the probabilities of events E i :=
(X = xi ). We use the convention that some proposition within brackets represents a
quantity which is 1 when the proposition is true and 0 when it is false. We have:
1.4 Probability of Events 11

n
P(X ) = xi P(X = xi ). (1.10)
i=1

Indeed

P(X ) = P(X (E 1 + · · · + E n ))
= P(X E 1 ) + · · · + P(X E n )

n
n
= P(X E i ) = P(xi E i )
i=1 i=1
n n
= xi P(E i ) = xi P(X = xi ),
i=1 i=1

where we have used the fact that X E i is a random number that is equal to xi when
E i = 1 and to 0 when E i = 0, i.e. X E i = xi E i .

In general, if φ is any function φ : R → R, we have

n
P(φ(X )) = φ(xi )P(X = xi ). (1.11)
i=1

The proof is completely analogous to the one of (1.10), which deals with the particular
case φ(x) = x.
Example 1.4.1 Let X be a random number representing the result of throwing a
symmetric die with faces numbered from 1 to 6. By symmetry it is natural to assign
the same probability (that must be 16 ) to all possible values. In this case:

1
6
6·7 7
P(X ) = i= = .
6 i=1 6·2 2

Note that in this case the expectation does not coincide with one of the possible
values of X .

Example 1.4.2 Let us throw a symmetric coin. Let X = 1 if the result is head and
X = 0 if we obtain tail. Also in this case by symmetry it is natural to assign the same
probability (that must be equal to 21 ) to both values. In this case

1 1 1
P(X ) = ·0+ ·1= .
2 2 2
12 1 Random Numbers

1.5 Uniform Distribution on Partitions

In some situations, for reasons of symmetry, it is natural to assign the same prob-
ability to all events of a partition. This is the case of hazard games. If the events
E 1 , . . . , E n are assigned the same probability, we say that the partition has the
uniform distribution. Since the probabilities of a partition add up to 1, we have then

1
P(E i ) = .
n
Let E an event which depends logically from the partition E 1 , . . . , E n , then the
probability of E is given by:
⎛ ⎞
{i|E i ⊂ E}
P(E) = P ⎝ Ei ⎠ = .
E i ⊂E
n

In the case of uniform distribution on the partition, we have

{i| E i ⊂ E}
P(E) = .
n
This formula is commonly expressed by saying that the probability is given by the
number of favorable cases (i.e. the elements E i contained in E), divided by the
number of possible cases (i.e. the total number of E i ), as shown below:

f avorable cases
P(E) = . (1.12)
possible cases

This identity is valid only if the events of the partition are judged equiprobable.

Example 1.5.1 A symmetric coin is thrown n times. Let X be the random number
that counts the number of heads in the n throws and let E i be the event that the ith
throw gives head. We consider the event

E := (X = k) = Q,
Q⊂E

where Q ranges over all constituents E 1∗ . . . E n∗ of E 1 , . . . , E n . The symmetry of

the coin leads to assign the same probability to all constituents. The probability of
E is then obtained by formula (1.12). The possible cases are 2n since a constituent
is determined by n two-valued choices.
1.5 Uniform Distribution on Partitions 13

n
The favorable cases are , since they are determined by choosing k elements
k
out of n where the result head is obtained. Therefore

n 1
P(E) = .
k 2n

It follows from the properties of binomial coefficients that when n is even, the largest
value for P(E) is obtained for k = n2 . If n is odd, the largest value for P(E) is obtained
for k = n−1
2
and k = n+1
2
.

Example 1.5.2 We perform n drawings with replacement from an urn containing N

identical balls. In the urn there are H white balls and (N − H ) black balls. Let X
be the random number of white balls which is obtained after n drawings. The set
I (x) of possible values of X is clearly {0, . . . , n}. In order to compute P(X = k) for
0 k n we can use formula (1.12), provided that we assign by symmetry reasons
the same probability to the N n sequences of length n that have exactly k white balls;
their number is

n
H k (N − H )n−k ,
k

n
since the position of the k white balls can be chosen in ways and after that we
k
must choose a sequence of length k from the set of H white balls and one of lenght
n − k from the set of N − H black balls. We have therefore

k
n H (N − H )n−k
P(X = k) = .
k Nn

Let us now consider the same problem, in the case when the drawings are made
without replacement. In this case n must be less than or equal to N , as we cannot
perform more than N drawings without replacement. Also X has some extra con-
straints, as the number X of the extracted white balls must be less than or equal to H
and the number n − X of extracted black balls must be less than or equal to N − H .
Therefore

I (X ) = {0 ∨ (n − (N − H )), . . . , n ∧ H }.

In this case the possible cases are represented by all possible sets of extracted balls.
An event corresponds to a set of extracted balls. The number of possible cases is then

N
.
n
14 1 Random Numbers

Also here by symmetry it is natural to assign the same probability to all events. If
we do so, we can apply formula (1.12) and get

H N−H
k n−k
P(X = k) =
,
N
n

for k ∈ I (X ), as the favorable cases are determined by a choice of k elements from

the H white balls and n − k from the N − H black balls.
We could instead consider as possible cases the set of sequences of length n with
distinct elements, i.e. we can take into account the order of the drawings. Of course
in this case we have to take into account the order also when we count the favorable
cases. The final result is the same.

1.6 Conditional Probability and Expectation

Conditional expectation and probability are very important concepts of probability.

We now introduce the definition of expectation and probability under the condition
that an event takes place. Let X be a random number and H an event. Conditional
expectation can be defined in an operative way as ordinary expectation using bets or
penalties.
1. Bet method: we have to choose a quantity with the agreement that we must be
ready to accept any bet with gain

G = cH (X − x̄) ,

where c is a constant (positive or negative). The chosen value is then our evaluation
of the conditional expectation of X given by H and denoted by P(X |H ).
2. Penalty method: Here we have to choose a value x̄¯ with the condition that we
accept to pay a penalty.
P = λH (X − x̄) ¯ 2,

where λ is a positive constant. Note that the penalty is null when the event H
does not take place, similarly as in the definition based on bets. According to this
definition x̄¯ is our evaluation of the conditional expectation P(X |H ) of X .
It can be shown, as in the case of ordinary expectation, that the two definitions are
equivalent.
In the particular case when we consider an event E we speak about the conditional
probability P(E|H ) of E given H .
1.6 Conditional Probability and Expectation 15

Let I (X |H ) ⊆ I (X ) denote the set of possible values of X when H takes place.

1.7 Formula of Composite Expectation and Probability

Let X be a random number and H an event, then

P(X H ) = P(H )P(X |H ). (1.13)

We call (1.13) the formula of composite expectation. If X is also an event, (1.13)

is said to be the formula of composite probability. In order to show that it follows
from the coherence principle, let us put z = P(X H ), x = P(H ) and y = P(X |H ).
Following the definition based on bets, this means that we are willing to accept any
combination of bets with total gain:

G = c1 (H − x) + c2 H (X − y) + c3 (X H − z)
= H (c1 + (c2 + c3 )X − c2 y) − c1 x − c3 z ,

where c1 , c2 and c3 are arbitrary constants. As in previous cases, let us fix c1 , c2 and
c3 in such a way that the random part of G cancels: c2 = −c3 and c1 = c2 y. Then

G = −c1 x − c3 z = c2 (z − x y).

If z = x y, then it is possible to choose c2 so that

G < 0. Therefore by coherence
principle
z = x y.

Analogously this equality follows by using the definition based on penalty. If P(H ) >
0, then
P(X H )
P(X |H ) = .
P(H )

In the case of an event E the formula

P(E H )
P(E|H ) =
P(H )
16 1 Random Numbers

has a logical meaning, as E H is the logical product of E and H , i.e. the event that
both E and H take place. In particular:
P(E)
1. E ⊂ H ⇒ P(E|H ) = ;
P(H )
2. H ⊂ E, that means I (E|H ) = {1} ⇒ P(E|H ) = 1;
3. H ⊂ Ẽ, that means I (E|H ) = {0} ⇒ P(E|H ) = 0.

1.8 Formula of Total Expectation and Total Probability

Given X a random number and H1 , . . . , Hn a partition, then

n
P(X ) = P(X |Hi )P(Hi ) . (1.14)
i=1

We call (1.14) the fomula of total expectation. If X is also an event, (1.14) is said to
be the formula of total probability. Indeed,

P(X ) = P(X · 1) = P(X (H1 + . . . + Hn ))

= P(X H1 + X H2 + · · · + X Hn )

n
n
= P(X Hi ) = P(X |Hi )P(Hi ).
i=1 i=1

1.9 Bayes Formula

Let E, H be events with P(H ) > 0. By applying twice the formula of total probability
we obtain Bayes’ formula:

P(E H ) P(H |E)P(E)

P(E|H ) = = .
P(H ) P(H )

This formula is a fundamental tool in statistical inference.

Example 1.9.1 Consider an urn contain N identical balls of which some are white
and some are black. Let Y be the random number of the white balls present in the
urn (the composition of the urn is unknown).
The events Hi = (Y = i), for i = 0, . . . , N form a partition. Let E be the event
that we obtain a white ball in a drawing from the urn. Using the formula of total
probability (1.14) we obtain:
1.9 Bayes Formula 17

N N
i
P(E) = P(E|Hi )P(Hi ) = P(Hi ) .
i=0 i=0
N

Indeed if the composition of the urn is known, i.e. if we condition with respect to Hi
for some i, we can apply usual symmetry considerations and get P(E|Hi ) = Ni .
In the case we assign to the partition H0 , . . . , HN the uniform distribution

1
P(Hi ) = , i = 0, . . . , N
N +1
we get

N
i 1
P(E) = = .
i=0
N (N + 1) 2

We now evaluate the probability that the urn contains i white balls if we have extracted
a white ball. This question is answered by Bayes’ formula:
i 1
P(E|Hi )P(Hi ) N N +1 2i
P(Hi |E) = = = .
P(E) 1
2
N (N + 1)

We see that distribution on the partition conditional to the event that a white ball is
drawn is no longer uniform, but it gives higher probabilities to compositions with a
large number of white balls.

1.10 Correlation Between Events

An event E is said to be positively correlated with the the event H if

P(E|H ) > P(E).

Analogously E is said to be negatively correlated with H if

P(E|H ) < P(E).

If P(E|H ) = P(E), we say that E is non-correlated with H .

If E is positively (resp. negatively) correlated with H , the information that H
takes place increases (resp. decreases) our evaluation of the probability of E. When
E is not correlated with H , our evaluation does not change.
When P(H ) > 0 and P(E) > 0, one can give a symmetric formulation of
correlation as it follows from the formula of composite probability. E and H are said
to be:
18 1 Random Numbers

• positively correlated if P(E H ) > P(E)P(H );

• negatively correlated if P(E H ) < P(E)P(H );
• non-correlated if P(E H ) = P(E)P(H ).
If E is positively correlated with H , so is Ẽ. Indeed in this case

P( Ẽ|H ) = 1 − P(E|H ) < 1 − P(E) = P( Ẽ).

In the same way, if E is non-correlated with H , so is Ẽ.

Example 1.10.1 We consider an urn with H white balls and N − H black balls. We
perform two drawings. Let E i be the event that a white ball is extracted at the ith
extraction, i = 1, 2. For drawings with replacement we have

H
P(E 1 ) = P(E 2 ) = .
N
Indeed the urn composition in the two drawings is the same. In this case E 1 and E 2
are non-correlated, as by (1.12)

H2
P(E 1 E 2 ) = = P(E 1 )P(E 2 ) .
N2
Let us now consider the case of drawings without replacement. We use again formula
H
(1.12) to compute probabilities and conditional probabilities. We have P(E 1 ) =
N
and by the formula of total probability (1.14) applied to the event E 2 and the partition
E 1 , E˜1 we get

P(E 2 ) = P(E 2 |E 1 )P(E 1 ) + P(E 2 | Ẽ 1 )P( Ẽ 1 )

H −1 H H H H
= + (1 − ) = .
N −1 N N −1 N N

Here P(E 1 ) and P(E 2 ) are both equal to HN and P(E 1 ), P(E 2 ) are negatively corre-
lated, as
H −1 H
P(E 2 |E 1 ) = < = P(E 2 )
N −1 N

if 0 < H < N .

We say that two events are stochastically independent if

P(E 1 E 2 ) = P(E 1 )P(E 2 ) .

When P(E 1 ) > 0 and P(E 2 ) > 0 this definition coincides with non-correlation.
When one or both of E 1 and E 2 have 0 probability, then E 1 and E 2 are stochastically
independent, as in this case P(E 1 )P(E 2 ) = 0 and
1.10 Correlation Between Events 19

P(E 1 E 2 ) ≤ P(E 1 ) ∧ P(E 2 ) = 0.

The definition of stochastic independence extends to the case of an arbitrary number

of events.
Definition 1.10.2 The events E 1 , . . . , E n are said to be stochastically independent
if for every subset {i 1 , . . . , i k } in {1, . . . , n} we have

P(E i1 · · · E ik ) = P(E i1 ) · · · P(E ik ). (1.15)

We remark that in general n events are not stochastically independent if the events
are only pairwise stochastically independent.
We shall see that if the events E 1 , . . . , E n are stochastically independent, then
the events E 1∗ , . . . , E n∗ are stochastically independent for every possible choice of
E i∗ between E i and Ẽ i , for i = 1, . . . , n.
Definition 1.10.3 Let H = {H1 , . . . , Hn } be a partition. The events E 1 , E 2 are said
to be stochastically independent conditionally to the partition H if

P(E 1 E 2 |Hi ) = P(E 1 |Hi )P(E 2 |Hi ) for all i = 1, . . . n .

Example 1.10.4 Let us consider an urn with unknown composition containing N

identical balls, of which some are white and some are black. Let Y be the random
number of white balls in the urn. We perform two drawings with replacement. Let
E i , i = 1, 2, be the event that in the ith drawing we extract a white ball.
Consider the partition

Hi = (Y = i) i = 0, . . . N .

It is easy to see that the events E 1 and E 2 are stochastically independent conditionally
to the partition H. We want to see whether E 1 and E 2 are stochastically independent,
assuming that we assign the uniform distribution to H , i.e. P(Hi ) = N 1+1 for i =
0, 1, . . . , N . We compute:
1. the probability of the first drawing:

N
P(E 1 ) = P(E 1 |Hi )P(Hi )
i=0

1 i
N
=
N + 1 i=0 N
1 N (N + 1)
=
N +1 2N
1
= ;
2
20 1 Random Numbers

2. the probability of the second drawing:

1
P(E 2 ) = P(E 1 ) = ;
2
3. the probability that we draw a white ball in both drawings:

N
P(E 1 E 2 ) = P(E 1 E 2 |Hi )P(Hi )
i=0

1
N
= P(E 1 |Hi )P(E 2 |Hi )
N + 1 i=0

1 i2
N
= .
N + 1 i=0 N 2

Using the fact that

(i + 1)3 − i 3 = 3i 2 + 3i + 1

we have

N N
(i + 1)3 − i 3
N N
1 (N + 1)3 N (N + 1) N
i =
2
− i− = − − ,
i=0 i=0
3 i=0 i=0
3 3 2 3

and
(N + 1)2 1 1
P(E 1 E 2 ) = − − .
3N 2 2N 3N (N + 1)

1
For N → +∞, P(E 1 E 2 ) tends to . Therefore at least for large N , E 1 and E 2
3
are positively correlated. This shows that stochastic independence conditionally
to a partition does not imply stochastic independence.

1.11 Stochastic Independence and Constituents

Proposition 1.11.1 The events E 1 , . . . , E n are stochastically independent if and

only if

P(Q) = P(E 1∗ ) · · · P(E n∗ ) (1.16)

for every constituent Q = E 1∗ · · · E n∗ of E 1 , . . . , E n .

1.11 Stochastic Independence and Constituents 21

Proof ⇒) Let Q = E 1∗ · · · E n∗ be a constituent of E 1 , . . . , E n . Developing the

products, we can express Q as a polynomial φ of E 1 , . . . , E n where the degree in
every variable is 1:
E 1∗ · · · E n∗ = φ(E 1 , . . . , E n ).

For example, consider the constituent Q of the events E 1 , E 2 , E 3 given by

Q = Ẽ 1 E 2 E 3 = (1 − E 1 )E 2 E 3 = E 2 E 3 − E 1 E 2 E 3 .

Here φ(x1 , x2 , x3 ) = x2 x3 − x1 x2 x3 = (1 − x1 )x2 x3 .

If the events E 1 , . . . , E n are stochastically independent, the probabilities of prod-
ucts factorize into products of probabilities so that

P(Q) = P (φ(E 1 , . . . , E n ))
= φ (P(E 1 ), . . . , P(E n ))
= P(E 1∗ ) · · · P(E n∗ ),

where the last equality is obtained by collecting terms in φ and using that P( Ẽ i ) =
1 − P(E i ). In the example Q = Ẽ 1 E 2 E 3 . We have

P(Q) = P Ẽ 1 E 2 E 3
= P (E 2 E 3 − E 1 E 2 E 3 )
= P(E 2 )P(E 3 ) − P(E 1 )P(E 2 )P(E 3 ) = φ(P(E 1 ), P(E 2 ), P(E 3 ))
= (1 − P(E 1 ))P(E 2 )P(E 3 ) = P( E˜1 )P(E 2 )P(E 3 ).

⇐) We assume that (1.16) holds for all constituents of the events E 1 , . . . , E n . Let
{i 1 , . . . , i k } ⊂ {1, . . . , n} and { j1 , . . . , jn−k } = {1, . . . , n} \ {i 1 , . . . , i k }. Then
⎛ ⎞

P(E i1 · · · E ik ) = P ⎝ Q⎠
Q⊂E i1 ···E ik

= P(E i1 ) · · · P(E ik ) P(E j1 . . . E jn−k )

where the sum ranges over all possible choices of E jl for l = 1, . . . , n − k. By
collecting terms we get:

P(E i1 . . . E ik ) = P(E i1 ) . . . P(E ik )[(P(E j1 ) + P( Ẽ j1 )] . . . [P(E jn−k ) + P( Ẽ jn−k )]

= P(E i1 ) . . . P(E ik ),

since the last n − k factors are all equal to 1.

22 1 Random Numbers

1.12 Covariance and Variance

Given two random numbers X and Y , the covariance between X and Y is defined by

cov(X, Y ) = P ((X − P(X ))(Y − P(Y ))) .

X and Y are said to be:

• positively correlated if cov(X, Y ) > 0;
• negatively correlated if cov(X, Y ) < 0;
• non-correlated if cov(X, Y ) = 0.
By developing the product in the definition of the covariance, we obtain:

cov(X, Y ) = P(X Y − P(X )Y − X P(Y ) + P(X )P(Y )) = P(X Y ) − P(X )P(Y ).

The variance of a random number X is defined by

σ 2 (X ) = cov(X, X ).

Other notations for the variance of X are var(X ) and D(X ). From the two expressions
for the covariance
we get two expressions
for the variance: σ 2 (X ) = P(X 2 ) − P(X )2
and σ (X ) = P (X − P(X )) . From the second expression we see that
2 2

σ 2 (X ) ≥ 0,

as it is the expectation of a non-negative random number. We also define:

• quadratic expectation:
PQ (X ) = P(X 2 );

• standart deviation:

σ(X ) = σ 2 (X ) = P Q (X − P(X )).

Proposition 1.12.1 (Properties of covariance and variance) Covariance and vari-

ance satisfy the following properties:
1. bilinearity:
cov(X + Y, Z ) = cov(X, Z ) + cov(Y, Z ); (1.17)

2. behavior with respect to linear transformations:

cov(a X + b, cY + d) = ac cov(X, Y ), (1.18)

σ 2 (a X + b) = a 2 σ 2 (X ). (1.19)
1.12 Covariance and Variance 23

Proof 1. From the definition of covariance we have

cov(X + Y, Z ) = P((X + Y ) − P(X + Y ), Z − P(Z )

= P((X + Y ) − P(X ) − P(Y ))(Z − P(Z ))
= P(((X − P(X ))(Z − P(Z ))) + P(((Y − P(Y ))(Z − P(Z )))
= cov(X, Z ) + cov(Y, Z ) .

2. Again from the definition of covariance and the linearity of the expectation we
have:

cov(a X + b, cY + d) = P ((a X + b − P(a X + b)) (cY + d − P(cY + d)))

= P ((a X + b − aP(X ) − b) (cY + d − cP(Y ) − d))
= P (a (X − P(X )) c (Y − P(Y )))
= ac cov(X, Y ).

Proposition 1.12.2 (Variance of the sum of random numbers) Let X 1 , . . . , X n n be

random numbers, then:

n
σ 2 (X 1 + · · · + X n ) = σ 2 (X i ) + cov(X i , X j )
i=1 i, j
i= j

n
= σ 2 (X i ) + 2 cov(X i , X j ).
i=1 i< j

Proof By the bilinearity property (1.17) we have:

σ 2 (X 1 + · · · + X n ) = cov(X 1 + · · · + X n , X 1 + · · · + X n )

n
= cov(X i , X i ) + cov(X i , X j )
i=1 i= j

n
= σ 2 (X i ) + cov(X i , X j ).
i=1 i, j
i= j

1.13 Correlation Coefficient

It is useful to introduce an index of the correlation of two random numbers X, Y ,

called correlation coefficient. As we shall see, it has the property that if X and Y
correspond to observed quantities, it does not depend on the units of measure of X
and Y .
24 1 Random Numbers

Definition 1.13.1 For X, Y random numbers with σ(X ) > 0, σ(Y ) > 0 the corre-
lation coefficient of X and Y is defined by

cov(X, Y )
ρ(X, Y ) = .
σ(X ) σ(Y )

Let us state two important properties of the correlation coefficient:

1. If X, Y are random numbers with σ(X ) > 0, σ(Y ) > 0 and a, b, c, d are con-
stants with a = 0 and c = 0, we have

ρ(a X + b, cY + d) = sgn(ac) ρ(X, Y ),

where sgn(x) = 1 for x > 0 and sgn(x) = −1 for x < 0.

Proof By using the properties (1.18) and (1.19) we get

cov(a X + b, cY + d)
ρ(a X + b, cY + d) =
σ 2 (a X + b) σ 2 (cY + d)
ac cov(X, Y )
=
|ac| σ 2 (X ) σ 2 (Y )

= sgn(ac) ρ(X, Y ).

2. −1 ≤ ρ(X, Y ) ≤ 1.
Let
X − P(X ) Y − P(Y )
X∗ = , Y∗ = .
σ(X ) σ(Y )

These are the so-called standardized random numbers: they are obtained from
X, Y by means of suitable linear transformation such that P(X ∗ ) = 0, P(Y ∗ ) = 0
and σ 2 (X ∗ ) = 1, σ 2 (Y ∗ ) = 1 by using linearity of the expectation and (1.19).
By (1.18) we get

P (X ∗ Y ∗ )
cov(X ∗ , Y ∗ ) = = ρ(X, Y ).
σ(X ) σ(Y )

Computing the variance of X ∗ + Y ∗ using Proposition 1.12.1 we get:

0 ≤ σ 2 (X ∗ + Y ∗ ) = σ 2 (X ∗ ) + σ 2 (Y ∗ ) + 2 cov(X ∗ , Y ∗ )
= 2 + 2ρ(X, Y ),
1.13 Correlation Coefficient 25

so that ρ(X, Y ) ≥ −1. Similarly computing the variance of X ∗ − Y ∗ , we obtain

0 ≤ σ 2 (X ∗ − Y ∗ ) = σ 2 (X ∗ ) + σ 2 (−Y ∗ ) + 2 cov(X ∗ , −Y ∗ )
= 2 − 2ρ(X, Y ).

so that ρ(X, Y ) ≤ 1.

1.14 Chebychev’s Inequality

The Chebychev’s inequality allows to estimate the probability that a random number
takes value far from its expectation. It can be formulated in two ways:
1. Let X be a random number with PQ (X ) > 0. For every t > 0

1
P |X | ≥ t PQ (X ) ≤ 2 .
t

2. Let X be a random number with σ 2 (X ) > 0. Let m = P(X ), ∀t > 0:

1
P (|X − m| ≥ σ(X )t) ≤ .
t2

Proof 1. Let E be the event E = |X | ≥ t PQ (X ) . We compute P X 2 using the
formula of total expectation with respect to the partition E, Ẽ:

P X 2 = P X 2 |E P(E) + P X 2 | Ẽ P Ẽ .

Since X 2 is non-negative, the last term on the right-hand side is non-negative.

Moreover inf I (X 2 |E) ≥ t 2 . Then PQ (X )2 = t 2 P(X )2 in force of the defin-
ition of E. Therefore we have P(X )2 ≥ t 2 P(X 2 )P(E). This implies the first
inequality.
2. The second inequality follows from the first by applying it to the random number
Y = X − m and using that P Q (Y ) = σ(X ).

1.15 Weak Law of Large Numbers

Theorem 1.15.1 (Weak law of large numbers). Let (X n )n=1,2,... be a sequence of

random numbers such that all have the same expectation, P(X i ) = m, the same
variance σ 2 (X i ) = σ 2 and cov(X i , X j ) = 0, ∀i, j with i = j. If we put Sn =
X 1 + · · · + X n , we have that for all λ > 0
26 1 Random Numbers

Sn
lim P | − m| ≥ λ = 0 .
n→+∞ n

Proof The proof is based on the second form of Chebychev’s inequality. First we
Sn
compute the expectation of :
n

Sn 1
P = (P(X 1 ) + · · · + P(X n )) = m
n n

and its variance

Sn 1 1 σ2
σ 2
= 2 σ 2 (Sn ) = 2 (σ 2 (X 1 ) + · · · + σ 2 (X n )) = ,
n n n n

where we have used Proposition 1.12.2 and the fact that random numbers of the
sequence are pairwise uncorrelated. From the second form of Chebychev’s inequality
we get

Sn σ 1
P | − m| ≥ √ t ≤ 2 .
n n t

σ 1 σ2
Putting λ = √ t, we obtain 2 = . Therefore
n t nλ2

Sn σ2
P | − m| ≥ λ ≤ ,
n nλ2

that tends to 0 as n → +∞.

Sn E1 + · · · + En
The quantity = is called frequence. In this case the weak law
n n
of large numbers shows that for a large sequence of trials (events) the frequence of
success is close to the probability of a single event with large probability.

Example 1.15.2 In particular one can apply the weak law of large numbers to the
case of a sequence of uncorrelated events (E i )i=1,2,... with the same probability
P(E i ) = p. Note that for an event E i ,

σ 2 (E i ) = P(E i2 ) − P(E i )2 = P(E i ) − P(E i )2 = p(1 − p)

so the E i ’s have automatically the same variance. Hence for all λ > 0 we have

Sn
P | − p| ≥ λ → 0
n

for n → ∞.
Chapter 2
Discrete Distributions

2.1 Random Numbers with Discrete Distribution

The distribution of a random number X is said to be discrete if there is a finite or

enumerable set A ⊂ I (X ) such that P(X ∈ A) = 1. This is obviously the case
when I (X ) is itself finite or enumerable, since in this case we may take A = I (X ).
Let A = {x1 , x2 , . . .} and define p(xi ) = P(X = xi ). In the examples of discrete
distributions that we shall consider, we always have
∞

p(xi ) = 1 .
i=1

This property is not a consequence of the basic properties of expectation that

we have derived
∞from the coherence principles (from linearity and monotonicity we
only get that i=1 p(xi ) ≤ 1). It can be considered as a regularity property of the
expectation. See [dF] for a thorough discussion of this problem. In the following we
introduce some of the most common discrete distributions.

2.2 Bernoulli Scheme

A simple and useful model from which some discrete distributions can be derived is
the Bernoulli scheme. It can be thought of as a potentially infinite sequence of trials,
each of them with two possible outcomes called success and failure. Each trial is
performed in the same known conditions and we assume that there is no influence
between different trials. Formally a Bernoulli scheme with parameter p, 0 < p < 1,
is a sequence E 1 , E 2 , . . . of stochastically independent equiprobable events with
P(E 1 ) = p.

© Springer International Publishing Switzerland 2016 27

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_2
28 2 Discrete Distributions

Example 2.2.1 A concrete example for which one can use as a model a Bernoulli
scheme with p = 21 is a sequence of throws of a symmetric coin, where E i is the
event that one gets head at the ith throw.

2.3 Binomial Distribution

Given a Bernoulli scheme (E i )i∈N with P(E i ) = p, let Sn the random number of
successes in the first n trials. Sn can be written as

Sn = E 1 + · · · + E n .

The set of possible values of Sn is I (Sn ) = {0, . . . , n}.

Let us compute, using the constituents of the events E 1 , . . . , E n , the probability
distribution of Sn :
P(Sn = k) = P(Q) .
Q⊂(Sn =k)

We must determine the probability of a constituent of I type with respect to the event
(Sn = k). An example of such a constituent is

Q = E 1 . . . E k Ẽ k+1 . . . Ẽ n , (2.1)

that is the event that k successes are obtained in the first k trials, whereas the remaining
n − k trials yield failures.
Analogously, any other constituent of I type will be a product of the same kind as in
(2.1). Since the events are stochastically independent, in force of Proposition 1.11.1,
every constituent Q of I type has the same probability, given by

P(Q) = p · · · p (1 − p) · · · (1 − p) = p k (1 − p)n−k .

k times (n−k) times

In order to compute P(Sn = k) we must therefore multiply

this value times the
n
number of constituents of I type. This is equal to , that is the number of ways
k
of choosing a subset of k elements out of n trials. Therefore we have

n
P(Sn = k) = p k (1 − p)n−k .
k

Sn is said to have binomial distribution

Bn(n, p) with parameters n, p.
It is easy to check that nk=0 P(Sn = k) = 1, as it must be since the events
(Sn = k), k = 0, . . . , n, make up a partition. Indeed, using Newton’s formula, we
have:
2.3 Binomial Distribution 29

n
n
1 = ( p + 1 − p) = n
p k (1 − p)n−k .
k
k=0

The simplest way to compute the expectation of Sn is through the linearity of expec-
tation:

n
P(Sn ) = P(E 1 + · · · + E n ) = P(E i ) = np .
i=1

Example 2.3.1 Consider an urn containing N identical balls, of which H are white
and N − H are black. We perform a sequence of n drawings with replacement.
It is easy to check that by symmetry the sequence of events (E i )i = 1, 2, ... where
E i = (a white ball is drawn at the ith drawing) makes up a Bernoulli scheme, with
parameters p = HN . Indeed for 1 ≤ i 1 < i 2 < . . . < i k
k
Hk H
P(E i1 . . . E in ) = k = ,
N N

where the possible cases correspond to the N k sequences of balls that may be drawn in
the drawings i 1 , . . . , i k , whereas the favorable cases correspond to the H k sequences
where white balls are drawn.

2.4 Geometric Distribution

Let (E i )i = 1, 2, ... be a Bernoulli scheme; let T be the random number representing

the number of the trial when the first success is obtained, i.e. T = min{n | E n = 1}.
The set of possible values of T is given by:

I (T ) = N \ {0} ∪ {∞} .

It is easy to see that P(T = ∞) = 0 since for all n > 0, (T = ∞) ⊆ E˜1 . . . E˜n
so that P(T = ∞) ≤ P( E˜1 . . . E˜n ) = (1 − p)n for every n. Let us compute the
probability distribution of T for finite values:

P (T = i) = P Ẽ 1 . . . Ẽ i−1 E i = P Ẽ 1 . . . P Ẽ i−1 P (E i ) = (1 − p)i−1 p .

T is said to have geometric distribution with parameter p. Using the formula for the
sum of geometric series (see Appendix G.1), one verifies that

+∞
+∞
+∞
1
P(T = i) = (1 − p)i−1 p = p (1 − p)k = p · = 1.
i=1 i=1 k=0
1 − (1 − p)
30 2 Discrete Distributions

The expectation of T can be computed by an extension of formula (1.10) to the case

of enumerable set of values. This can be justified (providing that the series converges)
as a regularity property, thinking that T can be approximated with random numbers
with a finite but arbitrarily large number of values. We then get

+∞
+∞
+∞
p 1
P(T ) = iP(T = i) = i(1 − p)i−1 p = p i(1 − p)i−1 = = ,
i=1 i=1 i=1
p2 p

where we used that for |x| < 1

+∞ +∞
+∞
d i d d 1 1
ix i−1
= [x ] = x i
= = .
i=1 i=1
d x d x i=0
dx 1−x (1 − x)2

The geometric distribution is said to be “memoryless”. Indeed for m > 0, n > 0

P(T > m + n | T > n) = P(T > m),

i.e. the conditional probability of no success up and including the (m +n)th trial given
that there was no success up and including the nth trial is equal to the probability
of no success up and including the m trial: everything starts from scratch. We have
namely that

P(T > m + n, T > n) P(T > m + n)

P(T > m + n | T > n) = = .
P(T > n) P(T > n)

But P(T > n) = (1 − p)n since (T > n) = E˜1 − E˜n . Hence

(1 − p)m+n
P(T > m + n | T > n) = = (1 − p)m = P(T > m) .
(1 − p)n

2.5 Poisson Distribution

A random number X is said to have Poisson distribution with parameters λ, λ ∈ R+ ,

if I (X ) = N and
λi
P(X = i) = e−λ .
i!
+∞
As in the case of a geometric distribution i=0 P(X = i) = 1. Indeed

+∞
+∞ i
+∞ i

λ λ
P(X = i) = e−λ = e−λ = e−λ eλ = 1 .
i=0 i=0
i! i=0
i!
2.5 Poisson Distribution 31

In order to compute the expectation, we use the extension of the formula for random
numbers with a finite number of possible values to the case of a enumerable set of
possible values as we did for geometric distribution and as we will do in similar cases
(provided that the series is convergent). We obtain

+∞
+∞
+∞

λi λi−1
P(X ) = iP(X = i) = i e−λ = λe−λ
i=0 i=0
i! i=1
(i − 1)!
+∞ k
λ
= λe−λ = λe−λ eλ = λ .
k=0
k!

2.6 Hypergeometric Distribution

Consider an urn containing N balls of which H are white and N − H black, where
0 < H < N . We perform n drawings without replacement from the urn with n ≤ N .
Let X be the random number that counts the number of white balls in the sample
that we draw.
Since we perform drawings without replacement, X is less than or equal to H and
n − X , the number of black balls in the sample, is less than or equal to N − H . From
this it follows that the set of possible values of X is given by

I (X ) = {0 ∨ n − (N − H ), . . . , n ∧ H } .

Let i ∈ I (X ). Due to the symmetry of the situation with respect to interchange of

balls, we evaluate P(X = i) using formula (1.12). When defining possible cases and
consequently favorable cases, we can consider the set of the n drawn balls, i.e. we
can avoid to consider the order of drawings, as the event does not involve the order.
In this way the possible cases correspond to the subset of size n from a set of N
elements:
N
possible cases = .
n

A sample with i white balls contains (n − i) black balls. The number of favorable
cases that correspond to such samples is therefore given by:

H N−H
f avorable cases = .
i n−i

The random number X is said to have hypergeometric distribution with parameters

n, H, N . By the former discussion we have:
32 2 Discrete Distributions

H N−H
i n−i
P(X = i) = .
N
n

In order to compute the expectation of X , it is convenient to decompose it as

n
X= Ei
i=1

where E i is the event that a white ball is chosen at ith drawing. Therefore by the
linearity of the expectation

P(X ) = P(E 1 ) + · · · + P(E n ).

In the evaluation of P(E i ) we can still use symmetry by interchange of balls, but when
defining possible cases we must take into account the order, since the event depends
on the order of the drawings. Possible cases correspond to sequences of length n
of distinct elements from a set of N elements. Their number is DnN = (N )n =
N (N − 1) − (N − n + 1). Favorable cases correspond to those sequences that have
a white ball at the ith place. This ball can be chosen in H ways. The remaining balls
form a sequence of lenght n − 1 of distinct elements from a set of N − 1 elements.
Therefore
N −1
f avorable cases H Dn−1 H
P(E i ) = = =
possible cases Dn N N

and
H
P(X ) = n .
N

2.7 Independence of Partitions

Two partitions H = (H1 , . . . , Hm ), L = (L 1 , . . . , L n ) are said to be stochastically

independent if for every i, j with 1 ≤ i ≤ m, 1 ≤ j ≤ n

P Hi L j = P (Hi ) P L j .

Stochastic independence can be extended to the case of r partitions H1 , . . . , Hr .

Consider the partitions
Hl = (H1(l) , . . . , Hn(l)
l
)

for 1 ≤ l ≤ r . H1 , . . . , Hr are said to be stochastically independent if for every

i 1 , . . . , ir with 1 ≤ il ≤ n l , . . . , 1 ≤ l ≤ r
2.7 Independence of Partitions 33

P Hi(1)
1
. . . Hi
(r )
r
= P Hi
(1)
1
. . . P Hi
(r )
r
.

Partitions can be thought as pluri-events, with a certain number of possible results,

such as in the case of drawings from an urn containing balls of several colors. In the
case of partitions with two events, one can select an event from each partition. In this
case stochastic independence of partitions is equivalent to stochastic independence
of the selected events.

2.8 Generalized Bernoulli Scheme

H1 , H2 , . . .be a sequence of partitions, each composed by r events, Hi =

Let
E 1(i) , . . . , Er(i) for i ≥ 1. We assume that H1 , . . . , Hn are stochastically inde-
pendent for every n and that P(E k(i) ) = pk , k = 1, . . . , r , for all i ≥ 1, with
p1 + · · · + pr = 1. The sequence H1 , H2 , . . . is called a generalized Bernoulli
scheme. In the case r = 2 a generalized Bernoulli scheme is equivalent to the ordi-
nary Bernoulli scheme (F1 , F2 , . . .) where Fi = E 1(i) with parameter p = p1 . We
can represent a generalized Bernoulli scheme in an array:

E 1(1) , . . . , Er(1)
E 1(2) , . . . , Er(2)
.. .
. , . . . , ..
.. .
. , . . . , ..
E 1(n) , . . . , Er(n) ,

where the events belonging to the same column are equiprobable, whereas the events
of each row constitute stochastically independent partitions.

2.9 Multinomial Distribution

Starting from a generalized Bernoulli scheme, as defined in Sect. 2.2, we can now
define the multinomial distribution in the same way as the binomial distribution can
be defined starting from an ordinary Bernoulli scheme. Given n > 0, let us consider
the random numbers Y1 , . . . , Yr defined by

n
Yl = El(i) , l = 1, . . . , r .
i=1
34 2 Discrete Distributions

In the array of the previous section, the Yl ’s are obtained by adding up the events
along the columns. We have

r
r
n
n
r
Yl = El(i) = El(i) = n .
l=1 l=1 i=1 i=1 l=1

1

The idea of constituents can be extended in a natural fashion from events to partitions.
A constituent of the partition H1 , . . . , Hn is an event of the form

Q = i=1
n
H∗i ,

where H∗i is an event of the partition Hi . If H1 , . . . , Hn are stochastically independent

(as in the case of generalized Bernoulli scheme) we have:

P(Q) = P(H∗1 ) . . . P(H∗n ) .

We want to compute
P (Y1 = k1 , . . . , Yr = kr )

for k1 ≥ 0, kr ≥ 0 such that k1 + · · · + kr = n. We can decompose this probability

in terms of constituents of I type:

P(Y1 = k1 , . . . , Yr = kr ) = P(Q),
Q

where Q varies among the constituents of I type contained in the event (Y1 =
k1 , . . . , Yr = kr ). In the product defining a constituent of I type there will be kl
events of index l with 1 ≤ l ≤ r . Therefore since the partitions are stochastically
independent, the probability of a constituent of I type is given by:

P(Q) = p1k1 , . . . , prkr .

The number of constituents of I type is equal to the way of partitioning a set of n

n!
elements into r subsets with k1 , . . . , kr elements, i.e. . We have therefore:
k 1 ! . . . kr !
n!
P (Y1 = k1 , . . . , Yr = kr ) = P(Q) = p k 1 . . . p kr .
k !...k ! 1 r
Q I type 1 r P(Q)
number of constituents

multinomial distribution depends on the parameters r, p1 , . . . , pr −1 , since pr =

The
−1
1 − ri=1 pi . For r = 2 the multinomial distribution reduces to the binomial one.
2.10 Stochastic Independence for Random Numbers … 35

2.10 Stochastic Independence for Random Numbers

with Discrete distribution

Let X and Y be two random numbers with I (X ) = {x1 , . . . , xm } and I (Y ) =

{y1 , . . . , yn }. We consider the partitions H and K generated by the events Hi =
(X = xi ), for i = 1, . . . , m, and K j = (Y = y j ), for j = 1, . . . , n.
The random numbers X and Y are said to be stochastically independent if the
partitions H and K are stochastically independent.

2.11 Joint Distribution

Let us consider two random numbers X and Y , that we can look at as a random vector
(X, Y ), assuming a finite number of possible values I (X, Y ). If I (X ) = {x1 , . . . , xm }
and I (Y ) = {y1 , . . . , yn } we define the joint distribution of X and Y . This is the
function
p(xi , y j ) = P(X = xi , Y = y j )

defined on I (X ) × I (Y ). We can associate to it the matrix

⎛ ⎞
p(x1 , y1 ) . . . p(x1 , yn )
⎜ .. .. .. ⎟
⎝ . . . ⎠.
p(xm , y1 ) . . . p(xm , yn )

The marginal distribution of X is the function

p1 (xi ) = P(X = xi )

for i = 1, . . . , m. The marginal distribution can be obtained from the joint distribu-
tion:
n
n
p1 (xi ) = P(X = xi ) = P(X i , Y j ) = p(xi , y j ),
j=1 j=1

i.e. adding up the elements on the rows of the matrix. It is called marginal because it
is customarily written at the margin of the matrix. Similarly the marginal distribution
of Y is defined by:

m
p2 (y j ) = P(Y = y j ) = p(xi , y j ) .
i=1
36 2 Discrete Distributions

It follows that two random numbers X and Y are stochastically independent if and
only if
p(xi , y j ) = p1 (xi ) p2 (y j ) (2.2)

for i = 1, . . . , m and j = 1, . . . , n. Given ψ : R2 −→ R, the expectation of the

random number Z = ψ(X, Y ) can be obtained from the joint distribution of X, Y :

m
n
P(Z ) = P(ψ(X, Y )) = ψ(xi , y j ) p(xi , y j ) . (2.3)
i=1 j=1

The proof is completely analogous to that one in the case of a single random number.
For example, we can compute P(X Y ):

m
n
P(X Y ) = xi y j p(X = xi , Y = y j ).
i=1 j=1

If X and Y are stochastically independent and φ1 , φ2 are two real functions φi :

R −→ R with i = 1, 2, we have that

P(φ1 (X )φ2 (Y )) = P(φ1 (X ))P(φ2 (Y )). (2.4)

Indeed

m
n
P(φ1 (X )φ2 (Y )) = φ1 (xi )φ2 (y j )P(X = xi , Y = y j )
i=1 j=1

= φ1 (xi )φ2 (y j ) p1 (xi ) p2 (y j )
(xi ,y j )∈I (X )×I (Y )

= φ1 (xi ) p1 (xi ) φ2 (y j ) p2 (y j )
xi ∈I (X ) y j ∈I (Y )

= P(φ1 (X ))P(φ2 (Y )) .

2.12 Variance of Discrete Distributions

We compute the variances of the distributions that we have previously introduced.

1. Variance of an event:

σ 2 (E) = P E 2 − P (E)2 = P(E) − P(E)2 = P(E)(1 − P(E)) ,

where we use that for an event E 2 = E since E can take only values 0 and 1.
2.12 Variance of Discrete Distributions 37

2. Binomial distribution: For X with binomial distribution with parameters n x and

p, we use the representation X = E 1 +. . . + E n , where the E i ’s are stochastically
independent and hence pairwise uncorrelated. We get:

n
σ 2 (E 1 + . . . + E n ) = σ 2 (E i ) = np(1 − p) .
i=1

3. Geometric distribution: we need to compute P(X 2 ) as we have already computed:

+∞
1
P(X ) = i p(1 − p)i−1 = .
i=1
p

Hence
+∞
+∞
+∞

P(X 2 ) = p i 2 (1 − p)i−1 = p i(i − 1)(1 − p)i−1 + p i(1 − p)i−1
i=1 i=1 i=1
+∞
1
= p(1 − p) i(i − 1)(1 − p)i−2 +
p
i=2
+∞
d2 1
= p(1 − p) (1 − p)i
+
d2 p p
i=2
2
d 1 1
= p(1 − p) − 1 − (1 − p) +
2
d p 1 − (1 − p) p
2(1 − p) 1
= +
p2 p
2 1
= 2 − .
p p

Therefore the variance of the geometric distribution is given by

(1 − p)
σ 2 (X ) = P[X 2 ] − P(X )2 = .
p2

4. Poisson distribution: if X has Poisson distribution with parameter λ, we have:

+∞
+∞
+∞ λi +∞
λi −λ λi
P(X 2 ) = i 2 P(X = i) = i2 e = e−λ i(i − 1) + λe−λ
i=0 i=0
i! i=0
i! i=0
i!
+∞
+∞ k

λi−2 λ
= λ2 e−λ + λ = λ2 e−λ + λ = λ2 + λ
i=2
(i − 2)! k=0
k!

where we have used the computation of the expectation of the Poisson distribution.
38 2 Discrete Distributions

We have then

σ 2 (X ) = P(X 2 ) − P(X )2 = λ2 + λ − λ2 = λ .

5. Hypergeometric distribution: with the notation of Sect. 2.6, we use the represen-
tation X = E 1 +· · · + E n . The events E i ’s in this case are not stochastically inde-
pendent and are actually pairwise negatively correlated. Indeed, for 0 < H < N
for every pair i, j with i = j, we have:

H H−N
cov(E i , E j ) = P(E i E j ) − P(E i )P(E j ) = <0
N2 N − 1
as
N −2 N −2
H (H − 1)Dn−2 H (H − 1) Dn−2 H (H − 1)
P(E i E j ) = = −2
= .
DnN N (N − 1) Dn−2
N N (N − 1)

Here we have used formula (1.12); possible cases are sequences with no repetition
of length n from a set of N elements, whereas in counting favorable cases we
first select two different white balls for the ith and the jth drawings and then the
remaining n − 2 balls from a set of N − 2 elements.
The variance of X is then obtained by means of the formula for the variance of
the sum of n random numbers:

n
σ 2 (X ) = σ 2 (E i ) + cov(E i , E j )
i=1 i, j
i = j
H H H H−N N −n H H
=n (1 − ) + n(n − 1) 2 =n (1 − ) ,
N N N N −1 N −1 N N

where n(n − 1) is the number of ordered pairs i, j, with i = j, which can be chosen
out of {1, . . . , n}.

2.13 Non-correlation and Stochastic Independence

Let us consider two random numbers X and Y with discrete joint distribution given
by:
p(X i , Y j ) = P(X = xi , Y = y j ) = pi, j

and marginal distributions given by:

p1 (xi ) = P(X = xi ) = pi i = 1, . . . , m ,

p2 (y j ) = P(Y = y j ) = q j j = 1, . . . , n .
2.13 Non-correlation and Stochastic Independence 39

X and Y are non-correlated if

P(X Y ) = P(X )P(Y )

i.e. if
xi y j pi, j = xi pi yjqj .
i j i j

Moreover, the following relations are satisfied:

p = 1 and p = pi for i = 1, . . . , m ,
i i j i, j
j q j = 1 and i pi, j = q j for j = 1, . . . , n ,

pi, j = 1 .
i j

Assume that we want to find values pi, j of the joint distribution, such thatX and Y are
non-correlated and have two fixed marginal distributions { pi }i=1,...,m and q j j=1,...,n .

We observe first of all that pi, j must satisfy the relation i, j pi, j = 1. In order to
determine the marginal distributions we must verify other additional (m −1)+(n −1)
linear relations. We have (m −1)+(n −1) and not m +n, since once (m −1)+(n −1)
relations
are satisfied, the last two follow from the fact that p
i, j i, j = 1, p
i i = 1,
q
j j = 1. Finally in order to impose non-correlation, an extra linear relation must
be verified on the pi, j ’s:
pi, j xi y j = m 1 m 2 ,
i j

m n
where m 1 = i x i pi and m 2 = j y j q j . We have therefore a system of 1 +
(m − 1) + (n − 1) + 1 = m + n linear equations for mn unknowns. This system has
the solution pi, j = pi q j , for which X and Y are stochastically independent. This will
be the only solution if the number of linearly independent equations is equal to the
number of the unknowns, i.e. if m + n = mn, or mn−m−n = (m−1)(n−1)−1 = 0.
This happens only if m = n = 2. It follows that non-correlation does not imply in
general stochastic independence. If m = n = 2, then there is just one solution so
that non-correlation and stochastic independence coincide. This is the case of events:
two events are non-correlated if and only if they are stochastically independent.
In Sect. 2.11 we have shown that stochastic independence implies non-correlation
and that in fact it implies non-correlation of any two functions of the random numbers.

2.14 Generating Function

Let X be a random number with discrete distribution on a subset of N. The generating

function of X is defined for u ∈ C, |u| ≤ 1, by
40 2 Discrete Distributions

φ X (u) := P(u X ) = u k P(X = k). (2.5)
k∈I (X )

The expectation of a complex random variable is defined as the expectation of the

real part plus i times the expectation of the imaginary part. The condition |u| ≤ 1
guarantees that the series (2.5) is convergent in the case of infinitely many possible
values. We will use characteristic functions just for real values of u. We have that

φ X (0) = P(X = 0).

In general, computing the nth derivative of (2.5) in u = 0, we obtain

1 d n φ X (u)
P(X = n) = ,
n! d x n u=0

for every n ∈ N. This shows that the probability distribution of X can be obtained
from its generating function.

Proposition 2.14.1 If P(X ) = k∈I (X ) kP(X = k) < ∞, then P(X ) =

limu→1− φX (u). Moreover P(X ) = k∈I (X ) kP(X = k) = +∞ if and only if
limu→1 φX (u)= ∞.
This is a particular case of the following result.

Proposition 2.14.2 If P(X (X − 1) . . . (X − k + 1)) = k∈I (X ) (k(k − 1) . . . (k −
n + 1))P(X = k) < ∞, then

P(X (X − 1) . . . (X − k + 1)) = lim− φ(n)

X (u) .
u→1

Furthermore k∈I (X ) (k(k − 1) . . . (k − n + 1))P(X = k) = ∞ if and only if
limu→1− φ(n)
X (u) = ∞.

Previous results are easily obtained by taking the derivatives of the generating func-
tion. In particular the variance of X can be obtained from the generating function:

2
σ 2 (X ) = P(X 2 ) − P(X )2 = lim− φX (u) + φX (u) − φX (u) ,
u→1

where φX and φX denote respectively the first and the second derivatives of φ X .
Generating functions of some common discrete distributions are easily obtained:

1. Event E with probability p

φ E (u) = up + (1 − p).
2.14 Generating Function 41

2. Binomial distribution Bn(n, p) with parameters n, p:

n
n
φ X (u) = ukp k (1 − p)n−k
k
k=0
n
n
= (up)k (1 − p)n−k = (up + (1 − p))n ,
k
k=0

where Newton’s binomial formula has been used.

3. Geometric distribution with parameter p:
∞

φ X (u) = u k p(1 − p)k−1
k=1
∞
up
= up [u(1 − p)]k−1 = ,
k=1
(1 − u(1 − p))

where the formula for the sum of geometric series has been used.
4. Poisson distribution with parameter λ:
∞
λk −λ
φ X (u) = uk e
k=0
k!
∞
(uλ)k
= e−λ = e−λ(1−u) .
k=0
k!

If X and Y are two stochastically independent random numbers with values in N, i.e.
P(X = i, Y = j) = P(X = i)P(Y = j) for all i, j ∈ I (X ) × I (Y ), then it is easy
to show that
φ X +Y (u) = φ X (u)φY (u).

Indeed:

φ X +Y (u) = P(u X +Y ) = P(u X u Y )

= u i u j P(X = i, Y = j)
i j

= u i u j P(X = i)P(Y = j)
i j
⎛ ⎞

= u i P(X = i) ⎝ u j P(Y = j)⎠
i j

= φ X (u)φY (u) .
42 2 Discrete Distributions

Of course if we have n stochastically independent random numbers we obtain sim-

ilarly: φ X 1 +···+X n (u) = φ X 1 (u) . . . φ X n (u). One can also consider the case of the
sum of a random number N of stochastically independent random numbers. Let
X 1 , X 2 , . . . be an infinite sequence of stochastically independent random numbers
with values in N. This means that if we take any finite number of them, they are
stochastically independent. We assume that X 1 , X 2 , . . . are identically distributed.
Let N be a random number with values in N, such that

N , X 1, X 2, . . .

are stochastically independent. Let S N be defined by

SN = X 1 + · · · + X N .

We now compute the generating function of S N :

φ SN (u) = P(u SN ) = P(u SN |N = k)P(N = k)
k∈I (N )

= P(u Sk )P(N = k)
k∈I (N )

= P(N = k)P(u X 1 +···+X k )
k∈I (N )

= P(N = k)φ X 1 (u) . . . φ X k (u)
k∈I (N )

= P(N = k)φ X 1 (u)k
k∈I (N )

= φ N φ X 1 (u) ,

where φ N is the generating function of X and we have used the fact that the random
numbers X i have the same distribution and hence the same generating function. See
e.g. [3] or [6] for a more complete treatment of generating functions.
Chapter 3
One-Dimensional Absolutely Continuous
Distributions

3.1 Introduction

For random numbers with discrete distribution, the distribution is completely spec-
ified by the probabilities of taking single values. If we want to introduce random
numbers that take values on intervals or on the whole line, then the specification
of the probabilities of taking single values is no longer sufficient to determine their
distributions. For example for a random number corresponding to a random choice
in an interval [a, b], the probabilities of taking single values must be clearly equal
to 0, but that in no way specifies the probability of taking value in a subinterval of
[a, b]. In the following we will see how it is possible to describe the distribution of
a random number in general.

3.2 Cumulative Distribution Function

Given a random number X , its cumulative distribution function (c.d.f) is defined by:

F(x) = P(X ≤ x), for x ∈ R.

The cumulative distribution function F(x) verifies the following properties:

1. 0 ≤ F(x) ≤ 1 since it is the probability of an event.
2. It is non-decreasing: for a < b we have F(b) − F(a) = P(a < X ≤ b) ≥ 0, so
that F(a) ≤ F(b).
We introduce now some further properties that are usually assumed to be verified
by cumulative distribution function. They can be thought of as regularity properties,
as they state that the probability of an event E is equal to the limit of the sequence
P(E n ), where E n is a monotonic sequence converging to E. In particular:

© Springer International Publishing Switzerland 2016 43

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_3
44 3 One-Dimensional Absolutely Continuous Distributions

1. continuity from the right: F(x) = lim+ F(y);

y→x
2. limit from the left: lim− F(y) = P(X < x);
y→x
3. lim F(x) = 1;
x→ + ∞
4. lim F(x) = 0.
x→−∞

In all examples of p.d.f.’s these extra properties will be satisfied, even if it is

possible to consider cases where they do not hold true. It follows from 1 and 2 that

P(X = x0 ) = P((X ≤ x0 ) − (X < x0 )) = F(x0 ) − F(x0− )

where F(x0− ) denotes lim x→x0− F(x). This limit always exists as F(x) is bounded
non-decreasing.
Example 3.2.1 (Discrete case) In the case of a random number X with discrete
distribution I (X ) = {x1 , x2 , . . .} one has:

F(x) = P(X ≤ x) = P(X = xi ).
xi ≤x

The probability that a random number X takes value in an interval (a, b] can be
obtained from its c.d.f. F by:

P(a < X ≤ b) = P((X ≤ b) − (X ≤ a))

= P(X ≤ b) − P(X ≤ a)
= F(b) − F(a).

3.3 Absolutely Continuous Distributions

Let X be a random number. We say that X has absolutely continuous distribution

if there exists a function f : R → R+ with such that the c.d.f. F(x) of X can be
written as: x
F(x) = f (t)dt.
−∞

The function f is the called a probability density function (p.d.f.) of X . Note that
f is not unique. Indeed if the values of f are changed on a finite set of points, the
new function is still a density of X , as its integrals are the same. It follows from
fundamental theorem of calculus that if x is a continuity point of f , then

f (x) = F (x).
3.3 Absolutely Continuous Distributions 45

Since lim x→∞ F(x) = 1, then if f is a p.d.f. of X we have

∞ x
f (x)d x = lim f (y)dy = lim F(x) = 1.
−∞ x→∞ −∞ x→∞

If x is a continuity point of f , then f (x) ≥ 0. Indeed assume that f (x) < 0, then by
continuity there would be a neighborhood (a, b) of x where f is still strictly negative
but then b
F(b) = F(a) + f (x)d x < F(a),
a

so that F would not be non-decreasing. We have that for a < b

b a b
P(a < X ≤ b) = F(b) − F(a) = f (x)d x − f (x)d x = f (x)d x.
−∞ −∞ a

Let us now see how to compute the expectation of X from the p.d.f. f . We consider
the particular case when I (X ) is contained in some interval [a, b] and the p.d.f.
f is continuous (and zero outside [a, b]). We subdivide [a, b] into n intervals Ii ,
i = 1, . . . , n of length b−an
. It is not important that the extremes are included: we
assume that the intervals are closed on the r.h.s. and open on the l.h.s. except for I1
that is closed on both sides. We define two random numbers with discrete distribution
(n) (n) (n)
X− and X + : if X takes value in Ii , then X − is equal to the left endpoint of Ii ,
(n) (n) (n)
X + is equal to the right endpoint. Since X − and X + have discrete distribution
with a finite number of possible values, we can compute their expectations using the
formula (1.10). They are given by:

n−1
a+( j+1) b−a
(n) b−a n
P(X − ) = a+ j f (x)d x;
j=0
n b−a
a+ j n

n−1
a+( j+1) b−a
(n) b−a n
P(X + ) = a + ( j + 1) f (x)d x.
j=0
n b−a
a+ j n

(n) (n)
Since X − ≤ X ≤ X+ , then

(n) (n)
P(X − ) ≤ P(X ) ≤ P(X + ).

(n)
It is easy to see, using the continuity of f (x), that as n → ∞ both P(X − ) and
(n)
P(X + ) converge to
b
x f (x)dx = x f (x)dx,
a R
46 3 One-Dimensional Absolutely Continuous Distributions

that is hence the value of P(X ). Approximation arguments lead to extend this formula
to the case of a general X with absolutely continuous distribution with probability
density f (x) provided that
|x| f (x)dx < ∞, (3.1)
R

i.e. one assume that, when (3.1) holds true, the expectation of X in the absolutely
continuous case is given by:
+∞
P(X ) = x f (x)dx.
−∞

Analogously if ψ : R → R is a real function such that ψ(x) f (x) is integrable, we

are lead to assign to P(ψ(X )) the value
+∞
P(ψ(X )) = ψ(x) f (x)dx. (3.2)
−∞

It follows that the variance can be obtained by:

σ 2 (X ) = P(X 2 ) − P(X )2
+∞ +∞ 2
= x f (x) dx −
2
x f (x) dx ,
−∞ −∞

provided that the integrals exist. In the following sections we shall introduce some
of the most common one-dimensional absolutely continuous distributions.

3.4 Uniform Distribution in [0, 1]

A random number X has uniform distribution in [0, 1] if its c.d.f. is given by:
⎧
⎨ 0 x ≤ 0,
F(x) = x 0 < x < 1,
⎩
1 x ≥ 1.

It is a continuous distribution since

P(X = x) = F(x) − F(x − ) = 0

for every x ∈ R. Indeed it is easy to check that it is an absolutely continuous distri-

bution with p.d.f. f (x) given by:
3.4 Uniform Distribution in [0,1] 47
⎧
⎨ 0 x ≤ 0,
f (x) = 1 0 < x < 1,
⎩
0 x ≥ 1.

As in the following examples the values of the p.d.f. in discontinuity points can be
chosen in an arbitrary way. The expectation is given by
1
1
x2 1
P(X ) = x f (x) dx = x dx = = ,
R 0 2 0 2

and the variance by

1 3
1
1 x 1 1
σ 2 (X ) = x 2 dx − = − = .
0 4 3 0 4 12

3.5 Uniform Distribution on an Arbitrary Interval [a, b]

A random number X has uniform distribution in [a, b] if its c.d.f. is given by:
⎧
⎨0 x ≤ a,
F(x) = c(x − a) a < x < b,
⎩
1 x ≥ 1.

In order to compute the constant c, we impose the continuity in the point x = b and
get c(b − a) = 1, that is:
1
c= .
b−a

The expectation is given by:

b
b
x x2 a+b
P(X ) = dx = = ,
a b−a 2(b − a) a 2

and the variance by:

2
b
1 a+b
σ 2 (X ) = P((X − P(X ))2 = x− dx
a b−a 2
3 b
1 1 a+b
= x−
b−a 3 2
a
(b − a)2
= .
12
48 3 One-Dimensional Absolutely Continuous Distributions

3.6 Exponential Distribution

A random number X has exponential distribution with parameter λ if its c.d.f. is

given by:
1 − e−λx x ≥ 0,
F(x) =
0 x < 0.

If X is the time when a certain fact happens (for example when the atom of some
isotope decays), the exponential distribution has the property of absence of memory.
Given x, y ≥ 0 we have:

P(X > x + y | X > y) = P(X > x). (3.3)

i.e. the probability that the fact does not occur for an extra amount of time x, given
that has not occurred up to time y, is the same as the probability starting from the
initial time. We obtain (3.3) by using the formula of composite probability:
P(X > x + y, X > y)
P(X > x + y | X > y) =
P(X > y)
P(X > x + y)
=
P(X > y)
e−λ(x+y)
=
e−λy
−λx
= e
= P(X > x).

In the following we shall see that the exponential distribution can be obtained as
limit of suitably rescaled geometric distributions. Geometric distribution has also the
property of absence of memory for discrete times, as we have remarked in Sect. 2.4.
The expectation of exponential distribution with parameter λ is equal to
+∞ +∞
+∞ 1
P(X ) = λxe−λx dx = −xe−λx 0 + e−λx dx = .
0 0 λ

The variance is equal to

σ 2 (X ) = P(X 2 ) − P(X )2
+∞
1
= λx 2 e−λx dx − 2
0 λ
+∞
2 −λx +∞ 1
= −x e + 2 xe−λx dx − 2
0
0 λ
2 1
= 2− 2
λ λ
1
= 2.
λ
3.7 A Characterization of Exponential Distribution 49

3.7 A Characterization of Exponential Distribution

The exponential distribution can be characterized in terms of its hazard rate.

Given a non-negative random variable with absolutely continuous distribution
that describes the time of occurence of some fact, its hazard rate h(x) at time x is
defined by:
P(x < X < x + h|X > x)
h(x) = lim .
h→0 h

We can express h(x) in terms of the probability density. Let

x
F(x) = P(X ≤ x) = f (y)dy.
−∞

Then
P(x < X < x + h) f (x) d
lim = =− log(1 − F(x)).
h→0 hP(X > x) 1 − F(x) dx

For exponential distribution with parameter λ, it is easy to see that the hazard rate is
equal to λ for all x. Indeed:

f (x) λe−λx
h(x) = = −λx = λ.
1 − F(x) e

Exponential distribution can be characterized as the unique distribution with constant

hazard rate. To see that, we first show that c.d.f. can be obtained from the hazard
rate.
Since X is assumed to be non-negative and with absolutely continuous distribu-
tion, we have: F(0) = P(X ≤ 0) = 0. Using that

d
h(x) = − log(1 − F(x)),
dx
we have that for x ≥ 0
x
log(1 − F(x)) = − h(y)dy (3.4)
0
x
= 1 − F(x) = exp − h(y)dy (3.5)
0 x
= F(x) = 1 − exp − h(y)dy . (3.6)
0

If the hazard rate is constant equal to λ > 0, then

F(x) = 1 − e−λx , x > 0,

50 3 One-Dimensional Absolutely Continuous Distributions

since X is non-negative, F(x) = 0 for x < 0. Therefore X has exponential distribu-

tion with parameter λ.

3.8 Normal Distribution

A random number X has standard normal distribution N (0, 1) if its probability

density function is:
x2
n(x) = K e− 2 , x ∈ R.
x2
Although the indefinite integral of e− 2 cannot be expressed in terms of elementary
functions, it can still be computed over the whole line and so the constant K . We
have:
+∞ 2 +∞ +∞
2 x2 y2
− x2
e dx = e− 2 e− 2 dxdy
−∞
−∞
−∞
x 2 +y 2
= e− 2 dxdy
2π +∞
r2
= e− 2 r dr dθ
0 0
+∞
r2
= 2π e− 2 r dr
0

r 2 +∞
= 2π −e− 2
0
= 2π,

where a change to polar coordinates x = r cos θ, y = r sin θ has been used. The
Jacobian determinant of this change of variable is r (see Appendix H).
+∞ x 2 √
It follows that −∞ e− 2 = 2π and so

1
K =√ .
2π

The cumulative distribution function will be denoted by N (x):

x
N (x) := n(t) dt.
−∞

Since n is an even function and its integral over the whole line is equal to 1, we
have:
N (−x) = 1 − N (x).
3.8 Normal Distribution 51

Therefore in tables of N (x), only values for positive values of x are usually tabulated.
The expectation of standard normal distribution is
+∞
P(X ) = x n(x) dx = 0,
−∞

as it follows immediately since f (x) = − f (x), where f (x) = xn(x), x ∈ R. The

variance of standard normal distribution is obtained by integration by parts, using
the fact that n (x) = −xn(x):
+∞ +∞
σ 2 (X ) = P(X 2 ) = x 2 n(x) dx = [−xn(x)]+∞
−∞ + n(x) dx = 1.
−∞ −∞

We introduce now the general normal distribution which has two parameters m, σ 2
and will be denoted by N (m, σ 2 ). We start with X ∼ N (0, 1) and consider Y =
m + σ X , where σ > 0. Then Y has normal distribution N (m, σ 2 ). The c.d.f. of Y is
given by:

FY (y) = P(Y ≤ y)
= P(m + σ X ≤ y)

y−m
= P X≤
σ

y−m
=N .
σ

The probability density function of Y is obtained by chain rule for the derivative of
a composite function:

d y−m 1 y−m 1 (y−m)2
f Y (y) = N = n = √ e− 2σ 2 .
dy σ σ σ σ 2π

The expectation and the variance of Y are obtained as follows:

P(Y ) = P(σ X + m) = σP(X ) + m = m

σ 2 (Y ) = σ 2 (σ X + m) = σ 2 σ 2 (X ) = σ 2 .

3.9 Normal Tail Estimate

As we have said, there is no formula in terms of elementary functions for N (x) and
therefore for the probability that a random number X ∼ N (0, 1) is greater than some
x > 0. It is however possible to give asymptotic estimates for this probability as x
tends to infinity.
52 3 One-Dimensional Absolutely Continuous Distributions

Proposition 3.9.1 Let X be a random number with standard normal distribution.

For every x > 0, we have:

n(x) n(x) n(x)

− 3 < P(X ≥ x) < ,
x x x
1 x2
where n(x) := √ e− 2 .
2π
The upper bound is obtained by integration by parts:
+∞ +∞
n(t)
P(X ≥ x) = n(t) dt = t dt
x x t

+∞
n(t) +∞ n(t) n(x)
= − − 2
dt < .
t x x t
x

n(x) >0
x

A second integration by parts gives the lower bound:

+∞
n(x) n(t)
P(X ≥ x) = − t 3 dt
x x t

+∞ +∞
n(x) n(t) 3n(t) n(x) n(x)
= − − 3 + 4
dt > − 3 .
x t x t
x x
x

n(x) >0
x3

3.10 Gamma Distribution

Let α and λ be strictly positive real numbers. The random number X is said to have
gamma distribution Γ (α, λ) if its probability density function is given by

K x α−1 e−λx x > 0,
gα,λ (x) =
0 x ≤ 0.

Note that exponential distribution is a particular case of gamma distribution corre-

sponding to the choice α = 1.
The normalizing constant K can be expressed in terms of Euler’s gamma function
Γ (α): +∞
Γ (α) = x α−1 e−x dx
0
3.10 Gamma Distribution 53

for α > 0. The function Γ satisfies the recursive property:

1. Γ (α + 1) = αΓ (α), since
+∞
Γ (α + 1) = x α e−x dx
0
+∞
+∞
= −x α e−x 0 + αx α−1 e−x dx
0
= α Γ (α).

2. It follows by iteration that for integer α > 0

Γ (α) = (α − 1)!
+∞
since Γ (1) = 0 e−x dx = 1.
Now for the p.d.f. gα,λ we have
+∞ +∞ +∞
K K
1= gα,λ (x) dx = K x α−1 e−λx dx = α y α−1 e−y dy = α Γ (α).
−∞ 0 λ 0 λ

Hence
λα
K = .
Γ (α)

The expectation and the variance of the gamma distribution can be computed
using the recurrence property of gamma function:
+∞
P(X ) = xgα,λ (x) dx
−∞
α +∞
λ
= x α e−λx dx
Γ (α) 0
λα Γ (α + 1)
=
Γ (α) λα+1
α
= .
λ
It follows that:

α(α + 1) α2 α2
σ 2 (X ) = P(X 2 ) − P(X )2 = − 2 = 2.
λ 2 λ λ
54 3 One-Dimensional Absolutely Continuous Distributions

3.11 χ2 -Distribution

From the normal distribution we can derive another distribution of wide use in sta-
tistics, the χ2 -distribution. In this section we introduce the χ2 -distribution with
parameter ν = 1. In Chap. 4 we shall consider general χ2 -distributions with parame-
ter ν ∈ N \ {0}.
Let X be a random number with standard normal distribution N (0, 1) and let
Y = X 2 . We first consider the c.d.f. of Y . If y < 0

FY (y) = P(Y ≤ y) = 0

since Y is non-negative. If y ≥ 0, then

FY (y) = P(Y ≤ y) = P(X 2 ≤ y)

√ √
= P(− y ≤ X ≤ y)
√ √
= N ( y) − N (− y)
√ √
= N ( y) − (1 − N ( y))
√
= 2N ( y) − 1.

The c.d.f. of Y is therefore

0 for y < 0,
FY (y) = √
2N ( y) − 1 for y ≥ 0.

Let us compute the p.d.f. f Y of Y (for y > 0):

1
f Y (y) = FY (y) = 2n(y) √
y
1 1 y 1
= √ √ e− 2 = √ y 2 −1 e− 2 y ,
1 1

2π y 2π

where the derivative has been computed by using chain rule for the derivative of
composite functions. The density f Y (y) is of course zero for negative y. It follows
that Y has distribution Γ ( 21 , 21 ). Moreover by comparing the normalizing constants,
we get
21
1 1 1 1
√ √ = ,
2 π 2 Γ 21

so that
1 √
Γ = π.
2
3.11 χ2 -Distribution 55

By using the recurrence formula Γ (α + 1) = αΓ (α), we have:

√
2k + 1 (2k − 1)(2k − 3) · · · 1 π
Γ =
2 2k 2

for k = 1, 2, . . ..

3.12 Cauchy Distribution

We now consider a distribution for which the expectation defined in Sect. 3.3 does not
exist. This is the Cauchy distribution. This is the distribution of a random number
Y = tan Θ, where the random number Θ has uniform distribution in the interval
π π
[− , ]. We have for y ∈ R that:
2 2

FY (y) = P(Y ≤ y) = P(tan Θ ≤ y)

= P(Θ ≤ arctan y).

The p.d.f. of Y f Y is obtained by deriving FY :

1
f Y (y) = .
π 1 + y2

The formula for the expectation of Y gives an integral

y
dy
π 1 + y2

which is undefined, as the integrand behaves like 1

y
for y → ∞.

3.13 Mixed Cumulative Distribution Functions

In addition to discrete and absolutely continuous c.d.f.’s, there are continuous but
not absolutely continuous c.d.f.’s. These will be not considered in this elementary
book. Here we briefly speak about mixed c.d.f.’s that are convex linear combinations
of discrete and absolutely continuous c.d.f.’s.
For 0 < p < 1, let F1 be a discrete c.d.f. and F2 be an absolutely continuous c.d.f.
Then we can consider a c.d.f. F(x):

F(x) = p F1 (x) + (1 − p)F2 (x),

56 3 One-Dimensional Absolutely Continuous Distributions

which is neither of discrete nor of absolutely continuous type. F(x) is said to be

a mixed c.d.f. If X is a random number with c.d.f. F(x), it is easy to see that the
expectation of a function φ(X ) is given by

P(φ(X )) = pP(φ(X 1 )) + (1 − p)P(φ(X 2 )),

where X 1 and X 2 are random numbers with c.d.f. F1 and F2 respectively, provided
that the terms on the right-hand side both make sense. The first term is expressed by
a sum or a series, while the second by an integral.
An example of random number with mixed c.d.f is the time T of function of some
device, for example a lamp, when there is a positive probability p that the device
does not work already at the initial time and otherwise the distribution is absolutely
continuous, for example exponential with parameter λ. The c.d.f of T is then given
by:
0 for t < 0,
FT (t) = −λt
p + (1 − p)(1 − e ) for t ≥ 0.

1− p
It is easy to check that P(T ) = .
λ
Chapter 4
Multi-dimensional Absolutely Continuous
Distributions

4.1 Bidimensional Distributions

Let X, Y be two random numbers that we can consider as a random vector (X, Y ).
The joint cumulative distribution function ( j.c.d.f.) is defined as:

F(x, y) = P(X ≤ x, Y ≤ y).

Then F is a map from R2 to [0, 1]:

F : R2 −→ [0, 1].

The probability that (X, Y ) belong to the rectangle (a1 , b1 ] × (a2 , b2 ] is given by:

P(a1 < X ≤ b1 , a2 < Y ≤ b2 ) = P [((X ≤ b1 ) − (X ≤ a1 )) ((Y ≤ b2 ) − (Y ≤ a2 ))]

= P(X ≤ b1 , Y ≤ b2 ) − P(X ≤ a1 , Y ≤ b2 )
− P(X ≤ b1 , Y ≤ a2 ) + P(X ≤ a1 , Y ≤ a2 )
= F(b1 , b2 ) − F(a1 , b2 ) − F(b1 , a2 ) + F(a1 , a2 ).
(4.1)

We shall always assume that the following continuity properties are verified:
1. lim F(x, y) = 1;
x→+∞
y→+∞
2. lim F(x, y) = lim F(x, y) = 0;
x→−∞ y→−∞
3. lim+ F(x, y) = F(x0 , y0 );
x→x0
y→y0+
4. P(X = x0 , Y = y0 ) = F(x0 , y0 ) − F(x0− , y0 ) − F(x0 , y0− ) + F(x0− , y0− ),

© Springer International Publishing Switzerland 2016 57

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_4
58 4 Multi-dimensional Absolutely Continuous Distributions

where F(x0− , y0 ) := lim x→x0− F(x, y0 ), F(x0 , y0− ) := lim y→y0− F(x0 , y) and F(x0− ,
y0− ) := lim− F(x, y).
x→x0
y→y0−

Other analogous properties will also be assumed. We shall quote them when they
will be needed.

4.2 Marginal Cumulative Distribution Functions

Given two random numbers X, Y with j.c.d.f. F(x, y), the c.d.f.’s F1 , F2 of X and
Y are called marginal cumulative distribution functions (m.c.d.f.’s).
The m.c.d.f. of X is obtained from the j.c.d.f. by taking the limit:

F1 (x) = P(X 1 ≤ x) = lim F(x, y),

y→+∞

as follows by usual continuity hypothesis. Similarly the m.c.d.f. of Y is obtained by:

F2 (y) = P(Y ≤ y) = lim F(x, y).

x→+∞

Two numbers are said to be stochastically independent if:

F(x, y) = F1 (x)F2 (y)

for every (x, y) ∈ R2 .

4.3 Absolutely Continuous Joint Distributions

Two random numbers X, Y or equivalently the random vector (X, Y ) has an

absolutely continuous distribution if there exists a function f

f : R2 −→ R

such that the j.c.d.f. F of X, Y can be expressed as:

x y
f (X, Y ) = f (s, t) dsdt.
−∞ −∞

Such function f is called joint probability density (j.p.d.). Applying formula (4.1)
for the probability that (X, Y ) belong to a rectangle (a, b] × (c, d], we get:
4.3 Absolutely Continuous Joint Distributions 59

P(a < X ≤ b, c < Y ≤ d) = F(b, d) − F(a, d) − F(c, b) + F(a, c)

b d a d
= f (s, t) dsdt − f (s, t) dsdt
∞ ∞ ∞ ∞
b c a c
− f (s, t) dsdt + f (s, t) dsdt
∞ ∞ ∞ ∞
b d
= f (s, t) dsdt.
a c

By usual limiting procedure one gets that the probability that a random vector
(X, Y ) belongs to a sufficiently regular region A of R2 is given by the integral of the
j.p.d.f. over A, i.e.
P((X, Y ) ∈ A) = f (s, t) dsdt.
A

Moreover, if ψ : R2 → R is a sufficiently regular function such that the function ψ f

is integrable, then, as in the one-dimensional case, we have that for Z = ψ(X, Y )

P(Z ) = ψ(s, t) f (s, t) dsdt. (4.2)
R2

For example, if Z = X Y we get

P(X Y ) = st f (s, t) dsdt,
R2

if the integrand function st f (s, t) is integrable. In order to derive probability densities

of X and Y , that are called marginal probability densities, we start by deriving their
c.d.f.’s: +∞ x
FX (x) = P(X ≤ x) = f (s, t) dsdt.
−∞ −∞

It follows that the marginal probability density of X is given by:

+∞
f X (x) = f (x, t) dt.
−∞

Analogously f Y , the marginal probability density of Y , is given by

+∞
f Y (y) = f (s, y) ds.
−∞

It is easy to check out that, if f (X, Y ) can be expressed as a product of two functions,

f (x, y) = u(x)v(y),
60 4 Multi-dimensional Absolutely Continuous Distributions

then X and Y are stochastically independent and their marginal probability densities
are proportional to u(x) and v(y). Conversely if X and Y are stochastically indepen-
dent and their joint distribution is absolutely continuous, then their joint probability
density can be expressed as the product of their marginal probability densities:

f (x, y) = f X (x) f Y (y). (4.3)

As in the case of discrete distributions, it follows from (4.3) that if X, Y are

stochastically independent and φ1 , φ2 are real functions such that φ1 f X and φ2 f Y are
integrable, then by Fubini’s theorem we obtain:

P(φ1 (X )φ2 (Y )) = P(φ1 (X ))P(φ2 (Y )).

4.4 The Density of Z = X + Y

Let X and Y be two random numbers with joint probability density f (x, y). We want
to determine the density of
Z = X + Y.

First we compute the c.d.f. of Z :

+∞ z−x
FZ (z) = P(Z ≤ z) = P(X + Y ≤ z) = f (x, y) dydx
−∞ −∞
+∞ z z +∞
= f (x, t − x) dtdx = f (x, t − x) dxdt,
−∞ −∞ −∞ −∞

where we have made the change of variable t = x + y for fixed x, that allows then
to exchange the order of integration in the final equality. It follows from the last
expression that
z
f Z (z) = f z (t) dt
−∞

with +∞
f Z (z) = f (x, z − x) dx,
−∞

i.e. f Z is the density of Z . In particular when X and Y are stochastically independent

and f (x, y) = f X (x) f Y (y), then
+∞
f Z (z) = f X (x) f Y (z − x) dx.
−∞
4.4 The Density of Z = X + Y 61

Hence f Z is obtained by the convolution of f X and f Y and is denoted by f X ∗ f Y . An

example of application of this formula is the sum of two stochastically independent
gamma distributed random numbers with parameters respectively α, λ and β, λ.
Using the previous formula we obtain the probability density of Z = X + Y :
+∞
f Z (z) = f X (x) f Y (z − x) dx
−∞
+∞
λα α−1 −λx λβ
= x e I{x > 0} (z − x)β−1 e−λ(z−x) I{(z−x) > 0} dx,
−∞ Γ (α) Γ (β)

where I A denotes the indicator function of the set A. The integral can be written as

λα+β z
e−λz x α−1 (z − x)β−1 dx
Γ (α)Γ (β) 0

if z > 0 and it is equal to 0 if z ≤ 0. For z > 0 we make the change of variable

dx = zdt and obtain
z
λα+β
f Z (z) = e−λz x α−1 (z − x)β−1 dx
Γ (α)Γ (β) 0
1
λα+β
= e−λz (zt)α−1 (z − zt)β−1 z dt
Γ (α)Γ (β) 0
1
λα+β α+β−1 −λz
= z e t α−1 (1 − t)β−1 dt
Γ (α)Γ (β) 0
1
λα+β
= t α−1 (1 − t)β−1 dt z α+β−1 e−λz
Γ (α)Γ (β) 0
= K z α+β−1 e−λz ,

with
λα+β 1
K = t α−1 (1 − t)β−1 dt. (4.4)
Γ (α) Γ (β) 0

It follows that Z has distribution Γ (α + β, λ).

Remark 4.4.1 Since the constant K must be equal to the normalizing constant of the
distribution Γ (a + b, λ), by (4.4) we obtain

λα+β 1
λα+β
K = t α−1 (1 − t)β−1 dt = ,
Γ (α + β) 0 Γ (α + β)

so that 1
Γ (α) Γ (β)
t α−1 (1 − t)β−1 dt = .
0 Γ (α + β)
62 4 Multi-dimensional Absolutely Continuous Distributions

4.5 Beta Distribution B(α, β)

Let α > 0 and β > 0. A random number X is said to have beta distribution B(α, β)
if its density f (x) is given by
⎧
⎨ K x α−1 (1 − x)β−1 x ∈ [0, 1],
f (x) =
⎩
0 otherwise.

It follows from the computation at the end of the previous section that

1 Γ (α + β)
K = 1 = . (4.5)
x α−1 (1 − x)β−1 dx Γ (α) Γ (β)
0

The expectation can be obtained from the recursion property of Euler’s gamma
function. If X has B(α, β) distribution, then

Γ (α + β) 1
P(X ) = x f (x) dx.
Γ (α) Γ (β) 0

The value of the integral is obtained by (4.5) by replacing α with α + 1 so that

Γ (α + β) Γ (α + 1)Γ (β) αΓ (α) Γ (α + β) α

P(X ) = = = .
Γ (α) Γ (β) Γ (α + β + 1) Γ (α) (α + β)Γ (α + β) α+β

Similarly we can compute P(X 2 ):

1
Γ (α + β)
P(X 2 ) = x α+1 (1 − x)β−1 dx
Γ (α) Γ (β) 0
Γ (α + β) Γ (α + 2) Γ (β)
= ,
Γ (α) Γ (β) Γ (α + β + 2)

where the integral is obtained by replacing α with α + 2 in formula (4.4). By using

the recursion property of the Gamma function we get

Γ (α + 2) = (α + 1)α Γ (α)
Γ (α + β + 2) = (α + β + 1) (α + β) Γ (α + β)

so that
α(α + 1)
P(X 2 ) =
(α + β) (α + β + 1)
4.5 Beta Distribution B(α, β) 63

and

σ 2 (X ) = P(X 2 ) − P(X )2
(α + 1) α α2 αβ
= − = .
(α + β + 1) (α + β) (α + β) 2 (α + β) (α + β + 1)
2

4.6 Student Distribution

We now introduce the Student distribution of parameter ν. Let Z and U be sto-

chastically independent random numbers. Weassume that Z has standard normal
ν 1
distribution and U has gamma distribution Γ , where ν ∈ N. The latter dis-
2 2
tribution is called χ2 -distribution with ν degrees of freedom and plays an important
− 21
U
role in statistics. Let T = Z . In order to obtain the probability density of
ν
T , we first derive its c.d.f.
∞ √u
U ν
FT (t) = P(T ≤ t) = P(Z ≤ t )= f (z, u)dzdu,
ν 0 −∞

where
1 z2 ν

ν e− 2 u 2 −1 e− 2 .
u
f (z, u) = √ ν
2 2πΓ 2
2

By taking the derivative of FT (t) with respect to t, it follows from the fundamental
calculus theorem that for density of the Student distribution is given for t > 0 by
∞

u u
f T (t) = FT (t) = f (t , u) du
ν ν
∞
0

1 ν+1 t2
u 2 −1 e− 2 (1+ ν ) du
u
= √ ν
ν
2 2πνΓ 2 0
2

Γ ν+1 t 2 ν+1
= √ 2

ν (1 + )− 2 ,
πνΓ 2 ν

where the integral has been computed by using the formula for the normalizing
constant of the gamma distribution. Note that for ν = 1 the Student distribution
coincides with the Cauchy distribution. Since
+∞
|t|
t 2 ν+1
dt
−∞ (1 + ν
) 2
64 4 Multi-dimensional Absolutely Continuous Distributions

must be finite for the existence of P(T ), we have that P(T ) exists and is finite if and
only if ν > 1. We have that

+∞
Γ ν+1 t 2 ν+1
P(T ) = √ 2

ν t (1 + )− 2 dt = 0,
πνΓ 2 −∞ ν

since the integrand is an odd function.To compute the variance, we calculate

2
νZ 1 1
σ(T ) = P(T 2 ) = P = νP(Z 2 )P = νP
U U U
∞ ∞
ν 1 ν −1 − u ν ν−2
u 2 −1 e− 2 du
u
= ν
ν u 2 e 2 du = ν
ν
22 Γ 2 0 u 22 Γ 2 0

ν ν−2 ν−2 ν
= ν
ν 2 2 Γ = .
2 Γ 2
2 2 ν − 2

Hence the variance exists finitely if ν > 2.

4.7 Multi-dimensional Distributions

Let (X 1 , X 2 , . . . , X n ) be an n-dimensional random vector. The function

F : Rn −→ [0, 1]

defined by:

F(x1 , x2 , . . . , xn ) = P(X 1 ≤ x1 , X 2 ≤ x2 , . . . , X n ≤ xn )

is called joint cumulative distribution function (j.c.d.f.) of (X 1 , X 2 , . . . , X n ). In the

following we shall always assume that the following continuity properties are satis-
fied by j.c.d.f.’s:
1. lim F(x1 , x2 , . . . , xn ) = 1;
x1 ,...,xn →+∞
2. lim F(x1 , x2 , . . . , xn ) = 0
xi →−∞
3. If {i 1 , . . . , i k } ⊂ {1, 2, . . . , n} and { j1 , . . . , jn−k } = {1, 2, . . . , n}\{i 1 , . . . , i k }
then lim F(x1 , . . . , xn ) = P(X i1 ≤ xi1 ,...,X ik ≤ xik ).
x j1 ,...,x jn−k →+∞

Here Fi1 ,...,ik (xi1 , . . . , xik ) := P(X i1 ≤ xi1 , . . . , X ik ≤ xik ), xi1 , . . . , xik ∈ R, is
called the marginal cumulative distribution function (m.c.d.f.) of X i1 , . . . , X ik . As in
the two-dimensional case the probability that X 1 , . . . , X n belongs to some intervals
(a1 , b1 ], . . . , (an , bn ] can be computed using the j.c.d.f. Precisely:

P(a1 < X 1 ≤ b1 , . . . , an < X n ≤ bn ) = (−1)
(c) F(c1 , . . . , cn )
c
4.7 Multi-dimensional Distributions 65

with c = (c1 , . . . , cn ), where ci can be ai or bi , and

(c) is equal to the number of
i’s such that ci = ai . The proof of this formula is completely analogous to the one
for (4.1) in the two-dimensional case.
The random numbers X 1 , . . . , X n are said to be stochastically independent if

F(x1 , . . . , xn ) = F1 (x1 ) . . . Fn (xn ),

where Fi is the m.c.d.f. of X i for i = 1, . . . , n. If X 1 , . . . , X n are stochastically

independent, then

P(a1 < X 1 ≤ b1 , . . . , an < X 1 ≤ bn ) = (−1)
(c) F1 (c1 ) . . . Fn (cn )
c
= Πin= 1 (Fi (bi ) − Fi (ai ))
= Πin= 1 P(ai < X i ≤ bi ).

4.8 Absolutely Continuous Multi-dimensional

Distributions

The random vector (X 1 , . . . , X n ) has an absolutely continuous distribution if there

exists a function
f : Rn −→ R

such that the j.c.d.f. F of (X 1 , X 2 , . . . , X n ) is given by:

x1 x2 xn
F(x1 , . . . , xn ) = ··· f (t1 , t2 , . . . , tn ) dt1 dt2 . . . dtn .
−∞ −∞ −∞

It follows from Property 1 of Sect. 4.7 that

+∞ +∞
··· f (t1 , . . . , tn ) dt1 . . . dtn = 1.
−∞ −∞

Moreover it can be shown that one can always choose a non-negative f . The function
f is called joint probability density ( j.p.d.) of (X 1 , . . . , X n ). What we have said
about two-dimensional joint probability density generalizes in a natural way to the
n-dimensional case.
If A is a sufficiently regular region A ⊂ Rn then

P((X 1 , . . . , X n ) ∈ A) = ··· f (t1 , . . . , tn )dt1 · · · dtn .
A
66 4 Multi-dimensional Absolutely Continuous Distributions

If ψ is a function ψ : Rn −→ R such that ψ f is integrable, then

+∞ +∞
P(ψ(X 1 , . . . , X n )) = ··· ψ(t1 , . . . , tn )dt1 . . . dtn .
−∞ −∞

If f (t1 , . . . , tn ) = g1 (t1 ) · · · gn (tn ), then X 1 , . . . , X n are stochastically independent

and the marginal density of X i can be taken proportional to gi for i = 1, . . . , n.
Conversely if X 1 , . . . , X n are stochastically independent with absolutely continuous
distribution, the j.p.d. of X 1 , . . . , X n can be taken as f (t1 , . . . , tn ) = f 1 (t1 ) . . . f n (tn ),
where f 1 , . . . , f n are marginal probability density functions of X 1 , X 2 , . . . , X n .

4.9 Multi-dimensional Gaussian Distribution

A random vector (X 1 , X 2 , . . . , X n ) has n-dimensional Gaussian distribution if its

density has the form:

f (x1 , x2 , . . . , xn ) = K e− 2 Ax·x+b·x
1

where x = (x1 , x2 , . . . , xn )t ∈ Rn , b = (b1 , b2 , . . . , bn )t ∈ Rn and A ∈ Rn × n is a

symmetric positive definite matrix.1 The symbol At denotes the transpose matrix of
A, with elements
[At ]i, j = [A] j,i .

We remind that b · x is the scalar product of b and x, given by

n
b·x = bi xi
i=1

and that Ax is the vector with elements

[Ax]i = ai j x j .
j

Let ai, j denote [A]i, j . The expression Ax · x is a quadratic form

ai j xi x j .
i, j

1 Recallthat a matrix A ∈ Rn × n is
• symmetric if At = A, i.e. ai j = a ji ,
• positive definite if Ax · x > 0 for all x
= 0, x ∈ R.
4.9 Multi-dimensional Gaussian Distribution 67

If we have a quadratic form

Bx · x = bi j xi x j
i, j

we can always replace the matrix B with a symmetric matrix A such that

Ax · x = ai j xi x j = Bx · x,
i, j

where ai j is defined by
⎧
⎨ bii for i = j,
ai j =
⎩
(bi j + b ji )/2 for i
= j.

We consider first the simplest case:

Case 1: A diagonal and b = 0

Let ⎛ ⎞
λ1 0 · · · 0
⎜ . ⎟
⎜ 0 λ2 . . . .. ⎟
A=⎜
⎜ . . .
⎟
⎟
⎝ .. . . . . 0 ⎠
0 · · · 0 λn

and b = 0. We obtain2

x12 x22 xn2
f (x1 , x2 , . . . , xn ) = K exp − λ1 + λ2 + · · · + λn .
2 2 2

By computing the marginal densities, it is easy to get

f (x1 , x2 , . . . , xn ) = f X 1 (x1 ) f X 2 (x2 ) · · · f X n (xn )

where
λi λi xi2
f X i (xi ) = exp −
2π 2

2 Here the notation exp (x) is introduced to denote the exponential function e x .
68 4 Multi-dimensional Absolutely Continuous Distributions

is the marginal density of X i . It follows that

1. X 1 , . . . , X n are stochastically
independent;

1
2. X i has gaussian density N 0, ;
λi
3. the normalizing constant is given by:

λ1 λ2 λn det A
K = ··· = .
2π 2π 2π (2π)n

The expectation vector is given by

(P(X 1 ), . . . , P(X n )) = (0, . . . , 0)

and the covariance matrix is:

⎛ ⎞
σ 2 (X 1 ) cov(X 1 , X 2 ) ··· cov(X 1 , X n )
⎜ ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ cov(X 2 , X 1 ) σ 2 (X 2 ) . . ⎟
⎜ ⎟
C=⎜ ⎜ ⎟
⎟
⎜ . . . ⎟
⎜ .. .. .. cov(X n−1 , X n ) ⎟
⎜ ⎟
⎝ ⎠
cov(X n , X 1 ) ··· cov(X n , X n−1 ) σ (X n )
2

⎛ ⎞
1
λ1
0 ··· 0
⎜ 1 ...
.. ⎟
⎜ 0 . ⎟
=⎜
⎜ ..
λ2 ⎟
⎟
⎝ .. ..
. . . 0⎠
0 ··· 0 1
λn

= A−1 .

Case 2: Computation of the expectation vector in the general case

Let now A be symmetric and positive definite and b
= 0. By making a translation
we can reduce the density to the case b = 0. Let U = X − c with c ∈ R. The j.c.d.f.
of the random vector U can be expressed in terms of that of X :

FU (u) = P(U ≤ u) = P(X − c ≤ u) = P(X ≤ u + c) = FX (u + c).

4.9 Multi-dimensional Gaussian Distribution 69

It follows that the joint probability density can be similarly obtained from that
of X :

fU (u 1 , u 2 , . . . , u n ) = f X (u 1 + c1 , u 2 + c2 , . . . , u n + cn )

1
= K exp − A(u + c) · (u + c) + b · (u + c)
2

1 1 1 1
= K exp − Au · u − Au · c − Ac · u − Ac · c + b · u + b · c
2 2 2 2

1 1
= K exp − Ac · c + b · c exp − Au · u + (b − Ac) · u ,
2 2

constant

where we have used the fact that

Ac · u = Au · c,

since A is symmetric. In order to reduce the density to the case b = 0, we must

choose c so that the first degree part cancels, i.e.:

b − Ac = 0.

We choose therefore
c = A−1 b.

Note that A is invertible since it is positive definite. For this choice of c the density
fU (u 1 , u 2 , . . . , u n ) is given by:

fU (u 1 , u 2 , . . . , u n ) = f X (u 1 + c1 , u 2 + c2 , . . . , u n + cn )

−1 A(A−1 b) · A−1 b 1
= K exp A b · b − exp − Au · u
2 2

1 −1 1
= K exp A b · b exp − Au · u
2 2

K

1
= K exp − Au · u .
2

It is easy to see that P(Ui ) = 0 for i = 1, 2, . . . , n, since the density of −U and U

are the same. Using previous results, we obtain that

P(X i ) = P(Ui + ci ) = P(Ui ) + ci = ci = (A−1 b)i ,

i.e. in vectorial notation:

P(X ) = A−1 b ;
70 4 Multi-dimensional Absolutely Continuous Distributions

where the expectation of a random vector is defined as the vector of the expectations
of its components. The normalizing constant is

1 −1

K = K exp A b·b ,
2

where K is the normalizing constant for the case with b = 0. The covariance matrix
of X is equal to one of U , as a translation leaves variances and covariances unchanged:

⎛ ⎞
σ 2 (X 1 ) cov(X 1 , X 2 ) ··· cov(X 1 , X n )
⎜ ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ cov(X 2 , X 1 ) σ 2 (X 2 ) . . ⎟
⎜ ⎟
C=⎜
⎜
⎟
⎟
⎜ .. .. .. ⎟
⎜ . . . cov(X n−1 , X n ) ⎟
⎜ ⎟
⎝ ⎠
cov(X n , X 1 ) ··· cov(X n , X n−1 ) σ (X n )
2

⎛ ⎞
σ 2 (U1 ) cov(U1 , U2 ) ··· cov(U1 , Un )
⎜ ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ cov(U2 , U1 ) σ 2 (U2 ) . . ⎟
⎜ ⎟
=⎜
⎜
⎟.
⎟
⎜ .. .. .. ⎟
⎜ . . . cov(Un−1 , Un ) ⎟
⎜ ⎟
⎝ ⎠
cov(Un , U1 ) ··· cov(Un , Un−1 ) σ 2 (Un )

Case 3: computation of covariance matrix and normalization constant in the

general case
As it is shown we can reduce to the case b = 0 by making a translation. Since A
is symmetric, there exists an orthogonal matrix O, i.e. such that O t AO = D, where
D is diagonal.
If U is the random vector U = O −1 X , its density is given by

1
f (u 1 , . . . , u n ) = K exp − A Ou · Ou
2

1 t
= K exp − O AOu · u
2

1
= K exp − Du · u .
2
4.9 Multi-dimensional Gaussian Distribution 71

Now for U we are in the situation of a diagonal matrix that we have already consid-
ered. The covariance matrix of X , is given by:

C = P(X X t ) = P(OU (OU )t )

= O P(U U t ) O t = O D −1 O t = A−1 .

Here the expectation of a random matrix denotes a matrix whose entries are the
expectations of the corresponding entries. We have used the easily verifiable fact
that if Z is a random matrix and A, B are constant matrices, such that the product
AZ B is defined, then P(AZ B) = AP(Z )B.
We have found that in the general case
1. the normalization constant is

det A − 1 A−1 b·b
K = e 2 ;
(2π)n

2. the expectation is
P(X ) = A−1 b ;

3. the covariance matrix is

C = A−1 .

Remark 4.9.1 It is easy to check that the marginal distribution of the X i ’s and of
subsets of the X i ’s are gaussian. In particular, if cov(X i , X j ) = 0 for some i, j i
= j,
then the covariance matrix of (X i , X j ) is diagonal, so that X i and X j are stochastically
independent as it is shown in the next remark.
Remark 4.9.2 When n = 2, the covariance matrix is given by:
⎛ ⎞
σ12 ρ σ1 σ2
C =⎝ ⎠
ρ σ1 σ2 σ22

where σ12 = σ 2 (X 1 ), σ22 = σ 2 (X 2 ) and ρ = ρ(X 1 , X 2 ). The matrix A can be

obtained as:
⎛ ⎞
σ22 −ρ σ1 σ2
1 ⎝ ⎠
A = C −1 =
det C −ρ σ σ σ12
1 2
⎛ ⎞
σ22 −ρ σ1 σ2
1 ⎝ ⎠
= 2 2
σ1 σ2 − ρ2 σ12 σ22 −ρ σ σ σ2
1 2 1
72 4 Multi-dimensional Absolutely Continuous Distributions
⎛ ⎞
1 ρ
⎜ −
1 ⎜ σ12 σ1 σ2 ⎟
⎟
= ⎜ ⎟.
1 − ρ2 ⎜ ⎟
⎝ ρ 1 ⎠
−
σ1 σ2 σ22

The density of two-dimensional gaussian distribution with parameters m 1 = P(X 1 ),

m 2 = P(X 2 ), σ12 = σ 2 (X 1 ), σ22 = σ 2 (X 2 ), ρ = ρ(X 1 , X 2 ) is therefore given by:

1
f (x, y) = ·
2πσ1 σ2 1 − ρ2

1 (x − m 1 )2 ρ(x − m 1 )(y − m 2 ) (y − m 2 )2
exp − −2 + .
2(1 − ρ2 ) σ12 σ1 σ2 σ22
Chapter 5
Convergence of Distributions

5.1 Convergence of Cumulative Distribution Functions

It is natural to introduce a notion of convergence for sequences of cumulative dis-

tribution functions, i.e. to give a meaning to the expression Fn → F. One possible
meaning could be pointwise convergence, i.e.: Fn (x) → F(x) for every x ∈ R. How-
ever this notion of convergence turns out to be too restrictive. For example consider
the sequence Fn (x) defined as:

1 for x ≥ n1 ,
Fn (x) =
0 for x < 0.

If the random number X n has c.d.f. Fn , then P(X n = 1

n
) = 1. For a reasonable
convergence notion we should have Fn → F, where

1 for x ≥ 0,
F(x) =
0 for x < 0.

However it is not true in this case that Fn (x) → F(x) for every x ∈ R. Indeed
Fn (0) = 0 for every n, whereas F(0) = 1. Therefore it is natural to introduce a
weaker definition of convergence.

Definition 5.1.1 We say that Fn → F for every x if for every > 0 there exists N
such that for n ≥ N

F(x − ) − < Fn (x) < F(x + ) + .

If x is a continuity point of F, this definition implies that

lim Fn (x) = F(x).

n→∞

© Springer International Publishing Switzerland 2016 73

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_5
74 5 Convergence of Distributions

Conversely if for every continuity point x of F lim Fn (x) = F(x), then Fn → F.

n→∞
For a cumulative distribution, continuity points make up an everywhere dense set
since discontinuity points are denumerable. Indeed there cannot be more than n
discontinuity points with jump larger than or equal to n1 because F is bounded by 1
from above. Let then x ∈ R and > 0. There exist two continuity points x0 , x of F
such that x − < x0 < x < x1 < x + . We have then

F(x − ) ≤ lim Fn (x0 ) = F(x0 )

n→∞

and also

lim Fn (x1 ) = F(x1 ) ≤ F(x + ).

n→∞

On the other side for every n

Fn (x0 ) ≤ Fn (x) ≤ Fn (x1 ).

Therefore for n sufficiently large we have

F(x − ) − < Fn (x) < F(x + ) + .

It is easy to build up examples of sequences of absolutely continuous √ c.d.f.’s con-

verging to a discrete (pure jump) c.d.f. For example if Fn (x) = N ( nx), the c.d.f.
of normally distributed X n with P(X ) = 0 and σ 2 (X n ) = n1 , then Fn → F with

1 for x ≥ 0,
F(x) =
0 for x < 0.

Conversely we can build up examples of discrete c.d.f’s converging to an absolutely

continuous c.d.f. For example if
⎧
⎪
⎨0 for x ≤ 0,
[nx]
Fn (x) = for 0 < x ≤ 1,
⎪ n
⎩
1 for x > 1,

where [x] denotes the integer part of x, then Fn → F, where F is the c.d.f. of the
uniform distribution in [0, 1]:
⎧
⎪
⎨0 for x ≤ 0,
F(x) = x for 0 < x ≤ 1,
⎪
⎩
1 for x > 1.
5.2 Convergence of Geometric Distribution to Exponential Distribution 75

5.2 Convergence of Geometric Distribution to Exponential

Distribution

We have seen that geometric and exponential distributions share the property of
absence of memory, the former among discrete distributions, the latter among ab-
solutely continuous distributions. Let us now consider a sequence (X n )n∈N of random
numbers with geometric distributions with parameters pn :

P(X n = k) = pn (1 − pn )k−1 , ∀k ≥ 1.

We assume that npn converges to λ > 0, as n → ∞. We put Yn = Xn

n
and denote by
FYn the c.d.f. of Yn . We have that

FYn → F,

where F is the c.d.f. of exponential distribution with parameter λ > 0, i.e.:

0 for x < 0,
F(x) = −λx
1−e for x ≥ 0.

Indeed for x < 0, FYn ≡ 0, as

Yn ≥ 0. For x ≥ 0

FYn (x) = P(Yn ≤ x) = P(X n ≤ nx)

∞

=1− pn (1 − pn )k−1
k=[nx]+1
∞

[nx]
= 1 − pn (1 − pn ) (1 − pn )i
i=0

[nx] 1
= 1 − pn (1 − pn )
1 − (1 − pn )
= 1 − (1 − pn )[nx] ,

where we have used the formula for the sum of geometric series. We write

nx = [nx] + δn , with 0 ≤ δn < 1.

We obtain therefore

FYn (x) = 1 − (1 − pn )nx−δn ,

76 5 Convergence of Distributions

which tends to 1 − e−λx for n → ∞, since

log (1 − pn )nx = nx log (1 − pn ) = −xnpn + o(npn )

which tends to −λx for n → ∞, whereas

(1 − pn )δn −−−→ 1,
n→∞

as 0 ≤ δn < 1 and pn −−−→ 0.

n→∞

5.3 Convergence of Binomial Distribution to Poisson

Distribution

We now provide an approximation of the binomial distribution when we consider

the number of successes in a large number of trials.
Let (X n )n∈N be a sequence of binomially distributed random numbers with para-
meters n, pn such that npn → λ with λ > 0 as n → ∞. For example X n represent
the number of successes in n Bernoulli trials with parameter pn . As the number of
trials grows to infinity we send to 0 the probability of success in a single trial. For
0 ≤ k ≤ n:

n
P(X n = k) = pnk (1 − pn )n−k
k
n! nk
= pnk (1 − pn )n−k k
k!(n − k)! n

multiplication and division by n k

1 1 k−1
= 1− ··· 1 − (npn )k (1 − pn )n−k .
k! n n

We observe that:

1 k−1
• 1− ··· 1 − tends to 1 as n → ∞;
n n
• (npn )k tends to λk for n → ∞;
• (1 − pn )−k tends to 1 for n → ∞;
• (1 − pn )n tends to e−λ for n → ∞ as

log(1 − pn )n = nlog(1 − pn ) = −npn + o(npn )

tends to −λ.
5.3 Convergence of Binomial Distribution to Poisson Distribution 77

It follows that for k ∈ N

λk −λ
P(X n = k) −−−→ e ,
n→∞ k!
and therefore the sequence of binomial c.d.f.’s with parameters n, pn tends to Poisson
c.d.f. with parameter λ.

5.4 De Moivre-Laplace Theorem

We consider now another type of convergence for sequences of binomial c.d.f’s. We

send the number of trials to infinity but this time we keep fixed the probability of
success in a single trial. In order to obtain convergence we need to perform a linear
rescaling.

Theorem 5.4.1 Let (X n )n∈N be a sequence of random numbers with binomial dis-
tribution Bn(n, p) with 0 < p < 1 and let X n∗ be the corresponding standardized
random numbers given by

X n − P(X n ) X n − np
X n∗ = =
σ(X n ) np p̃

for n ∈ N \ {0}. Where p̃ = 1 − p. Then we have for all n ∈ N \ {0}

hn x2
P(X n∗ = x) = √ e− 2 e En (x) ,
2π

1
where h n = and the error E n (x) tends uniformly to 0 when x ranges on
np p̃
I (X n∗ ) [−K , K ] for any fixed constant K .

Proof The set I (X n ) of the possible values of X n is I (X n ) = {0, 1, . . . , n}. Therefore

I (X n∗ ) = {h n (−np), h n (1 − np), . . . , h n (n − np)}

1
where h n = is the spacing between possible values of X n∗ .
np p̃
We define φn (x) = log P(X n∗ = x) for x ∈ I (X n∗ ) and consider its incremental
ratio:
φn (x + h n ) − φn (x) 1 P(X n∗ = x + h n )
= log .
hn hn P(X n∗ = x)
78 5 Convergence of Distributions

Putting k = np + x np p̃, we obtain

1 P(X n∗ = x + h n ) 1 P(X n = k + 1)
log = log
hn P(X n∗ = x) hn P(X n = k)
1 (n − k) p
= log
hn (k + 1) p̃

n p̃ − x np p̃ p
= np p̃log
n + 1 + x np p̃ p̃

p
1 − x
n p̃
= np p̃log .
1 p̃
1+ +x
np np

Using the 1-order expansion of the logarithm log(1 + x) = x + O(x 2 ), we obtain

p
1−x
n p̃
np p̃log
1 p̃
1+ +x
np np

p x2 p x2 + 1
= np p̃ −x + O( ) − x + O( )
n p̃ n n p̃ n
x2 + 1
= −x p − x p̃ + O( √ )
n
x2 + 1
− x + O( √ ) .
n

The function φn (x) is not defined everywhere, but only for x in I (X n∗ ). We can extend
it to values between two elements of I (X n∗ ) by linear interpolation. In this way we
can write x
φn (x) = φn (0) + φn (y)dy .
0

x2 + 1 x2 + 1
If x ≤ y ≤ x +h n , then φn (y) = h n φn (x) = −x + O( √ ) = −y + O( √ )
n n
so that:
x
φn (x) = φn (0) + φn (y)dy
0
x
|x|3 + |x|
= φn (0) + (−y)dy + O( √ )
0 n
x2 |x|3 + |x|
= φn (0) − + O( √ ).
2 n
5.4 De Moivre-Laplace Theorem 79

Since φn (x) = log P(X n∗ = x), we obtain

x2
log P(X n∗ = x) = eφn (0) e− 2 e En (x)

|x|3 + |x|
where E n (x) = O √ .
n
We can estimate eφn (0) in the following way: X n∗ is a standardized random number,
i.e. P(X n∗ ) = 0 and σ 2 (X n∗ ) = 1. By the Chebychev inequality, we have that:

1
P(|X n∗ | ≥ K ) ≤ .
K2
K can be chosen so that this probability is arbitrary small, that is for every > 0
there is K such that:
1
1−=1− ≤ P(|X n∗ | < K ) ≤ 1.
K2

Since P(|X n∗ | < K ) = x,|x|<K P(|X n∗ | = x), it follows that:

1−≤ P(X n∗ = x) ≤ 1 .
x,|x|<K

Moreover
x2
P(|X n∗ | < K ) = P(|X n∗ | = x) = h n e− 2 .
x,|x|<K x,|x|<K

x2
Since E n (X ) tends uniformly to 0 on bounded interval and x,|x|<K h n e− 2 is the
x2 K x2
Riemann sum for the function e− 2 and tends to −K e− 2 d x, we have for n suffi-
ciently large that:

eφn (0) K − x 2
1 − 2 ≤ e 2 dx ≤ 1 .
h n −K

Let K tending to infinity, we obtain

eφn (0) √
1 − 3 ≤ 2π ≤ 1
hn

so that

eφn (0) √
2π −−−→ 1 .
hn n→∞
80 5 Convergence of Distributions

It follows that
hn x2
P(|X n∗ | = x) = √ e− 2 e En (x) ,
2π

where E n (x) is an error that tends uniformly to 0 for x ranging on the possible values
of X n∗ in a bounded interval.

As application of the theorem, one obtains an approximation of the c.d.f. of the

binomial distribution. Given a, b, a < b:
hn x2
P(a ≤ X n∗ ≤ b) = P(X n∗ = x) = √ e− 2 e En (x) .
a≤x≤b a≤x≤b
2π

1 x2
This is the Riemann sum of n(x) = √ e− 2 , therefore it converges to
2π
b
1 x2
√ e− 2 d x = N (b) − N (a) ,
2π a

where N (x) is the c.d.f. of standard Gaussian distribution. The c.d.f. Fn (x) of X n∗
converges to N (x) since

Fn (x) = P(X n∗ ≤ x) = P(−k < X n∗ ≤ x) + P(X n∗ ≤ −k)

= N (x) − N (−k) + P(X n∗ ≤ −k) + E n (x)

with lim E n (x) = 0. The Chebychev inequality states that P(X n∗ ≤ −k) can be
n→∞
made arbitrarily small. Also N (−k) tends to 0 for k −→ ∞. Therefore the c.d.f. of
standardized binomial distributions tend to N .
Chapter 6
Discrete Time Markov Chains

6.1 Homogeneous Discrete Time Markov Chains

with Finite State Space

We define a homogeneous Markov chain with finite state space S ⊂ R as a sequence

of random numbers (X i )i∈N ⊂ S for i ∈ N such that:

P(X 0 = s0 , X 1 = s1 , . . . , X n = sn ) = ρs0 ps0 ,s1 ps1 ,s2 · · · psn−1 ,sn ,

where
1. ρsi , si ∈ S, is called initial distribution:

ρsi = P(X 0 = si ) and ρs = 1 .
s∈S

2. ps,s = [P]s,s , are called transition probabilities and satisfy:

•
0 ≤ pi j ≤ 1;
• s ∈S ps,s = 1 for every s ∈ S.
They can be arranged in a matrix P called transition probability matrix of entries

[P]s,s =: ps,s .

The Markov chain (X i )i∈N can be seen as representing the evolution of a system that
moves from one state to another in a random fashion. We have assumed that S ⊂ R,
but it may be convenient to consider in some situations a general finite set S. In this
case X i are not random numbers, but random entities. However what follows goes
through without any change.

© Springer International Publishing Switzerland 2016 81

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_6
82 6 Discrete Time Markov Chains

We show now that ps,s is the probability to go from state s to state s . Moreover
we show that the probability that X r +1 = s conditional to all previous history
X 0 = s0 , . . . , X r −1 = sr −1 , X r = s depends just on s and is equal to ps,s (Markov
property). Indeed:

P(X r +1 = s |X r = s, X r −1 = sr −1 , . . . , X 0 = s0 )
P(X r +1 = s , X r = s, X r −1 = sr −1 , . . . , X 0 = s0 )
=
P(X r = s, X r −1 = sr −1 , . . . , X 0 = s0 )
ρs0 ps0 ,s1 · · · psr −1 ,Ps,s
=
ρs0 ps0 ,s1 · · · psr −1 ,s , ps,s
= ps,s ,

provided that the probability of the conditioning event, which is at denominator, is

positive (this is required to compute the conditional probability).
Example 6.1.1 (Random walk). A random walk in an integer interval [a, b] ⊂ Z,
with absorbing boundary conditions is a Markov chain with state space S = [a, b]
and transition probability matrix:
⎛ ⎞
1 0 0 ··· ···0
⎜ .. .. ⎟
⎜1 − p 0 p . .⎟
⎜ ⎟
⎜ .. ⎟
⎜ 0 ... ..
.
..
.
..
.⎟.
P=⎜
⎜ . .
⎟,
⎟
⎜ .. .. .. .. ..
⎜ . . 0⎟ .⎟
⎜ . .. ⎟
⎝ .. . 1− p 0 p⎠
0 ··· ··· 0 0 1

where 0 < p < 1. Boundary conditions are determined by the transition probabil-
ities from state a and b. Other boundary conditions can be considered: reflecting,
mixed, …In the case p = 21 we speak of symmetric random walk.
Example 6.1.2 (Bernoulli-Laplace chain). Let us consider two urns A and B, each
containing N balls. The balls are assumed to be identical apart from their colors.
Among the balls there are N white balls and N black balls. At each integer time we
choose one ball from each urn and exchanges them.
Let X i be the random number of white balls in A at time i. The state space is

S = {0, 1, . . . , N }.

The transition probability from state k to state l is given by:

pk,k = P(two white balls or two black balls are drawn)

k N −k N −k k k N −k
= + =2 ; (6.1)
N N N N N N
6.1 Homogeneous Discrete Time Markov Chains with Finite State Space 83

pk,k+1 = P(1 black ball from urn A and 1 white ball from B )
N −k N −k (N − k)2
= = ; (6.2)
N N N2
pk,k−1 = P(1 white ball from A and 1 black ball from B )
k k k2
= = 2. (6.3)
N N N
The transition probabilities to other states are zero. This applies also to the case k = 0
and k = N . The transition matrix is therefore:
⎛ ⎞
0 1 0 0 ··· 0

⎜ 1 2(N −1) N −1 2 ⎟
⎜ N2 N2 0 ··· 0⎟
⎜ N
4(N −2)

2 ⎟
P=⎜ ··· 0⎟
4 N −2
⎜ 0 N2 N2 N ⎟.
⎜ .. .. ⎟
⎝ . . ⎠
0 ··· 0 ··· 1 0

6.2 Transition Probability in n Steps

By using composite probability formula we can compute the probability for a Markov
chain to go from state s to state s in n steps. Let s0 , s1 , . . . , sm−1 , s be a sequence of
states such that ρs0 ps0 ,s1 ps1 ,s2 · · · psm−1 ,s are strictly positive. We have:

P(X m+n = s |X m = s, X m−1 = sm−1 , . . . , X 0 = s0 )

P(X m+n = s , X m = s, X m−1 = sm−1 , . . . , X 0 = s0 )

=
P(X m = s, X m−1 = sm−1 , . . . , X 0 = s0 )

P(X m+n = s , X m+n−1 = sm+n−1 , . . . , X 0 = s0 )
sm+1 ,...,sm+n−1
=
P(X m = s, X m−1 = sm−1 , . . . , X 0 = s0 )

ρs0 ps0 ,s1 · · · psm−1 ,s ps,sm+1 · · · psm+n−1 ,s
sm+1 ,...,sm+n−1
=
ρs0 ps0 ,s1 ps1 ,s2 · · · psm−1 ,s

= ps,sm+1 · · · psm+n−1 ,s
sm+1 ,...,sm+n−1

= P n s,s .
84 6 Discrete Time Markov Chains

This probability does not depend on m, but just on n, that is the number of intermediate
steps. It is obtained as the element with coordinates s, s of the n-th power of the
(n)
transition matrix P. In the following we will use the common notation ps,s for this

probability: n
(n)
ps,s := P(X m+n = s |X m = s) = P s,s .

By convention one defines:

⎧
⎨ 1 if s = s ,
(0)
ps,s := ps,s =
⎩
0 otherwise.

6.3 Equivalence Classes

Let (X i )i∈N be a homogeneous Markov chain. We say that the state s communicates
with the state s if there exists n > 0 such that
(n)
ps,s > 0,

that is if there exists a path s, s1 , . . . , sn−1 , s such that all transition probabilities
ps,s1 , ps1 ,s2 , . . . , psn−1 ,sn are strictly positive. We will use the notation s ≺ s to
indicate that s communicates with s .
Two states s, s , are said to be equivalent if s ≺ s and s ≺ s. This is an equivalence
relation, i.e. it is reflexive, symmetric and transitive. The first two properties are
evident. Transitivity follows from transitivity of communication. Assume that s ≺ s
and s ≺ s . Then there are n 1 , n 2 such that ps,s n1 n2
> 0 and ps ,s > 0. It follows that

s ≺ s . Indeed:
(n 1 +n 2 )
(n 1 ) (n 2 ) (n 1 ) (n 2 )
ps,s = P n 1 +n 2 s,s = ps,s ps1 ,s ≥ ps,s ps ,s > 0 .
s1
1

>0 >0

The communication relation ≺ between states can be extended without ambiguity

to equivalence classes. We indicate with [s] the equivalence class of the state s, i.e.
the set of all states s equivalent to s according to the previously introduced relation.
When s ≺ s we say that s follows s. We say that [s] communicates with [s ] and
write [s] ≺ [s ] if s ≺ s . Using the transitivity property it is easy to check that this is
a well-posed definition, i.e. it does not depend on the choices of the representatives
in the equivalence classes.
An equivalence class is said to be maximal if it is not followed by any other
class with respect to the communication relation. If a Markov time is in a state of
a maximal equivalence class, then at all subsequent times, it will be in states of the
same class with probability 1.
6.3 Equivalence Classes 85

Another characteristic of a state of a Markov chain is its period. Let s ∈ S be a

state of a Markov chain and let:

A+ (n)
s = { n > 0 | ps,s > 0} .

If A+s = ∅, we define the period of s as the greatest common divisor (GCD) of the
elements of A+ s . If the period of s is 1, we say that s is an aperiodic state. For example,
in the random walk on the interval [a, b] with absorbing boundary conditions all states
s with a < s < b have period 2.
All states of an equivalence class have the same period. Therefore one can speak
of the period of an equivalence class.

Proof Let us consider two equivalent states s ∼ s , and q, q their periods. It is

enough to show that q divides every n ∈ A+ s . In force of the equivalence, there is n 1
such that ps,s > 0 and there is n 2 such that ps(n ,s2 ) > 0. Then (n 1 + n 2 ) ∈ A+
(n 1 )
s since

(n 1 ) (n 2 )
(n 1 +n 2 ) (n 1 ) (n 2 )
ps,s = ps,s 1 s1 ,s
≥ ps,s ps ,s > 0.

Similarly (n 1 + n 2 ) ∈ A+
s ; hence q and q both divide (n 1 + n 2 ). Moreover for all
+ +
n ∈ As , (n + n 1 + n 2 ) ∈ As , since

ps(n+n
,s
1 +n 2 )
≥ ps(n ,s2 ) ps,s
(n) (n 1 )
ps,s > 0 .

+
Hence, q and q divide (n + n 1 + n 2 ) for all n ∈ A+
s and for all n ∈ As . Since n 1 , n 2
are divisible by q and by q , q and q are both common divisors of As and of A+
+
s , so
that
q = q .

An equivalence class C of period q < ∞ can be decomposed in q subsets:

C = C0 ∪ C1 ∪ · · · ∪ Cq−1

(n)
with the property that if s ∈ Ci , s ∈ C j and ps,s > 0 then

n ≡ ( j − i) (mod q) .

If a maximal equivalence class C has period q, C0 , C1 , . . . , Cq−1 are cyclically visited

by the Markov chain: i.e. if X 0 ∈ Ci , then X 1 ∈ C[i+1]modq, X 2 ∈ C[i+2]modq with
probability 1, where we use the notation [k]q for the element of the set {0, . . . , q − 1}
that is equivalent to k modulo q.
86 6 Discrete Time Markov Chains

6.4 Ergodic Theorem

We want to study the behavior of a Markov chain as time proceeds.

An important result states that a Markov chain with finite state space and a sin-
gle aperiodic equivalence class has the property that the distribution on state space
converges to a limit that does not depend on the initial state. This is the result of the
following theorem called ergodic theorem (see e.g. Gnedenko (1997) for a proof ).

Theorem 6.4.1 (Ergodic theorem) Let (X i )i∈N be a homogeneous Markov chain

with a finite state space. If the chain is irreducible (i.e. there is a unique equivalence
class) and aperiodic (i.e. the period is 1), then there is a probability distribution
Π = (πs )s∈S on the state space and constants C > 0 and 0 ≤ δ < 1 such that for
all s , s ∈ S :
| ps(n)
,s − πs | ≤ Cδ .
n

In other words there are πs , s ∈ S, such that:

0 ≤ πs ≤ 1;
1.
2. s∈S πs = 1,

and ∀ s ∈ S
lim ps(n)
,s = πs
n→+∞

with exponential speed.

This theorem can be used also in the case when the period q is strictly larger than 1,
by considering the Markov chain with transition matrix P q . Indeed, the restriction of
this chain to each of the subsets C0 , C1 , . . . , Cq−1 satisfies the hypothesis of ergodic
theorem.
The probability distribution Π that appears in the statement of the ergodic theorem
is an invariant (or stationary) distribution for the Markov chain: this means that if
we take it as initial distribution, so that P(X 0 = s) = πs for every s ∈ S, then for
every s ∈ S and for every n ≥ 0

P(X n = s) = πs .

This property allows us to compute πs as the solution of a system of linear equations.

Indeed:

πs = P(X 1 = s)

= P(X 0 = s ) ps ,s
s ∈S

= πs ps ,s .
s ∈S
6.4 Ergodic Theorem 87

Moreover, since πs is a probability distribution, we have

πs = 1 .
s∈S

Under the hypothesis of the ergodic theorem one can show that there is one and
only one solution for this system of |S| + 1 equations in |S| unknowns; one of the
equations, in this case one of the first |S| equations, is a linear combination of the
others and therefore it can be skipped in the solution of the system:

Πt = Πt P
(6.4)
πs ∈S πs = 1 ,

where we have represented Π as |S|-dimensional vector. The ergodic theorem tells

us that as time advances the Markov chain forgets the initial state and reaches an
equilibrium. We show now the uniqueness of the invariant measure.

Proof Let us assume that (μs )s∈S is another probability distribution on the state space
satisfying system (6.4). We have

μt = μt P

s∈S μs = 1 ,

where we have represented the distribution (μs )s∈S as the |S|-dimensional column
vector μ. We have

μt = μt P ⇒ μt = μt P = μt P 2 = · · · = μt P n .

If n tends to infinity, by ergodic theorem P n converges to the matrix:

⎛ ⎞
π1 π2 · · · πn
⎜ π1 π2 · · · πn ⎟
⎜ ⎟
⎜ .. .. .. ⎟
⎝ . . . ⎠
π1 π2 · · · πn

therefore for s ∈ S
μs = μs ps ,s = μs ps(n)
,s .

s s

By taking the limit limn→+∞ ps(n)

,s = πs , we have

μs = μs πs = πs μs = πs .
s s

1
Chapter 7
Continuous Time Markov Chains

7.1 Introduction

In this chapter we shall introduce some simple queueing systems. For further reading,
we refer to [7, 8].
A queueing system can be described in terms of servers and a flow of clients
who access servers and are served according to some pre-established rules. The
clients after service can either stay in the system or leave it, also according to some
established rules.
The simplest case is when there is a single set of servers and a flow of clients
accessing to it. If there is at least one free server, then an incoming client is served
right away. Otherwise, i.e. if all servers are engaged, he is put in a queue and waits
for his turn. Once a client is served, he leaves the system.
Usual hypotheses are that service times are stochastically independent, identically
distributed, and moreover that they are stochastically independent from the flow of
clients’ arrivals. One would like to obtain the probabilities that, at given times, there
are some numbers of clients in the system. For this one needs to introduce a random
number for each time t; this leads us to introduce the notion of stochastic process.

Definition 7.1.1 A stochastic process (X t )t∈I with I interval of R, is a family of

random numbers with index varying in some interval I of R.

Speaking of stochastic processes therefore one refers to continuous index space,

where the index is usually interpreted as time. Markov chains, introduced in previous
chapter, can be considered as discrete time stochastic processes.
We model the flow of incoming clients by a stochastic process Nt , representing
the number of clients arrived before time t, which is assumed to be stochastically
independent from service times. For fixed t, X t represents the random number of the
clients who are present in the system at time t.

© Springer International Publishing Switzerland 2016 89

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_7
90 7 Continuous Time Markov Chains

In order to characterize a system such as that we have described, one needs to

specify:
1. the stochastic process ruling the flow of incoming clients;
2. the distribution of service times;
3. the number of servers.

It is customary to adopt the following notation to indicate the specifications of a

given queueing system:

1. M denotes the Poisson process for the flow of incoming clients or exponential
distribution for service times;
2. Er denotes the Erlang distribution with parameter r for the inter-arrival times
of clients (that are supposed to be stochastically independent and identically
distributed) or for service times. The Erlang distribution with parameter r is the
distribution of a sum of r stochastically independent exponential random numbers
with the same parameter;
3. D denotes deterministic (non-random) inter-arrival times or service times;
4. G indicates that one does not make any particular hypothesis on the inter-arrival
times or service times (that however are always assumed to be stochastically
independent).

A process of the type we have described will be indicated by three symbols separated
by two slashes. The first symbol refers to the distribution of inter-arrival times, always
assumed to be stochastically independent and identically distributed (i.i.d.). The
second symbol refers to the distribution of service times. The third symbol indicates
the number of servers; it can possibly take the value ∞.
We shall consider three examples of queueing systems and precisely the systems
M/M/1, M/M/n with n > 1 and M/M/∞. Before that we shall speak about
continuous time Markov chains with countable state space, and in particular introduce
the Poisson process Nt , t ≥ 0, that for these queueing systems represents the number
of clients who entered the system before time t.

7.2 Homogeneous Continuous Time Markov Chains

with Countable State Space

An homogeneous continuous time Markov chain is a stochastic process (X t )t≥0 with

I (X t ) = N characterized by the initial distribution (ρs )s∈N , and for every t > 0 a
transition matrix p(t)s,s = [Π ]ss (t). As in the case of discrete time case they must
respectively satisfy

0 ≤ ρs ≤ 1, ρs = 1,
s∈N

0 ≤ p(t)s,s ≤ 1, p(t)s,s = 1,
s ∈N
7.2 Homogeneous Continuous Time Markov Chains with Countable State Space 91

for every t > 0. If 0 = t0 < t1 < · · · < tn−1 < tn , then

P(X 0 = s, X t1 = s1 , . . . , X tn = sn )
= ρs0 ps0 ,s1 (t1 ) ps1 ,s2 (t2 − t1 ) . . . psn−1 ,sn (tn − tn−1 ).

It follows from conditions of compatibility that transition matrices are related by

Chapman-Kolmogorov equations that can be expressed in synthetic form by:

Π (t + t ) = Π (t) Π (t ) ∀ t, t ≥ 0

or explicitly:
ps,s (t + t ) = ps,s (t) ps ,s (t ).
s

In order to treat interesting examples, such as those arising from queueing theory,
we need to consider the case of strictly denumerable state spaces. In this case Π (t)
is a matrix with infinitely many rows and columns with non-negative entries, such
that the sum of the series of the elements of each row is equal to 1.
The product of two matrices of this kind can be defined according to the usual row
times column rule, where the finite sum is replaced by a series. It is easy to check
that the result is still a matrix of this kind.
In the case of discrete time the transition probabilities in more steps can be obtained
from those in one step. In the case of continuous time analogously transition prob-
abilities in a finite time t can be obtained starting from their behavior as t becomes
infinitely small. The simplest case is the Poisson process.

7.3 Poisson Process

A Poisson process is a continuous time Markov chain with state space S = N. In the
following we shall use a Poisson process as a model for the flow of clients entering a
queueing system. For the quantities that we shall consider the order in which clients
are served does not matter. A Poisson process N = (Nt )t≥0 with parameter λ, where
λ > 0, is characterized by the following properties:

1. ps,s (h) = 1 − λh + o(h);

2. ps,s+1 (h) = λh + o(h);
3. ps,s (h) = o(h) for s ∈
/ {s, s + 1},

where o(h) is infinitesimal of order larger than h, uniformly in s and s .

Starting from this hypothesis we can obtain the Kolmogorov forward equations,
a system of infinitely many differential equations for transition probabilities. Let us
fix s̄ = 0 and the initial distribution ρ0 = 1, ρs = 0 for s = 0, i.e. P(N0 = 0) = 1.
92 7 Continuous Time Markov Chains

We put:
μs (t) = p0,s (t) for s ∈ N

and denote by μs the first derivative of μs . The functions μs verify the system of
equations:
μ0 (t) = −λμ0 (t)
(7.1)
μs (t) = −λμs (t) + λμs−1 (t) for s ≥ 1,

μs (t + h) − μs (t)
as we now show. Consider for s > 0 the incremental ratio for
h
h > 0. We have:
μs (t + h) − μs (t) p0,s (t + h) − p0,s (t)
=
h h
p
j 0, j (t) p j,s (h) − p0,s (t)
=
h
1
= ((1 − λh + o(h)) p0,s (t) + (λh + o(h)) p0,s−1 (t))
h
⎛ ⎞
⎜ ⎟
1 ⎜ ⎟
+ ⎜ p0, j (t) p j,s (h) − p0,s (t)⎟
h ⎝ ⎠
j
j=s, j=s−1
o(h) o(h)
= −λ p0,s (t) + λ p0,s−1 (t) + = −λμs (t) + λμs−1 (t) + .
h h

By taking the limit h ↓ 0, we obtain an equation for the right derivative:

μs (t) = −λμs (t) + λμs−1 (t) for s ≥ 1 ,

where we have used the notation for the derivative since it is easy to show that it
exists. For s = 0, we obtain for h > 0:

μ0 (t + h) − μ0 (t) p0,0 (t + h) − p0,0 (t)

=
h h
p
j 0, j (t) p j,0 (h) − p0,0 (t)
=
h
(1 − λh + o(h)) p0,0 (t) + j=0 p0, j (t) p j,0 (h) − p0,0 (t)
=
h
o(h) o(h)
= −λ p0,0 (t) + = −λμ0 (t) +
h h
7.3 Poisson Process 93

that in the limit h ↓ 0 converges to the equation

μ0 (t) = −λμ0 (t).

As we show below the solution of the system is given by

(λt)s −λt
μs (t) = p0,s (t) = e ,
s!

i.e. for each t we have that Nt has Poisson distribution with parameter λt.
If we take ρs̄ = 1 and ρs = 0 for s = s̄ i.e. assume that P(N0 = s̄) for some
arbitrary state s̄, then we obtain the transition probabilities starting from s̄:
⎧
⎨ ps̄,s (t) = 0 ¯
for s < s̄,
(λt) s−s̄
¯ (7.2)
⎩ ps̄,s (t) = e−λt for s ≥ s̄.
(s − s̄)!

Let us prove that (7.2) provides a solution for the system with initial state s̄. Let us
consider the generating function:

Φ(z, t) = ps̄,s (t)z s .
s

We derive Φ(z, t) with respect to t. It is easy to see that the derivative can be
exchanged with the series. By applying the system of equation for μs (t) = ps̄,s (t),
we obtain

∞ ∞
∂
Φ(z, t) = μs (t)z s = −λ μs (t)z s + λ μs−1 (t)z s = λ(z − 1)Φ(z, t).
∂t s s=0 s=1

Therefore
1 ∂ ∂
Φ(z, t) = log Φ(z, t) = λ(z − 1),
Φ(z, t) ∂t ∂t

so that
log Φ(z, t) = λ(z − 1)t + K ,

that is Φ(z, t) = e K eλ(z−1)t . Since μs̄ (0) = 1 and μs (0) = 0 for s = s̄, we have
Φ(z, 0) = z s̄ . We have therefore:

(λt)k
Φ(z, t) = z s̄ eλ(z−1)t = e−λt z s̄ eλzt = e−λt z s̄+k .
k
k!
94 7 Continuous Time Markov Chains

Fig. 7.1 Scheme of Poisson λ λ

λ λ
process with parameter λ

0 1 2 3

(λt)s−s̄ −λt
It follows that ps̄,s (t) = 0 for s < s̄, ps̄,s (t) = e for s ≥ s̄. The Poisson
(s − s̄)!
process is non-decreasing with probability 1. It can be represented as in Fig. 7.1,
when an arrow connecting two states with superscript λ indicates that the transition
intensity from one state to the other one is equal to λ. We observe that an arrow
enters every state s with s ≥ 1. These two arrows, one in-coming and one exiting,
correspond to two terms, one with plus sign and one with minus sign, on the right-
hand side of the differential equation. For s = 0 there is just an out-coming arrow,
corresponding to the single term, with minus sign, on the right-hand side of the
differential equation.
If we indicate with Ps (t) = P(Nt = s) the probability that Poisson process at
time t is in the state s, then we have

Ps (t) = ρs̄ ps̄,s (t) ,
s∈N

where ρs is the initial distribution. It follows that for every initial distribution the
functions (Ps (t))s∈N satisfy the same system of differential equations.

P0 (t) = −λP0 (t)
Ps (t) = −λPs (t) + λPs−1 (t) for s ≥ 1.

The functions ( ps̄,s (t))s∈N can be considered as particular cases in which ρs̄ = 1
and ρs = 0 for s = s̄.

7.4 Queueing Processes

We now consider some examples of continuous time Markov chains that serve as
models of queueing processes. As we have said in Sect. 7.1, in queueing theory there
is a symbolic notation to indicate the type of a queueing system. In the examples we
consider the flow of incoming clients follows a Poisson process with parameter λ.
Clients who find a free server start a service time and after service leave the system.
When an arriving client finds all servers engaged, he is put in a queue. When a server
becomes free, if there are clients waiting in queue, one of them starts its service time.
For what we are interested in, the order in which clients access the service does
not matter; we can assume, for example, that the order is randomly chosen, but
other possible choices would not change the results. We assume that service times
7.4 Queueing Processes 95

are stochastically independent, identically distributed and stochastically independent

from the Poisson process ruling the flow of arrivals. We also assume that service times
are exponentially distributed with some parameter μ.
A process of this type will be indicated with the symbol M/M/n. The first M
means that the flow of arrivals is Poisson, the second M means that service times are
exponentially distributed, while n denotes the number of servers and can vary from
1 to ∞ (∞ is an admissible value).

7.5 M/M/∞ Queueing Systems

We consider an idealized situation in which there are infinitely many servers. The
flow of arrivals is ruled by a Poisson process with parameter λ and service times are
exponentially distributed with parameter μ.
Let X = (X t )t≥0 be the process indicating the number of clients who are in the
system at time t. As initial distribution we assume that:

P(X 0 = 0) = 1 ,
P(X 0 = i) = 0 for i > 0,

i.e. no client is present in the system at time 0. As stated in previous section, ser-
vice times are stochastically independent between themselves and from the arrivals’
process. In order to compute the intensity of service process, we obtain the probabil-
ity that a client is served in time interval (t, t + h), given that he has not been served
up to time t. If T is service time for a client, we have:

P(t < T ≤ t + h)
P(T ≤ t + h|T > t) =
P(T > t)
e−μt − e−μ(t+h)
=
e−μt
−μh
= 1−e
= 1 − (1 − μh + o(h))
= μh + o(h),

where we have used first order expansion of the exponential e−μh = 1 − μh + o(h)
for small h. Assume that there are n clients in the system. If no one of them has been
served up to time t, the probability that at least one of them is served in time interval
(t, t + h) is then:

1 − P(T1 > t + h, . . . , Tn > t + h|T1 > t, . . . , Tn > t)

= 1 − P(T > t + h|T > t)n = 1 − e−nμh = nμh + o(h) ,
96 7 Continuous Time Markov Chains

Fig. 7.2 Graphical

λ λ λ λ
representation of a M/M/∞
queueing system
0 1 2 3

μ 2μ 3μ 4μ

where T1 , . . . , Tn denote the service times of the clients and we have used the fact
that they are stochastically independent and identically distributed. Therefore a client
exits the system with an intensity which is proportional to the number of clients
present in the system. The process can be represented as in Fig. 7.2.
Putting p0,s (t) = μs (t), we can write forward Kolmogorov equations by using
the rule described in Sect. 7.3:

μ0 (t) = μμ1 (t) − λμ0 (t)
μi (t) = −(λ + iμ) μi (t) + λ μi−1 (t) + (i + 1)μ μi+1 (t)

for i ≥ 1, where μi (t) denotes the derivative of μi .

We have seen that n-steps transition probabilities of discrete time Markov chains
satisfying the hypothesis of ergodic theorem converge as n → ∞ to the stationary
distribution. Analogous results hold for continuous time Markov chains. Therefore
we look for a stationary solution ( pi )i≥0 of the system of equations, that is a solution
which does not depend on the time. We impose μi (t) = 0 so that μi (t) = pi and
obtain: ⎧
⎪
⎪ 0 = μ p1 − λ p0
⎪
⎨ 0 = −(λ + iμ) p + λ p + (i + 1)μ p , for i ≥ 1
i i−1 i+1
+∞

⎪
⎪
⎪
⎩ pi = 1 .
i=0

By adding up the equations up to the i-th one, we obtain the recursive formula:
i
λ 1 λ
pi = pi−1 = p0 .
iμ i! μ

+∞

1 λ i
By imposing the condition p0 = 1, we obtain:
i=0
i! μ

+∞

1 λ i
p0 = 1.
i=0
i! μ
7.5 M/M/∞ Queueing Systems 97

+∞

1 λ i λ
Since = e μ , therefore
i=0
i! μ i
− μλ 1 λ λ
pi = e and pi = e− μ ,
i! μ

which is Poisson distribution with parameter μλ . We come to the conclusion that for
M/M/∞ the queueing system stationary distribution exists for all values of λ and μ.

7.6 M/M/1 Queueing Systems

Also for M/M/1 service times are assumed to be stochastically independent and
identically distributed with exponential distribution with parameter μ. The arrival
flow of clients is ruled by a Poisson process with parameter λ which is stochastically
independent from service times.
For this system there is just one server. Therefore the intensity for a client to
exit the system is equal to μ independently from the number of clients present in
the system. M/M/1 queueing system can be graphically represented as shown in
Fig. 7.3.
The system of differential equations for the function μs (t) = ps̄,s (t), where s̄ is
some fixed state, is then:

μ0 (t) = μμ1 (t) − λμ0 (t)
μ1 (t) = −(λ + μ) μ1 (t) .

Also in this case we look for a stationary solution, i.e. such that μi (t) = 0 for i ∈ N
with μi (t) = pi , where pi is a probability distribution. We obtain then the system of
linear equations: ⎧
⎪
⎪ 0 = μ p1 − λ p0
⎪
⎨ 0 = −(λ + μ) pi + λ pi−1 + μ pi+1
+∞

⎪
⎪
⎪
⎩ pi = 1 .
i=0

Fig. 7.3 Scheme of M/M/1

λ λ λ λ
queueing system

0 1 2 3

μ μ μ μ
98 7 Continuous Time Markov Chains

From this system we obtain, by adding up the first n equations, the recursive relation
n
λ λ
pn = pn−1 = p0 .
μ μ
+∞
By imposing the condition i=0 pi = 1, we obtain
∞ i

λ
p0 = 1.
i=0
μ

λ
This series is convergent if < 1. In this case we get
μ

λ
p0 = 1 − .
μ

The stationary probability distribution is then

i
λ λ λ
pi = 1− , for i = .
μ μ μ

The stationary probability distribution is a shifted geometric distribution with para-

meter μλ (the set of possible values is N instead of N \ {0}). It exists if and only if
λ
μ
< 1 or λ < μ, i.e. if the intensity of arrivals of clients is strictly less than the
parameter of the exponential distribution of service times.

7.7 M/M/n Queueing Systems

We finally consider M/M/n queueing systems with n ≥ 2, i.e. with a finite number
of servers larger than 1. From considerations similar to those developed for the
other cases we obtain the following system of equations for transition probabilities
μs (t) = ps̄,s (t), where s̄ is some fixed state:
⎧
⎪
⎪ μ0 (t) = μμ1 (t) − λμ0 (t)
⎪
⎪
⎪
⎪ μ1 (t) = −(λ + μ) μ1 (t) + λ μ0 (t) + 2μ μ2 (t)
⎪
⎪
⎨···
μn−1 (t) = −(λ + (n − 1)μ) μn−1 (t) + λ μn−1 (t) + nμ μn (t)
⎪
⎪
⎪
⎪ μn (t) = −(λ + nμ) μn (t) + λ μn−1 (t) + nμ μn+1 (t)
⎪
⎪
⎪
⎪ μ (t) = −(λ + nμ) μn+1 (t) + λ μn (t) + nμ μn+2 (t)
⎩ n+1
...,
7.7 M/M/n Queueing Systems 99

λ λ λ λ λ λ

0 1 2 3 n−1 n n+1 n+2

μ 2μ 3μ nμ nμ nμ
Fig. 7.4 Scheme of a M/M/n queueing system with initial state in 0

where λ and μ are, as in previous cases, respectively the parameter of the Poisson
process ruling the arrival of clients and of the exponential distribution of service
times. The system is graphically represented in Fig. 7.4.
Let us now look for the stationary distribution by imposing μi (t) = 0 for all
i ∈ N. If we denote pi ≡ μi (t), we obtain the following system of linear equations:
⎧
⎪
⎪ 0 = μ p1 − λ p0
⎪
⎪ = 2μ p2 − λ p1
⎪
⎪ 0
⎪
⎪ · · ·
⎪
⎪
⎪ 0 = (n − 1)μ p
⎪
⎪
⎨ n−1 − λ pn−2
0 = nμ pn − λ pn−1
⎪
⎪ 0 = nμ pn+1 − λ pn
⎪
⎪
⎪
⎪ ···
⎪
⎪
⎪
⎪ +∞

⎪
⎪
⎪
⎩ pi = 1 .
i=0

We obtain the following recursive equations:

λ
pi = pi−1 for i = 1, . . . , n;
iμ
λ
pi = pi−1 for i ≥ n + 1.
nμ

Therefore we have:
i
λ 1
pi = p0 for i = 0, . . . , n,
μ i!
i
λ 1
pi = p0 for i ≥ n + 1.
μ n!n i−n

A solution of the system exists if

n−1 i
∞
λ 1 λ i 1
+ < +∞.
i=0
μ i! i=n μ n!n i−n
100 7 Continuous Time Markov Chains

The first term on the left-hand side is a finite sum. The series of the second term can
be rewritten by putting j = i − n as
n ∞
1 λ λ j
.
n! μ j=0
nμ

λ
The condition of convergence is therefore < 1, i.e. λ < nμ. This result answers
nμ
the problem of how many servers are needed for a queueing system with some fixed
Poisson flow of incoming clients so that the queue stabilizes (so that a stationary
distribution exists). For λ < nμ we have:
n−1 −1
λ i 1 1 λ n 1
p0 = + λ
(7.3)
i=0
μ i! n! μ 1 − nμ
i
λ 1
pi = p0 for i = 1, . . . , n, (7.4)
μ i!
i
λ 1
pi = p0 for i ≥ n + 1. (7.5)
μ n!n i−n

7.8 Queueing Systems in Stationary Regime and Little’s

Formulas

For Markov queueing systems introduced in the previous sections the existence of
an invariant distribution allows us to consider a stationary regime for the process X
representing the number of clients present in the system. In the stationary regime
probabilistic characteristics of the process don’t vary in time. The stationary regime
is obtained by taking as initial distribution the stationary distribution.
It can be shown that, when a stationary distribution exists, these queueing systems
evolve towards stationary regime and moreover temporal averages of observables
tend, as the length of the temporal interval tends to infinity, to the expectations of
the observables computed in stationary regime. All this should be precisely stated
and supported with proofs. We limit ourselves to accept it and to reason at intuitive
level. We now consider some quantities or observables, which are relevant for the
study of queueing systems and their efficiency, and establish some useful relations.
From now on we shall always refer to queueing systems in stationary regime.
In order to evaluate the efficiency of a queueing system, we introduce the utiliza-
tion factor ρ. This quantity is defined as the client’s average arrival rate λ times the
average service time T̄ divided by the number m of servers. It can be shown that the
utilization factor is equal to the average percentage rate of utilization of servers. For
a non-deterministic system in stationary regime it is known that ρ < 1, see also [12],
7.8 Queueing Systems in Stationary Regime and Little’s Formulas 101

i.e. that with probability one servers do not work full time. A server will be free for
a positive percentage of time. Other interesting quantities are:
1. the average number L of clients present in the system;
2. the average number L q of clients waiting in queues;
3. the average time W that a client spends in the system;
4. the average time Wq that a client spends waiting in queues.
The last two quantities are related by the equation

W = Wq + T̄ ,

where T̄ is the equation of service time.

Let us assume that every client pays an amount equal to the time he spends in the
system. In a time interval of length t the expectation of the amount paid by clients is
given, apart from quantities of order smaller than t, by λt (expectation of the number
of clients entering the system in a time interval of length t) times W (expectation
of the time a client spends in the system). Alternatively the same quantity is given
by Lt. By equating the expressions and letting t tend to infinity, we get first Little’s
formula L = λW .
Analogously if we assume that a client pays an amount equal to the time he spends
in queue, we get the second Little’s formula L q = λWq .
Little’s formulas apply to a large class of queueing systems in stationary regime.
Let us consider the case of M/M/1 queueing system. As we have seen, this system
has an invariant distribution if and only if λ < μ, where λ is the parameter of Poisson
process of incoming clients and μ is the parameter of the exponential distribution of
service time. In this case the stationary distribution for the number of clients present
in the system is given by:
k
λ λ
ρk = 1− .
μ μ

We have therefore
∞
∞ k
λ λ λ
L= kρk = k 1− =
k=1 k=1
μ μ μ − λ

and
∞
∞ k
λ λ λ2
Lq = (k − 1)ρk = (k − 1) 1− = .
k=2 k=1
μ μ μ(μ − λ)

Therefore by using Little’s formulas we have:

1 λ
W = , Wq = ,
μ−λ μ(μ − λ)
102 7 Continuous Time Markov Chains

1
that satisfy the equation W = Wc + T̄ , where T̄ = (expectation of exponential
μ
distribution with parameter μ).
λ
In this case the utilization factor is . We observe that, as ρ tends to 1, the average
μ
number of clients present in the system and waiting in queue, as well as the average
time spent by a client in the system, all tend to infinity. This is a general characteristics
of random queueing systems. If one tries to increase utilization factor, one has to pay
the price of an increase of the number of clients in queue and of their typical waiting
times. Value 1 for the utilization factor is not reachable by a random queueing system
in stationary regime, but it can be obtained by a deterministic system with one server
where clients arrive at regular time intervals equal to the service time.
Chapter 8
Statistics

We now introduce some basic notions in Bayesian statistics. For further reading, we
refer to [5, 9, 10].

8.1 Bayesian Statistics

Assume that we know the value xi of some characteristics, for example the height for
every individual i of a population i = 1, . . . , N . We can then build up a cumulative
distribution function F(x) defined by

{i| xi ≤ x}
F(x) = .
N

F(x) can be interpreted as the c.d.f. of a random number X , where X is the height of
an individual randomly chosen from the population (every individual is chosen with
equal probability N1 ). Some relevant quantities can be extracted from F(x), such as
the expectation, the variance, the median and others.
F(x) (called empirical c.d.f.) will always be of discrete type, but for large N it is
possible that it is well approximated by an absolutely continuous c.d.f. Similarly for
two quantities xi , yi for example height and weight relative to each individual, we
can obtain the joint c.d.f. F(x, y) defined by

{i| xi ≤ x, yi ≤ y}
F(x, y) = .
N

F(x, y) is the joint c.d.f. of the random vector (X, Y ), where X and Y are respectively
the height and the weight of a randomly chosen individual in the population. Also
in this case relevant indices such as covariance, correlation coefficient, etc. can be

© Springer International Publishing Switzerland 2016 103

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_8
104 8 Statistics

extracted from F(x, y). The study of empirical c.d.f.’s is part of descriptive statistics
and is obviously related to the study of probability distributions.
Often the data about the entire population we are interested in are not available. In
this case one tries to form an evaluation of the distributions of quantities in the whole
population starting from results obtained by sampling (that is by randomly extracting
a subset of individuals of the population). These methods are part of what is called
statistical inference or statistical induction, in the Bayesian approach, that we shall
follow in this chapter. They are an application of Bayes’ Formula and therefore are
part of Probability Theory. We deal here just with a few relevant examples in which
a model based on some distribution is assumed to be fixed and one makes inference
on one or a certain number of unknown parameters, that in Bayesian approach are
treated as random numbers.

8.2 Conditional Density for Two Random Numbers

We now introduce the conditional density of a random number Y given another

random number X . Let f (x, y) be the joint probability density function of (X, Y )
and f X , f Y the probability density functions of X, Y , respectively. The conditional
probability of the event (a ≤ Y ≤ b) given (x − h ≤ X ≤ x + h) is then given by
x+h b
x−h a f (s, t)dsdt
P(a ≤ Y ≤ b| x − h < X < x + h) = .
P(x − h < X < x + h)

In order to give a meaning to the conditional probability given (X = x), we let h

tend to 0. Assume that f (x, y) satisfies the following conditions:
1. f (x, y) is continuous;
2. f X (x) is continuous.
Then it is easy to see that if f X (x) > 0
b
f (x, t)
lim P(a ≤ Y ≤ b| x − h < X < x + h) = dt.
h→0 a f X (x)

Previous argument justifies the definition of conditional density f X (y|x) of Y given

X = x under the condition f X (x) > 0 as given by

f (x, y)
f Y |X (y|x) =
f X (x)

We obtain then Bayes’ formula for densities. From

f (x, y) = f Y |X (y|x) f X (x)

8.2 Conditional Density for Two Random Numbers 105

and
f (x, y) = f X |Y (x|y) f Y (y),

we get
f X |Y (x|y)
f Y |X (y|x) = f Y (y) .
f X (x)

These formulas generalize to the n-dimensional case. Let X 1 , . . . , X n be random

numbers with joint probability density f (x1 , . . . , xn ). Let {i 1 , . . . , i k } be a proper sub-
set of {1, . . . , n} and assume that the marginal density function f i1 ,...,ik (xi1 , . . . , xik )
of X i1 , . . . , X ik is strictly positive at the point (xi1 , . . . , xik ). Let { j1 , . . . , jn−k } =
{1, . . . , n}\{i 1 , . . . , i k }. Then the conditional density of X j1 , . . . , X n−k given (X i1 =
xi1 , . . . , X ik = xik ), provided that f i1 ,...,xik (xi1 , . . . , xik ) > 0, is defined by

f j1 ,..., jn−k |i1 ,...,ik (x j1 , . . . , x jn−k |xi1 , . . . , xik )

f (x1 , . . . , xn )
= .
f i1 ,...,ik (xi1 , . . . , xik )

As in the two-dimensional case we get Bayes’ formula for densities

f j1 ,..., jn−k |i1 ,...,ik (x j1 , . . . , x jn−k |xi1 , . . . , xik )

f i ,...,i | j ,..., j (xi , . . . , xik |x j1 , . . . , x jn−k ) f j1 ,..., jn−k (x j1 , . . . , x jn−k )
= 1 k 1 n−t 1
f i1 ,...,ik (xi1 , . . . , xik )

This formula is applied to statistical inference in the Bayesian approach that will be
treated in following sections.

8.3 Statistical Induction on Bernoulli Distribution

Let us consider a sequence of events (E i )i=1,2,... stochastically independent condi-

tionally on the knowledge of a parameter Θ such that

P(E i = 1|Θ = θ) = θ

where 0 < θ < 1.

The events E i can be thought of as the result of experiments; their stochastic
independence conditionally on the knowledge of the value of Θ means that

n
P(E 1 = 1 , . . . , E n = n |Θ = θ) = P(E i = i |Θ = θ)
i=1

for any i ∈ {0, 1} for i = 1, . . . , n.

106 8 Statistics

Let Θ have an a priori probability density. We want to find out how the distribution
of Θ changes after n experiments are performed. Assume that the results are E 1 =
1 , . . . , E n = n . The conditional density of Θ given E 1 = 1 , . . . , E n = n is
denoted by
πn (θ|E 1 = 1 , . . . , E n = n )

and it is called a posteriori density. By the composite probability law we have, given
0 ≤ a < b ≤ 1 that

P(θ ∈ [a, b], E 1 = 1 , . . . , E n = n )

P(θ ∈ [a, b]|E 1 = 1 , . . . , E n = n ) = .
P(E 1 = 1 , . . . , E n = n )
(8.1)
By using the formula of total probabilities, that can be easily extended to this
continuous case, and the conditional independence of E 1 , . . . , E n given Θ = θ we
can rewrite the right-hand side of (8.1) as
b
θ1 +···+n (1 − θ)n−(1 +···+n ) π0 (θ)dθ
a
1 .
0 θ1 +···+n (1 − θ)n−(1 +···+n ) π0 (θ)dθ

Therefore we have

πn (θ|E 1 = 1 , . . . , E n = n )
1
= π0 (θ)θ1 +···+n (1 − θ)n−1 −···−n
c

for 0 ≤ θ ≤ 1 where
1
c = P(E 1 = 1 , . . . , E n = n ) = θ1 +···+n (1 − θ)n−1 −···−n π0 (θ) d(θ).
0

In particular, if a priori distribution of Θ is beta B(α, β) with parameters α and β,

the a posteriori distribution will also be beta B(α , β ) with parameters

n
n
α = α + i and β = β + n − i
i=1 i=1

n n
where i=1 i and n − i=1 i are respectively the number of events that have and
have not taken place. Therefore
⎧ Γ (α +β ) α −1
⎨ Γ (α ) Γ (β ) θ (1 − θ)β −1 θ ∈ [0, 1],
πn (θ|E 1 = 1 , . . . , E n = n ) =
⎩
0 otherwise.
8.4 Statistical Induction on Expectation of Normal Distribution 107

8.4 Statistical Induction on Expectation of Normal

Distribution

Let (X i )i = 1,2,... be a sequence of random numbers that are stochastically independent

given the knowledge of a parameter Θ with conditional probability density

1 (x − θ)2
f (x|θ) = √ exp −
σ 2π 2σ 2

for some σ > 0.

By using Bayes’ formula for densities on X 1 , . . . , X n , Θ we get an expression for
the a posteriori density of Θ, i.e. the conditional density given X 1 = x1 , . . . , X n =
xn :
n
π0 (θ) i =1 f (xi |θ)
πn (θ|x1 , . . . , xn ) = pn (x1 , . . . , xn )

n
= K π0 (θ) f (xi |θ),
i =1

where pn (x1 , . . . , xn ) is the marginal density of X 1 , . . . , X n and we have denoted by a

constant K the quantity pn (x1 , . . . , xn )−1 , since it does not depend on θ and can there-
fore thought of as a normalizing constant for the probability density πn (θ|x1 , . . . , xn ).
In the future we shall denote any normalization constant by K , even if its value
changes from one formula to the other, in order not to introduce too many constants.
If the a priori distribution of Θ is Gaussian N (μ0 , σ02 ), we obtain

πn (θ|x1 , . . . , xn ) :=

n
= K π0 (θ) f (xi |θ)
i=1
(θ − μ0 )2 n
−
2 (xi − θ)2
= Ke 2σ0 exp −
i=1
2σ 2

n
1 1 n μ0 i=1 x i
= K exp − + 2 θ − 2θ
2
+
2 σ02 σ σ02 σ2

1 (θ − m n ) 2
= K exp − ,
2 σn2

where
108 8 Statistics
n
μ0 i=1 x i
+

σ02 σ2 1 n −1
mn = , σn2 = +
1 n σ02 σ2
+ 2
σ0
2 σ

x1 +···+xn
and K is the normalizing constant. If x̄ denotes the sample average x̄ = n
,
the a posteriori distribution of Θ is Gaussian

μ0 σ0−2 + x̄nσ −2 1
N , .
σ0−2 + nσ −2 σ0−2 + nσ −2

The expectation can be thought of as a weighted average of μ0 and x̄ with weights

σ0−2 and nσ −2 .

8.5 Statistical Induction on Variance of Normal

Distribution

We consider now statistical induction on the variance of normal distribution. It is

convenient to use as parameter the inverse of the variance, called precision; it is
clear that precision carries the same amount of information as the variance. The
term precision is related to the interpretation of random numbers as measurements
of some quantity. Let (X n )n=1,2,... be a sequence of random numbers stochastically
independent conditionally on the knowledge of the value of the parameter Φ.
Assume that the conditional probability density of each of the X i , given that
(Φ = φ), is equal to
1

φ2 φ
f (x|φ) = f (xi |φ) = √ exp − (x − μ)2 ,
2π 2

where μ is some constant. The conditional density of X 1 , . . . , X n given (Φ = φ)

called the likelihood factor is given by

n
n φ
n
f (xi |φ) = K φ 2 exp − (xi − μ)2

i=1
2 i=1

n nS 2 φ
= K φ exp −
2 ,
2

where
n
i=1 (x i − μ)2
S :=
2
n
8.5 Statistical Induction on Variance of Normal Distribution 109

is the average of the squares of the deviations of the xi ’s from μ. If we assume that
the a priori distribution of Φ is Γ (α0 , λ0 ), then the a posteriori density of Φ, given
that X 1 = x1 , . . . , X n = xn , is given by:

nS 2
2 +α0 −1
n
πn (φ|x1 , . . . , xn ) = K φ exp −φ(λ0 + )
2

for φ > 0 and 0 otherwise. That is the a posteriori distribution of Φ is gamma

2
Γ (α0 + n2 , λ0 + nS2 ).

8.6 Improper Distributions

Let us go back to the induction on the expectation of normal distribution. We want

to describe a vague initial state of information. This can be achieved by choosing an
a priori distribution with large variance. We can let the variance tend to infinity. In
the limit we do not get a probability distribution.
Nonetheless we observe that the corresponding a posteriori distributions converge.
The limiting a posteriori distribution can be alternatively obtained by introducing as
a priori distribution the so called uniform improper distribution density π0 (θ) = K .
This π0 does not correspond to a probability distribution, but it must be interpreted
in terms of the limiting procedure we have just described.

8.7 Statistical Induction on Expectation and Variance

of Normal Distribution

Let us now consider the case of statistical induction on both expectation and variance
of a normal distribution. Assume that we are in a state of vague information, that, as
we have said, can be described by means of an improper distribution. We have now
two unknown parameters Θ and Φ, respectively the expectation and the precision,
that is the inverse of the variance. Since Φ can take only positive values, we con-
sider as a priori distribution an improper uniform distribution for Θ and log Φ. This
corresponds to the improper density:

π0 (θ, φ) = K φ−1 , (φ > 0).

Assume that we have a sequence of random numbers that are stochastically indepen-
dent conditionally on the event that Θ and Φ take some definite values θ and φ and
that their conditional density is:

1 1 φ
f (x|θ, φ) = √ φ exp − (x − θ) .
2 2
2π 2
110 8 Statistics

The conditional joint density of X 1 , . . . , X n , given Θ = θ, Φ = φ, which is called

likelihood factor, is then

φ
n
n
f (x1 , . . . , xn |θ, φ) = K φ 2 exp − (xi − θ)2
2 i=1

n φ
= K φ exp − (x̄ − θ) + νs
2 2 2
,
2
n n
xi (xi − x̄)2
where x̄ = i=1
, ν = n − 1, s = i=1
2
. The joint a posteriori density
n ν
of Θ, Φ is obtained by Bayes’ formula for densities and is given by:

n
−1 φ
πn (θ, φ|x1 , . . . , xn ) = K φ 2 exp − (x̄ − θ) + νs 2 2
.
2

From joint a posteriori probability density of Θ and Φ we can get their marginal
densities by integrating with respect to the other variable. The integral with respect
to φ reduces to the integral of the gamma function. After collecting in the constant
K all factors that do not depend on θ, we obtain
+∞
K
πn (θ|x1 , . . . , xn ) = πn (θ, φ|x1 , . . . , xn )dφ = .
0 ( x̄ − θ) 2 + νs 2

x̄ − θ
From this it follows that the random number T = √ has Student t density with
s ν
ν degrees of freedom

− ν +2 1
t2
f T (t) = K 1 + .
ν

Analogously we obtain the a posteriori density of Φ by integrating the conditional

density πn (θ, φ|x1 , . . . , xn ) with respect to θ. It is a Gaussian integral that, apart from
constant factors, gives a factor φ− 2 . The a posteriori marginal probability density of
1

Φ is
+∞

ν νs 2 φ
πn (φ|x1 , . . . , xn ) = πn (θ, φ|x1 , . . . , xn )dθ = K φ 2 −1 exp − ,
0 2

with φ > 0. By making a linear change of variable we see that the random number
νs 2 Φ has a posteriori distribution with density
ν
u
K u 2 −1 exp − ,
2

with u > 0, i.e. is χ2 -distribution with ν degrees of freedom. The normalizing

1
constant is therefore given by K = ν .
2 2 Γ ( ν2 )
8.8 Bayesian Confidence Intervals and Hypotheses’ Testing 111

8.8 Bayesian Confidence Intervals and Hypotheses’ Testing

A synthetic description of a posteriori distribution can be achieved by means of

confidence intervals or, in the multidimensional case, confidence regions. Given
0 < α < 1, an α-level confidence interval or confidence region is an interval
or respectively a region whose a posteriori probability is 1 − α. The choice of an
interval or a region with this property is clearly arbitrary. In concrete situations one
can base the choice on symmetry criteria if the a posteriori density is symmetric or
alternatively one can choose the region with minimal volume in parameters’ space.
In the Bayesian approach to statistics, hypotheses’ testing can be related to the
definitions of confidence intervals or regions. The hypotheses that parameters have
a given value is rejected if the value does not belong to the confidence interval or
region. This procedure, it must be stressed, is arbitrary, since, as we have said, the
interval or region can be arbitrarily chosen. Nevertheless, since in many situations
there is a preferential choice, the use of hypotheses’ testing in Bayesian approach
can be accepted as a shortened and less precise form of induction with respect to the
complete analysis based on a posteriori distribution.

8.9 Comparison of Expectations for Normal Distribution

Assume that we have two samples of size respectively n 1 and n 2 that, conditionally
on the knowledge that the parameters Θ1 and Θ2 are equal respectively to θ1 and
θ2 , are stochastically independent samples with Gaussian distribution N (θ1 , σ12 ) and
N (θ2 , σ22 ) respectively. If the a priori density of Θ1 and Θ2 is uniform improper,
Θ1 and Θ2 are stochastically independent a posteriori with Gaussian distribution
σ2 σ2
N (x̄1 , n1 ) and N (x̄2 , n2 ) respectively, where x̄1 , x̄2 are the sample averages of the
samples.
Indeed, since the samples are stochastically independent and Θ1 and Θ2 are sto-
chastically independent in the a priori distribution, we can separately apply to the
samples the results on the induction on the expectation of normal distribution in the
case of uniform improper a priori distribution. If we define Θ = Θ2 − Θ1 , then the
σ2 σ2
a posteriori distribution of Θ is in N (x̄2 − x̄1 , 2 + 1 ).
n2 n1
Let us now consider the case when there is an extra parameter Φ such that con-
ditionally on the knowledge that Φ = φ and Θ1 = θ1 , Θ2 = θ2 , the two sam-
ples are stochastically independent with distributions respectively N (θ1 , φ−1 ) and
N (θ2 , φ−1 ). The conditional probability densities of the random numbers of the first
and the second sample are then respectively
112 8 Statistics

1 1 φ
f 1 (x|θ1 , θ2 , φ) = √ φ 2 exp − (x − θ1 )2 ,
2π 2

1 1 φ
f 2 (x|θ1 , θ2 , φ) = √ φ 2 exp − (x − θ2 )2 .
2π 2

Also here we consider the case of improper a priori distribution and precisely we
assume that Θ1 , Θ2 , log Φ are stochastically independent with uniform improper
distribution on R. This corresponds for Θ1 , Θ2 , Φ to an a priori improper density

π0 (θ1 , θ2 , φ) = K φ−1 (φ > 0).

Consider first statistical induction for Φ. Here we can apply without any essential
change what we have seen about the induction for normal distributions with two
unknown parameters and obtain that the a posteriori density of Φ is given by:

2
ν1 +ν2
−1 s φ
Kφ 2 exp − ,
2

where s 2 = ν1 s12 + ν2 s12 with νi = n i − 1, i = 1, 2,

ni
j=1 (x i, j − x̄i )2
si2 = ,
vi

and xi, j is the j-th value of the i-th sample. By combining these results we can obtain
the a posteriori probability density of Θ = Θ2 − Θ1 in the case when Φ is unknown.
Indeed we have:
⎛ ⎞

2
φ ν1 + ν2 s φ
φ exp ⎝− (θ − (x̄2 − x̄1 ))2 ⎠ φ
1
−1
π(θ|x1 , x2 ) = K 2 2 exp − dφ
R+ 2 n11 + 1 2
n2
⎡ ⎤− ν 1 + ν 2 + 1
2
ν1 + ν2 + 1 (θ −
⎣ 2 ( x̄ − x̄ )) 2 ν1 + ν2 + 1
+ s2⎦ −1 −y
1
= K2 2 y 2 e dy
+
1
n1 + 1
n2
R

⎡ ⎤− ν1 + ν2 + 1
2
ν1 + ν2 + 1 ν1 + ν2 + 1 ⎣ (θ − (x̄2 − x̄1 ))2
= K2 2 Γ( ) + s2⎦ ,
2 1
+ 1 n1 n2
⎡ ⎤− ν1 + ν2 +1
2
ν1 +ν2 +1 ν1 + ν2 + 1 ⎣ (θ − (x̄2 − x̄1 ))2
= K (2/s 2 ) 2 Γ( ) + 1⎦
2 s2 1 + 1 n1 n2
8.9 Comparison of Expectations for Normal Distribution 113

x̄2 −x̄1))
(θ−( 2
where we have used the change of variable y = 1
2
+ s 2 φ to express
n +n
1 1
1 2
the integral in terms of a Gamma function. We obtain
⎡ ⎤− ν +2 1
(θ − ( x̄ − x̄ ))2
π(θ|x̄1 , x̄2 ) = K ⎣ + 1⎦
2 1
,
νs 2 n11 + n12

where ν = ν1 + ν2 and K is now a suitable normalization constant. If we define

θ − (x̄2 − x̄1 )
T = 21 ,
s n11 + n12

we see that the a posteriori distribution of T is Student with ν = ν1 + ν2 degrees

of freedom. This allows us to use Student distribution’s table to obtain confidence
intervals for Θ.
Part II
Exercises
Chapter 9
Combinatorics

Exercise 9.1 The game of bridge is played with 52 cards. Compute:

1. The number of different ways a player can receive an handful of 13 cards.
2. The number of different ways the cards can be distributed among 4 players.
3. The number of different ways a player can receive an handful of 13 cards all
different in values. Which is the number of different ways in which all 4 players
receive cards all different in values?
4. The number of different ways a player can receive flush number cards of the same
sign. In how many ways can a player obtain at least 2 cards with equal value?

Solution 9.1 1. The number of different ways a player can receive an handful of
13 cards is given by the simple combinations

52
.
13

Namely one has to choose 13 elements out of 52 without repetitions and without
taking in account of the order.
2. For the first player we have already computed the number of different ways she
can receive an handful of 13 cards. For the second player we can choose 13 cards
out of the 52 − 13 = 39 remaining ones. Analogously for the third player. The
fourth player receives the remaining 13 cards. The number of different ways in
which all 4 players receive cards all different in values is then

52 39 26 13 52!
= .
13 13 13 13 (13!)4

The multinomial coefficient counts the number of ways of making 4 groups of

13 elements each out of a set of 52 cards.
3. Since the cards are all different in values, we can think that they are in increasing
order. For the first card, we can choose one of the four aces. For the second one,
one of the 4 twos, and so on. The number of different ways a player can receive
© Springer International Publishing Switzerland 2016 117
F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_9
118 9 Combinatorics

an handful of 13 cards all different in values is then in

4 · 4 ·· · · · 4 = 4
13

13 times

different ways. If we consider all 4 players, we have that for the second player
the choices for each card will reduce to

3 · 3 ·· · · · 3 = 3
13

13 times

different ways. Then the 4 players receive cards all different in values in

413 · 313 · 213 · 113 = (4!)13

different ways.
4. A player can receive flush number cards of the same sign in 4 different ways,
since there exist flush number cards of 4 different signs. If we consider 4 players,
the number of ways of assigning them flush number cards of the same sign is
given by the number of permutations of the 4 signs, i.e.

4 · 3 · 2 · 1 = 4!.

The number of ways in which a player can obtain at least 2 cards with equal
value is equal to
52
− 413
13

that is the number of all possible choices minus the number of ways of obtaining
an handful of all different cards.

Exercise 9.2 At the ticket counter of a theatre there are available tickets with num-
bers from 1 to 100. The tickets are randomly distributed among the buyers. Four
friends A, B, C, D buy separately a ticket each.
1. Which is the probability that they have received the tickets with numbers
31, 32, 33 and 34?
2. Which is the probability that they have received the tickets 31, 32, 33 and 34 in
this order?
3. Which is the probability that they have received tickets with 4 consecutive num-
bers?
4. Which is the probability that A, B, C receive tickets with a number greater than
50?
9 Combinatorics 119

Solution 9.2 1. To compute this probability we use the formula

favorable cases
.
possible cases

The possible cases are all the ways of choosing 4 numbers out of 100, i.e.

100
.
4

There exists only 1 favorable case, i.e. to choose the numbers 31, 32, 33 and
34. Hence the probability that the three friends have received the tickets with
numbers 31, 32, 33, 34 is given by

1
p = .
100
4

2. Here the number of possible cases is given by

100!
D4100 = .
96!
The probability that the 4 friends receive the tickets 31, 32, 33, 34 in this order
is then
1 96!
p = 100 = .
D4 100!

3. One can obtain tickets with consecutive numbers in

100 − 3 = 97

different ways. We need also to consider the case {97, 98, 99, 100}. The proba-
bility of receiving 4 consecutive tickets is then

97 97!4!
= .
100 100!
4

4. The probability that A, B and C receive tickets with numbers greater than 50 is

50 49 48
p = .
100 99 98
For the first case the are 50 favorable cases (all tickets with number from 51
up to 100) out of 100. For the second ticket there are 49 possibilities out of the
99 tickets left. And so on.
120 9 Combinatorics

Exercise 9.3 A credit card PIN consists of 5 numbers. We assume that every
sequence of 5 digits is generated with the same probability. Compute :
1. The probability that the numbers composing the PIN are all different.
2. The probability that the PIN contains at least 2 numbers which are equal.
3. The probability that the numbers composing the PIN are all different if the first
digit is different from 0.
4. The probability that the PIN contains exactly 2 numbers which are equal, if the
first digit is different from 0.

Solution 9.3 1. A PIN differs from another one if the digits are in different order.
The possible cases are given by
105 .

The favorable cases, when all digits are different, are

10!
D510 = .
5!
The probability that the numbers composing the PIN are all different is then

D510
p1 = .
105
2. The probability that the PIN contains at least 2 numbers which are equal is

10!
p = 1 − p1 = 1 − ,
5! 105
where p1 is the probability that the numbers composing the PIN are different.
3. In this case the number of possible cases is

9 · 10 · 10 · 10 · 10.

For the first digit we have 9 possibilities (all numbers from 1 to 9). We need to
choose the remaining digits without repetitions and taking in account the order:
we have D49 ways. The number of favorable cases is then

9 · 9 · 8 · 7 · 6 = 9 · D49 .

The probability that the numbers composing the PIN are all different if the first
digit is different from 0 is then

9 · D49 D9
= 44 .
9 · 10 4 10
9 Combinatorics 121

4. The possible cases are still given by

9 · 104 .

In order to compute the number of ways in which the PIN contains exactly 2
numbers which are equal, if the first digit is different from 0, we can proceed as
follows:
(a) For the digit that is repeated: without loss of generality we can think that
it is equal to the first digit in the string. There are 9 ways of choosing it
(remember: the 0 is now excluded).
(b) We choose the place of the repeated digit in the string: there are

4
1

positions where it can be placed.

D39

different ways in the string.

In total we have
4
9· · D39
1

possibilities. The procedure illustrated in (a),(b), and (c) will be called in

the sequel the string rule.
(d) If the repeated digit is different from the first one, we have
• 9 ways of choosing the first digit;
• 9 ways
of choosing the repeated digit;
4
• ways of choosing the place in the string;
2
• D2 ways of placing the remaining digits.
8

Totally we have
4
9 · 9 · D2 ·
8
.
2

The total number of favorable cases is then

4 4 4 4 5
9· · D 3 + 9 · 9 · D2 ·
9 8
= 9· + · D3 = 9 ·
9
· D39 ,
1 2 1 2 2
122 9 Combinatorics

where we have used the formula

n n−1 n−1
= + .
r r r −1

Exercise 9.4 Four fair dice are thrown at the same time. Their faces are numbered
from 1 to 6. Compute:
(a) The probability of obtaining four different faces.
(b) The probability of obtaining at least 2 equal faces.
(c) The probability of obtaining exactly 2 equal faces.
(d) The probability that the sum of the faces is equal to 5.
(e) We throw only 2 dice. Compute the probability that the sum of the faces is an
odd number.

Solution 9.4 (a) To compute the probability we use the formula

favorable cases
p= . (9.1)
possible cases

The possible cases are given by

possible cases = 6 · 6 · 6 · 6 = 64 .

This is given by the number of all possible dispositions of 4 elements out of 6.

The favorable cases are given by the simple dispositions of 4 elements out of 6,
since the faces are required to be different from each other:

favorable cases = 6 · 5 · 4 · 3 = D46 .

The probability of obtaining four different faces is then

D46 5
P(all the thrown dies have different faces) = = .
64 18
(b) The probability of obtaining at least 2 equal faces can be computed by using the
probability obtained above, since:

P(the thrown dice have at least 2 same faces)

= 1 − P(all the thrown dice have different faces)
D46 13
=1− 4
= .
6 18
9 Combinatorics 123

c) Also in this case we use the string rule as in Exercise 9.3. The number of the
ways of obtaining exactly 2 equal faces is then:

4
· 6 · D25 ,
2

where

4
= ways of choosing 2 dice with equal faces,
2
6 = ways of choosing the face which is repeated,
D25 = ways of choosing the remaining faces.

Recall that the remaining faces must be different among each other and with
respect to the one which is repeated.
(d) In order to have the sum of the faces equal to 5, the only possibility is that 3
faces present the number 1 and one the number 2, since we are dealing with 4
dice. We compute first the favorable cases. After having chosen the places for
the number 1, it remains only one possibility for the number 2, i.e. we have

4
·1=4
3

favorable cases. The possible cases are given by 64 ways of having a configuration
of 4 dice. Hence the probability that the sum of the faces is equal to 5 is given
by
4
p = 4.
6
(e) The sum of the faces is odd if one of the faces presents an odd number and the
other one an even number. Hence

2
favorable cases = · 3 · 3 = 18,
1

2
where counts the number of ways for a die to come out with an even face.
1
Hence

2 · 32 1
P(the sum of the faces is given by an odd number) = = .
62 2
More simply, one can consider that the sum of the faces can be either odd or
even. Hence:
124 9 Combinatorics

possible cases = 2
favorable cases = 1,

and consequently

1
P(the sum of the faces is given by an odd number) = .
2

Exercise 9.5 Two factories A and B produce garments for the same trademark Y .
For the factory A, 5 % of the garments present some production defect; for the
factory B, 7 % of the garments present some production defect. Furthermore 75 %
of the garments sold by Y derive from the the factory A, while the remaining 25 %
comes from the factory B. We suppose that a garment is chosen randomly with equal
probability among all the garments on sale. Compute:
1. The probability of purchasing a garment of the trademark Y which presents some
production defect.
2. The probability that the garment comes from the factory A, subordinated to the
fact that it presents some production defect.

Solution 9.5 We denote by:

• with A the event

A = {the garment comes from the factory A};

• with B the event

B = {the garment comes from the factory B};

• with D the event

D = {the garment presents some production defect}.

1. The probability of purchasing a garment of the trademark Y which presents some

production defect can be computed with the formula of the total probabilities,
since we do not know whether it comes from factory A or B. Hence

P(D) = P(D|A) P(A) + P(D|B) P(B)

5 75 7 25 11
= + = .
100 100 100 100 200
9 Combinatorics 125

2. The probability that the garment comes from the factory A, if it presents some
production defect, is given by:

P(D|A) P(A) 15
P(A|D) = = .
P(D) 22

This subordinated probability has been computed with Bayes’ Formula.

Exercise 9.6 We consider 3 different elementary schools E, M, S. The percentage
of pupils wearing glasses is 10 % in the school E, 25 % in the school M and 40 % in
the school S. Compute:
1. The probability that by choosing randomly 3 pupils, one out of each school, at
least one of them wears glasses.
2. The probability that a pupil wears glasses, if we randomly choose her or him out
of the three schools (each school can be picked up with the same probability).
3. The probability that the pupil belongs to school E, if she wears glasses.
Solution 9.6 1. The quickest method to compute the probability that by choosing
by chance 3 pupils, one out of each school, at least one of them wears glasses, is
to evaluate the probability that none of them wears glasses. If B is the event that
at least one of the 3 pupils wears glasses, then

P(B) = 1 − P B̃ .

90 75 60 81
In this case P B̃ = = , from which
100 100 100 200
81 119
P(B) = 1 − = .
200 200
2. Let O be the event
O = {the pupil wears glasses}.

The probability of O can be computed by using the formula of the total probability,
since we do not know which school the pupil belongs to. We set
• E = {the pupil belongs to school E};
• M = {the pupil belongs to school M};
• S = {the pupil belongs to school S}.
We then have:

P(O) = P(O|E) P(E) + P(O|M) P(M) + P(O|S) P(S)

1 1 1 1 2 1 1
= + + = .
10 3 4 3 5 3 4
126 9 Combinatorics

Note that we have assumed that each school can be picked up with the same
probability.
3. The probability that the pupil belongs to school E, if she wears glasses, can be
computed by using Bayes’ formula:

P(O|E) P(E) 2
P(E|O) = = .
P(O) 15

Chapter 10
Discrete Distributions

Exercise 10.1 Two friends A and B are playing with a deck of cards consisting of
52 cards, 13 for each sign. They choose out 2 cards each. Player A starts. In order to
win, the player has to be the first to extract the ace of spade or 2 cards of diamonds.
After having chosen the 2 cards, they put the 2 cards back in the deck and mix it.
Compute the probability that:
(a) Player A wins after 3 trials (i.e. after each player has done 2 extractions).
(b) Player A wins, player B wins, nobody wins.
(c) Let T be the random number representing the number of the trial, when one of
the player first wins. Compute the expectation of T .
(d) Which is the probability distribution of T ?

Solution 10.1 (a) The trials of the 2 players can be represented as a sequence of
stochastically independent and equally distributed random trials. The probability
that player A wins after 3 trials (i.e. after each player has done 2 extractions) is
then equal to the probability of first success after

2+2+1=5

trials. The player A wins if she extracts the ace of spade or 2 cards of diamonds.
The probability of this event is given by

13
51 2
p= + (10.1)
52 52
2 2

where we have used the fact that the events are incompatible and that:

© Springer International Publishing Switzerland 2016 127

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_10
128 10 Discrete Distributions
⎛ ⎞
51 ⎠
1·⎝
1
1. The probability of extracting the ace of spade, is given by ⎛ ⎞ .
⎝
52 ⎠
2
⎛ ⎞
⎝
13 ⎠
2
2. The probability of extracting 2 cards of diamonds, is given by ⎛ ⎞ .
⎝
52 ⎠
2
Let T be the random number representing the first time of success. The proba-
bility that A wins at the third trial is

P(T = 5) = p(1 − p)4 ,

where p is given by (10.1).

(b) If A wins, the game stops with an odd trial. The probability that A wins is then
∞

P(A wins) = P(T = 2k + 1)
k=0
∞
1
= p(1 − p)2k = p .
k=0
1 − (1 − p)2

If B wins, the game stops with an even trial. The probability that B wins is then
∞

P(B wins) = P(T = 2k)
k=1
∞
= p (1 − p)2k−1
k=1

p 1
= − 1
1 − p 1 − (1 − p)2
p (1 − p)2
=
1 − p 1 − (1 − p)2
p(1 − p) 1− p
= = .
1 − (1 − p) 2 2− p

The probability that nobody wins is given by

10 Discrete Distributions 129

P(nobody wins) = 1 − P(A wins) − P(B wins)

∞ ∞
= 1− P(T = 2k + 1) − P(T = 2k)
k=0 k=1
∞

= 1− P(T = k)
k=1
= 0.

(c)–(d) The random number T that represents the time when the game is decided,
has a geometric distribution of parameter p since it denotes the first time of
success in a sequence of stochastically independent and identically distrib-
uted trials. Hence the expectation of T is given by:

52
1 2
P(T ) = = .
p 13
1+
2

Exercise 10.2 Let X, Y be two stochastically independent random numbers with

Poisson distribution with parameters μ and σ, respectively.
1. Let Z = X + Y . Compute the expectation and the variance of Z .
2. What is the set I (Z ) of possible values for Z ?
3. Compute P(Z = i), for i ∈ I (Z ).
4. Compute cov(Z , X,).
5. Let u > 0; compute the generating function φ Z (u) = P(u Z ) of Z .

Solution 10.2 1. By the linearity of the expectation we obtain

P(Z ) = P(X + Y ) = P(X ) + P(Y ) = μ + σ .

To compute the variance, we use the formula of the variance of the sum

σ 2 (X + Y ) = σ 2 (X ) + σ 2 (Y ) + 2 cov(X, Y ) .

Since X, Y are stochastically independent, we have

cov(X, Y ) = 0 .

Hence

σ 2 (X + Y ) = σ 2 (X ) + σ 2 (Y ) = μ + λ .
130 10 Discrete Distributions

2. The set I (Z ) of possible values for Z is given by

I (Z ) = N = {inf(X ) + inf(Y ), . . . } .

3. We now compute the probability distribution of Z . The event {Z = i} can be

written as

{Z = i} = {X = 0, Y = i} + {X = 1, Y = i − 1} + · · · + {X = i, Y = 0}
i
= {X = k, Y = i − k},
k=0

since the events {X = k, Y = i − k} are disjoint for k = 0, . . . , i. By the linearity

of the expectation we obtain

i
P(Z = i) = P(X = k, Y = i − k) .
k=0

Furthermore X, Y are stochastically independent, hence

P(X = k, Y = i − k) = P(X = k) P(Y = i − k),

so that

i
P(Z = i) = P(X = k) P(Y = i − k)
k=0

i
μk σ (i−k) −σ
= e−μ e
k=0
k! (i − k)!

e−(μ+σ)
i
i!
= μk σ (i−k)
i! k=0
k! (i − k)!
(μ + σ)i −(μ+σ)
= e ,
i!
where we have used Newton’s binomial formula. Therefore Z has Poisson distri-
bution with parameter μ + λ.
4. In order to compute the covariance between Z and X , we proceed as follows:

cov(Z , X ) = P(Z X ) − P(Z ) P(X )

= P ((X + Y )X ) − (P(X ) + P(Y )) P(X )
= P(X 2 ) + P(X Y ) − P(X )2 − P(Y )P(X )
= σ 2 (X )
= μ.
10 Discrete Distributions 131

5. For μ > 0, the generating function of Z is given by

φ Z (u) = P(u Z ) = P(u X +Y ) .

Since X, Y are stochastically independent, we have

P(u X +Y ) = P(u X · u Y ) = P(u X ) · P(u Y ) .

We now compute P(u X ) by using the formula for the expectation of a function
of X :
+∞

P(u X ) = u i P(X = i)
i=0
+∞
(uμ)i
= e−μ
i=0
i!
(u−1) μ
=e ,

where in the last step we have used the series:

+∞ i
x
= ex .
i=0
i!

It follows that

φ Z (u) = P(u X ) · P(u Y )

= e(u−1) μ e(u−1) σ
= e(u−1) (μ+σ) .

Since the generating function uniquely identifies the distribution, this proves that
Z has Poisson distribution with the parameter μ + σ.

Exercise 10.3 In a small village with 200 inhabitants, 5 inhabitants are affected by
a particular genetic disease. A sample of 3 individuals is chosen randomly among
the population (all subsets have the same probability of being chosen). Let X be the
number of individuals in the sample who are affected by the disease.
1. Determine the set I (X ) of possible values for X .
2. Determine the probability distribution of X .
3. Compute the expectation and the variance of X .
132 10 Discrete Distributions

Solution 10.3 1. The possible values of X are 0, 1, 2 and 3, i.e. the minimum
number of people affected by the disease in the sample is 0 and the maximum
number is 3.
2. Consider the event {X = i}, i ∈ I (X ). To determine the probability distribution
of X , we need to compute

P(X = i), i ∈ I (X ) .

To this purpose we use the formula

favorable cases
.
possible cases

The number of possible cases is given by the number of ways of choosing 3

people out of 200 inhabitants, i.e.

200
.
3

The number of favorable cases is given by the number of ways of choosing i

people out of the group of inhabitants affected by the disease and (3 − i) people
out of the group of ‘healthy’ people, i.e.

5 195
.
i 3−i

We obtain
5 195
i 3−i
P(X = i) = .
200
3

The distribution of X is then hypergeometric.

3. We can compute directly the expectation of X , since I (X ) consists only of 4
values:

3
P(X ) = i P(X = i)
i=0

1 195 195
= 5 + 20 + 30
200 2 1
3
3
= .
40
10 Discrete Distributions 133

For the variance the computations is analogous. It is sufficient to apply the

formula

σ 2 (X ) = P(X 2 ) − P(X )2

3
and to compute P(X 2 ) = i 2 P(X = i).

i=0

Exercise 10.4 At a horse race there are 10 participants. Gamblers can win if they
correctly predict the first 3 horses in order of arrival. We suppose that all the orders
have the same probability of occurrence and that the gamblers choose independently
of each other and with the same probability the 3 horses on which to bet.
1. Compute the probability that one of the gamblers wins.
2. If the gamblers are 100 in total, let X be the random numbers counting the number
of gamblers who win. Determine I (X ) and P(X = i) for i = 1, 2, 3.
3. Compute expectation and variance of X .
4. Suppose that the gamblers are numbered from 1 to 100. Compute the probability
that there is at least one winner and that the winner with the minimal number has
a number greater or equal to 50.

Solution 10.4 1. The probability that a gambler wins can be computed with
the formula
favorable cases
.
possible cases

In this case, the possible cases are given by the simple dispositions of 3 elements
out of 10. They represent the number of ways of assuming the first 3 positions for
the 10 horses. Only one is the winning triplet, hence the probability of winning
for a gambler is given by

1 7! 1
p = = = .
D310 10! 720

2. If X is the random numbers counting the number of gamblers who win, we can
write

X = E 1 + E 2 + · · · + E 100 ,

where the event E i is verified if the i-th gambler wins. The events E i , i =
1, . . . , 100, are stochastically independent and identically distributed since the
gamblers choose independently of each other and with the same probability the 3
horses on which to bet. Hence X has binomial distribution Bn(n, p) of parameters
1
n = 100 and p = . The set of possible values is then
720
134 10 Discrete Distributions

I (X ) = {0, 1, . . . , 100}

and
i 100−i
100 1 1
P(X = i) = 1− .
i 720 720

In particular, we obtain:
i=1 99
1 719
P(X = 1) = 100 · · .
720 720

i=2 2
100 1 719 98
P(X = 2) = · .
2 720 720

i=3 3
100 1 719 97
P(X = 3) = · .
3 720 720

3. The expectation of X is given by linearity by

P(X ) = P(E 1 + · · · + E 100 )

100
1 5
= P(E i ) = 100 · = .
i=1
720 36

Analogously by the formula of the variance of the sum of n random numbers, we

have:

σ 2 (X ) = σ 2 (E 1 + · · · + E 100 )

100 100
= σ 2 (E i ) + cov(E i , E j )
i=1 i, j=1

0

1 1
= 100 · · 1− .
720 720

Here we have used that the events E i are stochastically independent.

4. In order to have a winner with minimal number greater than or equal to 50, we
need that the first 49 gamblers do not win and that at least one of the gamblers
with number from 50 to 100 wins. Let E be the event that all the all the gamblers
with number from 1 to 49 lose and F the event that at least one of the gamblers
10 Discrete Distributions 135

with number from 50 to 100 wins. The probability that there is at least one winner
and that the winner with the minimal number has a number greater than or equal
to 50 is then

719 49 719 51
P(E F) = P(E)P(F) = 1− ,
720 720

and F
where P(F) = 1 − P( F) is the event that no gambler with number from
50 to 100 wins.

Exercise 10.5 In an opinion poll 100 people are asked to answer a questionnaire
with 5 questions. Each question can be answered only yes or no. For each person the
probability of all possible answers is the same and their choices are stochastically
independent. Let N be the number of interviewed people that answer yes to the first
questions or answer yes at least to 4 questions.
1. Which is the probability distribution of N ?
2. Compute the expectation, the variance and the generating function of N .

Solution 10.5 1. Let E i be the event that the i-th interviewed person has answered
yes to the first questions or yes at least to 4 questions. We can rewrite N as

N = E 1 + E 2 + · · · + E 100 .

The events are stochastically independent and identically distributed since every
person answers independently of the other ones. Furthermore we have assumed
that the probability of all possible answers is the same. It is sufficient to compute
the probability of each E i . We put:
• Fi = {the i-th interviewed person answers yes to the first question};
• G i = {the i-th interviewed person answers yes at least to 4 questions}.
We obtain that E i = Fi ∨ G i e

P(E i ) = P(Fi ) + P(G i ) − P(Fi ∧ G i ) .

The probability of Fi , G i e Fi ∧ G i are given by:

(a)
1 1 1
P(Fi ) = · = .
2 2 4
For all question we have 2 possible cases (yes and no), while for the first
question we have only one possible choice (yes).
(b)
5 5
5 1 5 1
P(G i ) = + .
4 2 5 2
136 10 Discrete Distributions

A person answers yes at least to 4 questions if she answers yes to exactly 4

questions or to exactly 5 questions.
(c)
5 5
3 1 3 1
P(Fi ∧ G i ) = + .
2 2 3 2

In the case the events happen at the same time, we need to choose only the
other 2, respectively 3 questions to which the candidate answer yes.
Finally

1 5 1 1 3 1 1 1 5
P(E i ) = + + 5− 5− 5 = 2+ 4 = .
4 4 2 5 2 2 2 2 2 16

We obtain that I (N ) = {0, . . . , 100} and

i
100 5 5 100−i
P(N = i) = 1− ,
i 16 16

5
i.e. N has binomial distribution Bn(100, 16 ).
2. The expectation of N is given by

100
5 125
P(N ) = iP(E i ) = 100 · = .
i=1
16 4

The variance of N is given by

100
100
σ 2 (N ) = σ 2 (E i ) + cov(E i , E j )
i=1 i, j=1

0

5 5
= 100 · · 1− .
16 16

The generating function of N is given by

φ N (t) = P(t N )
100
i
100 5t 5 100−i
= · · 1−
i 16 16
i=0
100
5t 5
= +1− ,
16 16

where we have used Newton’s binomial formula.

10 Discrete Distributions 137

Exercise 10.6 A box contains 8 balls: 4 white and 4 black. We draw 4 balls. Let E i
be the event that the i-th ball extracted is white. Let X = E 1 + E 2 , Y = E 3 + E 4 .
(a) Compute the joint distribution of X and Y .
(b) Compute P(X ), P(Y ), σ 2 (X ), σ 2 (Y ).
(c) Compute cov(X, Y ), the correlation coefficient ρ(X, Y ). Are X and Y stochas-
tically independent?

Solution 10.6 (a) Consider the random vector (X, Y ). The set of possible values
for (X, Y ) is given by

I (X, Y ) = {(i, j)| i = 0, 1, 2, j = 0, 1, 2} .

To compute the joint distribution of (X, Y ), we need to calculate

P(X = i, Y = j) = P(Y = j| X = i)P(X = i)

for all (i, j) ∈ I (X, Y ). The probability of extracting a white ball in the first 2
extractions is given by

4 4
i 2−i
P(X = i) = .
8
2

8
Here the possible cases are since we consider only the first 2 extractions.
2
Moreover

4−i
4 − (2 − i)
j 2− j
P(Y = j| X = i) =
6
2

4−i 2+i
j 2− j
= .
6
2

After the first 2 extractions, only 6 balls are left in the box. We have to draw 2
more balls, j among the remaining white ones (4 − i) and (2 − j) among the
remaining black ones 4 − (2 − i) = 2 + i. The joint distribution of X and Y
is then
138 10 Discrete Distributions

4−i 2+i 4 4
j 2− j i 2−i
P(X = i, Y = j) = · .
6 8
2 2

(b) To compute P(X ) and P(Y ) we use the fact that the events E i have equal prob-
ability (but they are not stochastically independent!), hence

4
P(X ) = P(E 1 ) + P(E 2 ) = 2 · =1
8
and

P(X ) = P(Y ) = 1 .

The events E 1 and E 2 (and consequently also E 3 e E 4 ) are negatively correlated

with covariance:

cov(E 1 , E 2 ) = P(E 1 E 2 ) − P(E 1 )P(E 2 )

= P(E 2 |E 1 )P(E 1 ) − P(E 1 )P(E 2 )
1
=− .
28
The variance of X is then

σ 2 (X ) = σ 2 (E 1 + E 2 ) = σ 2 (E 1 ) + σ 2 (E 2 ) + 2cov(E 1 , E 2 )
1 1 1 13
= + − = .
4 4 28 28
13
Also in this case σ 2 (Y ) = σ 2 (X ) = .
28
(c) We have:

cov(X, Y ) = cov(E 1 + E 2 , E 3 + E 4 )
= cov(E 1 , E 3 ) + cov(E 1 , E 4 ) + cov(E 2 , E 3 ) + cov(E 2 , E 4 )

1 1
= 4· − =− .
28 7

Here we have used that fact that the covariance is a bilinear function. Finally,
the coefficient of correlation between X and Y is equal to:

1
cov(X, Y ) − 4
ρ(X, Y ) = = 7 =− .
σ(X )σ(Y ) 13 13 13
·
28 28

10 Discrete Distributions 139

Exercise 10.7 Let E 1 , E 2 , F1 , F2 be stochastically independent events with P(E 1 ) =

1 1
P(E 2 ) = , P(F1 ) = P(F2 ) = . Let X = E 1 + E 2 , Y = F1 + F2 .
4 3
(a) Compute the set of possible values and the probability distributions of X and Y .
(b) Compute P(X + Y ), σ 2 (X + Y ).
(c) Compute P(X = Y ), P(X = −Y ).

Solution 10.7 (a) Since E 1 , E 2 are events, i.e. random numbers that can assume
only the values 0 and 1, we have that the set of possible values of X is given by

I (X ) = {0, 1, 2} .

Analogously for Y
I (Y ) = {0, 1, 2} .

To compute the probability distribution of X means that we have to calculate

with which probability X assumes each of the possible values. For example, we
have that

P(X = 0) = P(E 1 + E 2 = 0)
9
= P(E 1 = E 2 = 0) = P( Ẽ 1 )P( Ẽ 2 ) = .
16
Since X is equal to the sum of 2 stochastically independent events with the
same probability, we can immediately say that the distribution of X is binomial
1
Bn(n, p) with parameters n = 2 and p = . Analogously Y has binomial
4
1
distribution Bn(2, ) and we have that
3
i 2−i
2 1 3
P(X = i) = , i = 0, 1, 2
i 4 4
j 2− j
2 1 2
P(Y = j) = , j = 0, 1, 2 .
j 3 3

(b) To compute the expectation, we can use the linearity

P(X + Y ) = P(E 1 + E 2 + F1 + F2 )
= P(E 1 ) + P(E 2 ) + P(F1 ) + P(F2 )
1 1 7
= 2· +2· = .
4 3 6
For the variance, we use the formula of the variance of a sum:

σ 2 (X + Y ) = σ 2 (X ) + σ 2 (Y ) + 2 cov(X, Y ).
140 10 Discrete Distributions

Since X and Y have binomial distribution, we have

1 3 3
σ 2 (X ) = 2 · · = ,
4 4 8
1 2 4
σ 2 (Y ) = 2 · · = .
3 3 9
To compute the covariance between X and Y we use the fact that the events
E 1 , E 2 , F1 , F2 are stochastically independent in the following way:

cov(X, Y ) = cov(E 1 + E 2 , F1 + F2 )
= cov(E 1 , F1 ) + cov(E 1 , F2 ) + cov(E 2 , F1 ) + cov(E 2 , F2 )
= 0.

Hence
3 4 59
σ 2 (X + Y ) = + = .
8 9 72
(c) To compute P(X = Y ) we note that the event

(X = Y )

is given by

(X = Y ) = (X = 0, X = 0) + (X = 1, Y = 1) + (X = 2, Y = 2) .

Hence

2
P(X = Y ) = P(X = i, Y = i)
i=0

2
= P(X = i)P(Y = i)
i=0
2 i 2−i i 2−i
2 1 3 2 1 2
=
i4 4 i 3 3
i=0

2 2
2 1 i 1 2−i
=
i 12 2
i=0

1 2 2 1 i
2
61
= = .
4 i=0 i 6 144
10 Discrete Distributions 141

On the other side the event

(X = −Y )

is verified only if

(X = −Y ) = (X = 0, Y = 0) .

Hence
1
P(X = −Y ) = P(X = 0, Y = 0) = P(X = 0)P(Y = 0) = .

4
Chapter 11
One-Dimensional Absolutely Continuous
Distributions

Exercise 11.1 The random numbers X, Y and Z are stochastically independent with
exponential distribution of parameter λ = 2.
(a) Compute the density of the probability of X + Y and of X + Y + Z .
(b) Let E, F, G be the events E = (X ≤ 2), F = (X + Y > 2), G = (X + Y +
Z ≤ 3). Compute P(E), P(F), P(G) e P(E F).
(c) Determine if E, F and G are stochastically independent.

Solution 11.1 (a) The exponential distribution is a particular case of the gamma
distribution with parameter 1, λ. If X, Y and Z are stochastically independent
random numbers with exponential distribution of parameter λ = 2, i.e. Gamma
distribution Γ (1, 2), we can use the following property of the the sum of sto-
chastically independent random numbers with Gamma distribution

Γ (α, λ) + Γ (β, λ) ∼ Γ (α + β, λ).

Hence W1 = X + Y has distribution Γ (2, 2). We can iterate this procedure and
obtain that
W2 = X + Y + Z = W1 + Z

has distribution Γ (3, 2).

(b) We have:
2
P(E) = P(X ≤ 2) = 2e−2x dx
0
= 1 − e−4 ;

+∞
P(F) = P(X + Y > 2) = 4xe−2x dx
2
+∞
+∞
= −2xe−2x 2 + 2 e−2x dx
2

© Springer International Publishing Switzerland 2016 143

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_11
144 11 One-Dimensional Absolutely Continuous Distributions
+∞
= 4e−4 + −e−2x 2 = 5e−4 ;
3
P(G) = P(X + Y + Z ≤ 3) = 4x 2 e−2x dx
0
+∞
2 −2x
=1− 4x e dx
3
+∞
+∞
= 1 − −2x 2 e−2x 3 + 4 xe−2x dx
3
−6
= 1 − 25e ,

and

P(E F) = P(X ≤ 2, X + Y > 2)

= P(X ≤ 2, Y > 2 − X ) = P(X ≤ 2, Y > 0)
= P(X ≤ 2)P(Y > 0) = P(X ≤ 2) .

Here we have used the fact that X and Y are assumed to be stochastically inde-
pendent, as well as that the product of 2 events denotes that both conditions must
be simultaneously satisfied.
(c) To determine if E, F, G are stochastically independent, we need to verify all the
following conditions:

P(E F) = P(E)P(F);
P(E G) = P(E)P(G);
P(F G) = P(F)P(G);
P(E F G) = P(E)P(F)P(G) .

If one of them is not verified, then the events are not stochastically independent.
We can immediately see that

P(E F) = P(E)P(F)

by using the results above. Hence the three events are not stochastically inde-
pendent.

Exercise 11.2 Let X be a random number with standard normal distribution. Let
Y = 3X + 2 and Z = X 2 .
1. Compute the c.d.f. and the density of Y .
2. Estimate P(Y ≥ y), where y > 0.
3. Compute the expectation and the variance of Z .
4. Compute the c.d.f. and the density of Z .
11 One-Dimensional Absolutely Continuous Distributions 145

Solution 11.2 1. Put

1 t2
n(t) = √ e− 2
2π

We compute the c.d.f. FY of Y = 3X + 2. Given y ∈ R

FY (y) = P(Y ≤ y) = P(3X + 2 ≤ y) =

y−2 y
y−2 3 1 (z−2)2
P X≤ = n(t)dt = √ e− 18 dz ,
3 −∞ −∞ 3 2π

z−2
where we have used the change of variable t = . The density f Y of Y is
3
obtained by the derivation of FY :

d 1 1 (y−2)2
f Y (y) = FY (y) = √ e− 2·9 .
dy 3 2π

It follows that Y has normal distribution N (2, 9).

2. To estimate the probability P(Y ≥ y), y > 0, we use that

1 x2 1 1 1 1 x2
√ e− 2 − 3 e− 2 ,
≤ P(X ≥ x) ≤
√
2π x x x 2π

if X has standard normal distribution. Since P(Y ≥ y) = P X ≥ y−2

3
for
y > 0, we obtain

1 (y−2)2 3 27 3 1 (y−2)2
√ e− 2·9 − ≤ P(Y > y) ≤ √ e− 2·9 .
2π y − 2 (y − 2)3 y −2 2π

3. The expectation of Z is given by:

+∞
1 x2
P(Z ) = P(X ) = 2
x2 √ e− 2 dx = σ 2 (X ) = 1,
−∞ 2π

where we have used the formula P(ψ(x)) = ψ(x) f X (x) dx. To compute the
variance of Z , we use the formula

σ 2 (Z ) = P(Z 2 ) − P(Z )2 .

It remains to compute
146 11 One-Dimensional Absolutely Continuous Distributions

P(Z 2 ) = P((X 2 )2 ) = P(X 4 )

+∞
1 x2
= x4 √ e− 2 dx
−∞ 2π
+∞ +∞
1 2
− x2 1 x2
= −x √3
e +3 x2 √ e− 2 dx
2π −∞ −∞ 2π
= 3.

4. To compute the c.d.f. FZ of Z , we proceed as above, i.e.

FZ (z) = P(Z ≤ z) = P(X 2 ≤ z) .

Since Z = X 2 is a non negative random number, we can distinguish 2 cases:

(a) for z < 0 we have that FZ (z) = 0;
(b) if z ≥ 0

FZ (z) = P X 2 ≤ z
√ √
=P − z≤X≤ z
√ √
= P X ≤ z −P X ≤− z
√z −√z
1 − t2
2 1 t2
= √ e dt − √ e− 2 dt
−∞ 2π −∞ 2π
√z
1 t2
= √ √ e− 2 dt .
− z 2π

Finally we get ⎧
⎪
⎪ 0 z < 0,
⎨
FZ (z) = √z
⎪
⎪ 1 t2
⎩ −√z √ e− 2 dt z ≥ 0.
2π

To compute the density f Z , we can take the derivative of the c.d.f.. For z ≥ 0
√ −√z
z
d 1 2
− t2 1 2
− t2
f Z (z) = √ e dt − √ e dt
dz −∞ 2π −∞ 2π

1 1 √ 1 1 √
= z − 2 n( z) − − z − 2 n(− z)
2 2
− 21 √
= z · n( z)
1
= z− 2 √ e− 2 .
1 z

2π
11 One-Dimensional Absolutely Continuous Distributions 147

We obtain ⎧
⎪
⎪ 0 z < 0,
⎨
f Z (z) =
⎪
⎪ 1
z − 2 e− 2
1 z
⎩√ z ≥ 0.
2π

Hence Z has Gamma distribution of parameters Γ ( 21 , 21 ), i.e. χ2 -distribution of

parameter 1.

Exercise 11.3 Let X be a random number with exponential distribution with para-
meter λ = 2.
1. Compute the moments of order n of X , i.e. P(X n ), n ∈ N.
2. Consider the family of random numbers Z u = eu X , u < λ. Given a fixed u < λ,
compute the expectation Ψ X (u) = P(eu X ) of Z u . The function Ψ X (u) is called
moment generating function of X .

Solution 11.3 1. The moment of order n ∈ N for X can be computed with the
formula

P(Ψ (x)) = Ψ (x) f X (x) dx ,

for a given function Ψ : R −→ R such that the integral above exists and is finite.
In this case Ψ (x) = x n . We then obtain

+∞
P(X n ) = x n λ e−λx dx
0
+∞
Γ (n + 1)
=λ x n e−λx dx = λ
0 λn+1
n!
= n.
λ
1
In particular for n = 1 we have that P(X ) =.
λ
2. We compute the expectations of Z u = e , u ∈ R.
uX

P(Z u ) = P(eu X )
+∞
= λ eux e−λx dx
0
+∞
= λ e(u−λ)x dx .
0
148 11 One-Dimensional Absolutely Continuous Distributions

Note that here u is a given parameter. The integral is well-defined since u < λ.
We obtain that
λ (u−λ)x +∞ λ
P(Z u ) = e = .
u−λ 0 u−λ

Exercise 11.4 The random number X has uniform distribution on the interval
[−1, 1].
(a) Write the density of X .
Let Z = log |X |.
(b) Compute I (Z ) e P(Z ).
(c) Compute the c.d.f. and the density of Z .
(d) Calculate P(Z < − 21 |X > − 21 ).

Solution 11.4 (a) The density of X is equal to

1
per x ∈ (−1, 1),
f (x) = 2
0 otherwise.

(b) The random number X has as set of possible values

I (X ) = [−1, 1],

hence the set of possible values for Z = log |X | is given by

I (Z ) = (−∞, 0] .

The random number Z is not defined if X assumes the value 0 ∈ I (X ). To

compute the expectation we can proceed as follows:

P(Z ) = P(Z |X > 0)P(X > 0) + P(Z |X < 0)P(X < 0)

1 1
= P(log X |X > 0) · + P(log(−X )|X < 0) · ,
2 2
where we have used the fact that
1
P(X > 0) = P(X < 0) = .
2
Verify this by direct computation!
11 One-Dimensional Absolutely Continuous Distributions 149

We need only to calculate

1
P(log X |X > 0) = log xdx (11.1)
0
= [x log x − x]10 = −1 ,

0
P(log(−X )|X < 0) = log(−x)dx (11.2)
−1
1
= log ydy = −1,
0

hence
P(Z ) = P(log X ) = −1 .

FZ (z) = P(Z ≤ z)
= P(Z ≤ z, X > 0) + P(Z ≤ z, X < 0) .

If z ≥ 0, then FZ (z) = 1. Let z < 0. We obtain:

FZ (z) = P(log X ≤ z, X > 0) + P(log(−X ) ≤ z, X < 0) .

We now compute

P(log X ≤ z, X > 0) = P(X ≤ e z , X > 0) (11.3)

= P(0 < X ≤ e z )
ez
1 1
= dx = e z
0 2 2

and

P(log(−X ) ≤ z, X < 0) = P(X ≥ −e z , X < 0) (11.4)

= P(−e z ≤ X < 0)
0
1 1
= dx = e z .
−e z 2 2

Hence
FZ (z) = e z if z < 0 .
150 11 One-Dimensional Absolutely Continuous Distributions

The density of Z is given by

e z for z < 0,
f Z (z) =
0 otherwise .

1 1
(d) We evaluate P(Z < − |X > − ) by using the formula of the conditional prob-
2 2
ability:

1 1 P Z < − 21 , X > − 21
P Z < − X > − = ,
2 2 P X > − 21

where

1 1 1 1
P Z <− ,X >− = P log |X | < − , X > − =
2 2 2 2

1 1 1
P log X < − , X > 0 + P log(−X ) < − , − < X < 0 ,
2 2 2

Here we have used that

1 1
X >− = (X > 0) + − < X < 0 .
2 2

It follows that

e− 21
1
P log X < − , X > 0 = P 0 < X < e− 2 =
1

2 2

and furthermore

1 1 − 21 1
P log(−X ) < − , − < X < 0 = P X > −e , − < X < 0
2 2 2
0
1 1 1
=P − <X <0 = dx = .
2 −2 2
1 4

Finally
1 1 1 1 1
P Z < − |X > − = √ + .
2 2 2 e 2
Chapter 12
Absolutely Continuous and Multivariate
Distributions

Exercise 12.1 Let X be the random number with density

K x 2 for − 1 ≤ x ≤ 1,
f (x) =
0 otherwise.

(a) Compute K .
(b) Compute the c.d.f., the expectation and the variance of X .
(c) Let Y be a random number which is stochastically independent and has expo-
nential distribution with parameter λ = 2. Write the joint density function and
the joint c.d.f. of (X, Y ).

Solution 12.1 (a) The normalization constant K is such that

1
K x 2 dx = 1 .
−1

Hence
1 3
K = 1 = .
x 2 dx 2
−1

(b) The c.d.f. of X is given by

x
F(x) = P(X ≤ x) = f (t)dt .
−∞

© Springer International Publishing Switzerland 2016 151

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_12
152 12 Absolutely Continuous and Multivariate Distributions

Hence ⎧
⎪
⎪ 0 for x ≤ 1,
⎨ x
3 2 1 3
F(x) = t dt = (x + 1) for x ∈ [−1, 1]
⎪
⎪ −1 2 2
⎩
1 for x ≥ 1.

Furthermore the expectation of X is equal to

1
3 3
P(X ) = t f (t)dt = x dx = 0 .
R −1 2

The variance is given by

σ 2 (X ) = P(X 2 ) − P(X )2
1
3
= P(X ) =
2
x 2 · x 2 dx
−1 2

3 1 4 3
= x dx = .
2 −1 5

(c) The density of Y is given by

⎧ −2y
⎨ 2e for y ≥ 0,
g(y) =
⎩
0 otherwise.

If X and Y are stochastically independent, then the joint density is given by the
product of the marginal densities:
⎧
⎪ 3
⎨ 2e−2y x 2 = 3e−2y x 2 for x ∈ [−1, 1] and y ≥ 0,
f (x, y) = f X (x)gY (y) = 2
⎪
⎩
0 otherwise.

Analogously the joint c.d.f. coincides with the product of the marginal distribu-
tion functions:
⎧
−2y x + 1
3
⎪
⎪
⎪ (1 − e )
⎪ for x ∈ [−1, 1] and y ≥ 0,
⎪
⎨ 2
F(x, y) = FX (x)FY (y) = −2y
⎪
⎪ 1−e for x > 1 and y ≥ 0,
⎪
⎪
⎪
⎩
0 otherwise.
12 Absolutely Continuous and Multivariate Distributions 153

Exercise 12.2 Let (X, Y ) be a random vector with uniform distribution on the disk
of radius 1 and center at the origin of the axes.
1. Compute the joint density function f (x, y) of (X, Y ).
2. What is the marginal density f X of X ?
1
3. Let Z = X 2 + Y 2 , compute P( ≤ Z ≤ 1).
4
4. Compute the c.d.f. and the density of Z .

Solution 12.2 1. Since (X, Y ) have uniform distribution on the disk

D1 = (x, y) : x 2 + y 2 ≤ 1 .

the joint density f (x, y) is constant on D1 and 0 outside. We obtain that

⎧
⎪ 1
⎨ area1 D1 = for (x, y) ∈ D1 ,
f (x, y) = π
⎪
⎩
0 otherwise.

The density domain is shown in Fig. 12.1.

The value of the density f on D1 can be determined by imposing that

1 = f (x, y) dx dy = c dx dy,
R2 D1

i.e.
1 1 1
c = = = .
D1 dx dy area D 1 π

2. To compute the marginal density of X , we distinguish 4 cases as follows.

Fig. 12.1 Representation of

y
the area D1 on the plane

O x
154 12 Absolutely Continuous and Multivariate Distributions

Fig. 12.2 Case 0 ≤ x ≤ 1

O x x

Fig. 12.3 Case −1 ≤ x ≤ 0

x O x

• Case x > 1: f X (x) = 0.

• Case 1 ≥ x ≥ 0: set the x coordinate; y varies along the line orthogonal to
the x-axis and passing through (x, 0). The extremes are the points where this
line intersects the graph of D1 as shown in Fig. 12.2. We obtain:
√ √ √
1−x 2 1−x 2
1 2 1 − x2
f X (x) = √ f (x, t) dt = √ dt = .
− 1−x 2 − 1−x 2 π π

• Case −1 ≤ x < 0: by symmetry, we obtain as shown in Fig. 12.3, that

√
2 1 − x2
f X (x) = .
π
• Case x < −1: also here we have f X (x) = 0.
12 Absolutely Continuous and Multivariate Distributions 155

Summing up:
⎧ √ 2
⎨ 2 1−x
π
for x ∈ [−1, 1]
f X (x) =
⎩
0 otherwise.

3. Let Z = X 2 + Y 2 ; to compute P 41 ≤ Z ≤ 1 is equivalent to calculate the prob-
ability that the random vector (X, Y ) belongs to the region A of the plane between
the disk with center O and radius 21 and the disk with center O and radius 1, i.e.

1 1
P ≤Z ≤1 = P ≤ X2 + Y 2 ≤ 1 .
4 4

Hence

1
P ≤ X +Y ≤1 =
2 2
f (x, y) dx dy .
4 A

We can compute this integral by passing to the polar coordinates

x = ρ cos θ, y = ρ sin θ .

To perform the change of variables in the integral, we need to take account of

the absolute value of the Jacobian determinant (Fig. 12.4). In the case of polar
coordinates, this is equal to
|J | = ρ .

Fig.
12.41 Area2 of the region

y
(x, y)| 4 ≤ x + y 2 ≤ 1

O x
156 12 Absolutely Continuous and Multivariate Distributions

It follows that

f (x, y) dx dy = f (ρ, θ) dρ dθ
A(x,y) A(ρ,θ)

2π 1
1 1 1 3
= dθ dρ = 2ρ dρ = ρ2 1 = .
0 1
2
π 1
2
2 4

4. To compute the c.d.f. FZ (z) of Z we use again spherical symmetry.

• z < 0: In this case FZ (z) = 0.
• 1 ≥ z ≥ 0:

FZ (z) = P(Z ≤ z)
= P(X 2 + Y 2 ≤ z)

= f (x, y) dx dy,
Dz

where Dz = {(x, y) : x 2 + y 2 ≤ z}. It follows that

√ √
2π z
1 z √z
FZ (z) = ρ dρ dθ = 2ρ dρ = ρ2 0 = z .
0 0 π 0

• z > 1: In this case FZ (z) = P X 2 + Y 2 ≤ z = 1.
Summing up:
⎧
⎪
⎪ 0 for z < 0,
⎪
⎪
⎨
FZ (z) = z for 0 ≤ z < 1,
⎪
⎪
⎪
⎪
⎩
1 for z > 1 .

The density function of Z is given by

⎧
⎨ 1 for 0 ≤ z ≤ 1,
f Z (z) =
⎩
0 otherwise.

The random number Z has therefore a uniform density in [0, 1].

Exercise 12.3 Let (X, Y ) be a random vector with joint density

⎧
⎨ k x y (x, y) ∈ T,
f (x, y) =
⎩
0 otherwise.
12 Absolutely Continuous and Multivariate Distributions 157

where T = {(x, y) ∈ R2 | 0 ≤ y ≤ −x + 2, 0 < x < 2}.

1. Compute the normalization constant k.
1
2. Compute the probability P(X > 1, Y < ) and the conditional probability P(X >
2
1
1|Y < ).
2
3. Let Z = X + Y . Compute the probability that P(0 < Z < 1).
4. Compute the p.d.f. and the density of Z .

Solution 12.3 1. To compute the normalization constant k we impose that

f (x, y) dx dy = 1 .
R2

The integral of f can be computed by using Fubini-Tonelli Theorem:

2 −x+2
f (x, y) dx dy = k x y dy dx
R2 0 0

2 −x+2 2
y2 1
=k x dx = k x (2 − x)2 dx
0 2 0 0 2

2 2
k k 4 1 2
= (4x − 4x + x ) dx =
2 3
2x − x 3 + x 4
2
= k.
2 0 2 3 4 0 3

It follows that
3
k = .
2
1
2. The probability P(X > 1, Y < ) is given by the integral of the joint density on
2
the region D given by the intersection

1
D = {(x, y) ∈ R2 | x > 1, y < }∩T ,
2
see Figs. 12.5 and 12.6.
To find the extremes, it is easier this time to fix y and let x vary. The extremes
are given by the intersection of the border of D with the line passing in (0, y)
which is parallel to the x-axis, as we can see in Fig. 12.7.
158 12 Absolutely Continuous and Multivariate Distributions

Fig. 12.5 Representation of y

the area T on the plane

O x

Fig. 12.6 Representation of

y
the area D on the plane

O x

E = {(x,y)| x>1, y<1/2}

Fig. 12.7 Extremes of

y
variation of x

O x
12 Absolutely Continuous and Multivariate Distributions 159

1
P(X > 1, Y < )= f (x, y) dx dy
2 D
1 −y+2
2 3
= y x dx dy
0 1 2
1 −y+2
2 3 2
= y x dy
0 4 1
1
3 2
= y (3 − 4y + y 2 ) dy
4 0
1
3 3 2 4 3 1 4 2
= y − y + y
4 2 3 4 0
43
= .
256
1
The conditional probability P(X > 1|Y < ) can be obtained as follows:
2

1 P(X > 1, Y < 21 )

P(X > 1|Y < ) = .
2 P(Y < 21 )

We simply need to compute P(Y < 21 ). To this purpose we do not necessarily

need to know the marginal density of Y . This probability is given by the integral
of the joint probability f (x, y) on the domain D1 given by the intersection of
E 1 = {(x, y) ∈ R2 | y < 21 } and of T , i.e.

D1 = E 1 ∩ T ,

see Fig. 12.8.

We can obtain the probability that Y is less than 21 by computing the joint prob-
ability that there are no restrictions on X and that Y is less than 21 . We obtain:

1
P(Y < ) = f (x, y) dx dy
2 D1
1 −y+2 1
2 3 3 2
= y x dx dy = y (4 − 4y + y 2 ) dy
0 20 4 0
21
3 4 1 67
= 2y 2 − y 3 + y 4 = .
4 3 4 0 256

The conditional probability is then given by

1 P(X > 1, Y < 21 ) 43

P(X > 1|Y < ) = = .
2 P(Y < 21 ) 67
160 12 Absolutely Continuous and Multivariate Distributions

Fig. 12.8 Representation of

y
the area D1 on the plane

O x

3. We now consider the random number Z = X + Y . To compute the probability

P(0 < Z < 1) we can use the joint density of (X, Y ). We obtain

P(0 < Z < 1) = P(0 < X + Y < 1)

= P(−Y < X < 1 − Y )
= P(0 < X < 1 − Y ).

Note that in this case X and Y are both positive, hence the condition X > −Y
reduces to X > 0. In Fig. 12.9 we represent the region where the integral of the
joint density of X, Y must be calculated to obtain P(0 < Z < 1).

Fig. 12.9 Region where

y
0<Z <1

O x
X+Y=1
12 Absolutely Continuous and Multivariate Distributions 161
1 1−y
3
P(0 < X < 1 − Y ) = y x dx dy
0 2 0

3 1
= y (1 − y)2 dy
4 0

3 1
= (y − 2y 2 + y 3 ) dy
4 0

3 1 2 2 3 1 4 1
= y − y + y
4 2 3 4 0
1
= .
16

4. The m.d.f. of Z is given by

FZ (z) = P(Z ≤ z) = P(X + Y ≤ z) = P(X ≤ z − Y ) .

If we consider the line x + y − z = 0, the distribution function of Z is given by

the integral of the joint density of X, Y on the region R delimited by this line on
T , as shown by Fig. 12.10.
We obtain:
• for z < 0: P(Z < z) = 0;
• for z > 2: P(Z < z) = 1;
• for 0 ≤ z ≤ 2:

Fig. 12.10 Region R

O x

X+Y=Z
162 12 Absolutely Continuous and Multivariate Distributions
z z−y
3
P(Z < z) = y x dx dy
2
0
0
3 z
= y (z − y)2 dy
4 0

3 z 2
= (z y − 2zy 2 + y 3 ) dy
4 0

3 1 2 2 2 3 1 4 z
= z y − zy + y
4 2 3 4
0
3 1 4 2 4 1 4
= z − z + z
4 2 3 4
z4
= .
16

Summing up:
⎧
⎪ 0 for z < 0,
⎪
⎪
⎪
⎪
⎨ 4
z
FZ (z) = for 0 ≤ z ≤ 2,
⎪
⎪ 16
⎪
⎪
⎪
⎩
1 for z > 2 .

The density can be obtained by deriving the distribution function

⎧
⎪ 0 for z < 0,
⎪
⎪
⎪
⎪
⎨ 3
z
f Z (z) = for 0 ≤ z ≤ 2,
⎪
⎪ 4
⎪
⎪
⎪
⎩
0 for z > 2,

or by means of the formula

f Z (z) = f (x, z − x)d x .
R

Exercise 12.4 Let X, Y be two random numbers with joint distribution function

K x for y ≤ x ≤ y + 1, 0 ≤ y ≤ 2,
f (x, y) =
0 otherwise.

(a) Compute K .
(b) Compute the m.d.f. and the expectation of X .
12 Absolutely Continuous and Multivariate Distributions 163

1 2

−1

Fig. 12.11 Region R of definition of the density

(c) Compute cov(X, Y ).

(d) Compute P(0 < X − Y < 1).

Solution 12.4 (a) As in previous exercises, first we draw the picture of the region
R of definition of the joint density, as shown by Fig. 12.11.
Since the integral of a density must be equal 1, the constant of normalization is
given by
1
K = ,
R2 xdxdy

where
2 y+1
xdxdy = dy xdx
R2 0 y
2 y+1
x2
= dy
0 2 y

(y + 1)2
2
y2
= − dy
0 2 2
1 2
= (y + 1)3 − y 3 0 = 3 .
6
164 12 Absolutely Continuous and Multivariate Distributions

Fig. 12.12 Extremes of variation y

1
We conclude that K = .
3
(b) To compute the marginal density of X we apply the formula

f X (x) = f (x, y)dy .
R

To find the extremes of integration, we apply the general method as shown in

Fig. 12.1.
We have to pay attention, since the expressions for the extremes of integration
vary if 0 < x < 1, 1 < x < 2, 2 < x < 3 (see Fig. 12.12).
We have that if 0 < x < 1, then y varies between the lines

y=0 e y=x.

If 1 < x < 2, then y varies between the lines

y = x −1 e y = x.

If 2 < x < 3, then y varies between

y = x − 1 and y = 2.
12 Absolutely Continuous and Multivariate Distributions 165

• For 0 < x < 1: x

1 1
f X (x) = xdy = x 2 .
0 3 3

• For 1 < x < 2: x

1 1
f X (x) = xdy = x .
x−1 3 3

• For 2 < x < 3: 2

1 1
f X (x) = xdy = x(3 − x) .
x−1 3 3

Summing up:
⎧
⎪ 1 2
⎪
⎪ x for 0 < x < 1,
⎪3
⎪
⎪
⎪
⎪
⎪
⎪
⎪ 1
⎪
⎨ x for 1 < x < 2,
f X (x) = 3
⎪
⎪1
⎪
⎪
⎪
⎪ x(3 − x) for 2 < x < 3,
⎪
⎪
⎪
⎪ 3
⎪
⎪
⎩
0 otherwise.

We now verify that f X (x) is a probability density. We need to have that

f X (x)dx = 1,
R

Indeed
1 2 3
1 2 1 1
x dx + xdx + x(3 − x)dx =
0 3 1 3 2 3
2 3
1 3 1 1 2 2 x x3
= x + x + −
9 0 6 1 2 9 2
= 1.

The expectation of X is given by:

166 12 Absolutely Continuous and Multivariate Distributions

P(X ) = x f (x)dx
R
1 2 3
1 1 1
= x x 2 dx + x xdx + x x(3 − x)dx
0 3 1 3 2 3
1 2 3 4 3
1 4 1 3 x x
= x + x + −
12 0 9 1 3 12 2
16
= .
9
(c) The covariance cov(X, Y ) is given by:

cov(X, Y ) = P(X Y ) − P(X )P(Y ),

where

P(X Y ) = x y f (x, y)dxdy =
R2
2 y+1 2
1 1
= dy x y xdx = y (y + 1)3 − y 3 dy
0 y 3 0 9
2
1 4 1 3 1 2 22
= y + y + y = .
12 9 18 0 9

To compute the expectation of Y, we do not need to compute the marginal

distribution of Y . In fact it holds that

P(Y ) = y f Y (y)dy
R
= y f (x, y)dx dy
R R
= y f (x, y)dxdy .
R

Hence

P(Y ) = y f (x, y)dxdy
R
2 y+1
1
= dy x ydx
0 y 3
2
1
= y (y + 1)2 − y 2 dy
0 6
3 2
y y2 5
= + = .
6 12 0 3
12 Absolutely Continuous and Multivariate Distributions 167

Fig. 12.13 The region R1

We obtain
22 11 5 11
cov(X, Y ) = − × = ,
9 12 3 12
i.e. X and Y are positively correlated.
(d) To compute P(0 < X − Y < 1), we note that

P(0 < X − Y < 1) = P(Y < X < Y + 1) = 1 ,

since the region

R1 = {(x, y)| y < x < y + 1}

contains entirely the domain of definition of the density, see Fig. 12.13.

Exercise 12.5 Let X, Y be two stochastically independent random numbers with

the following marginal density:

K (x 3 − 1) for 1 ≤ x ≤ 2,
f (x) =
0 otherwise.

(a) Compute K .
(b) Compute the joint c.d.f., the expectation, the variance and the covariance of X
and Y .
168 12 Absolutely Continuous and Multivariate Distributions

(c) Let Z = X 2 . Compute the c.d.f., the expectation and the variance of Z .
(d) Compute the correlation coefficients ρ(X, Z ), ρ(X + Y, Z ).

Solution 12.5 (a) Since K is the normalization constant, we obtain

1 1 4
K = 2 = 2 = .
0 (x 3 − 1)dx x4
−x
11
4 1

(b) Since the random numbers X and Y are stochastically independent, their joint
c.d.f. is given by the product of the marginal c.d.f.’s:

F(x, y) = P(X ≤ x, Y ≤ y) = FX (x)FY (y) .

It is sufficient to compute
x
F(x) = P(X ≤ x) = f (t)dt .
−∞

We obtain
⎧
⎪
⎪ 0
⎪ x < 1,
⎪
⎨
x
4 3 4 x4 3
(t − 1)dt = −x+ x ∈ [1, 2],
FX (x) = 11 11 4 4
⎪
⎪
1
⎪
⎪
⎩
1 x ≥ 2.

Hence
⎧
⎪
⎪ 0 for x < 1 or y < 1,
⎪
⎪
⎪
⎪ 4 2 4
⎪
⎪
⎪
⎪ (x − x + 43 )(y 4 − y + 43 ) for (x, y) ∈ [1, 2] × [1, 2],
⎪
⎪
11
⎨
F(x, y) = 4
(x 4 − x + 34 ) for x ∈ [1, 2], y > 2,
⎪
⎪ 11
⎪
⎪
⎪
⎪ 4 (y 4 − y + 3 )
⎪
⎪ for x > 2, y ∈ [1, 2],
⎪
⎪ 11 4
⎪
⎪
⎩
1 for x > 2, y > 2 .

Since X and Y are stochastically independent, we have immediately

cov(X, Y ) = 0 .
12 Absolutely Continuous and Multivariate Distributions 169

Finally, we compute the expectation and the variance as follows:

1. Expectation

P(X ) = P(Y ) = t f (t)dt
R
2 2
4 4 t5 t2 94
= t (t 3 − 1)dt = − = .
11 1 11 5 2 1 55

2. For the variance, we need first to calculate

2
4
P(X 2 ) = P(Y 2 ) = t 2 (t 3 − 1)dt
11 1
3 2
4 t6 t 98
= − = .
11 6 3 1 33

Hence 2
98 94
σ 2 (X ) = P(X 2 ) − P(X )2 = − .
33 55

(c) We now compute the c.d.f. of Z = X 2 :

FZ (z) = P(Z ≤ z) = P(X 2 ≤ z) .

√
For z < 1, we have immediately FZ (z) = 0. For 1 ≤ z < 4, i.e. for 1 ≤ z < 2,
we have
√ √
FZ (z) = P(Z ≤ z) = P(X 2 ≤ z) = P(− z ≤ X ≤ z) =
√z √z
4 4 t4 1 2 √
(t − 1)dt =
3
−t = (z − 4 z + 3) .
11 1 11 4 1 11

For z ≥ 4, FZ (z) = 1. Summing up:

⎧
⎪
⎪ 0 for z < 1,
⎪
⎨ 1 2 √
(z − 4 z + 3) for z ∈ [1, 4],
FZ (z) = 11
⎪
⎪
⎪
⎩
1 for z ≥ 4.

The expectation of Z coincides with the expectation of X 2 , i.e.

98
P(Z ) = P(X 2 ) = .
33
170 12 Absolutely Continuous and Multivariate Distributions

To compute the variance, we note that

2
4 4 3
P(Z ) = P(X ) =
2 4
t (t − 1)dt =
1 11
8 5 2
4 t t 1027
− = .
11 8 5 1 110

Hence the variance is given by

2
1027 98
σ (Z ) = P(Z ) − P(Z ) =
2 2
− 2
.
110 33

(d) We now compute the correlation coefficient ρ(X, Z ):

cov(X, Z )
ρ(X, Z ) = .
σ(X )σ(Z )

Since we have already determined σ 2 (X ), σ 2 (Z ), we have immediately the mean

square deviations σ(X ), σ(Z ). It remains to compute

cov(X, Z ) = P(X Z ) − P(X )P(Z )

= P(X 3 ) − P(X )P(Z )
2
4 94 98
= t 3 (t 3 − 1)dt − ×
11 1 55 33
2
4 t7 t4 94 98 403 94 98
= − − × = − × .
11 7 4 1 55 33 77 55 33

Finally to obtain the correlation coefficient ρ(X + Y, Z ) we simply note that

ρ(X + Y, Z ) = ρ(X, Z ) + ρ(Y, Z ) = ρ(X, Z ) ,

since X and Y (hence Z and Y ) are stochastically independent.

Exercise 12.6 The random numbers X and Y are stochastically independent. The
probability density f X (x) of X is given by:

2x for 0 ≤ 1,
f X (x) =
0 otherwise,

while the probability density of Y is given by

e−y for y ≥ 0,
f Y (y) =
0 otherwise .
12 Absolutely Continuous and Multivariate Distributions 171

(a) Compute P(X ), P(Y ), σ 2 (X ), σ 2 (Y ).

(b) Determine the joint c.d.f. and the joint density of (X, Y ).
(c) Let Z = X + Y . Compute P(Z ), σ 2 (Z ) the c.d.f. and the density of Z .

Solution 12.6 (a) We compute the first moments of X and Y :

1
2
P(X ) = x f X (x)dx = 2x 2 dx = ;
R 0 3
1
4 1
σ 2 (X ) = P(X 2 ) − P(X )2 = 2x 3 dx − = .
0 9 18

The random number Y has exponential density of parameter λ = 1, hence we

can immediately write

1 1
P(Y ) = = 1, σ 2 (Y ) = = 1.
λ λ2
(b) The random numbers X and Y are stochastically independent, hence their joint
density is equal to
f (x, y) = f X (x) f Y (y),

i.e.
2xe−y for 0 ≤ x ≤ 1 and y ≥ 0,
f (x, y) =
0 otherwise.

We compute the joint c.d.f.

x y
F(x, y) = f (s, t)dsdt
−∞ −∞

after having identified the domain D of definition of the joint density as shown
in Fig. 12.14.
We obtain that
⎧ x y −t −y
0 0 2se dsdt = x (1 − e ) for 0 ≤ x ≤ 1 e y ≥ 0,
2
⎪
⎪
⎪
⎪
⎨
F(x, y) = 1 y −t −y
⎪ 0 0 2se dsdt = 1 − e for x > 1 e y ≥ 0,
⎪
⎪
⎪
⎩
0 otherwise.

(c) Consider now Z = X + Y . To compute P(Z ) e σ 2 (Z ) we use:

(i) the linearity property of the expectation:
2 5
P(Z ) = P(X ) + P(Y ) = +1= ;
3 3
172 12 Absolutely Continuous and Multivariate Distributions

Fig. 12.14 The domain D of

definition of the joint density

0 1

(ii) the formula for the variance of the sum of 2 random numbers:

σ 2 (Z ) = σ 2 (X + Y ) = σ 2 (X ) + σ 2 (Y ) + 2cov(X, Y )
19
= σ 2 (X ) + σ 2 (Y ) = .
18
To compute the distribution function of Z = X + Y , we use the fact that

FZ (z) = P(Z ≤ z) = P(X + Y ≤ z)

= P(Y ≤ z − X )

= f (s, t)dsdt ,
Dz

where for all fixed z, Dz is the region of the plane determined by the intersection
of the domain D of definition of the density and of the semi-plane

Sz = {(x, y)|y ≤ z − x}.

Figures 12.15 and 12.16 show the region intersected by Sz on D when z varies.
We obtain that:
(i) for z < 0, Fz (z) = 0;
(ii) for 0 < z < 1,
z z−x
FZ (z) = 2x e−y dydx

0 0
z
= 2x 1 − e−(z−x) dx = z 2 + 2(1 − z) − 2e−z ;
0
12 Absolutely Continuous and Multivariate Distributions 173

0 1

Fig. 12.15 Case 0 < z < 1

Fig. 12.16 Case z > 1

0 1
174 12 Absolutely Continuous and Multivariate Distributions

(iii) for z > 1

1 z−x
FZ (z) = 2xe−y dydx =
0 0
1
2x 1 − e−(z−x) dx = 1 − 2e−z .
0

We obtain the density of Z by deriving the c.d.f. of Z , i.e.:

⎧
⎨ 2z − 2 + 2e−z for 0 ≤ z < 1,
f Z (z) = 2e−z for z > 1,
⎩
0 otherwise.

Exercise 12.7 The random numbers X and Y have bidimensional Gaussian density

1 − 1 (x 2 +y 2 )
p(x, y) = e 2 .
2π
Let U = 2X + 3Y and V = X − Y . Compute:
1. The covariance matrix of U and V .
2. The joint density of U and V .
Solution 12.7 1. We compute the covariance matrix of U and V :
⎛ ⎞
σ 2 (U ) cov(U, V )
C = ⎝ ⎠.
cov(U, V ) σ 2 (V )

In order to compute C we use the formula of the variance of the sum of 2 random
numbers and the bilinearity of the covariance:
• σ 2 (U )

σ 2 (U ) = σ 2 (2X + 3Y )
= 4 σ 2 (X ) + 9 σ 2 (Y ) + 2 × 6 cov(X, Y )
= 13 ;

• σ 2 (V )

σ 2 (V ) = σ 2 (X − Y )
= σ 2 (X ) + σ 2 (Y ) − 2 cov(X, Y )
= 2;
12 Absolutely Continuous and Multivariate Distributions 175

• cov(U, V )

cov(U, V ) = cov(2X + 3Y, X − Y )

= 2 σ 2 (X ) − 2 cov(X, Y ) + 3 cov(X, Y ) − 3 σ 2 (Y )
= −1 .

The covariance matrix is ⎛ ⎞

13 −1
C = ⎝ ⎠.
−1 2

2. To compute the joint density of (U, V ), we first compute the joint c.d.f. of (U, V )
given by

F(u, v) = P(U ≤ u, V ≤ v) = P(2X + 3Y ≤ u, X − Y ≤ v) .

This probability is given by the integral of the joint density on the domain Du,v
of R2 where

Du,v = {(x, y) ∈ R2 | 2x + 3y ≤ u, x − y ≤ v} .

We obtain
F(u, v) = f (x, y) dx dy.
Du,v

To solve the integral, we perform the change of variables

z = 2x + 3y, t = x − y,

to transform the domain Du,v into the region

D̂u,v = {(x, y) ∈ R2 | z ≤ u, t ≤ v} .

with sides which are parallel to the axes. If we now compute x, y as function of
z and t, we obtain
1 1
x= (z + 3t), y= (z − 2t) .
5 5
It follows that the Jacobian matrix is equal to
⎛ ∂Ψ1 ∂Ψ1 ⎞ ⎛ 1 3
⎞
∂z ∂t 5 5
JΨ = ⎝ ⎠=⎝ ⎠,
∂Ψ2 ∂Ψ2
∂z ∂t
1
5
− 25
176 12 Absolutely Continuous and Multivariate Distributions

z + 3t z − 2t
where (x, y) = Ψ (z, t) = (Ψ1 (z, t), Ψ2 (z, t)) = , , with
5 5
determinant
1
|det JΨ | = .
5
We obtain:

F(u, v) = f (x, y) dx dy
Du,v

= f (ψ(z, t)) |det JΨ | dz dt
D̂u,v
u v
1 − 21 ( z+3t
5 ) +( 5 )
2 z−2t 2 1
= e dz dt
−∞ −∞ 2π 5
u v
1 − 1 · 1 (2z 2 +13t 2 +2zt )
= e 2 25 dz dt .
−∞ −∞ 10π

The joint density of (U, V ) is then

1 − 1 (2z 2 +13t 2 +2zt )

e 50 , z, t ∈ R2 .
10π
Note that (U, V ) have again joint Gaussian distribution with covariance matrix
equal to C.
To verify these results, compute the inverse matrix of A, where
⎛ 2 1
⎞
25 25
A = ⎝ ⎠.
1 13
25 25

Exercise 12.8 A random vector (X, Y, Z ) has joint density given by

f (x, y, z) = k e− 2 (2x −2x y+y 2 +z 2 +2x−6y )

1 2
.

1. Compute k.
2. Compute the expectations P(X ), P(Y ) and P(Z ).
3. Compute the density of the random vector (X, Z ).
4. Compute the correlation coefficient between X and Z and between X and Y .
5. Let W = X + Z ; compute the probability density of W .
12 Absolutely Continuous and Multivariate Distributions 177

Solution 12.8 1. If we write the density in the standard form

f (x, y, z) = k e− 2
1
Av·v+b·v

where A is the symmetric matrix

⎛ ⎞
2 −1 0
⎜ ⎟
⎜ ⎟
A = ⎜
⎜ −1 1 0⎟⎟,
⎝ ⎠
0 0 1

b is the vector in R3 ⎛ ⎞
−1
b = ⎝ 3 ⎠,
0

and v is given by ⎛ ⎞
x
v = ⎝y⎠ .
z

We can compute the normalization constant k as follows:

det A − 1 A−1 b·b
k = e 2 .
(2π)3

It is now sufficient to calculate the determinant and the inverse matrix of A. We

obtain:

det A = 1,
⎛ ⎞
1 1 0
⎜ ⎟
⎜ ⎟
A−1 = ⎜ ⎜1 2 0⎟⎟
⎝ ⎠
0 0 1

from which ⎛ ⎞
2
A−1 b = ⎝ 5 ⎠
0
178 12 Absolutely Continuous and Multivariate Distributions

and
− 21 A−1 b·b det A 1
= e− 2
13
k=e .
(2π) 3 (2π)3

2. The expectations of X, Y, Z are given respectively by

P(X ) = A−1 b 1 = 2 ,

P(Y ) = A−1 b 2 = 5 ,

P(Z ) = A−1 b 3 = 0 .

3. The random vector (X, Z ) has bidimensional Gaussian density of covariance

matrix D given by
⎛ ⎞ ⎛ ⎞
A−1 11
A−1 13
1 0
D = ⎝
⎠ = ⎝ ⎠
−1 −1
A 31
A 33
0 1

and vector d of expectations

2
d = .
0

To prove this, we derive the joint density f X,Z (x, z) from f (x, y, z) as follows:

f X,Z (x, z) = f (x, y, z) dy
R

k e− 2 (2x −2x y+y 2 +z 2 +2x−6y)
1 2
= dy
R

= e− 2 (2x +z 2 +2x)
k e− 2 (y −2x y)+3y
1 2 1 2
dy
R

= e− 2 (2x +z 2 )−x
k e− 2 y +(3+x)y
1 2 1 2
dy .
R

Here we can consider

k e− 2 y +(3+x)y
1 2
I = dy
R

as the integral of a one-dimensional Gaussian distribution with coefficients

depending on the parameter x. In the same notation as above, we obtain

A = 1 and b = 3+x,
12 Absolutely Continuous and Multivariate Distributions 179

from which we obtain

− 21 y 2 +(3+x)y 2π 1 A−1 b·b
ke dy = e2
R det A
√
2πe 2 (3+x) .
1 2
=

We can obtain the same result also by completing the square

1
− y 2 + (3 + x)y
2
in the integral I. It follows that
√
f X,Z (x, z) = k e− 2 (2x +z 2 )−x
2π e 2 (3+x)
1 2 1 2
·
e− 2 + 2 − 1 (x 2 +z 2 )+2x
13 9

= e 2
2π
e−2 − 1 (x 2 +z 2 )+2x
= e 2 .
2π

4. The correlation coefficient between X and Z can be obtained by the formula

cov(X, Z )
ρ(X, Z ) = .
σ(X ) σ(Z )

From the covariance matrix we have that

cov(X, Z ) = 0 ,

hence √
1 2
ρ(X, Y ) = √ √ = .
2 1 2

5. The probability density of W can be computed via the formula

f W (w) = f X,Z (x, w − x) dx .
R
180 12 Absolutely Continuous and Multivariate Distributions

Hence with the same method used above:

−2
e
e− 2 (x +(w−x) )+2x dx
1 2 2
f W (w) =
R 2π

e−2 − 1 w2
e− 2 (2x )+(2+w)x dx
1 2
= e 2
2π R
e−2 − 1 w2 √ 1 × 1 (2+w)2
= e 2 πe 2 2
2π
e−2+1 1 2
= √ e− 4 w +w
2 π
1
= √ e− 4 (w−2) .
1 2

2 π

The random number W has normal density with expectation

P(W ) = P(X ) + P(Z ) = 2

and variance

σ 2 (W ) = σ 2 (X ) + σ 2 (Z ) + 2cov(X, Z ) = 2 .
Chapter 13
Markov Chains

Exercise 13.1 A Markov chain (X n )n ∈ N with states S = {1, 2, 3, 4} has the

following transition matrix
⎛1 3 ⎞
4 4
0 0
⎜ ⎟
⎜ ⎟
⎜ 0 0 2 1⎟
⎜ 3 3⎟
⎜ ⎟
⎜1 ⎟
⎜ 0 3 0⎟
⎜4 4 ⎟
⎝ ⎠
0 13 0 23

and initial distribution

1
μ(1) = μ(2) = μ(3) = μ(4) = .
4
(a) Determine the equivalence classes of the states and their periods.
(2) (2) (2)
(b) Compute p2,1 , p1,4 , p1,1 .
(c) Check the existence of the following limits and compute them, if they exist:
(n)
lim p1,3 and lim P(X n = 2).
n→∞ n→∞

Solution 13.1 (a) To determine equivalence classes of the states, we can draw a
graph of the transition probabilities by using the matrix P. We first represent
the states (see Fig. 13.1) and then connect with an arrow two states such that
the transition probability from one to the other is strictly positive. For example,
since
3
[P]1,2 = ,
4

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_13
182 13 Markov Chains

Fig. 13.1 The states

1 2 3 4

Fig. 13.2 The chain has

positive probability of going 1 2
from 1 to 2

Fig. 13.3 The chain have

positive probability of
1
remaining in the state 1

Fig. 13.4 Graph of the

relations among the states
1 2 3 4

the chain has positive probability to go from the state 1 to the state 2 in one step.
We represent this on the graph by connecting 1 and 2 with an arrow going from
1 to 2, see Fig. 13.2.
1
Analogously, [P]1,1 = means that the chain has positive probability of remain-
4
ing in the state 1. This can be represented as illustrated in Fig. 13.3.
By using this procedure we can construct the graph of Fig. 13.4.
From the graph we deduce that all elements can communicate with each other,
i.e. there exists paths that connect each state to all the other ones with positive
probability. We conclude that there exists only one equivalence class [1].
Furthermore we can deduce from the graph that the period of the chain is 1, since
there exists a path of length 1 from state 1 to itself, i.e.
(n)
1 ∈ {n | p1,1 > 0}.

(2)
(b) To compute p2,1 , i.e. the probability of going in 2 steps from the state 2 to the
state 1, we write
(2)
(1) (1)
p2,1 = p2,i pi,1 .
i∈S

This formula shows how the probability of going in 2 steps from the state 2 to
the state 1 can be computed as the sum of the probabilities of all possible paths
from 2 to 1.
From the graph relative to the matrix P we obtain

(2) (1) (1) 2 1 1

p2,1 = p2,3 p3,1 = · = .
3 4 6
13 Markov Chains 183

(2)
Note that we can compute p2,1 by taking the product of the matrix column with
the matrix row
(2)
p2,1 = P2 · P 1

where P2 denotes the second row and P 1 the first column of the matrix P.
(2) (2)
Analogously we compute p1,4 and p1,1 .
(c) Since the chain is irreducible and aperiodic, the ergodic theorem guarantees the
existence of the limit
(n)
lim p1,3 = π3 ,
n→∞

where π3 can be obtained by the solution of the linear system

π = tπ P
⎧
⎪
⎪ π1 = πi pi,1
⎪
⎪
⎪
⎪
⎪
⎪
⎨ π2 = πi pi,2
⎪
⎪
⎪ π3 = πi pi,3
⎪
⎪
⎪
⎪
⎪
⎩
π1 + π2 + π3 + π4 = 1.

In this case
⎧
⎪ π1 = 4 π1 + 4 π3
1 1
⎪
⎪
⎪
⎪
⎪
⎪
⎪ π2 = 43 π1 + 13 π4
⎪
⎪
⎪
⎨
⎪ π3 = 23 π2 + 43 π3
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ 4
⎪
⎪ πi = 1.
⎩
i=1

By using standard methods for the solution of linear systems of equations, we

obtain:
12
π3 = .
25
Hence

(n) 12
lim p1,3 = .
n→∞ 25
184 13 Markov Chains

To compute lim P(X n = 2), we note that

n→∞

4
1 (n)
4
P(X n = 2) = P(X n = 2|X 0 = i)P(X 0 = i) = p .
i=1
4 i=1 i,2

Since for all i

(n)
lim pi,2 = π2 ,
n→∞

we have
1 (n)
4
1
lim P(X n = 2) = lim pi,2 = · 4π2 = π2 ,
n→∞ n→∞ 4 4
i=1

9
where π2 = .

50
Exercise 13.2 A Markov chain X n , n = 0, 1, 2 . . . with states

S = {1, 2, 3, 4, 5, 6}

has the following transition matrix

⎛ 1 1 ⎞
0 3
0 3
0 13
⎜1 0 1
0 0 0⎟
⎜2 1 2 ⎟
⎜0 0 23 0 0 ⎟
⎜ 3 ⎟
⎜0 0 2
0 13 0 ⎟
⎜ 3 ⎟
⎝0 0 0 21 0 21 ⎠
1 1
3
0 3
0 13 0

and initial distribution

1 2
μ(1) = , μ(2) = , μ(3) = μ(4) = μ(5) = μ(6) = 0.
3 3
1. Determine the equivalence classes of the states and their periods.
2. Check the existence of the following limits and compute them, if they exist:
(2n) (n) (2n)
lim p1,5 , lim p3,5 , lim p2,5 and lim P(X n = 5).
n→∞ n→∞ n→∞ n→∞

3. Compute P(X 2 < 3).

Solution 13.2 1. As in the previous exercise, we draw the graph of the states as
in Fig. 13.5, in order to determine the equivalence classes of the states and their
periods. We connect the states with an arrow in the case there exists a positive
13 Markov Chains 185

Fig. 13.5 Graph of the states

1 2 3 4 5 6

Fig. 13.6 Graph of the

probabilities of transition

1 2 3 4 5 6

probability to pass from the state, where the arrow starts, to the state where the
arrow ends.
By using the transition matrix P we obtain the graph shown in Fig. 13.6, where
we can see that there exists only one equivalence class. We note that the number
of steps needed to come back to the state from where we started is always even.
(2)
Furthermore p1,1 > 0, hence it follows that the period of the equivalence class is
2. Namely

2 = MCD A+
s

where A+ (n)
s = {n| ps,s > 0}.
2. To study the limits, we consider the equivalence classes of the matrix P 2 ; we obtain
two equivalence classes, each of period 1. To derive the equivalence classes, it
is not necessary to compute the whole matrix P 2 ; for example, the equivalence
class of 1 will be given by all states that communicate with 1 with an even number
of steps. We obtain

[1] = {1, 3, 5},

[2] = {2, 4, 6}.

Since 2 and 5 do not communicate with an even number of steps, we immediately

have that
(2n)
p2,5 = 0

for all n, hence

(2n)
lim p2,5 = 0.
n→∞
186 13 Markov Chains

The state 5 belongs to the class [1] calculated with respect to P 2 . Hence we can
apply the ergodic theorem to that class, since it has period 1 with respect to P 2 .
The submatrix of P 2 relative to [1] is given by:
⎛ 5 9 2
⎞
18 18 9
⎜ ⎟
⎜ ⎟
⎜ 1 11 2⎟
.
⎜ 6 18 9⎟
⎝ ⎠
1 1 1
6 2 3

By the ergodic theorem we have that

(2n)
lim p1,5 = π5
n→∞

where π5 is the solution of the system

⎧
⎪ 5 1 1
⎪
⎪ π1 = 18 π1 + 6 π3 + 6 π5
⎪
⎪
⎨
⎪
⎪ π3 = 189
π1 + 11 π + 21 π5
18 3
⎪
⎪
⎪
⎩
π1 + π3 + π5 = 1.

We obtain
3 9 1
π1 = π3 = π5 = .
16 16 4
Hence
(2n) 1
lim P1,5 = .
n→∞ 4
(n)
To obtain the asymptotic behavior of p3,5 for n going to infinity, we note that
(a) on the even steps, i.e. for n = 2k, we have
(2k)
p3,5 −−−→ π5 ;
k→∞

(b) on the odd steps, i.e. for n = 2k + 1, we have

(2k+1)
p3,5 = 0

since the probability of going from the state 3 to the state 5 in an odd number of
steps is zero. Since the limit on two subsequences is different, we can conclude
that
(n)
lim p3,5
n→∞
13 Markov Chains 187

does not exist. To compute

lim P(X n = 5),
n→∞

we use the formula of total probability:

6
P(X n = 5) = P(X n = 5|X 0 = i) P(X 0 = i)
i=1

6
(n) 1 (n) 2 (n)
pi,5 μi = p + p .
i=1
3 1,5 3 2,5

We need to distinguish the following 2 cases:

(a) what happens on the even steps, i.e. for n = 2k. We have:

1 (2k) 2 (2k) 1 (2k) π5

p1,5 + p2,5 = p1,5 −−−→ ;
3 3 3 k→∞ 3

(b) what happens on the odd steps, i.e. for n = 2k + 1. We have:

2 (1) (2k)
6
1 (2k+1) 2 (2k+1) 2 (2k+1)
p1,5 + p2,5 = p2,5 = p p ,
3 3 3 3 i=1 2,i i,5

2 6
(1) 2 (1)
that tends to π5 p2,i = π5 for k → ∞, since we have that p2,i = 0
3 i=1
3
(2k)
for the i such that we have lim pi,5 = π5 .
k→∞

Since we obtain different limits, we can conclude that the limit

lim P(X n = 5)
n→∞

does not exist.

3. To compute P(X 2 < 3) we note that

2
P(X 2 < 3) = P(X 2 = i),
i=1

since the event (X 2 < 3) = (X 2 = 1) + (X 2 = 2). It is then sufficient to compute

188 13 Markov Chains

6
P(X 2 = 1) = P(X 2 = 1|X 0 = i) P(X 0 = i)
i=1

6
(2)
= pi,1 μi
i=1
1 (2) 2 (2) 5
= p1,1 + p2,1 =
3 3 54
and

6
P(X 2 = 2) = P(X 2 = 2|X 0 = i) P(X 0 = i)
i=1

6
(2)
= pi,2 μi
i=1
1 (2) 2 (2) 2
= p1,2 + p2,2 = .
3 3 9
17
Finally, P(X 2 < 3) = .

54
Exercise 13.3 A Markov chain X n , n = 0, 1, 2 . . . with states

S = {1, 2, 3, 4}

has the following transition matrix

⎛ 1 1
⎞
0 2 2
0
⎜ ⎟
⎜ ⎟
⎜ 2
0 0 1⎟
⎜ 3 3⎟
P=⎜
⎜
⎟
⎟
⎜ 1
0 0 5⎟
⎜ 6 6⎟
⎝ ⎠
3 1
0 4 4
0

and initial distribution

1 1 1
μ(1) = , μ(2) = , μ(3) = , μ(4) = 0.
3 3 3
1. Determine the equivalence classes of the states and their periods.
(2)
2. Compute P(X 5 = 2|X 2 = 3), p1,4 and P(X 2 ).
3. Check the existence of the following limits and compute them, if they exist:
(2n) (2n) (n)
lim p1,3 , lim p1,4 , lim p2,3 and lim P(X n = 2).
n→∞ n→∞ n→∞ n→∞
13 Markov Chains 189

Fig. 13.7 Graph of the states

1 2 3 4

Solution 13.3 1. To find the equivalence classes we draw the graph of the states
as shown in Fig. 13.7.
Since we can reach all other states by starting from the state 1 and the state 1
can be reached from all other states, there exists only one equivalence class.
Furthermore by starting from 1, we return to it always with only an even number
(2)
of steps and p1,1 > 0. We conclude that the period of the class is equal to 2.

2. To compute the conditional probability P(X 5 = 2|X 2 = 3) we use the fact that
the chain is homogeneous. It holds that
(3)
P(X 5 = 2|X 2 = 3) = p3,2 = [P 3 ]3,2 = 0.

To compute this probability, we need to calculate the element on the row 3 and
column 2 of the matrix P 3 , We can obtain this element by multiplicating the third
row of P 2 with the second column of P.

Anyways without any computation we can immediately state that the probability
of going from the state 3 to the state 2 in a odd number of steps is 0!

Furthermore
(2) 7
p1,4 = [P 2 ]1,4 = .
12
We now compute the expectation of the state of the chain at time t = 2 by using
the formulas of the expectation of a random number with discrete distribution
and of the total probabilities:

4
P(X 2 ) = i P(X 2 = i)
i=1

4
4
= i P(X 2 = i|X 0 = j) P(X 0 = j)
i=1 j=1

4
i
= (P(X 2 = i|X 0 = 1) + P(X 2 = i|X 0 = 2) + P(X 2 = i|X 0 = 3))
i=1
3
4
i 2
= [P ]1,i + [P 2 ]2,i + [P 2 ]3,i .
i=1
3
190 13 Markov Chains

The matrix of P 2 is given by:

⎛ 5 7
⎞
12
0 0 12
⎜ ⎟
⎜ ⎟
⎜ 0 7 5
0⎟
⎜ 12 12 ⎟
P2 =⎜
⎜
⎟.
⎟
⎜ 0 17 7
0⎟
⎜ 24 24 ⎟
⎝ ⎠
13 11
24
0 0 24

Hence the expectation of X 2 is then:

1 5 7 17 5 7 41
P(X 2 ) = +2 + +3 + = .
3 12 12 24 12 24 24

3. We now compute the limits. The Markov chain observed on the even steps can
be considered as a Markov chain with transition matrix equal to P 2 . We can
immediately see that the state 3 cannot be reached from the state 1 with an even
number of steps. In fact, the equivalence classes relative to P 2 are given by

[1] = {1, 4},

[2] = {2, 3}.

It follows that
(2n)
lim p1,3 = 0.
n→∞

The state 4 belongs to the equivalence class [1] relative to P 2 . This class has
period 1. Hence we can apply the ergodic theorem to this irreducible aperiodic
subchain to compute
(2n)
lim p1,4 .
n→∞

(2n) (2n)
If we put π4 = lim p1,4 and π1 = lim p1,1 , by the ergodic theorem we
n→∞ n→∞
obtain
⎧
⎨ π1 + π4 = 1
⎩ 5
12
π1 + 13
24
π4 = π1 .
13 Markov Chains 191

The solution of the system is

13 14
π1 = , π4 = .
27 27
It follows that

(2n) 14
lim p1,4 = .
n→∞ 27
(n)
To compute lim p2,3 we observe the behavior of the chain on the even steps and
n→∞
on the odd ones.
(a) First we note that 2 ∈ [3] relative to P 2 .

By the ergodic theorem we have that on the even steps (i.e. if n = 2k)
(2k)
p2,3 −−−→ π3
k→∞

where π3 is the solution of the system

⎧
⎨ π2 + π3 = 1
⎩ 5
12
π2 + 7
24
π3 = π3 .

(b) There exists no path with an odd number of steps from the state 2 to the state
3, hence
p2,3 (2k + 1) = 0 ∀ k.

We can obtain the same result by computing

(2k+1)
(2k)
p2,3 = p2, j p j,3 (1)
j
(2k) (2k)
= p2,1 p1,3 (1) + p2,4 p4,3 (1) = 0.

Summing up
(2k)
p2,3 −−−→ π3 > 0
k→∞
(2k+1)
p2,3 −−−→ 0.
k→∞

(n)
Hence the limit lim p2,3 does not exist.
n→∞
192 13 Markov Chains

Finally, to compute lim P(X n = 2) we proceed as in the previous case. First we

n→∞
use the formula of the total probability to compute P(X n = 2):

4
P(X n = 2) = P(X n = 2|X 0 = i) P(X 0 = i)
i=1
1
= (P(X n = 2|X 0 = 1) + P(X n = 2|X 0 = 2) + P(X n = 2|X 0 = 3))
3
1 (n) (n) (n)

= p1,2 + p2,2 + p3,2 .
3

By using the results above, we obtain

(a) if n = 2k

1 (2k) (2k) (2k)

1
(2k) (2k)

p1,2 + p2,2 + p3,2 = p2,2 + p3,2
3 3
2
which tends to π2 for k → ∞;
3
(b) if n = 2k + 1

1 (2k+1) (2k+1) (2k+1)

1
(2k+1)
p1,2 + p2,2 + p3,2 = p1,2
3 3
1 (1) (2k)
4
= p p
3 i=1 1,i i,2
1 (1) (2k) (1) (2k)

= p1,2 p2,2 + p1,3 p3,2
3

(1) (1)
which tends to 13 π2 p1,2 + p1,3 = 13 π2 for k → ∞.

Summing up, if we put pn = P(X n = 2), we have that

2
p2n −−−→ π2 ,
n→∞ 3
1
p2n+1 −−−→ π2 ,
n→∞ 3

i.e. the limit lim P(X n = 2) does not exist.

n→∞

Exercise 13.4 A Markov chain (X n )n ∈ N with states S = {1, 2, 3, 4, 5} has the fol-
lowing transition matrix
13 Markov Chains 193

Fig. 13.8 Graph of states

1 2 3 4 5

⎛ 1 1 ⎞
0 2 2
00
⎜ ⎟
⎜1 1 ⎟
⎜ 0 0 0⎟
⎜2 2 ⎟
⎜ ⎟
⎜ 2 1 ⎟
P = ⎜ 0 3 0 3 0⎟
⎜
⎟
⎜ ⎟
⎜ ⎟
⎜ 0 0 2 0 1⎟
⎜ 3 3⎟
⎝ ⎠
2 1
3
00 3 0

and initial distribution

2 1
μ(1) = 0, μ(2) = , μ(3) = , μ(4) = μ(5) = 0.
3 3
(a) Determine the equivalence classes of the states and their periods.
(b) Check the existence of the following limits and, if they exist, compute them:
(n) (n) (n) (n)
lim p1,5 , lim p3,5 , lim ( p2,3 + p3,5 ), lim P(X n = 5).
n→∞ n→∞ n→∞ n→∞

(c) Compute P(X 1 ≤ 2) and P(X 2 = 5).

Solution 13.4 (a) To determine the equivalence classes of the states and their peri-
ods we draw the graph of the states, see Fig. 13.8.

We first note that all the states comunicate among each other. Consider the set
(n)
A+
1 = {n| p11 > 0}

i.e. the set of the lengths of the paths that starts and ends in 1. We note that there
exists a path of length 2 (for example, from 1 to 2 and from 2 to 1) and of length
3 (from 1 to 3, from 3 to 2, from 2 to 1). We have

2, 3 ∈ A+
1.

The period d of the equivalence class [1] is given by

d = MC D(A+
1 ),
194 13 Markov Chains

hence d must be equal to 1 since it must divide 2 and 3. We conclude that there
exists only one equivalence class of period 1.
(b) By the ergodic theorem it follows that all the limits exist since the chain has a
unique equivalence class of period 1. First we note that
(n) (n)
lim p1,5 = lim p3,5 = π5
n→∞ n→∞

since the starting state (1 to 3) does not count. Furthermore

(n) (n) (n) (n)
lim ( p2,3 + p3,5 ) = lim p2,3 + lim p3,5
n→∞ n→∞ n→∞
= π3 + π5

and finally

5
lim P(X n = 5) = lim P(X n = 5|X 0 = i)P(X 0 = i)
n→∞ n→∞
i=1

5
(n)
= lim μ(i) pi,5
n→∞
i=1

5
= π5 μ(i) = π5 · 1 = π5 ,
i=1

(n) 5
since lim pi,5 = π5 , ∀i = 1, . . . , 5 and i=1 μ(i) = 1. To obtain πi , it is suf-
n→∞
ficient to solve the system
π = πt P
5
i=1 πi = 1

i.e. ⎧
⎪
⎪ π1 = 1
π
2 2
+ 23 π5
⎪
⎪
⎪
⎪
⎪
⎪ π2 = 1
π + 23 π3
⎪
⎪ 2 1
⎪
⎪
⎨
π4 = 1
π
3 3
+ 13 π5
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ π5 = 1
π
3 4
⎪
⎪
⎪
⎪
⎩ 5
i=1 πi = 1.

In this system we have already taken out a redundant equation. We obtain

7 14
π3 = e π5 = .
540 135
13 Markov Chains 195

Hence, summing up

(n) (n) 14
lim p1,5 = lim p3,5 = lim P(X n = 5) = π5 = ,
n→∞ n→∞ n→∞ 135
7 14 7
(n) (n)
lim p2,3 + p3,5 = π3 + π5 = + = .
n→∞ 540 135 60
(c) To compute the probabilities, we note that

P(X 1 ≤ 2) = P(X 1 = 1) + P(X 1 = 2)

5
5
= P(X 1 = 1|X 0 = i)μ(i) + P(X 1 = 2|X 0 = i)μ(i)
i=1 i=1
2 1 2 1
= p2,1 + p3,1 + p2,2 + p3,2
3 3 3 3
1 2 5
= + = .
3 9 9
The second probability can be computed by using the formula of the total prob-
ability:

5
P(X 2 = 5) = P(X 2 = 5|X 0 = i)μ(i)
i=1
2 (2) 1 (2)
= p + p
3 2,5 3 3,5
2 1
= [P 2 ]2,5 + [P 2 ]3,5
3 3
1 1 1
= · = .
3 9 27

Chapter 14
Statistics

Exercise 14.1 The events E 1 , E 2 , . . . are stochastically independents subordinately

to the random parameter Θ with P(E i |Θ = θ) = θ. The a priori density of Θ is
given by
⎧ 2
⎨ 3 θ 0 ≤ θ ≤ 1,
π0 =
⎩
0 otherwise.

We observe the values of the first 4 events:

E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1.

1. Compute the a posteriori density π4 (Θ|E 1 = 0, E 2 = 1, E 3 = 1, = 1) of Θ.

E4
1
2. Compute the a priori probability that Θ belongs to the interval ,1 .
2
1
3. Compute the a posteriori probability that Θ belongs to the interval ,1 .
2
4. Compute arg max π4 (Θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1). 1

5. Compute the a posteriori expectation of E = E 5 ∧ E 6 .

Solution 14.1 1. The a posteriori density can be computed by using the formula

π4 (Θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)
= kP(E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1|Θ = θ)π0 (θ).

1 Here and in the sequel, for a given function f we denote by arg max( f ) the points where f achieves
its maximum.
© Springer International Publishing Switzerland 2016 197
F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8_14
198 14 Statistics

Since the events E i are stochastically independent subordinately to the random

parameter Θ, the probability

P(E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1|Θ = θ)

can be factorized in

Hence the a posteriori density is given by

π4 (Θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1) = kθ5 (1 − θ),

where k is a normalization constant. Since the a posteriori density π4 corresponds

to a beta distribution B(6, 2), we have

Γ (6 + 2) 7!
k = = = 42.
Γ (6)Γ (2) 5!

1
2. The a priori probability that Θ belongs to the interval , 1 is given by:
2
1
1
P( ≤ Θ ≤ 1) = π0 (θ) dθ
2 1

2
1 1 7
= 3 θ2 dθ = θ3 1 = .
1
2
2 8

1
3. The a posteriori probability that Θ belongs to the interval , 1 is given by:
2

1
P( ≤ Θ ≤ 1|E 1 = 0, E 2 = E 3 = E 4 = 1)
2
1
= π4 (θ|E 1 = 0, E 2 = E 3 = E 4 = 1) dθ
1
2
1
1
θ6 θ7 15
= 42 (θ5 − θ6 ) dθ = 42 − = .
1
2
6 7 1 16
2

4. By calculating the derivative

d
π4 (θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1) = 42 θ4 (5 − 6θ),
dθ
14 Statistics 199

5
we have that it is equal 0 in θ̄ = . Since
6

π4 θ= 5 = 42 θ3 (20 − 30θ)
θ= 5 < 0,
d θ
2 6 6

we can conclude that arg max π4 (Θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1) = 56 .

5. The a posteriori expectation of the event

E = E 5 ∧ E 6 = min(E 5 , E 6 ) = E 5 E 6

coincides with its a posteriori probability

P(E 5 E 6 |E 1 = 0, E 2 = E 3 = E 4 = 1)
1
= P(E 5 E 6 |θ)P(θ|E 1 = 0, E 2 = E 3 = E 4 = 1) dθ
0
1
= P(E 5 E 6 |θ)π4 (θ|E 1 = 0, E 2 = E 3 = E 4 = 1) dθ
0
1
= P(E 5 |θ)P(E 6 |θ) · 42 θ5 (1 − θ) dθ
0
1
= θ2 42 θ5 (1 − θ) dθ
0
1
= 42 θ7 (1 − θ) dθ
0
Γ (8) Γ (8) Γ (2)
= ·
Γ (6) Γ (2) Γ (10)
Γ (8) 7 · 6 · Γ (6)
= ·
Γ (6) 9 · 8 · Γ (8)
7
= .
12

Here we have used the following formula2

1
Γ (α) Γ (β)

θα−1 (1 − θ)β−1 dθ = .
0 Γ (α + β)

Exercise 14.2 The events E 1 , E 2 , . . . are stochastically independents subordinately

to Θ with P(E i |Θ = θ) = θ. The a priori density Θ is given by

2 For this, see the proof for the density of the sum of Γ (α, λ) + Γ (β, λ), where Γ (α, λ), Γ (β, λ)
are stochastically independent.
200 14 Statistics
⎧
⎨ K θ2 (1 − θ) 0 ≤ θ ≤ 1,
π0 =
⎩
0 otherwise.

We observe the values of the first 5 events:

E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 0, E 5 = 1.

1. Compute the normalization constant K .

2. Compute the a posteriori density of the event Θ and the a posteriori probability
of (Θ < 21 ).
3. Compute the a posteriori expectation of X = E 6 + E 7 and the a posteriori
probabilities of E = E 6 E 7 and F = E 6 ∨ E 7 .

Solution 14.2 1. To compute the constant k, we impose that the integral of the
density is equal to 1, i.e.
1
π0 (θ) dθ = 1.
0

It follows that
1
k = 1 .
0 θ2 (1 − θ) dθ

The value of this integral is well-known and equal to:

1
Γ (3) Γ (2)
θ2 (1 − θ) dθ = ,
0 Γ (3 + 2)

hence
Γ (3 + 2) 4!
k = = = 12.
Γ (3) Γ (2) 2! · 1!

2. The a posteriori density is given by

π5 (Θ|E 1 = 0, E 2 = E 3 = 1, E 4 = 0, E 5 = 1)
= π0 (θ)P(E 1 = 0, E 2 = E 3 = 1, E 4 = 0, E 5 = 1|θ)
= π0 (θ)P(E 1 = 0|θ)P(E 2 = 1|θ)P(E 3 = 1|θ)P(E 4 = 0|θ)P(E 5 = 1|θ)
= c · θ2 (1 − θ) · (1 − θ)2 Θ 3
= c · θ5 (1 − θ)3 .

Here we denote with c the normalization constant of the a posteriori density. In

this case the a posteriori probability distribution is a beta B(6, 4). It follows that
14 Statistics 201

Γ (6 + 4) 9!
c = = = 7 · 8 · 9 = 504.
Γ (6) Γ (4) 5! 3!

If we put W = E˜1 E 2 E 3 E˜4 E 5 , the a posteriori density is given by

⎧
⎨ 504 θ5 (1 − θ)3 0 ≤ θ ≤ 1,
π5 (θ) =
⎩
0 otherwise.

To find the a posteriori probability of the event (Θ < 21 ) it is sufficient to integrate

1
the a posteriori density between 0 and :
2
1 1
1 2 2 65
P(Θ < ) = 504 θ (1 − θ) dθ = 504
5 3
(θ5 + 3θ7 − 3θ6 − θ8 )dθ = .
2 0 0 256

3. The a posteriori expectation of X = E 6 + E 7 is given by

2
P(X |W ) = i P(X = i|W )
i=0
= P(X = 1|W ) + 2 P(X = 2|W ).

We obtain:

P(X = 1|W ) = P(E 6 = 1, E 7 = 0|W ) + P(E 6 = 0, E 7 = 1|W )

1
=2 P(E 6 = 0, E 7 = 1|Θ = θ)π5 (Θ = θ|W ) dθ
0
1
Γ (10)
=2 θ · (1 − θ) · θ5 · (1 − θ)3 dθ
0 Γ (6) Γ (4)
1
Γ (10)
=2 θ6 (1 − θ)4 dθ
Γ (6) Γ (4) 0
Γ (10) Γ (7) Γ (5)
=2
Γ (6) Γ (4) Γ (12)
24
=
55
and

P(X = 2|W ) = P(E 6 = 1, E 7 = 1|W )

1
= P(E 6 = 1, E 7 = 1|θ = Θ)π5 (Θ = θ|W ) dθ
0
202 14 Statistics
1
Γ (10)
= θ2 · θ5 · (1 − θ)3 dθ
Γ (6) Γ (4) 0
1
Γ (10)
= θ7 (1 − θ)3 dθ
Γ (6) Γ (4) 0
Γ (10) Γ (8) Γ (4)
=
Γ (6) Γ (4) Γ (12)
21
= .
55
The a posteriori expectation of X is equal to

21 24 6
P(X |W ) = 2 · + = .
55 55 5
Note that X = E 6 + E 7 is a random number, but not an event since it can assume
3 possible values: 0, 1 or 2.
The a posteriori probability of the events E = E 6 E 7 e F = E 6 ∨ E 7 can be
calculated in the same way:

21
P(E|W ) = P(E 6 E 7 = 1|W ) = P(E 6 = E 7 = 1|W ) =
55
and

P(F|W ) = P(E 6 ∨ E 7 = 1|W )

= P(E 6 = 1, E 7 = 0|W )
+ P(E 6 = 0, E 7 = 1|W )
+ P(E 6 = 1 = E 7 = 1|W )
9

= .
11

Exercise 14.3 The events E 1 , E 2 , . . . are stochastically independents subordinately

to Θ with P(E i |Θ = θ) = θ. The a prior density Θ is given by

K θ2 (1 − θ)2 for 0 ≤ θ ≤ 1,
π0 (θ) =
0 otherwise.

We observe the values of the first 4 events: E 1 = 0, E 2 = 1, E 3 = E 4 = 1.

(a) Compute the normalization constant K .
(b) Compute the a posteriori density and the a posteriori expectation of Θ.
(c) Compute the a posteriori probability of the event F = E 52 and the a posteriori
variance of expectation of Ẽ 6 .
14 Statistics 203

Solution 14.3 (a) To compute K we impose that

1
π0 (θ)dθ = 1
0

i.e.
1
K = 1 ,
0 θ2 (1 − θ)2 dθ

since the integral of a probability density must be equal to 1. The integral appear-
ing at the denominator is well-known and equal to
1
Γ (3)2
θ2 (1 − θ)2 dθ =
0 Γ (6)

hence
Γ (6) 5!
K = = = 30.
Γ (3)2 (2!)2

(b) The a posteriori density of Θ given the events E 1 = 0, E 2 = E 3 = E 4 = 1 is

given by the formula

π4 (θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)
= K π0 (θ)P(E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1|θ)
= P(E 1 = 0|θ) · P(E 2 = 1|θ) · P(E 3 = 1|θ) · P(E 4 = 1|θ)
= K θ5 (1 − θ)3 ,

Γ (10)
where K = = 504 and θ ∈ [0, 1]. For θ ∈ / [0, 1], the a posteriori
Γ (6)Γ (4)
density is equal to 0. To compute the a posteriori expectation of Θ, we apply the
formula of the expectation for absolutely continuous distributions, i.e.

P(Θ|E 1 = 0, E 2 = E 3 = E4 = 1)
1
= θπ4 (θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)dθ
0
1
Γ (10)
= θ6 (1 − θ)3 dθ
Γ (6)Γ (4) 0
Γ (10) Γ (7)Γ (4) 3
= · = .
Γ (6)Γ (4) Γ (11) 5
204 14 Statistics

(c) The event F = E 52 coincides with E 5 since it assumes only the value 0 or 1. The
a posteriori probability of F is given by

P(F|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)
= P(E 52 |E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)
= P(E 5 |E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)
1
3
= θπ4 (θ|E 1 = 0, E 2 = 1, E 3 = 1, E 4 = 1)dθ = .
0 5

To compute the a posteriori variance of Ẽ 6 , we consider the usual formula for

the variance. To simplify the notations, we put A = E1 E 2 E 3 E 4 . We obtain

where we have used that

Ẽ 62 = Ẽ 6 .

We need only to compute the a posteriori expectation of Ẽ 6 . For this purpose we

apply the formula of the total probabilities. Hence

Exercise 14.4 The events E 1 , E 2 , . . . are stochastically independent subordinately

to Θ with P(E i |Θ = θ) = θ. The a priori density of Θ is given by
√
K θ2 1 − θ for 0 ≤ θ ≤ 1,
π0 (θ) =
0 otherwise.

We observe the values of the first 4 events: E 1 = 1, E 2 = 0, E 3 = 0, E 4 = 1.

(a) Compute the normalization constant K .
(b) Compute the a posteriori density π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1) of Θ
and arg max π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1).
(c) Compute the a posteriori covariance of the events E 6 and E 7 .

Solution 14.4 (a) The normalization constant K makes the integral of the density
equal to 1, hence
1
K = 1 1
.
0 θ2 (1 − θ) 2 dθ

We know that

1 Γ (3)Γ 23
1
θ (1 − θ) dθ =
2

2 ,
0 Γ 3 + 23

hence
9 3
Γ 7
· 5
· 3
·Γ 105
K = 2
3 = 2 2 2
3 2
= .
Γ (3)Γ 2
2!Γ 2
16

(b) We compute the a posteriori density by using the fact that the events are stochas-
tically independent subordinately to Θ. We have

π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1)
= K P(E 1 = 1, E 2 = E 3 = 0, E 4 = 1|θ)π0 (θ)
= K P(E 1 = 1|θ) · P(E 2 = 0|θ) · P(E 3 = 0|θ) · P(E 4 = 1|θ)π0 (θ)
5
= K θ4 (1 − θ) 2 ,
206 14 Statistics

where 7
Γ 5 + 27 15
· 13
· 11
· 29 Γ 6435
K = = 2 2 2
7 2
= .
Γ (5)Γ 27 Γ (5)Γ 2
128

Hence
⎧
⎪ 6435 4 5
⎨ θ (1 − θ) 2 θ ∈ [0, 1],
π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1) = 128
⎪
⎩
0 otherwise.

The arg max can be computed by finding the zeros of the first derivative. We
have

5
π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1) = K 4θ3 (1 − θ) 2 − θ4 (1 − θ) 2
5 3

2

3 5
= K θ3 (1 − θ) 2 4(1 − θ) − θ
2
θ 3
3
= K (1 − θ) 2 [8 − 13θ].
2
The derivative is equal to 0 in the extremes of the interval as well in

8
θ̄ = .
13

8 8
Since π4
> 0 for θ ∈ 0,
and π4 < 0 for θ ∈ , 1 , we have that
13 13
8
arg max π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1) = .
13
(c) The a posteriori covariance of the events E 6 and E 7 is given by

cov(E 6 , E 7 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1)
= P(E 6 E 7 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1)
− P(E 6 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1)P(E 7 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1).

In Exercise 14.3 we have proved that

P(E 6 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1)
= P(E 7 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1)
1
10
= θπ4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1)dθ = .
0 17
14 Statistics 207

Analogously

P(E 6 E 7 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1)
1
120
= θ2 π4 (θ|E 1 = 1, E 2 = E 3 = 0, E 4 = 1)dθ = .
0 323

We conclude that
2
120 10
cov(E 6 , E 7 |E 1 = 1, E 2 = E 3 = 0, E 4 = 1) = − .

323 17

Exercise 14.5 The random numbers X 1 , X 2 , . . . are stochastically independents

subordinately to Θ with the same conditional marginal density given by

1 (x − θ)2
f (x|θ) = √ exp − , x ∈ R.
2π 2

We assume that Θ has standard normal distribution. We observe the values of the
first 4 experiments:

x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5.

(a) Write the a priori density of Θ.

(b) Compute the a posteriori density π4 (θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5)
of Θ and arg max π4 (θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5).
(c) Compute the a posteriori expectation and variance of Θ.

Solution 14.5 (a) Since Θ has a standard normal distribution as a priori distribution,
we can write immediately the a priori density

1 θ2
π0 (θ) = √ e− 2 , θ ∈ R.
2π

(b) We compute the a posteriori density by using the fact that the random numbers
are stochastically independent subordinately to Θ:

π4 (θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5)

= k f (x1 , x2 , x3 , x4 |θ)π0 (θ)

4
i=1 (x i − θ) + θ
2 2
= 4
ki=1f (xi |θ)π0 (θ) = k exp −
2

5 8
= k exp − (θ − )2 ,
2 25
208 14 Statistics

Note that all factors which are independent of θ are now included in the constant k.
8 1
We obtain that the a posteriori distribution is Gaussian N ( , ), hence
25 5
√
5
k=√ .
2π

8
The graph of the a posteriori density is bell shaped with symmetry axis x = .
25
Then arg max π4 (θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5) = 25
8
. Verify this by
computing the derivatives of the density function.
(c) The parameters of the a posteriori density provide us with:
1. the a posteriori expectation

8
P(Θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5) = ;
25
2. the a posteriori variance

σ 2 (Θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5) =

1
.

5

Exercise 14.6 The random numbers X 1 , X 2 , . . . are stochastically independent sub-

ordinately to Θ with the same conditional marginal density given by

1 (x − θ)2
f (x|θ) = √ exp − , x ∈ R.
2 2π 8

The a priori distribution of Θ is given by

1 (θ − 1)2
π0 (θ) = √ exp − , θ ∈ R.
4π 4

We observe the values of the first 3 experiments:

x1 = 1, x2 = 0.5, x3 = −1.

(a) Compute the likelihood factor.

(b) Compute the a posteriori density of Θ.
(c) Estimate the a posteriori probability of the event (Θ > 1000).
14 Statistics 209

Solution 14.6 (a) By definition, the likelihood factor is given by

f (x1 , x2 , x3 |θ) = i=1

3
f (xi |θ)

3
i=1 (x i − θ)
2
1
= exp −
8 (2π)3 8

1 1 9
= exp − 3θ2 − θ + .
8 (2π)3 8 4

(b) By the computations for the likelihood factor, we immediately obtain the a
posteriori density as follows

π3 (θ| x1 = 1, x2 = 0.5, x3 = −1)

= k f (x1 , x2 , x3 , x4 |θ)π0 (θ)

1 9 (θ − 1)2
= k exp − 3θ2 − θ + −
8 4 4

5 1
= k exp − (θ − )2 ,
8 2

where we have put in the constant k all terms which are independent of θ. We
1 4
obtain that the a posteriori distribution is a normal distribution N ( , ) with
√ 2 5
5
normalization constant k = √ .
2 2π
(c) To estimate the a posteriori probability of the event (Θ > 1000) we use the
tail estimation for the Gaussian distribution. To this purpose we need first to
express Θ as function of a random variable Y with distribution N (0, 1). Since
1 4
the a posteriori distribution of Θ is Gaussian N ( , ), we have
2 5
√
2 5 1
Θ= Y+ ,
5 2
where Y ∼ N (0, 1). Hence
√ √
2 5 1 5
P(Θ > 1000) = P( Y + > 1000) = P(Y > · 999, 5).
5 2 2
By the tail estimation of the standard normal distribution, we obtain that

n(x) n(x) n(x)

− 3 < P(Y > x) < ,
x x x
210 14 Statistics

1 x2
where x > 0, n(x) = √ e− 2 . To obtain an upper bound for P(Θ > 1000)
2π
n(x)
we can compute in the point
x
√
x=
5
· 999, 5.

2

Exercise 14.7 The random numbers X 1 , X 2 , . . . are stochastically independent sub-

ordinately to Φ with the same conditional marginal density given by

1 1 φ(x − 1)2
f (x|φ) = √ φ 2 exp − , x ∈ R.
2π 2

The a priori distribution of Φ is given by a Gamma distribution Γ (2, 1). We observe

the values of the first 3 experiments:

x1 = 1.5, x2 = 0.5, x3 = 2.

(a) Write the a priori density of Φ.

(b) Compute the a posteriori density π3 (φ| x1 = 1.5, x2 = 0.5, x3 = 2) of Φ and
arg max π3 (φ| x1 = 1.5, x2 = 0.5, x3 = 2).
(c) Compute the a posteriori expectation and variance of Φ.

Solution 14.7 (a) Since the a priori distribution of Φ is given by a Gamma distri-
bution Γ (2, 1), we can write immediately the a priori density:

φ e−φ φ ≥ 0,
π0 (φ) =
0 φ < 0.

(b) The a posteriori density is given by

π3 (φ| x1 = 1.5, x2 = 0.5, x3 = 2)

= k f (x1 , x2 , x3 |φ)π0 (φ)

3
5 (x i − 1) 2
= kφ 2 exp − i=1
+1 φ
2
= kφ 2 e− 4 φ
5 7

for φ ≥ 0, 0 otherwise. Note that we have put in the constant k all factors which
are independent of φ. The a posteriori distribution is then a Gamma distribution
7 7
Γ ( , ) with normalization constant
2 4
14 Statistics 211

27 √
7 1 77
k= = √ .
4 Γ (2)
7
240 π

Furthermore we have that

d 5 7
π3 (φ| x1 = 1.5, x2 = 0.5, x3 = 2) = kφ 2 e− 4 φ ( − φ) = 0
3 7

dφ 2 4

10
if φ = . We immediately obtain that arg max π3 (φ| x1 = 1.5, x2 = 0.5, x3 =
7
2) = 10
7
by analyzing the sign of the first derivative.
(c) The parameters of the a posteriori density provide us with
1. the a posteriori expectation

P(Φ| x1 = 1.5, x2 = 0.5, x3 = 2) = 2 ;

2. the a posteriori variance

σ 2 (Θ| x1 = 0.1, x2 = 2, x3 = −1, x4 = 0.5) =

8
.

7

Exercise 14.8 The random numbers X 1 , X 2 , . . . are stochastically independent sub-

ordinately to Φ with the same conditional marginal density given by

1 1 φx 2
f (x|φ) = √ φ 2 exp − , x ∈ R.
2π 2

The a priori distribution of Φ is given by an exponential distribution with parameter

λ = 2. We observe the values of the first 4 experiments:
√
x1 = 1, x2 = 2, x3 = 0.5, x4 = 2.

(a) Write the a priori density of Φ and the a priori probability of the event (Φ > 2).
(b) Compute the a posteriori density of Φ and the a posteriori probability of the
event (Φ > 2).
(c) Compute the a posteriori expectation of Z = Φ 2 .

Solution 14.8 (a) Since the a priori distribution of Φ is given by an exponential

distribution with parameter λ = 2, i.e. a Gamma distribution Γ (1, 2), we can
write immediately the a priori density:

2e−2φ φ ≥ 0,
π0 (φ) =
0 φ < 0.
212 14 Statistics

The a priori probability of the event (Φ > 2) is given by

+∞
P(Φ > 2) = 2e−2φ dφ = e−4 .
2

(b) The a posteriori density is given by

√
π4 (φ| x1 = 1, x2 = 2, x3 = 0.5, x4 = 2)
= k f (x1 , x2 , x3 , x4 |φ)π0 (φ)

4 2
i=1 x i
= kφ exp −
2
+2 φ
2
= kφ2 e− 8 φ
45

for φ ≥ 0, 0 otherwise. Note that we have put in the constant k all factors which
are independent of φ. The a posteriori distribution is then a Gamma distribution
45
Γ (α4 , λ4 ) = Γ (3, ) with normalization constant
8
3
45 1 453
k= = 10 .
8 Γ (3) 2

The a posteriori probability of the event (Φ > 2) is given by

√
P(Φ > 2| x1 = 1, x2 = 2, x3 = 0.5, x4 = 2)
+∞ √
= π4 (φ| x1 = 1, x2 = 2, x3 = 0.5, x4 = 2)dφ
2
+∞
φ2 e− 8 φ dφ
45
=k
2
8 2 45 +∞ 34 53 45
=k − φ + 2φ + 2 e− 8 φ = 6 e− 4 .
45 2 2

(c) To compute the a posteriori expectation of Z = Φ 2 it is sufficient to note that

√
P(Φ 2 | x1 = 1, x2 = 2, x3 = 0.5, x4 = 2)
√
= σ 2 (Φ| x1 = 1, x2 = 2, x3 = 0.5, x4 = 2)
√
+ P(Φ| x1 = 1, x2 = 2, x3 = 0.5, x4 = 2)
α4 α42 256
= + = .

λ4
2
λ4
2 675
Appendix A
Elements of Combinatorics

n
Consider a set Ω = {a1 , . . . , an } of n elements. We recall that the symbol is
r

n n!
called binomial coefficient and that = .
r r !n − r !

A.1 Dispositions

We count the number of ways of choosing r elements out of a set of n elements with
repetitions and taking in account of their order, i.e. the number of dispositions of r
elements out of n. We have:

1st element −→ n choices,

2nd elements −→ n choices,

· ·
· ·
· ·
r th elements −→ n choices .

Totally, the dispositions are n · n . . . n = nr . They count the number of functions

from a set of r elements to a set of n elements.

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
214 Appendix A: Elements of Combinatorics

A.2 Simple Dispositions

We count the number of ways of choosing r elements out of a set of n elements

without repetitions and taking in account of their order, i.e. the number of simple
dispositions of r elements out of n. We have:

1o element −→ n choices,

2o elements −→ (n − 1) choices,

3o elements −→ (n − 2) choices,

· ·
· ·
· ·
r elements −→ (n − r + 1) choices .
o

n!
Totally, the simple dispositions are n · (n − 1) . . . (n − r + 1) = and are
(n − r )!
denoted by the symbol Drn or (n)r . They count the number of injective functions from
a set of r elements to a set of n elements. If r = n, they are called permutations.

A.3 Simple Combinations

We count the number of ways of choosing r elements out of a set of n elements

without repetitions and without taking in account of their order, i.e. the number of
simple combinations of r elements out of n. Given a simple combination of r elements
out of n, we obtain r ! dispositions by permutating the r elements. The number of
simple combinations is then

1 n n! n
Dr = = .
r! r !(n − r )! r

They count the number of injective functions from a set of r elements to a set of n
elements which have a different image.
Appendix A: Elements of Combinatorics 215

A.4 Combinations

We count the number of ways of choosing r elements out of a set of n elements without
taking in account of their order, i.e. the number of combinations of r elements out
of n. Given a combination {a1 , . . . , ar }, without loss of generality, we can suppose
that a1 ≤ · · · ≤ ar . Starting from this combination, we now construct a simple
combination of r elements out of n + r − 1 elements in the following way:

b1 = a1 ,
b2 = a2 + 1 ,
· ·
· ·
· ·
br = ar + r − 1 .

On the other way round, we can always associate a combination to a simple com-
bination. Hence the r -combinations
are
as many as the r -simple combinations in
n +r −1
n + r − 1, elements, i.e. .
r

A.5 Multinomial Coefficient

The number of ways of forming k groups of r1 , . . . , rk elements respectively, where

r1 + · · · + rk = n is given by the multinomial coefficient

n!
.
r1 !r2 ! . . . rk !

n
To form the first group of r1 elements, we have possibilities. For the second
r1

n − r1
group, we have ways. Analogously we proceed for the remaining groups.
r2
We obtain

n n − r1 n − r1 − · · · − rk−1 n!
··· = .
r1 r2 rk r1 !r2 ! . . . rk !
Appendix B
Relations Between Discrete and Absolutely
Continuous Distributions

In Table B.1 we summarize some analogies between discrete and absolutely contin-
uous distributions.

Table B.1 Some analogies between discrete and absolutely continuous distributions
C. Discrete C. Abs. Continuous
Probability Density
P(X = x) −→ f (x)
Cumulative distribution function P(X ≤ x)
x
P(X = i) −→ −∞ f (s) ds
i∈I (X ),i≤x
Expectation of X
+∞
i P(X = i) −→ −∞ s f (s) ds
i∈I (X )
Expectation of Y = Ψ (X )
+∞
Ψ (i) P(X = i) −→ −∞ Ψ (s) f (s) ds
i∈I (X )
P(X ∈ A)

P(X = i) −→ A f (s) ds
i∈I (X ),i∈A

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
Appendix C
Some Discrete Distributions

We present in Table C.1 an overview of the discrete distributions presented in Chap. 2.

Table C.1 Some discrete distributions

Distribution I (X ) P(X = k) P(X ) σ 2 (X )
Bernoulli p {0, 1} P(X = 1) = p p p(1 − p)

n
Binomial {0, . . . , n} p k (1 − p)n−k np np (1 − p)
Bn(n, p) k
1− p
Geometric p {1, 2, . . . } p (1 − p)k−1 1
p p2
⎛ ⎞⎛ ⎞
⎝
b ⎠⎝ N − b ⎠
k n−k
N −n
Hypergeometric {0 ∨ (n − (N − ⎛ ⎞ n Nb n N −1
b
N 1− b
N
(n, N , b) b)), . . . , n ∧ b} ⎝
N⎠
n
λk −λ
Poisson λ N k! e λ λ

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
Appendix D
Some One-Dimensional Absolutely
Continuous Distributions

We recall in Table D.1 the most common one-dimensional absolutely continuous

distributions.

Table D.1 Some one-dimensional absolutely continuous distributions

Distribution I (X ) Density P(X ) σ 2 (X )
(b−a)2
Uniform [a, b] [a, b]a 1
b−a I[a,b]
a+b
2 12

Exponential λ R+ λ e−λx I{x≥0} 1

λ
1
λ2
x2
Std. normal N (0, 1) R √1 e− 2 0 1
2π
2
− (x−μ)
Gen. normal N (μ, σ 2 ) R √1 e 2σ2 μ σ2
2πσ 2
λα α α
Gamma Γ (α, β) R+ Γ (α) x
α−1 e−λx I
{x≥0} λ λ2
Γ (α+β) α−1 α αβ
Beta B(α, β) [0, 1] Γ (α)Γ (β) x (1 − x)β−1 α+β (α+β)2 (α+β+1)
ab >a

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
Appendix E
The Normal Distribution

We present in Table E.1 a summary on the normal distribution.

Table E.1 The normal distribution in a nutshell

. , xn ) = k e(− 2 Ax·x+b·x)
1
Density f (x1⎛, . . ⎞ ⎛ ⎞
x1 b1
⎜ . ⎟ ⎜ . ⎟
x = ⎝ .. ⎠ , A ∈ S(n), b = ⎝ .. ⎟
⎜ ⎟ ⎜
⎠
xn bn

det A − 21 A−1 b·b
Normalization constant k= (2π)n e
Expectation P(X )= A−1 b ⇒ P(X −1
i ) = (A b)i
Variance and covariance matrix C = A −1

Marginal distribution of X i X i ∼ N (A−1 b)i , [A−1 ]ii

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
Appendix F
Stirling’s Formula

In this chapter we present Stirling’s formula, which describes the asymptotic behavior
of n! with n increasing. It holds that:
√
2π n n+ 2 e−n (1 + O(n −1 )) .
1
Stirling’s formula: n! =

Different kinds of proofs can be used to prove this formula. Here we present the
classical proof and a more general result, of which the Stirling’s formula represents
a particular case.
We start with the classical proof which can be found in [2]. Here we recall it for
reader’s convenience.

F.1 First Proof

Here we obtain Stirling’s √formula modulo a multiplicative constant. This value can
be shown to be equal to 2π, as a consequence of Theorem 5.4.1 by approximating
the probability that a random number with binomial distribution Bn(2n, 21 ) assumes
the value n.
The Stirling’s formula is equivalent to

n!
lim √ 1 = 1.
n→∞ 2π n n+ 2 e−n

In order to compute this limit, we look for an estimation of

log n! = log(1 · 2 · . . . n) = log 1 + log 2 + · · · + log n .

Since log x is an increasing function, it can be approximated as follows:

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
226 Appendix F: Stirling’s Formula
k k+1
log xdx < log k · 1 < log xdx .
k−1 k

Hence summing up
n
k
n n
k+1
log xdx < log k < log xdx,
k=1 k−1 k=1 k=1 k

we have
n log n − n < log n! < (n + 1) log(n + 1) − n .

This inequality suggests to use log n! to approximate

1
(n + ) log n − n .
2

We can namely think that (n + 21 ) log n represents a sort of average. If we put

1 n!
dn = log n! − n + log n − n = log 1 ,
2 n n+ 2 e−n

we have
1 n+1
dn − dn+1 = n + log − 1, (F.1)
2 n

however
1
1+
n+1 2n + 1
=
n 1
1−
2n + 1

and
∞
xn
log(x + 1) = (−1)n+1 . (F.2)
n=1
n

Since
⎛ 1 ⎞
1+
n+1 ⎜ 2n + 1 ⎟ = log 1 + 1 1
= log ⎝ − − ,
1 ⎠
log log 1
n 2n + 1 2n + 1
1−
2n + 1

1
using (F.2) with x = ± we obtain
2n + 1
Appendix F: Stirling’s Formula 227

1 1 1
dn − dn+1 = (2n + 1) log 1 + − log 1 − −1
2 2n + 1 2n + 1

1 2 2 2
= (2n + 1) + + + · · · −1
2 (2n + 1) 3(2n + 1)3 5(2n + 1)5
1 1
= + + ··· ,
3(2n + 1) 2 5(2n + 1)4

from which it follows

dn − dn+1 > 0 .

Hence dn is decreasing. It follows that the limit of dn exists (finite or infinite). To

prove that the limit is finite, we note that
⎡ ⎤
∞
2k
1 1 1⎢
⎢ 1 ⎥
0 < dn − dn+1 < = − 1⎥
3 2n + 1 3⎣ 1 ⎦
k=1 1−
(2n + 1)2
1 1 1 1
= = − ,
3 (2n + 1) − 1
2 12n 12(n + 1)

i.e. the sequence

1
an = dn −
12n
is increasing. Since
an ≤ dn ∀n ∈ N, n = 0,

and it holds that

1
lim an = lim dn − = lim dn ,
n→∞ n→∞ 12n n→∞

we obtain that the limit of dn exists finite since the two sequences an e dn are bounded
by each other.

F.2 Proof by Using the Gamma Function

Consider the Gamma function given by

+∞
Γ (α) = x α e−x d x ,
0

where α > 0. It represents a generalization of factorial n, since for all α > 0 it holds
that
228 Appendix F: Stirling’s Formula

Γ (α + 1) = αΓ (α).

This can be easily verified by integration by parts. If α is a natural number, by iteration

we obtain
Γ (n + 1) = n!.

To prove Stirling’s formula we show the more general result that

√
2π αα+ 2 e−α (1 + O(α−1 )) .
1
Γ (α + 1) =

We consider logarithm of φ(x) = log (x α e−x ) = α log x − x. We compute the Taylor

expansion of φ(x) at the maximum point α:

1 n
(−1)k−1 (x − α)k (−1)n (x − α)n+1
φ(x) = α log α − α − (x − α)2 + +α ,
2α k=3
k α k−1 n+1 ξ n+1

where ξ ∈ [α, x]. In the integral we perform the change of variable

x −α √
u= √ , dx = αdu .
α

We obtain +∞
u2
Γ (α + 1) = αα+ 2 e−α e− 2 +ψ(u) du,
1

√
− α

where
n
(−1)k−1 u k n+3 (−1)
n
u n+1
ψ(u) = +α 2 √
k=3
k α 2 −1
k
n + 1 (α + ξ α)n+1

with ξ ∈ [0, u]. We divide the integral in three parts:

√
I1 = [− α, −αδ ], I2 = [−αδ , αδ ], I3 = [αδ , +∞],

where δ > 0 s sufficiently small constant. For what concerns I1 , I3 we note that φ(u)
is a concave function. Hence we obtain that also the function

u2
θ(u) = − + ψ(u),
2
obtained by φ by adding a constant and by a linear transformation of the underlying
variable, is concave. For u ≤ −αδ we have
u
θ(u) ≤ − θ(−αδ )
αδ
Appendix F: Stirling’s Formula 229

and for u ≥ αδ
u
θ(u) ≤ θ(αδ ).
αδ
1
By the expansion of ψ(u) with n = 2 we note that for α sufficiently big and δ <
6
α2δ α2δ
we have θ(−αδ ) < − , θ(αδ ) < − Hence for |u| ≥ αδ it holds that
4 4

αδ
θ(u) ≤ −|u| .
4
It follows that

αδ
eθ(u) du + eθ(u) du ≤ e−|u| 4 du
I1 I3 |u|≥αδ

8 −|u| αδ +∞ 8 α2δ
= − δe 4 = δ e− 4 .
α αδ α

We now consider I2 . If we choose n = 3 we obtain

1 u3 1 α3 u 4 1 u3 u4
eψ(u) = exp 1 − √ =1+ 1 + O( )
3 α2 4 (α + ξ α)4 3 α2 α

with ξ ∈ [0, u] ⊂ I2 and for |u| < αδ . It follows that

α− 2
1
u2 u2 u2 u2
e− 2 +ψ(u) du = e− 2 du − e− 2 du + + u 3 e− 2 du + O(α−1 )
I2 I2c 3 I2
√ −1
= 2π + O(α )

and hence √
2π αα+ 2 e−α (1 + O(α−1 )) .
1
Γ (α + 1) =
Appendix G
Elements of Analysis

In this appendix we recall some definitions and results of analysis in one variable to
facilitate the theoretical comprehension and the execution of the exercises.

G.1 Limit of a Sequence

Let (an )n ∈ N be a sequence of real numbers. This is called

1. convergent if
lim an = L < ∞,
n→∞

i.e. if for all > 0 there exists N = N () such that for all n > N

|an − L| < ;

2. divergent if
lim an = +∞,
n→∞

i.e. if for all M > 0 there exists N = N (M) such that for all n > N

an > M ,

or limn→∞ an = −∞ respectively, i.e. if for all M > 0 there exists N =

N (M) such that for all n > N

an < M .

A sequence may neither be convergent nor divergent. For example, the sequence
an = (−1)n oscillates between 1 and −1.

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
232 Appendix G: Elements of Analysis

G.2 Limit of Functions

A function f : R −→ R has:
1. finite limit in x if
lim f (y) = L < ∞ ,
y→x

i.e. if for all > 0 there exists δ = δ() such that for all y with |y − x| < δ

| f (y) − L| < ;

2. infinite limit in x if
lim f (y) = +∞ ,
y→x

i.e. if for all M > 0 there exists δ = δ(M) such that for all y with |y − x| < δ

f (y) > M ,

or lim y→x f (y) = −∞ meaning that, for all M > 0 there exists δ =
δ(M) such that for all y with |y − x| < δ

f (y) < M .

G.3 Limits of Special Interest

We recall the following limits of special interest:

1.
1 n
lim 1+ = e;
n→∞ n

2. ∀ x ∈ R x n
lim 1+ = ex ;
n→∞ n
3.
log(1 + x)
lim = 1.
x→0 x
Appendix G: Elements of Analysis 233

G.4 Series

We recall the following series:

1. the geometric series
∞
1
xn =
n=0
1−x

for all |x| < 1;

2. the series
∞
1
nx n−1 =
n=1
(1 − x)2

for all |x| < 1, which is obtained as derivative of the geometric series;
3. the exponential series

∞
xn
= ex
n=0
n!

for all x ∈ R.

G.5 Continuity

A function is said to be continuous in the point x0 if

lim f (x) = lim+ f (x) = f (x0 ) ,

x→x0− x→x0

where lim− f (x), lim+ f (x) are called left limit and right limit respectively. The
x→x0 x→x0
left limit is taken over x < x0 , the right limit is taken over x > x0 .

G.6 Table of the Principal Rules of Derivation

We summarize the most common derivatives as well as the principal rules of deriva-
tion in Table G.1 and in Table G.2, respectively.
234 Appendix G: Elements of Analysis

Table G.1 Derivatives Function f (x) Derivative f (x)

xn n x n−1
ex ex
1
log x x
sin x cos x
cos x − sin x
x2 x2
e− 2 −x e− 2

Table G.2 Rules of d

dx [ f (x) + g(x)] f (x) + g (x)
derivation
d
dx [ f (x) g(x)] f (x) g(x) + f (x) g (x)

d f (x) f (x) g(x)− f (x) g (x)
dx g(x) g 2 (x)
d
dx [ f (g(x))] f (g(x)) · g (x)

G.7 Integrals

1. Integration by parts formula

b b
f (x) g (x) dx = [ f (x) g(x)]ab − f (x) g(x) dx .
a a

2. Change of variable
x = g(y) ⇒ d x = g (y) dy,
b g −1 (b)
f (x) dx = f (g(y)) g (y) dy .
a g −1 (a)
Appendix H
Bidimensional Integrals

In this appendix we recall some notions of analysis in several variables to facilitate

the comprehension of the text and the execution of the exercises.

H.1 Areas of Bidimensional Regions

Let A be a region of the plane (Fig. H.1). The area of A is given by

area A = dxdy .
A

This is analogous to the one-dimensional case, where the length of a segment [a, b]
is given by
b
l([a, b]) = dx .
a

H.2 Integrals of Functions of Two Variables

Let f : R2 → R and put z = f (x, y). A function in two variables describes a surface
in R3 of coordinates (x, y, f (x, y)). We want to calculate the volume between the
surface described by the function and the plane x y. This volume is given by the
double integral

f (x, y) dxdy ,
R2

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
236 Appendix H: Bidimensional Integrals

Fig. H.1 A region of the

plane y

O x

if f is sufficiently regular (for example, if f is continuous). We can compute a double

integral as two nested one-dimensional integrals, i.e.

dx f (x, y) dy = dy f (x, y) dx = f (x, y) dxdy .

This result holds for sufficiently regular functions f (for example, if f is continuous)
and is known as Fubini-Tonelli theorem. We refer to [11] for further details.

Example H.2.1 Let A = {1 < x < 2, 3 < y < 4}. We compute the following
integral on A.
2 4
x 2 y dxdy = dx x 2 y dy
A 1
! "
3

x is a parameter!
2 4
= x 2
y dy dx
1 3
2 4
1 2
= x2 y dx
1 2 3

7 2 2
= x dx
2 1
49
= .
6
Appendix H: Bidimensional Integrals 237

Fig. H.2 The region B

O x

Example H.2.2 Let B = {0 < x < 1, x − 1 < y < x + 1}, see Fig. H.2.
We calculate the following integral on B.
1 x+1
−y
e dxdy = dx e−y dy
B 0 x−1
1 # −y $x+1
= e x−1 dx
0
1 −(x+1)
= e − e−(x−1) dx
0

−1 1
= e − e e−x dx
0−1
= e − e 1 − e−1 .

Example H.2.3 To perform a double integral, it is convenient to divide the domain

of integration in a suitable way. Consider the example of Fig. H.3, where D = {0 <
y < 1, y − 1 < x < −y + 1}.
We compute the integral of a function f (x, y), which is assumed to be sufficiently
regular, on D:
1 −y+1
f (x, y) dxdy = dy f (x, y) dx
D 0 y−1
1 −x+1 0 x+1
= dx f (x, y) dy + dx f (x, y) dy.
0 0 −1 0
238 Appendix H: Bidimensional Integrals

Fig. H.3 Region D

y=
1
x+

−x
y=

+1
−1 O 1 x

In the first passage the extremes of integration can be found by drawing the parallels
to the x-axis and finding the intersections with the border of the domain D. In the
second step the integral has been split in two parts and the extremes have been found
by drawing the parallels to the y-axis.

H.3 Partial Derivatives with Respect to a Single Variable

Let f : R2 → R, z = f (x, y). We call partial derivative of f with respect to the

variable x and write ∂∂xf the derivative of f obtained by considering the function as
depending only by the variable x and considering the other variables as parameters.
Analogously we can define the partial derivatives of a function with respect to the
other variables.

Example H.3.1 (Partial derivatives)

1. f (x, y) = x 2 y:
∂f ∂f
= 2x y , = x2 ;
∂x ∂y

2. f (x, y) = log(x y):

∂f 1 ∂f 1
= , = .
∂x x ∂y y

H.4 Change of Variables

Let Ψ : R2 → R2 , (x, y) = (Ψ1 (x, y), Ψ2 (x, y)). We call Jacobian Jψ of the
function ψ the matrix
Appendix H: Bidimensional Integrals 239
⎛ ⎞
∂Ψ1 ∂Ψ1
∂x ∂y
⎜ ⎟
JΨ = ⎝ ⎠.
∂Ψ2 ∂Ψ2
∂x ∂y

A change of coordinates in R2 is given by a function

Ψ : R2 → R2

(u, v) −→ (x, y)

with particular regularity properties (diffeomorphismus). To change the variables in

an integral, we use then the following rule:

f (x, y) dxdy = f (Ψ (u, v)) |det JΨ | dxdy
A Ψ −1 (A)

with the help of the following diagram:

Ψ / 2
R2(u,v) R(x,y)
GG
GGf ◦Ψ
GG
GG f
G#
R

Example H.4.1 In this example we consider the computation of the normalization

constant for the standard normal distribution in 2 dimension. To this purpose we need

Fig. H.4 Extremes of

integration as x varies y

O x
240 Appendix H: Bidimensional Integrals

Fig. H.5 Extremes of

integration as y varies y

O x

to use a change of variable (Figs. H.4 and H.5). Consider

e− 2 (x +y 2 )
1 2
dxdy .
R2

To compute this integral, we use the polar coordinates:

x = ρ cos θ, y = ρ sin θ

Ψ
(θ, ρ) → (x, y) = (ρ cos θ, ρ sin θ) .

The Jacobian of this transformation is given by

⎛ ∂ ∂ ⎞ ⎛ ⎞
∂θ
ρ cos θ ∂ρ
ρ cos θ −ρ sin θ cos θ
JΨ = ⎝ ⎠ = ⎝ ⎠.
∂ ∂
∂θ
ρ sin θ ∂ρ
ρ sin θ ρ cos θ sin θ

The Jacobian determinant is then

det JΨ = −ρ (sin2 θ + cos2 θ) = −ρ,

i.e.

| det JΨ | = ρ .
Appendix H: Bidimensional Integrals 241

It follows that
+∞ 2π
e− 2 (x +y 2 )
ρ e− 2 ρ dθ
1 2 1 2
dxdy = dρ
R2 0 0
+∞ 2π
− 21 ρ2
= ρe dρ dθ
0 0

1 2 +∞
= 2π −e− 2 ρ
! 0 "
1
= 2π .

Hence
+∞ 2 +∞ +∞
x2 x2 Y2
e− 2 dx = e− 2 dx e− 2 dy
−∞ −∞ −∞

e− 2 (x +y 2 )
1 2
= dxdy = 2π .
R2

Finally +∞
x2 √
e− 2 dx = 2π .
−∞
References

1. de Finetti, B.: Theory of Probability. A Critical Introduction Treatment, vol. 1, 2. Wiley,

New York (1974, 1975)
2. Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1. Wiley, New York
(1957)
3. Foatà, D., Fuchs, A.: Calcul des probabilités, 2nd edn. Dunod, Paris (1998)
4. Gnedenko, B.: The Theory of Probability. Gordon and Breach Science Publishers, Amsterdam
(1997)
5. Hogg, R.V., Tanis, E.A.: Probability and Statistical Inference. Prentice Hall, New York (2001)
6. Jacod, J., Protter, P.: Probability Essentials. Springer, Berlin (2003)
7. Kleinrock, L.: Queueing Systems 1. Wiley, New York (1975)
8. Kleinrock, L., Gail, R.: Queueing Systems: Problems and Solutions. Wiley, New York (1996)
9. Lee, P.M.: Bayesian Statistics: An Introduction. Edward Arnold, London (1994)
10. Lindley, D.V.: Introduction to Probability and Statistics, vol. 1, 2. Cambridge University Press,
New York (1965)
11. Munkres, J.R.: Analysis On Manifolds. Advanced Books Classics. Westview Press, Boulder
(1997)
12. Ross, S.M.: Introduction to Probability Models. Elsevier, New York (2010)

F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
Index

A a posteriori, 106
Absolutely continuous distribution, 44 conditional, 104
Absorbing boundary conditions, 82 joint probability, 58
Area, 235 marginal probability, 59
Derivative, 234
Discrete distribution, 27
B Dispositions, 213
Bayes’ formula, 16 simple, 214
Bernoulli scheme, 27 Distribution
Bet, 8 a priori, 106
Binomial distribution, 28 beta, 62
Bounded, 3 Cauchy, 55
χ2 , 54
exponential, 48
C gamma, 52
Change of variables, 239 Gaussian n-dimensional, 66
Client, 89 initial, 81
Coefficient normal, 50
binomial, 213 stationary, 96
multinomial, 215 Student, 63
Coherence, 8 Double integral, 235
Combinations, 215
Complementary, 5
Conditional probability, 14 E
Confidence Equations
intervals, 111 Chapman-Kolmogorov, 91
region, 111 Kolmogorov forward, 91
Constituent, 6 Event, 5
Continuity, 233 Exhaustivity, 6
Convergence for sequences of cumulative Expectation, 8
distribution functions, 73
Covariance, 22
Cumulative distribution function, 43 F
Formula
of composite expectation, 15
D Stirling’s, 225
Density Function
© Springer International Publishing Switzerland 2016 245
F. Biagini and M. Campanino, Elements of Probability and Statistics,
UNITEXT - La Matematica per il 3+2 98, DOI 10.1007/978-3-319-07254-8
246 Index

joint cumulative distribution, 57 P

marginal cumulative distribution, 58 Partial derivatives, 238
Partition, 6
Penalty, 8
G Permutations, 214
Generating function, 39 Pluri-events, 33
Geometric distribution, 29 Poisson distribution, 30
Polar coordinates, 240
Positively
H correlated, 22
Homogeneous Markov chain, 81 Possible values, 3
Hypergeometric distribution, 31 Precision, 108
Process
Poisson, 91
I stochastic, 89
Incompatibility, 6

Q
J Queueing system, 89
Jacobian, 238
Joint distribution, 35
R
Random
L number, 3
Law of large numbers, 25 vector, 35
Likelihood factor, 108 walk, 82
Limit, 231
Linearity, 8
Little’s formula, 101 S
Logical Series, 233
product, 5 Server, 89
sum, 5 Service times, 89
Logically Simple combinations, 214
dependent event, 7 State space, 81
independent event, 7 Stationary regime, 100
semidependent event, 7 Statistical
Lower bounded, 3 induction, 104
inference, 104
Successes, 28
M
Marginal distribution, 35
Memoryless, 30 T
Monotonicity, 8 Theorem
Multinomial distribution, 33 De Moivre-Laplace, 77
Transition probability matrix, 81

N
Negatively U
correlated, 22 Uniform distribution, 46
Non-correlated, 22 Upper bounded, 3

Resampling Methods For Dependent Data
No ratings yet
Resampling Methods For Dependent Data
382 pages
Combinatorial Optimization: Alexander Schrijver
No ratings yet
Combinatorial Optimization: Alexander Schrijver
34 pages
Stable Convergence and Stable Limit Theorems: Erich Häusler Harald Luschgy
No ratings yet
Stable Convergence and Stable Limit Theorems: Erich Häusler Harald Luschgy
231 pages
Bosq Nguyen A Course in Stochastic Processes PDF
100% (1)
Bosq Nguyen A Course in Stochastic Processes PDF
354 pages
Formal Language A Practical Introduction by Adam Brooks Webber
100% (5)
Formal Language A Practical Introduction by Adam Brooks Webber
400 pages
Parametric Test
100% (1)
Parametric Test
8 pages
Basic Elements of Queueing Theory Lec Notes Philippe NAIN
No ratings yet
Basic Elements of Queueing Theory Lec Notes Philippe NAIN
110 pages
(Zurich Lectures in Advanced Mathematics) Guus Balkema-High Risk Scenarios and Extremes A Geometric Approach - European Mathematical Society (2007)
No ratings yet
(Zurich Lectures in Advanced Mathematics) Guus Balkema-High Risk Scenarios and Extremes A Geometric Approach - European Mathematical Society (2007)
391 pages
All in Likelihood
No ratings yet
All in Likelihood
546 pages
LargeScaleInference PDF
No ratings yet
LargeScaleInference PDF
273 pages
Mastering Probabilistic Graphical Models Using Python - Sample Chapter
No ratings yet
Mastering Probabilistic Graphical Models Using Python - Sample Chapter
36 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Introduction To Optimisation PDF
No ratings yet
Introduction To Optimisation PDF
264 pages
Test
100% (1)
Test
297 pages
Math PHD Eui
No ratings yet
Math PHD Eui
312 pages
LectureNotes RBF
No ratings yet
LectureNotes RBF
58 pages
Girard - Linear Logic (1987)
No ratings yet
Girard - Linear Logic (1987)
101 pages
Bause Kritzinger SPN Book Screen
No ratings yet
Bause Kritzinger SPN Book Screen
216 pages
2018 Book ProbabilityAndStatisticsForCom
No ratings yet
2018 Book ProbabilityAndStatisticsForCom
374 pages
Operations Research
No ratings yet
Operations Research
118 pages
Causal Inference For The Brave and True - Causal Inference For The Brave and True
No ratings yet
Causal Inference For The Brave and True - Causal Inference For The Brave and True
2 pages
Theorem Proving in Lean
100% (1)
Theorem Proving in Lean
173 pages
Matlab Notes
No ratings yet
Matlab Notes
189 pages
22-Lecture Notes On Probability Theory and Random Processes
100% (2)
22-Lecture Notes On Probability Theory and Random Processes
302 pages
MStat Bog
100% (2)
MStat Bog
259 pages
R 271 Who Gets Power
No ratings yet
R 271 Who Gets Power
3 pages
0 387 28982 8
No ratings yet
0 387 28982 8
656 pages
OSPF Packet Types
No ratings yet
OSPF Packet Types
10 pages
Topological and Statistical Methods For Complex Data: Janine Bennett Fabien Vivodtzev Valerio Pascucci Editors
100% (1)
Topological and Statistical Methods For Complex Data: Janine Bennett Fabien Vivodtzev Valerio Pascucci Editors
297 pages
R Lesson (1 of 2) PDF
No ratings yet
R Lesson (1 of 2) PDF
182 pages
Probability and Computing Lecture Notes
No ratings yet
Probability and Computing Lecture Notes
252 pages
Getting Started With MATLAB: Notes
100% (4)
Getting Started With MATLAB: Notes
27 pages
Practical Linear Algebra
100% (1)
Practical Linear Algebra
253 pages
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros PDF
100% (1)
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros PDF
184 pages
Cvitanovic Et Al. Classical and Quantum Chaos Book (Web Version 9.2.3, 2002) (750s) - PNC
No ratings yet
Cvitanovic Et Al. Classical and Quantum Chaos Book (Web Version 9.2.3, 2002) (750s) - PNC
750 pages
Statistical Inference in Science
No ratings yet
Statistical Inference in Science
262 pages
Jan de Vries - Topological Dynamical Systems, An Introduction To The Dynamics of Continuous Mappings (2014, de Gruyter) - Libgen - Li
No ratings yet
Jan de Vries - Topological Dynamical Systems, An Introduction To The Dynamics of Continuous Mappings (2014, de Gruyter) - Libgen - Li
515 pages
Introduction To Probability For Ds
No ratings yet
Introduction To Probability For Ds
180 pages
(Series in Mathematical Analysis and Applications 9) Leszek Gasinski, Nikolaos S. Papageorgiou - Nonlinear Analysis-Chapman & Hall - CRC (2006)
100% (1)
(Series in Mathematical Analysis and Applications 9) Leszek Gasinski, Nikolaos S. Papageorgiou - Nonlinear Analysis-Chapman & Hall - CRC (2006)
960 pages
Interaction Particle System Ligget
No ratings yet
Interaction Particle System Ligget
514 pages
Sanet - ST - Pragmatic Type Level Design
No ratings yet
Sanet - ST - Pragmatic Type Level Design
331 pages
Introduction To Random Graphs
100% (1)
Introduction To Random Graphs
583 pages
Fake News Detection Natural Language Processing
No ratings yet
Fake News Detection Natural Language Processing
62 pages
Topics in Random Matrix Theory
No ratings yet
Topics in Random Matrix Theory
342 pages
Complex Systems A Survey
100% (1)
Complex Systems A Survey
10 pages
Numerical Continuation Methods For Dynamical Systems
0% (1)
Numerical Continuation Methods For Dynamical Systems
411 pages
Point Set Theory
100% (1)
Point Set Theory
296 pages
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
No ratings yet
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
536 pages
ALAFF
100% (1)
ALAFF
642 pages
Optimal Transport Old and New
No ratings yet
Optimal Transport Old and New
998 pages
Pierre Bremaud - Mathematical Principles of Signal Processing
No ratings yet
Pierre Bremaud - Mathematical Principles of Signal Processing
262 pages
Computer Simulation Techniques: The Definitive Introduction!
No ratings yet
Computer Simulation Techniques: The Definitive Introduction!
176 pages
Parallel Coordinates Visual Multidimensional Geometry and Its Applications
No ratings yet
Parallel Coordinates Visual Multidimensional Geometry and Its Applications
581 pages
Mathematical Methods Modelling and Applications
No ratings yet
Mathematical Methods Modelling and Applications
412 pages
(MPS-SIAM series on optimization) James Renegar - A mathematical view of interior-point methods in convex optimization-Society for Industrial and Applied Mathematics _, Mathematical Programming Societ.pdf
No ratings yet
(MPS-SIAM series on optimization) James Renegar - A mathematical view of interior-point methods in convex optimization-Society for Industrial and Applied Mathematics _, Mathematical Programming Societ.pdf
126 pages
Who Really Matters: The Core Group Theory of Power, Privilege, and Success
From Everand
Who Really Matters: The Core Group Theory of Power, Privilege, and Success
Art Kleiner
4/5 (8)
Simulation: A Modeler's Approach
From Everand
Simulation: A Modeler's Approach
James R. Thompson
No ratings yet
Theoretical Numerical Analysis: An Introduction to Advanced Techniques
From Everand
Theoretical Numerical Analysis: An Introduction to Advanced Techniques
Peter Linz
No ratings yet
Nonnegative Matrices and Applicable Topics in Linear Algebra
From Everand
Nonnegative Matrices and Applicable Topics in Linear Algebra
Alexander Graham
No ratings yet
Elementary Theory and Application of Numerical Analysis: Revised Edition
From Everand
Elementary Theory and Application of Numerical Analysis: Revised Edition
David G. Moursund
No ratings yet
Exercises of Power, Taylor and Fourier Series
From Everand
Exercises of Power, Taylor and Fourier Series
Simone Malacrida
No ratings yet
2ruun Biomechanics and Motor Control of Human Movement Fourth Edition
100% (1)
2ruun Biomechanics and Motor Control of Human Movement Fourth Edition
383 pages
Interdisciplinary Applications of Kinematics
No ratings yet
Interdisciplinary Applications of Kinematics
196 pages
Head Injury Simulation in Road Traf C Accidents
No ratings yet
Head Injury Simulation in Road Traf C Accidents
107 pages
A Primer On Stochastic Epidemic Models
No ratings yet
A Primer On Stochastic Epidemic Models
15 pages
Climate Changes and Air Pollution PDF
No ratings yet
Climate Changes and Air Pollution PDF
428 pages
Same Same, But Different?: Should Football Boot Selection Be A Consideration After Aclr
No ratings yet
Same Same, But Different?: Should Football Boot Selection Be A Consideration After Aclr
6 pages
A Modelling System For Predicting Urban Air Pollut
No ratings yet
A Modelling System For Predicting Urban Air Pollut
12 pages
Does Metabolic Rate Increase Linearly With Running Speed in All Distance Runners?
No ratings yet
Does Metabolic Rate Increase Linearly With Running Speed in All Distance Runners?
8 pages
Advances in Optimization and Decision Science For Society, Services and Enterprises
No ratings yet
Advances in Optimization and Decision Science For Society, Services and Enterprises
493 pages
Advances in Information Systems and Technologies PDF
No ratings yet
Advances in Information Systems and Technologies PDF
1,143 pages
Applied Scientific Computing
100% (2)
Applied Scientific Computing
280 pages
Aerodynamic Characteristics and RNA Concentration of SARS-CoV-2 Aerosol in Wuhan Hospitals During COVID-19 Outbreak
No ratings yet
Aerodynamic Characteristics and RNA Concentration of SARS-CoV-2 Aerosol in Wuhan Hospitals During COVID-19 Outbreak
9 pages
Chapter 3-Multiple Regression Model
No ratings yet
Chapter 3-Multiple Regression Model
26 pages
Wave Scattering 2015
No ratings yet
Wave Scattering 2015
29 pages
Z - Test P8A, PS8B
No ratings yet
Z - Test P8A, PS8B
13 pages
Chapter 9 HoH1
No ratings yet
Chapter 9 HoH1
10 pages
Hypothesis
No ratings yet
Hypothesis
11 pages
Application of The Calculus of Variation
No ratings yet
Application of The Calculus of Variation
4 pages
Introduction To Discrete Choice Models
No ratings yet
Introduction To Discrete Choice Models
6 pages
0705.1617v1-Non Computability of Consciousness
No ratings yet
0705.1617v1-Non Computability of Consciousness
10 pages
1 Sequential Stern-Gerlach Experiment
No ratings yet
1 Sequential Stern-Gerlach Experiment
5 pages
Chapter 4
No ratings yet
Chapter 4
4 pages
Schrodinger Eq
No ratings yet
Schrodinger Eq
13 pages
Hidden Variable Theory
No ratings yet
Hidden Variable Theory
12 pages
NPTEL Phase II - Physics - Nonequilibrium Statistical Mechanics
No ratings yet
NPTEL Phase II - Physics - Nonequilibrium Statistical Mechanics
4 pages
Clabe Problem Sheet 6 Solution
No ratings yet
Clabe Problem Sheet 6 Solution
5 pages
Ece201 Probability Theory and Random Process TH 1.10 Ac29
No ratings yet
Ece201 Probability Theory and Random Process TH 1.10 Ac29
2 pages
GR Cheat Sehe 8
No ratings yet
GR Cheat Sehe 8
2 pages
Var Model Validation: Laura Garc Ia Jorcano February 2018
No ratings yet
Var Model Validation: Laura Garc Ia Jorcano February 2018
9 pages
CHAPTER 4 Scientific Theories
No ratings yet
CHAPTER 4 Scientific Theories
5 pages
DPP - 02 Quantum Chemistry: CSIR-NET - IIT-GATE - IIT-JAM - Other Msc. Entrance
100% (1)
DPP - 02 Quantum Chemistry: CSIR-NET - IIT-GATE - IIT-JAM - Other Msc. Entrance
2 pages
Econometrics: Chapter 6: Multiple Regression Model
No ratings yet
Econometrics: Chapter 6: Multiple Regression Model
23 pages
Problem Sheet 2
No ratings yet
Problem Sheet 2
4 pages
Chapter Two: Bivariate Regression Mode
100% (1)
Chapter Two: Bivariate Regression Mode
54 pages
Solution For Quantum Computation and Quantum Information by Nielsen and Chuang
No ratings yet
Solution For Quantum Computation and Quantum Information by Nielsen and Chuang
7 pages
Many Body Problem
No ratings yet
Many Body Problem
7 pages
Liberalism: Summary of Andrew Heywood's Discussion On "Ideologies" 5 Edition
100% (1)
Liberalism: Summary of Andrew Heywood's Discussion On "Ideologies" 5 Edition
21 pages
RQM PDF 15296
No ratings yet
RQM PDF 15296
40 pages
Teoryang Behaviorism
100% (1)
Teoryang Behaviorism
4 pages
My Einstein: George F. Smoot
No ratings yet
My Einstein: George F. Smoot
8 pages
Personality Types by Eric Partaker
No ratings yet
Personality Types by Eric Partaker
1 page