On Continuous Distributions
On Continuous Distributions
Tristan Chaang
February 8, 2022
In this article I will write about continuous distributions, including their properties and their deriva-
tions, without using advanced concepts such as measure theory.
Contents
1 PDFs and CDFs 2
3 Normal Distribution 5
3.1 Properties of the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Deriving the pdf for N (0, 1) from B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Sample Statistics 9
5 Chi-Squared Distribution 10
5.1 Deriving the pdf for the Chi-Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . 12
6 Student’s t-Distribution 14
6.1 Deriving the pdf for the t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
1 PDFs and CDFs
Suppose we want to randomly choose a number in the interval [0, 100] so that ‘every number is equally
likely to be chosen’. The natural question to ask is, what do we mean exactly by equally likely? The set
[0, 100] has infinitely many elements, so the probability of choosing an exact given value is zero. How do
we resolve this? The answer is by using arbitrary intervals: The probability of choosing a number in the
interval [a, b] is (b − a)/100 for any 0 ≤ a ≤ b ≤ 100.
The distribution above is called the uniform distribution on [0, 100], denoted as U[0,100] . However, that
is not the only distribution out there. We can have distribution on other intervals, and even if we are just
taking a distribution on [0, 100], there can be distributions where it is more likely to choose one number
than the other, e.g. choosing a number nearer to 0 being more likely than choosing one near 100.
The way we describe a distribution is by using probability density functions (pdf ). For a continuous
random variable X with distribution D (written as X ∼ D), the pdf fX (x) of X is the function such that
Z b
P(a ≤ X ≤ b) = fX (x) dx
a
fX (x)
1/100
P(a ≤ X ≤ b) = (a − b)/100
x
a b
Notice we used t in the integrand instead of x because we have already used the symbol x in the bounds.
The cdf for X ∼ U[0,100] is thus
0
if x < 0;
FX (x) = x/100 if 0 ≤ x ≤ 100;
1 if x > 100.
1
We assume these pdfs are Riemann integrable. You can ignore this if you haven’t heard of this. It roughly means it
cannot be weird functions such as f = 0 at rational numbers and f = 1 otherwise. We cannot integrate it the typical way.
2
A simple corollary is that FX′ (x) = fX (x). Using this corollary, we can find the pdf of g(X) where g is
a function taking X as input:
d
fg(X) (x) = P(g(X) ≤ x).
dx
The proof of this is complicated. I will write a new article about this.
3
This means, instead of calculating the probability by finding the area under a curve of a 2D pdf graph, we
calculate the probability by finding the volume under a surface of a 3D pdf graph. Such a pdf must also
satisfy similar properties such as the total volume under the surface is 1 and the pdf is always nonnegative.
If fX,Y (x, y) = fX (x) · fY (y) for all x, y, we say that the random variables X and Y are independent.
We will normally deal with independent variables.
If X ∼ D1 and Y ∼ D2 are independent variables, we would like to find the pdf of Z = X + Y :
d
fZ (z) = P(Z ≤ z)
dz
d
= P(Y ≤ z − X)
dz Z Z
∞ z−x
d
= fX,Y (x, y) dy dx
dz −∞ −∞
Z ∞ Z z−x
d
= fX (x) · fY (y) dy dx
−∞ dz −∞
Z ∞
= fX (x) · fY (z − x) dx
−∞
This will be important later on. Also, for some positive constant c, the pdf of Z = cX is
d d z 1 z
fZ (z) = P (Z ≤ z) = P X ≤ = fX
dz dz c c c
Finally, let’s find the pdf of Z = X/Y .
d
fZ (z) = P(Z ≤ z)
dz
d
= P(X ≤ Y z)
dz Z Z
∞ yz
d
= fX,Y (x, y) dy dx
dz −∞ −∞
Z ∞ Z yz
d
= fY (y) · fX (x) dx dy
−∞ dz −∞
Z ∞
= y · fY (y)fX (yz) dy
−∞
Linearity of Expectation
ZZ
E(X + Y ) = (x + y)fX,Y (x, y) dy dx
∀x,y
ZZ ZZ
= xfX,Y (x, y) dy dx + yfX,Y (x, y) dy dx
∀x,y ∀x,y
Z Z Z Z
= x fX,Y (x, y) dy dx + y fX,Y (x, y) dx dy
∀x ∀y ∀y ∀x
Z Z
= xfX (x) dx + yfY (y) dy
∀x ∀y
= E(X) + E(Y ).
Z Z
and E(cX) = cxfX (x) dx = c xfX (x) dx = cE(X).
∀x ∀x
4
If X and Y are two independent variables,
3 Normal Distribution
We come to our first famous distribution: the normal distribution. Any continuous random variable X
having the pdf in the form2
2 !
1 1 x−µ
fX (x) = √ exp − for some µ and positive σ
σ 2π 2 σ
is said to belong to a normal distribution, denoted as X ∼ N (µ, σ 2 ). Figure 2 below shows the pdf of
N (µ, σ 2 ). Even though this distribution is determined by the mean µ and variance σ 2 , we can always
apply a transformation to the pdf of N (µ, σ 2 ) by denoting Z = (X − µ)/σ, transforming X ∼ N (µ, σ 2 )
into Z ∼ N (0, 1), as shown in Figure 2.
This transformation is called standardisation. The resultant pdf is much simpler, given by
1 1 2
fZ (z) = √ exp − x
2π 2
2
Here exp(x) means ex .
5
fX (x)
x
µ
fZ (z)
z
0
6
The Sum of Two Independent Normal Variables is Normal
Therefore Z is normal. We do not have to specifically find out what the constants are because
we know E(Z) = E(X) + E(Y ) = µ1 + µ2 and V ar(Z) = V ar(X) + V ar(Y ) = σ12 + σ22 . Thus
X + Y ∼ N (µ1 + µ2 , σ12 + σ22 ).
If we plot the graphs of Bn,p (x) with fixed p and varying n, we get the following:
7
As we see above, the graph gets wider as n increases because the domain enlarges. The height of the
graph is also falling because as the domain enlarges, the sum of the probabilities must still be equal to 1.
√
We know that the mean and standard deviation of B(n, p) are µ = np and σ = npq respectively. So, in
order to get a converging sequence of graphs, let’s shift the distribution of B(n, p) to the left by µ, shrink
it horizontally by σ, and then stretch it vertically by σ to maintain the sum-equals-one property:
Now we’re converging (or at least we seem to)! Suppose they converge to N (x). For fixed n, the point
(x, σBn,p (σx+µ)) on the adjusted Binomial graph corresponds to (u, Bn,p (u)) = (σx+µ, Bn,p (σx+µ)) on the
original Binomial graph. Therefore, the gradient of the line connecting (u, Bn,p (u)) and (u + 1, Bn,p (u + 1))
is
Bn,p (u + 1) − Bn,p (u) n u+1 n−u−1 n u n−u
= p q − p q
(u + 1) − u u+1 u
This line, after transforming it onto the adjust Binomial graph, is shrinked horizontally by σ but stretched
vertically by σ, so its gradient is σ 2 times larger than the expression above, i.e.
′ 2 n u+1 n−u−1 n u n−u
N (x) = lim σ p q − p q
n→∞ u+1 u
This is hard to handle, so let’s divide this by N (x) = lim σBn,p (σx + µ) = σBn,p (u).
n→∞
n n u n−u
u+1 n−u−1
N ′ (x) u+1
p q − u
p q
∴ = lim σ 2 ·
σ nu pu q n−u
N (x) n→∞
√ np − u − q
= lim npq ·
n→∞ q(u + 1)
√
√ np − x npq − np − q
= lim npq · √
n→∞ qx npq + npq + q
√
−(pqx)n − (q pq)n1/2
= lim √
n→∞ (pq)n + (qx pq)n1/2 + q
= −x
N ′ (x) d
but = ln(N (x)), so
N (x) dx
ln(N (x)) = −x2 /2 + c
N (x) = C exp(−x2 /2)
where C is a constant. Now we just have to solve for C in
Z ∞ 2
x
C exp − dx = 1.
−∞ 2
8
R∞ √
However, −∞ exp(−x2 ) dx = π is a classic calculus problem (perhaps I will write about it). Therefore,
√
in this case C must be 1/ 2π. In conclusion,
2
1 x
N (x) = √ exp −
2π 2
4 Sample Statistics
Suppose the variables X1 , · · · , Xn are independent but all have the same distribution D. We say that
X1 , · · · , Xn are independent observations of X ∼ D. A set of independent observations {X1 , · · · , Xn } is a
sample. Given a sample S, we can construct any function taking inputs from S. These functions are called
statistics. For example,
X 1 + · · · + Xn
Sample mean: X=
n
(X 1 − X)2 + · · · + (Xn − X)2
Sample variance: s2 =
n−1
are commonly-used statistics. From section 2, we know that we can find the distribution of X by finding
its pdf.
Suppose D = N (µ, σ 2 ). Since the sum of two normal variables is normal, by induction, X1 + · · · + Xn
is normal. By the linearity of expectation and variance,
1X 1
E(X) = E(X) = · nµ = µ.
n n
1 X 1 σ2
V ar(X) = 2 V ar(X) = 2 · nσ 2 = .
n n n
Therefore X ∼ N (µ, σ 2 /n).
Even when X is not normal, if X1 , · · · , Xn are independent observations of X then the dis-
tribution of X converges to N (µ, σ 2 /n) when n → ∞.
This is also a hard theorem to prove, I will write about it in the future.
Suppose we want to estimate the mean µ and variance σ 2 of a population. If we were given access to
every possible data value of the population, then we can get µ and σ 2 easily. However, when we only take
a sample, we can only evaluate X and s2 . First of all, why is the denominator of s2 , the sample variance,
n − 1 instead of n, the number of values taken? Second of all, how are we sure that X and s2 give good
estimates of µ and σ 2 ? Third of all, how confident are we in these estimates?
We answer the first and second question together. We say that a statistic T is an unbiased estimate of
a value θ if E(T ) = θ. For example, X is an unbiased estimate of µ because E(X) = µ asPshown above.
Since the formula of X looks exactly the same as that for µ, one would expect that S 2 = (Xi − X)2 /n
9
is also an unbiased estimate for σ 2 . However, since σ 2 = E(X 2 ) − E(X)2 = E(X 2 ) − µ2 ,
n
! Pn 2 !
1 X X i
E(S 2 ) = E X2 − E i=1
n i=1 i n
n
!
1 X X
= E(X 2 ) − 2 E Xi2 + 2 Xi Xj
n i=1 i<j
2 1 2 2 n
= E(X ) − E(X ) − 2 · · E(X)E(X)
n n 2
1 n−1 2
= 1− (σ 2 + µ2 ) − µ
n n
n−1 2
= σ ̸= σ 2
n
suggests otherwise. However, in light of the (n − 1)/n term, we see that by replacing the denominator of
S to n − 1, we have E(s2 ) = σ 2 instead. Therefore s2 is an unbiased estimator for σ 2 , while S 2 is not.
To answer the third question, we previously proved that X ∼ N (µ, σ 2 /n), so for large values of n, the
standard deviation is very small and hence we are very confident that X is close to µ. The problem is,
how confident is confident, especially when n is small? If we want to find an interval so that X has a 0.9
probability to lie in it, we just standardise and solve
X −µ
P −z0.95 < √ < z0.95 = 0.9
σ/ n
σ σ
P X − z0.95 √ < µ < X + z0.95 √ = 0.9
n n
where z0.95 = 1.645 is the value of z such that P(Z < z) = 0.95 whereby Z is normal. This is mathemati-
cally correct, but we have no idea what the value of σ is in the first place. We will resolve this problem in
the t-distribution section.
5 Chi-Squared Distribution
In section 1, we found that the pdf of Z = X 2 is
0 if z ≤ 0;
√ √
fZ (z) = fX ( z) + fX (− z)
√ if z > 0.
2 z
√
In this section, we will study the case when X ∼ N (0, 1), so fX (x) = exp(−x2 /2)/ 2π, giving
0 if z ≤ 0;
fZ (z) = exp(−z/2)
√ if z > 0.
2πz
Therefore, the distribution of X 2 looks like Figure 6 (asymptote at x = 0):
This is called the chi-squared distribution with 1 degree of freedom. In general, the chi-squared distribu-
tion with n degrees of freedom, denoted as χ2n , is the distribution of Qn = X12 + · · · + Xn2 where X1 , · · · , Xn
are independent observations of X ∼ N (0, 1).
Firstly, the sum of two independent chi-squared distributed variables with a and b degrees of freedom
respectively is chi-squared distributed with a + b degrees of freedom. This is simply because (X12 + · · · +
Xa2 ) + (Y12 + · · · + Yb2 ) is just the sum of squares of a + b standard normal variables.
10
fX 2 (x)
(n − 1)s2
∼ χ2n−1 if X1 , · · · , Xn ∼ N (µ, σ 2 ) are n independent observations.
σ2
Standardise Zi = (Xi − µ)/σ. (n − 1)s2 /σ 2 can be written as a sum of n − 1 squares:
(n − 1)s2 X
2 (Z1 + · · · + Zn )2
= Z i −
σ2 1≤i≤n
n
n−1 X 2 2 X
= Z − Zi Zj
n 1≤i≤n i n 1≤i<j≤n
r !2
n−1 1 1
= Zn − p Zn−1 − · · · − p Z1
n n(n − 1) n(n − 1)
n−2 X 2 X
+ Zi2 − Zi Zj
n − 1 1≤i≤n−1 n − 1 1≤i<j≤n−1
n
r !2
X k−1 1 1
= Zk − p Zk−1 − · · · − p Z1 by induction
k=2
k k(k − 1) k(k − 1)
It is a fun exercise to prove that each squared term at the end is standard normal. However, it is
extremely difficult to prove that they are mutually independent using the tools we’ve learnt so far
(we can proceed something similar to how we proved X and X + Y are dependent in section 4).
We will accept the fact that the above expression is a sum of n − 1 independent standard normal
variables, and hence has the chi-squared distribution with n − 1 degrees of freedom.
11
5.1 Deriving the pdf for the Chi-Squared Distribution
exp(−x/2)
From above, g1 (x) = √ . Let’s find the pdf for Q2 ∼ χ22 for x > 0, which is
2πx
Z ∞
g2 (x) = fX12 (t)fX22 (x − t) dt
−∞
Z x
= g1 (t)g1 (x − t) dt
Z0 x
exp(−t/2) exp(−(x − t)/2)
= √ · p dt
0 2πt 2π(x − t)
exp(−x/2) x
Z
1
= p dt
2π 0 t(x − t)
x
exp(−x/2) 2t
= arcsin −1
2π x t=0
1
= exp(−x/2).
2
fX 2 (x)
Now that we have g1 and g2 , we can construct a recursive relation of gn . Since X12 + · · · + Xn2 =
2 2 2
(X1 + · · · + Xn−2 ) + (Xn−1 + Xn2 ) and X12 + · · · + Xn−2
2
and Xn−12
+ Xn2 are independent, the random variable
Qn is just the sum of two independent variables Q2 and Qn−2 . Hence
Z ∞
gn (x) = fQn−2 (t) · fQ2 (x − t) dt
−∞
1 x
x−t
Z
= gn−2 (t) exp − dt
2 0 2
12
Finding the pdf for Chi-Squared Distributions with Even Degree of Freedom
xk−1
Since h2 (x) = 1, we can prove by induction that h2k = , because
(k − 1)!
Z x k−1
xk−1 t xk
h2k = ⇒ h2(k+1) = dt = .
(k − 1)! 0 (k − 1)! k!
Here, we find the pattern by integrating hi (x) many times. A similar method can be used to find the
pdf of those with odd degrees of freedom.
Finding the pdf for Chi-Squared Distributions with Odd Degree of Freedom
13
The Gamma Function
• Γ(x) = (x − 1)Γ(x − 1)
• Γ(n + 1) = n!
√
π(2n)!
• Γ(n + 0.5) =
22n (n − 1)!
fQn (x)
6 Student’s t-Distribution
We answer the question raised at the end of section 4: Let X1 , · · · , Xn be n independent observations of
N (0, 1). How do we find an interval, in terms of X, such that µ has 0.9 probability to lie in it? Previously,
we considered
X −µ
√ ∼ N (0, 1)
σ/ n
σ σ
⇒ P X − z0.95 √ < µ < X + z0.95 √ = 0.9
n n
14
but we noticed that σ is unknown. Therefore, we will consider the distribution of
X −µ
√ ∼ tn−1 (0, 1)
s/ n
instead, where s is the sample variance. This distribution will differ according to n and is known as the
Student’s t-distribution, or t-distribution in short, with ν = n − 1 degrees of freedom.
d p
fW (w) = P(R ≤ w T /ν)
dw
Z ∞ Z w√t/ν
d
= fT (t) fR (r) dr dt
dw 0 −∞
Z ∞ r r !
t t
= fT (t) · fU w dt
0 ν ν
Z ∞ ν/2−1 r
w2 t
t exp(−t/2) t 1
= · √ exp − · dt
2ν/2 Γ ν2
0 ν 2π 2 ν
Z ∞
w2
1 (ν−1)/2
1
= √ t exp −
1+ t
dt
2ν/2 Γ ν2 2πν 0 2 ν
| {z }
K
Z ∞
1
= ν
√ (Kt)(ν−1)/2 exp (−Kt) d(Kt)
ν/2
2 Γ 2 2πν · K (ν+1)/2
0
1 − ν+1 ν+1
= √ ·K 2 ·Γ
2ν/2 Γ ν2 2πν 2
− ν+1
Γ ν+1
2 w2 2
= √ 1 +
Γ ν2
πν ν
15
f (x)
Figure 9: t-Distributions t1 , t2 , t3 , t4 , t5 , t6
In Figure 9, the one with the lower y-intercept has the lower degree of freedom.
16
References
[1] Law of the Unconscious Statistician
https://en.wikipedia.org/wiki/Law of the unconscious statistician
[5] T-Distribution
https://en.wikipedia.org/wiki/Student%27s t-distribution
17