Math Stats Lecture 2020F
Math Stats Lecture 2020F
Shuyang Ling
1 Probability 4
1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Important distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Normal distribution/Gaussian distribution . . . . . . . . . . . . . 7
1.2.3 Moment-generating function (MGF) . . . . . . . . . . . . . . . . 8
1.2.4 Chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.6 Bernoulli distributions . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.7 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.8 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Limiting theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Law of large number . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Introduction to statistics 15
2.1 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Important statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Probabilistic assumption . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Evaluation of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Nonparametric inference 21
3.1 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Estimation of CDF . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Plug-in principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Parametric inference 30
4.1 Method of moments (M.O.M) . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Consistency and asymptotic normality of MM estimators . . . . . . . . . 33
4.2.1 Generalized method of moments . . . . . . . . . . . . . . . . . . . 35
4.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Properties of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.3 Equivariance of MLE . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Cramer-Rao bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Multiparameter models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1
4.7.1 Multiparameter MLE . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7.2 Bivariate normal distribution . . . . . . . . . . . . . . . . . . . . 49
4.7.3 Asymptotic normality of MLE . . . . . . . . . . . . . . . . . . . . 52
4.7.4 Multiparameter Delta method . . . . . . . . . . . . . . . . . . . . 53
4.7.5 Multiparameter normal distribution . . . . . . . . . . . . . . . . . 54
4.7.6 Independence between sample mean and variance . . . . . . . . . 57
5 Hypothesis testing 59
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Test statistics and rejection region . . . . . . . . . . . . . . . . . 60
5.1.3 Type I and II error . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 More on hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Composite hypothesis testing . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1 Asymptotics of LRT . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2 General LRT and asymptotics . . . . . . . . . . . . . . . . . . . . 72
5.4 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.2 Pearson χ2 -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.3 Test on Independence . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5.1 KS test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . 78
5.5.2 Two-sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2
6.6.2 Inference in logistic regression . . . . . . . . . . . . . . . . . . . . 114
6.6.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.6.4 Repeated observations - Binomial outcomes . . . . . . . . . . . . 116
6.6.5 General logistic regression . . . . . . . . . . . . . . . . . . . . . . 118
This lecture note draft is prepared for MATH-SHU 234 Mathematical Statistics I am
teaching at NYU Shanghai. It covers the basics of mathematical statistics at undergrad-
uate level.
3
Chapter 1
Probability
1.1 Probability
Probability theory is the mathematical foundation of statistics. We will review the basics
of concepts in probability before we proceed to discuss mathematical statistics.
The core idea of probability theory is studying the randomness. The randomness is
described by random variable X, a function from sample space to a number. Each
random variable X is associated with a distribution function.
We define the cumulative distribution function (cdf) of X as:
A cdf uniquely determines a random variable; it can be used to compute the probability
of X belonging to a certain range
fX (i) = pi = P(X = ai )
4
and we require
∞
X
pi ≥ 0, pi = 1.
i=1
This discrete function fX (i) is called probability mass function (pmf). It is natural to
see that the connection between cdf and pmf
∞
X
FX (x) = pi · 1{ai ≤ x}
i=1
It is easily seen that the cdf of a discrete random variable is not continuous. In fact, it is
piecewise continuous.
The expectation (mean) of a random variable is given by
∞
X
EX = ai p i
i=1
given that
∞
X
|ai |pi < ∞.
i=1
More generally, suppose we have a function ϕ(x) : X → R, then the expectation of ϕ(X),
a new random variable, is
X∞
E ϕ(X) = ϕ(ai ) · pi .
i=1
We say X is a continuous random variable if there exists a function fX (x) such that
ˆ x
FX (x) = fX (t) dt.
−∞
The function fX (x) is called the probability density function (pdf). To get pdf from cdf,
we simply take the derivative of FX (x),
d
fX (x) = FX (x).
dx
• Continuous random variables takes uncountably many values.
• The probability of X = a, i.e., P(X = a) = 0 since FX is continuous.
What does pdf mean?
P(x − ≤ X ≤ x + ) FX (x + ) − FX (x − )
lim+ = lim+ = fX (x).
→0 2 →0 2
5
For a function ϕ(x) : X → R, we have
ˆ
E ϕ(X) = ϕ(x)fX (x) dx.
R
Var(X) = E X 2 − (E X)2 .
This is equivalent to
i.e., the joint cdf of (X, Y ) equals the product of its marginal distributions.
Suppose X and Y are independent, then f (X) and g(Y ) are also independent for two
functions f and g. As a result, we have
6
1.2 Important distributions
1.2.1 Uniform distribution
If the pdf of a random variable X satisfies
1
fX (x) = , a ≤ x ≤ b.
b−a
This is called the uniform distribution, denoted by Unif[a, b]. Its cdf is
0,
x≤a
x−a
FX (x) = b−a , a < x < b
1, x ≥ b.
This can be extended to the sum of n independent Gaussian random variables. For
example,
Xn
Xi ∼ N (0, n)
i=1
if Xi ∼ N (0, 1) are i.i.d. random variables.
7
1.2.3 Moment-generating function (MGF)
Pn
Why does i=1 Xi ∼ N (0, n) hold? The moment generating function is
M (t) := E etX .
n dn M (t)
EX = .
dtn t=0
Suppose two random variables have the same moment generating functions, they are of
the same distribution. MGF uniquely determines the distribution. For X ∼ N (µ, σ 2 ), it
holds that
Exercise: Suppose we are able to generate uniform random samples, can we generate
normal random variables?
8
√
In particular, if n is a positive integer, Γ(n) = (n − 1)! and Γ(1/2) = π.
Chi-squared distribution is closely connected to normal distribution. Suppose Z ∼
N (0, 1). Now we take a look at X = Z 2 :
√ √
P(X ≤ x) = P(Z 2 ≤ x) = P(− x ≤ Z ≤ x)
√
= 2P(0 ≤ Z ≤ x)
r ˆ √x
2 2
= e−z /2 dz.
π 0
The pdf of X is obtained by differentiating the cdf,
r
2 1 1
fX (x) = · √ · e−x/2 = √ x−1/2 e−x/2 , x > 0.
π 2 x 2π
Now if {Zi }ni=1 is a sequence of n independent standard normal random variables, then
n
X
X= Zi2 ∼ χ2n .
i=1
Recall that the left side involves conditional probability. For two events A and B, the
conditional probability of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
9
Here
P(X ≥ t + s, X ≥ t) P(X ≥ t + s)
P(X ≥ t + s|X ≥ t) = =
P(X ≥ t) P(X ≥ t)
since {X ≥ t + s} is contained in {X ≥ t}
Exercise: Verify the memoryless properties and think about what does it mean?
Exercise: What is the distribution of ni=1 Xi if Xi ∼ E(β)?
P
P(X = 1) = p, P(X = 0) = 1 − p.
and
Var(X) = E X 2 − (E X)2 = E X − p2 = p(1 − p).
where
n n!
=
k k!(n − k)!
10
distribution. Suppose {Xi }ni=1 are
Binomial distribution is closely related to Bernoulli P
n
n i.i.d. Bernoulli(p) random variables, then X = i=1 Xi ∼Binomial(n, p). In par-
ticular, if X ∼Binomial(n, p), Y ∼Binomial(m, p), and X is independent of Y , then
X + Y ∼Binomial(n + m, p).
Exercise: Use the idea of moment generating function to show that ni=1 Xi ∼ Binomial(n, p)
P
if Xi ∼Bernoulli(p).
Exercise: What is the mean and variance of Binomial(n, p)?
Exercise: Use the mgf to obtain the mean and variance of Binomial(n, p).
11
1.3 Limiting theorem
1.3.1 Law of large number
Law of large numbers, along with central limit theorem (CLT), plays fundamental roles
in statistical inference and hypothesis testing.
Theorem 1.3.1 (Weak law of large number). Let Xi be a sequence of i.i.d. random
variables,
n
1X p
lim Xi −→ µ,
n→∞ n
i=1
Law of large number basically says that the sample average converges to the expected
value as the sample size grows to infinity.
We can prove the law of large number easily if assuming Xi has a finite second moment.
The proof relies on Chebyshev’s inequality.
Theorem 1.3.2 (Chebyshev’s inequality). For a random variable X with finite second
moment, then
E |X − µ|2 σ2
P(|X − µ| ≥ ) ≤ = .
2 2
12
1.3.2 Central limit theorem
Theorem 1.3.3 (Central limit theorem). Let Xi , 1 ≤ i ≤ n be a sequence of i.i.d. random
variables with mean µ and finite variance σ 2 , then
Pn
Xi − nµ d
Zn := i=1√ −→ N (0, 1),
nσ 2
i.e., convergence in distribution.
Sometimes, we also use √
n(X n − µ)
Zn = .
σ
We say a sequence of random variable Zn converges to Z in distribution if
a = zα/2 , b = z1−α/2 .
13
Proof: The mgf of Zn is
n
! n
t X Y t
E exp(tZn ) = E exp √ (Xi − µ) = E exp √ (Xi − µ)
nσ i=1 i=1
nσ
t2
t t
E exp √ (Xi − µ) = 1 + √ E(Xi − µ) + · E(Xi − µ)2 + o(n−1 )
nσ nσ 2nσ 2
t2
=1+ + o(n−1 )
2n
As a result,
n
t
lim E exp(tZn ) = lim E exp √ (Xi − µ)
n→∞ n→∞ nσ
n
t2
−1
= lim 1 + + o(n )
n→∞ 2n
= exp(t2 /2).
Repeat M times and collect the data. Plot the histogram or the empirical cdf. Here the
empirical cdf is defined as
M
1 X
FY (y) = 1{Yi ≤ y}.
M j=1
14
Chapter 2
Introduction to statistics
2.1 Population
One core task of statistics is making inferences about an unknown parameter θ associated
to a population. What is a population? In statistics, a population is a set consisting of
the entire similar items we are interested in. For example, a population may refer to
all the college students in Shanghai or all the residents in Shanghai. The choice of the
population depends on the actual scientific problem.
Suppose we want to know the average height of all the college students in Shanghai or
want to know the age distribution of residents in Shanghai. What should we do? Usually,
a population is too large to deal with directly. Instead, we often draw samples from the
population and then use the samples to estimate a population parameter θ such as mean,
variance, median, or even the actual distribution.
This leads to several important questions in statistics:
• How to design a proper sampling procedure to collect data? Statistical/Experimental
design.
• How to use the data to estimate a particular population parameter?
• How to evaluate the quality of an estimator?
• Variance: n
1 X
Sn2 = (xi − xn )2
n − 1 i=1
15
• Standard deviation: v
u n
u 1 X
Sn = t (xi − xn )2
n − 1 i=1
• Median:
median(xi ) = x[(n+1)/2]
where [(n + 1)/2] means the closest integer to (n + 1)/2. More generally, the α
quantile is x[α(n+1)] .
• Range, max/min:
Range = max xi − min xi .
1≤i≤n 1≤i≤n
• Empirical cdf:
n
1X
Fn (x) = 1{xi ≤ x}.
n i=1
16
2.2 Evaluation of estimators
There are several ways to evaluate the quality of point estimators.
Definition 2.2.1. The bias of θbn is
bias(θbn ) = E θbn − θ.
As a result, it holds
1
E Sn2 nσ 2 − σ 2 = σ 2 .
=
n−1
However, bias is not usually a good measure of a statistic. It is likely that two unbiased
estimators of the same parameter are of different variance. For example, both T1 = X1
and T2 = X n are both unbiased estimators of θ. However, the latter is definitely preferred
as it uses all the samples: the sample mean is consistent.
Definition 2.2.2. We say θbn is a consistent estimator of θ if θbn converges to θ in prob-
ability, i.e.,
lim P(|θbn − θ| ≥ ) = 0
n→∞
for any fixed > 0. The probability is taken w.r.t. the joint distribution of (X1 , · · · , Xn ).
The consistency of sample mean is guaranteed by the law of large number. How about
sample variance Sn2 ?
Let’s take a closer look at Sn2 :
n
!
1 X 2
Sn2 = Xi2 − nX n .
n−1 i=1
We are interested in the limit of Sn2 as n → ∞. First note that by law of large number,
we have n
1X 2 p p
Xi −→ E X 2 , X n −→ E X.
n i=1
Recall that variance of X equals E X 2 − (E X)2 . As a result, we can finish the proof if
2
X n → µ2 . Does it hold?
17
Theorem 2.2.1 (Continuous mapping theorem). Suppose g is a continuous function and
p p
Xn −→ X, then g(Xn ) −→ g(X). This also applies to convergence in distribution.
Remark: this is also true for random vectors. Suppose Xn = (Xn1 , · · · , Xnd ) ∈ Rd is a
p
random vector and Xi −→ X, i.e., for any > 0,
lim P(kXn − Xk ≥ ) = 0
n→∞
p
where kXn − Xk denotes the Euclidean distance between Xn and X, then g(Xn ) −→
g(X) for a continuous function g.
2 p
This justifies X n −→ µ2 . Now, we have
n
!
n X 2
lim · n−1 Xi2 − Xn = E X 2 − (E X)2 = σ 2
n→∞ n − 1
i=1
convergence in probability.
Exercise: Complete the proof to show that Sn2 and Sn are consistent estimators of σ 2
and σ.
Another commonly-used quantity to evaluate the quality of estimator is MSE (mean-
squared-error).
Definition 2.2.3 (MSE: mean-squared-error). The mean squared error is defined as
The MSE is closely related to bias and variance of θbn . In fact, we have the following
famous bias-variance decomposition
MSE(θbn ) = bias(θbn )2 + Var(θbn ).
18
Lemma 2.2.2. Convergence in MSE implies convergence in probability.
The proof of this lemma directly follows from Chebyshev’s inequality.
P(θ ∈ Cn ) ≥ 1 − α,
19
• As α decreases (1 − α increases), z1−α/2 increases, making CI larger.
• Suppose we have the samples, the CI is
z1−α/2 σ z1−α/2 σ
xn − √ , xn + √ .
n n
The meaning of CI is: suppose we repeat this experiments many times, the fre-
quency of this interval containing µ is close to 1 − α.
Now, we focus on another question: what if σ is unknown? A simple remedy is: use
sample standard deviation Sn to replace σ and we have the following potential candidate
of CI:
z1−α/2 Sn z1−α/2 Sn
Xn − √ , Xn + √
n n
Does it work? Does this interval cover µ with probability approximately 1 − α. This
question is equivalent to ask if
√
n(X n − µ) d
−→ N (0, 1)
Sn
as n → ∞.
This is indeed true, which is guaranteed by Slutsky’s theorem.
Theorem 2.3.1 (Slutsky’s theorem). Suppose
d p
Xn −→ X, Yn −→ c
d d
Exercise: Construct a counterexample. Suppose Xn −→ X, Yn −→ Y , then Xn + Yn
does not necessarily converge to X + Y in distribution.
d p
Exercise: Show that if Xn −→ c, then Xn −→ c where c is a fixed value.
Exercise: Challenging! Prove Slutsky’s theorem.
p
Note that Sn −→ σ (Why?). By Slutsky’s theorem, we have
√ √
n(X n − µ) n(X n − µ) σ d
= · −→ N (0, 1).
Sn | σ
{z S n
} |{z}
d p
−→N (0,1) −→σ
z1−α/2 Sn z1−α/2 Sn
The argument above justifies why X n − √
n
, Xn + √
n
is a 1 − α confidence
interval of µ.
20
Chapter 3
Nonparametric inference
where (
1, Xi ≤ x,
1{Xi ≤ x} =
0, Xi > x.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
Figure 3.1: CDF v.s. empirical CDF for n = 100 and n = 500
21
Figure 3.1, the empirical cdf gets closer to cdf as the number of data increases. In fact,
Fn,X (x) is a consistent estimator of FX (x). Why?
Proof: Note that {Xi }ni=1 are i.i.d. random variables. Therefore, 1{Xi ≤ x} are also
i.i.d. random variables. n
1X
Fn,X (x) = 1{Xi ≤ x}
n i=1
is the sample mean of 1{Xi ≤ x}. By law of large number, it holds that
p
Fn,X (x) −→ E 1{Xi ≤ x}.
Since we have
E 1{Xi ≤ x} = P(Xi ≤ x) = FX (x),
Fn,X (x) is a consistent estimator of FX (x).
as n → ∞.
In other words, the empirical cdf is a consistent estimator of the actual cdf.
Here Fn,X (x) is not continuous everywhere since it jumps at Xi . One way to understand
this integral is to treat Fn,X (x) as the cdf of a discrete random variable which takes value
Xi with probability 1/n. By law of large number, θbn is a consistent estimator of θ if the
population parameter θ exists.
The plug-in principle can be also extended to more complicated scenarios.
22
• α-quantile:
θ = FX−1 (α) := inf{x : FX (x) ≥ α}.
The plug-in estimator for α-quantile is
−1
Fn,X (α) := inf{x : Fn,X (x) ≥ α}.
Note that ˆ ˆ ˆ
σXY = xy dF (x, y) − x dF (x, y) y dF (x, y).
R2 R R
and thus
ˆ ˆ ˆ
σ
bXY,n = xy dFn (x, y) − x dFn (x, y) y dFn (x, y)
R2 R R
n
1X
= X i Yi − X n Y n .
n i=1
Exercise: Is σ
bXY,n an unbiased estimator of σXY ? Why or why not?
Exercise: Show that Fn (x, y) is an unbiased/consistent estimator of FX,Y (x, y).
Exercise: Show that
lim Fn,XY (x, y) = Fn,X (x),
y→∞
23
3.2 Bootstrap
We have discussed how to evaluate the quality of an estimator θbn = T (X1 , · · · , Xn ) and
construct a confidence interval for a population parameter. From the previous discussion,
we may realize that the core problem in understanding θbn is deriving its distribution. Once
we know its distribution and connection to the population parameter θ, we can easily
evaluate the its quality and construct confidence interval.
However, it is usually not easy to characterize the distribution of an estimator. Suppose
we observe a set of samples X1 , · · · , Xn ∼ FX . We want to estimate its mean, variance,
and median, as well as a 1 − α confidence interval for these population parameters.
What should we do? Our current toolbox is able to provide an interval estimation for µ;
however, it is less clear for variance and median.
Let’s start with providing estimators for µ, σ 2 , and median.
T (Fn ) = X n ,
n
1 X
T (Fn ) = (Xi − X n )2 ,
n − 1 i=1
T (Fn ) = median(X1 , · · · , Xn ).
All these estimators are consistent, i.e., as n → ∞
p
T (Fn ) −→ T (F ) = θ.
Suppose we know the actual distribution of θbn = T (Fn ), it will be much easier to find a
confidence interval for θ. Why? First we can find the α/2 and 1 − α/2 quantiles, denoted
by a = qα/2 and b = q1−α/2 , for θbn − θ, then
P(a ≤ θbn − θ ≤ b) ≥ 1 − α.
Then a 1 − α confidence interval of θ is
θbn − q1−α/2 , θbn − qα/2
However, we don’t know either the distribution of θbn or θ. What is the solution?
Bootstrap was invented by Bradley Efron in 1979. It is widely used in various applications
due to its simplicity and effectiveness. The idea of bootstrap is quite simple: we use the
empirical distribution to approximate the actual population distribution plus resampling.
Recall that
θ = T (F ), θbn = T (Fn ).
How to find the distribution of θbn ? Assume we have access to a random number generator
of FX . Then we can sample X1,k , · · · , Xn,k , and compute
θbn,k = T (X1,k , · · · , Xn,k ), 1 ≤ k ≤ B.
Suppose we repeat this procedure many times (say B times), the distribution of θbn will
be well approximated by {θbn,k }B
k=1 .
However, we still haven’t resolved the issue that FX is unknown. The solution is: we
replace FX by Fn,X , i.e., instead of sampling data from FX , we sample from Fn,X which
is known. Then use the obtained data to approximate the distribution of θbn . For the
unknown parameter θ, we simply approximate it by θbn .
Essentially, the idea is summarized as follows:
24
• If we know θ and FX , to estimate the distribution of θbn = T (Fn ), we sample
(X1 , · · · , Xn ) from FX and calculate θb = T (Fn ). Repeat it many times and the
obtained samples approximate the distribution of θbn .
• In reality, we only have one sample x1 , · · · , xn without knowing the underlying
FX . Thus we approximate FX by Fn from x1 , · · · , xn ; use Fn to generate new data
points X1∗ , · · · , Xn∗ from Fn,X ; and get θbn∗ = T (Fn∗ ) where Fn∗ is the empirical cdf of
X1∗ , · · · , Xn∗ . Use the simulated data θbn∗ to approximate the distribution of θbn .
Here we actually have a two-stage approximation:
• The approximation error due to Fn,X and FX may not be small.
• The approximation error due to the resampling is small if B is large; where B is
the number of copies of θbn∗ .
Now we are ready to present the bootstrap method for the construction of confidence
interval.
• Step 1: Given the empirical cdf Fn,X (one realization):
n
1X
Fn,X = 1{xi ≤ x}.
n i=1
∗ ∗
• Step 2: Generate n samples X1,k , · · · , Xn,k from Fn,X and compute
∗ ∗ ∗ ∗
θbn,k = T (X1,k , · · · , Xn,k ) = T (Fn,X,k ), 1 ≤ k ≤ B,
∗ n
where {Xi,k }i=1 are n independent samples from Fn,X (x) and they form the empir-
∗
ical cdf Fn,X,k . Note that generating samples from Fn,X is equivalent to uniformly
picking data from {x1 , · · · , xn } with replacement, i.e., it is the cdf of the following
random variable Z:
1
P(Z = xi ) =
n
25
∗
• Step 3a (Basic bootstrap): Compute Rk = θbn,k − θbn , and find the α/2 and 1 − α/2
empirical quantiles of {Rk }B b∗
k=1 , i.e., R(α) = θn,k (α) − θn . A (1 − α)-confidence
b
interval is given by
α α b
∗ ∗
θbn,k − θbn ≤ θbn − θ ≤ θbn,k 1− − θn
2 2
which equals
∗
α ∗
α
2θbn − θbn,k 1− ≤ θ ≤ 2θbn − θbn,k
2 2
∗
In other words, we use the empirical quantile R(α) = θbn,k (α) − θbn as an α-quantile
estimator of the random variable θbn − θ.
• Step 3b (Percentile intervals): Another way is to use the empirical quantile of
∗
{θbn,k }B
k=1 , α α
∗ ∗
θbn,k , θbn,k 1−
2 2
as a 1 − α confidence interval for θ.
• Estimation of standard deviation:
B n
!2
1 X b∗ 1 X b∗
b∗2
σ = θ − θ
B − 1 k=1 n,k B j=1 n,j
Pn
Exercise: The empirical cdf Fn,X = n−1 i=1 1{xi ≤ x} defines a discrete random
variable Z with pmf:
1
P(Z = xi ) = .
n
Exercise: Show that drawing i.i.d. samples X1∗ , · · · , Xn∗ from Fn,X is equivalent to draw
n samples from {x1 , · · · , xn } uniformly with replacement.
Exercise: Suppose x1 , · · · , xn are observed i.i.d. data from FX . Assume that they are
distinct, i.e., xi 6= xj for all i 6=
Pj. Compute the mean and variance for the random
∗ −1 n
variable X with cdf Fn,X = n i=1 1{xi ≤ x}.
4, 0, 6, 5, 2, 1, 2, 0, 4, 3
Yk = T (X1∗ , · · · , Xn∗ )
26
• The 95% CI via central limit theorem is (1.4247, 3.9753):
Sn Sn
X n − 1.96 √ , X n + 1.96 √
n n
where n = 10.
• The 95% CI via Bootstrap methods is (1.5, 3.9)
Example: (Wasserman Example 8.6). Here is an example first used by Bradley Efron to
illustrate the bootstrap. The data consist of the LSAT scores and GPA, and one wants
to study the correlation between LSAT score and GPA.
The population correlation is
E(X− µX )(Y − µY )
ρ= p p
E(X − µX )2 E(Y − µY )2
27
3.5
3.4
3.3
3.2
3.1
2.9
2.8
2.7
540 560 580 600 620 640 660 680
The scatterplot implies that higher LSAT score tends to give higher GPA. The empir-
ical correlation is ρb = 0.776 which indicates high correlation between GPA and LSAT
scores.
How to obtain a 1 − α confidence interval for ρ? We apply bootstrap method to obtain
a confidence interval for ρ.
1. Independently sample (Xi∗ , Yi∗ ), i = 1, 2, · · · , n, from {(Xi , Yi )}ni=1 uniformly with
replacement.
2. Let Pn ∗ ∗
∗
− X n )(Yi − Y n )
i=1 (Xi
ρb∗k = qP qP
n ∗ ∗ 2 n ∗ ∗ 2
i=1 (Xi − X n ) i=1 (Yi − Y n )
where
B
∗ 1 X ∗
ρbB = ρb .
B k=1 k
ρ∗k }B
• The CI can be obtained via computing the empirical quantile of {b k=1 .
Let B = 1000 and we have the histogram in Figure 3.3. A 95% confidence interval for
the correlation (0.4646, 0.9592).
Exercise: Show that drawing n i.i.d. samples from the 2D empirical cdf Fn (x, y) =
n−1 ni=1 1{Xi ≤ x, Yi ≤ y} is equivalent to sampling n points uniformly with replacement
P
from {Xi , Yi }ni=1 .
In this note, we briefly introduce the bootstrap method and apply it to construct con-
fidence interval. However, we didn’t cover the theory for bootstrap. Under certain reg-
ularity condition, the CI from bootstrap covers the actual parameter θ with probability
28
ρ∗k }B
Figure 3.3: Histogram of {b k=1
approximately 1 − α as n → ∞. For more details, you may refer to [3, Chapter 8] and [1,
Chapter 6.5].
29
Chapter 4
Parametric inference
We have discussed the estimation of some common population parameters such as mean,
variance, and median. You may have realized that we did not impose any assumption
on the underlying population distribution: the analysis hold for quite general distribu-
tions. However, in many applications, we have more information about the underlying
distribution, e.g., the population distribution may belong to a family of distributions
S = {fθ (x)|θ ∈ Θ} where Θ is the parameter space and fθ (x), also denoted by f (x; θ), is
the pdf or pmf. This is referred to as the parametric models. The population distribution
is uniquely determined by the hidden parameter θ. Of course, the validity of such an as-
sumption remains verified: one may need to check if the population is indeed consistent
with the assumed distribution. This is a separated important question which we will deal
with later.
Here are several examples of parametric models.
Normal data model:
1 −
(x−µ)2
S= √ e 2σ 2 : σ > 0, µ ∈ R .
2πσ 2
Gamma distribution:
1 α−1 −x/β
S= x e , x>0:θ∈Θ , θ = (α, β), Θ = {(α, β) : α > 0, β > 0}.
β α Γ(a)
30
4.1 Method of moments (M.O.M)
Method of moments is a convenient way to construct point estimators with wide appli-
cations in statistics and econometrics. Given a set of i.i.d. samples X1 , · · · , Xn from
f (x; θ), how to find a suitable θ such that the data X1 , · · · , Xn well fit f (x; θ)? Suppose
two distributions are close, we would expect their mean, second moment/variance, and
higher moments to be similar. This is the key idea of the method of moments: moment
matching.
Suppose the distribution f (x; θ) depends on θ = (θ1 , · · · , θk ) ∈ Rk .
• First compute the jth moment of F (x; θ), i.e.,
αj (θ) = E X j , 1 ≤ j ≤ k.
where α
bj converges to the population moment αj in probability provided the higher
order moment exists by the law of large number.
The method of moments estimator θbn is the solution to a set of equations:
αj (θbn ) = α
bj , 1 ≤ j ≤ k.
In other words, the method of moments estimator matches sample moments. It is ob-
tained by solving these equations.
Question: How to choose k? We usually choose the smallest k such that θbn is uniquely
determined, i.e., the solution to αj (θ) = α
bj is unique. We prefer lower moments since
higher moment may not exist and also have a higher variance.
Example: Suppose X1 , · · · , Xn be i.i.d. samples from Poisson(λ). Find the method of
moments estimators.
The first moment of Poisson(λ) is
EX =λ
Thus n
b= 1
X
λ Xi .
n i=1
It is a consistent estimator and converges to λ in mean squared error.
However, the method of moments estimator is not unique:
E Xi2 = Var(Xi ) + (E Xi )2 = λ2 + λ.
Another choice is n
b= 1
X
λ (Xi − X n )2 .
n i=1
31
since λ = Var(Xi ). How to compare these three estimators? Which one is the best?
Example: For X1 , · · · , Xn ∼Poisson(λ), X n and Sn2 are both unbiased estimators of λ.
Which one has smaller MSE? What is the optimal MSE? You are encouraged to try but
the calculation can be long and involved.
Example: Suppose X1 , · · · , Xn be i.i.d. samples from N (µ, σ 2 ). Find the method of
moments estimators.
The means of first two moments are
α1 (θ) = E(Xi ) = µ
and
α2 (θ) = E(Xi )2 = Var(Xi ) + (E(Xi ))2 = µ2 + σ 2 .
32
Method of moment does not exist as E(X) = ∞. It is also possible that the first moment
exist while the second moment does not.
Secondly, the MOM estimators may not satisfy some natural constraints of the population
distribution. Here is one such example.
Example: Suppose we have a sample X1 , · · · , Xn ∼ Binomial(n, p) where n and p are
unknown. Find the method of moment estimator for n and p.
First we compute the expectation and second moment.
We approximate the mean and second moment via its empirical mean and second mo-
ment: n
2 2 1X 2
np = X n , np(1 − p) + n p = X .
n i=1 i
Solving for n and p:
2 Pn
X − X n )2
i=1 (Xi
n
b= −1
Pn n 2
, pb = 1 −
Xn − n i=1 (Xi − X n) nX n
The estimation of n is not necessarily an integer, even negative if the rescaled empirical
variance is larger than the empirical for a small sample.
αj (θ) = E X j
33
is uniquely determined. However, it could be tricky to determine if a function is one-to-
one. For a single variable function, h(θ) is one-to-one if it is strictly increasing/decreasing.
Moreover, if h(θ) and h−1 (α) are also continuous at θ and α respectively, then by con-
tinuous mapping theorem, it holds
p
θbn = h−1 (α)
b −→ h−1 (α) = θ.
Theorem 4.2.1 (Consistency). Suppose h(θ) is one-to-one with its inverse function h−1
continuous at α, then θb = h−1 (α)
b is a consistent estimator of θ provided that the corre-
sponding moments exist.
Question: Can we construct a confidence interval for θ? Yes, under certain mild condi-
tions. Let’s consider a simple case: the single variable case, and the analysis can be easily
extended to multivariate case which will involve multivariate Gaussian distribution and
CLT.
For θ ∈ R, we have
h(θ) = E X.
Suppose h(θ) is one-to-one, then the inverse function exists. The MOM estimator satis-
fies:
h(θbn ) = X n .
In other words,
h(θbn ) − h(θ) = X n − E X.
Recall that θbn is a consistent estimator of θ, then θbn is quite close to θ for large n and by
linearization, we have
for some function R(x) which goes to zero as x → θ. This is the Taylor’s theorem with a
remainder. Suppose h0 (θ) 6= 0 holds, then we have the following approximation:
1
θbn − θ ≈ X n − E X
h0 (θ)
When constructing a confidence interval, we can use plug-in principle to estimate σ 2 and
h0 (θ) by Sn2 and h0 (θbn ) respectively.
By Slutsky’s theorem, we have
√
nh0 (θbn )(θbn − θ) d
−→ N (0, 1) .
Sn
34
A 1 − α confidence interval for θ is
z1−α/2 Sn
θ − θn < √ .
b
n|h0 (θbn )|
pb = 1/X n , EX = 1/p.
Can we derive a confidence interval for p? Let h(p) = p−1 and then h0 (p) = −p−2 . Thus
a 1 − α CI for p is
z1−α/2 Sn −1 z1−α/2 Sn 2
|p − pb| < √ =⇒ |p − X n | < √ · X n.
nh0 (θbn ) n
exists for some function r(·). Then we perform “moment matching” by solving θbn
from n
1X
αr (θ) = r(Xi ).
n i=1
Intuitively, this approach also would work if αr (·) is one-to-one. This idea is actually
called generalized method of moments. We can derive similar results for GMM by using
the tools we have just covered.
35
Here we have a few remarks. Firstly, the likelihood function is just the joint density/pmf
of the data, except that we treat it as a function of the parameter θ. Similarly, we can
define the likelihood function for non i.i.d. samples. Suppose X1 , · · · , Xn are samples
from a joint distribution fX1 ,··· ,Xn (x|θ), then
L(θ|Xi = xi , 1 ≤ i ≤ n) = fX1 ,··· ,Xn (x; θ).
The likelihood function for non-i.i.d. data will be very useful in linear regression. We will
also briefly discuss an interesting example from matrix spike model later in this lecture.
Another thing to bear in mind is: in general, L(θ|x) is not a density function of θ even
though it looks like a conditional pdf of θ given x.
Now it is natural to ask: what is maximum likelihood estimation? As the name has
suggested, it means the maximizer of L(θ|x).
Definition 4.3.2 (Maximum likelihood estimation). The maximum likelihood estimator
MLE denoted by θbn is the value of θ which maximizes L(θ|x), i.e.,
θbn = argmaxθ∈Θ L(θ|x).
This matches our original idea: we want to find a parameter such that the corresponding
population distribution is most likely to produce the observed samples {x1 , · · · , xn }.
In practice, we often maximize log-likelihood function instead of likelihood function.
Log-likelihood function `(θ|x) := log L(θ|x) is defined as the logarithm of likelihood
L(θ|x).
There are two reasons to consider log-likelihood function:
• The log-transform will not change the maximizer since natural logarithm is a strictly
increasing function.
• The log-transform enjoys the following property:
n
Y n
X
`(θ|x) = log fXi (xi ; θ) = log fXi (xi ; θ)
i=1 i=1
where Xi are independent with the pdf/pmf fXi (·; θ). This simple equation usually
makes the calculation and analysis much easier.
Example: Suppose that x1 , · · · , xn ∼ Bernoulli(p). The pmf is
f (x; p) = px (1 − p)1−x
where x ∈ {0, 1}.
First let’s write down the likelihood function. Note that xi is an independent copy of
Bernoulli(p); their distribution equals the product of their marginal distributions.
n
Y P P
L(p|x) := pxi (1 − p)1−xi = p i xi
(1 − p)n− i xi
i=1
nxn
=p (1 − p)n(1−xn )
where xn = n−1 ni=1 xi . Next we take the logarithm of L(p) (Here we omit x if there is
P
no confusion) and we have
n
!
X
`(p) = nxn log(p) + n 1 − xi log(1 − p).
i=1
36
How to maximize it? We can differentiate it, find the critical point, and use tests to see if
the solution is a global maximizer. We differentiate `n (p) w.r.t. p and obtain the critical
point:
d`(p) nxn n(1 − xn )
= − = 0 =⇒ pb = xn
dp p 1−p
Is pb a global maximizer?
d2 `(p) nxn n(1 − xn )
2
=− 2 − < 0.
dp p (1 − p)2
This implies that the likelihood function is concave. All the local maximizers of a concave
function are global. Therefore, pb = xn is the MLE. If we treat each xi as a realization of
Xi , then the statistic
pb = X n
√ d
is a consistent estimator of p and enjoys asymptotic normality n(b p − p) −→ N (0, 1) by
CLT. From now on, we replace xi by Xi since xi is a realization of Xi .
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pn
Figure 4.1: Plot of `(p) with n = 100 and i=1 xi = 75.
Example: Suppose X1 , · · · , Xn are sampled from N (µ, σ 2 ) where µ and σ 2 > 0 are
unknown. What is the MLE of (µ, σ 2 )?
n
(Xi − µ)2
2
Y 1
L(µ, σ |X1 , · · · , Xn ) = √ exp −
i=1 2πσ 2 2σ 2
n
!
1 X (Xi − µ)2
= exp − .
(2πσ 2 )n/2 i=1
2σ 2
Taking the logarithm of the likelihood function leads to
n
n 1 X
`(µ, σ 2 ) = − log σ 2 − 2 (Xi − µ)2 + C
2 2σ i=1
where C contains no information about µ and σ 2 . Taking the partial derivative w.r.t. µ
and σ 2 (we treat σ 2 as a variable) gives
n
∂` 1 X
=− 2 (µ − Xi ) = 0
∂µ σ i=1
n
∂` n 1 X
2
=− 2 + 4 (Xi − µ)2 = 0
∂σ 2σ 2σ i=1
37
Solving for µ and σ 2 :
n
1X
2 n−1 2
b = X n,
µ σ
b = (Xi − X n )2 = Sn .
n i=1 n
n
∂ 2` n 1 X
2 2
= 4− 6 b)2
(Xi − µ
∂(σ ) 2b
σ σ
b i=1
n n n
= 4 − 4 = − 4.
2b
σ σ
b 2b
σ
Thus the Hessian matrix is
n
2 2 b2
0
∇ `(b b )=−
µ, σ σ
n .
0 σ4
2b
It is negative definite which is equivalent to the statement that all of its eigenvalues
are negative. As a result, (b b2 ) is a local maximizer. The Hessian of log-likelihood
µ, σ
function plays an important role in statistics, which is also called Fisher information
matrix. Obviously, the MLE is a consistent estimator of (µ, σ 2 ).
Example: Let X1 , · · · , Xn ∼ Unif(0, θ) where θ is unknown. Recall that
(
θ−1 , if 0 ≤ x ≤ θ,
f (x; θ) =
0, otherwise.
38
Note Ln (θ) is zero if max1≤i≤n Xi > θ and is decreasing as θ increases. Therefore, the
MLE is max1≤i≤n Xi .
Exercise: Is θb = max1≤i≤n Xi a consistent and unbiased estimator of θ?
Exercise: Does θb = max1≤i≤n Xi enjoy asymptotic normality? Or ask what is the distri-
bution of θ?
b We actually have derived the distribution in one homework problem.
All the examples we have seen have an MLE of closed form. However, it is often that we
don’t have an explicit formula for the MLE. Here is one such example.
Example: Consider
1
f (x; α, β) = xα−1 e−x/β , x > 0.
Γ(α)β α
How to find out the MLE?
n
!α−1 n
1 Y X
L(α, β|Xi , 1 ≤ i ≤ n) = Xi exp(− Xi /β), x > 0.
Γ(α)n β nα i=1 i=1
Maximizing the log-likelihood function over α is quite tricky as it will involve the deriva-
tive of Gamma function. On the other hand, it is quite simple to derive the method of
moments estimator for Gamma(α, β).
Here is another very interesting but challenging problem related to the matrix spike
model, a famous model in high dimensional statistics. Feel free to try it.
Challenging Exercise: Suppose we observe the data
Yij = θi θj + σWij ,
where {Wij }i≤j are independent Gaussian random variables and satisfy
(
N (0, 1), i 6= j,
Wij =
N (0, 2), i = j.
and Wij = Wji . Here θ = (θ1 , · · · , θn )> ∈ Rn is the hidden parameter to be estimated and
kθk = 1. Write down the log-likelihood function and show that the MLE of θ is the top
eigenvector of Y . (Note that Yij are independent but not identically distributed.)
In fact, finding the MLE in some statistical models is quite challenging and may even be
NP-hard.
• Optimization tools are needed to maximize
max `n (θ).
θ∈Θ
39
4.4 Properties of MLE
We have discussed that MOM estimators satisfy consistency and asymptotic normality
property under certain mild conditions. Does these two important properties also hold
for MLE?
4.4.1 Consistency
First let’s recall some examples:
1. For Xi ∼Bernoulli(p), the MLE is pb = X n . By LLN, pb is a consistent estimator of
p.
2. For Xi ∼Unif[0, θ], the MLE is θbn = max Xi ≤ θ. We have calculated the distribu-
tion of θbn
P(θbn ≤ x) = θ −n xn , 0 ≤ x ≤ θ.
n
Now P(|θ − θbn | ≥ ) = P(θbn ≤ θ − ) = (1 − θ−1 ) → 0 as n goes to infinity.
Therefore, θbn is a consistent estimator of θ in the two examples described above. Can we
extend it to more general scenarios?
Theorem 4.4.1 (Consistency). Under certain regularity condition, the MLE is consis-
tent, i.e.,
lim P(|θbn − θ| ≥ ) = 0
n→∞
Each log f (Xi ; θ) is an independent copy of the random variable f (X; θ). By the law of
large number, it holds that
1 p
`(θ) −→ Eθ0 log f (X; θ)
n
as n → ∞ where ˆ
Eθ0 log f (X; θ) = (log f (x; θ))f (x; θ0 ) dx.
X
40
since ˆ
f (X; θ)
Eθ0 −1 = (f (x; θ) − f (x; θ0 )) dx = 1 − 1 = 0.
f (X; θ0 ) X
Now we summarize:
• θbn is the MLE, i.e., the global maximizer of `(θ);
• θ0 is the unique maximizer of Eθ0 log f (X; θ) due to identifiability;
• For any θ, n−1 `(θ) converges to its expectation Eθ0 log f (X; θ).
Since `(θ) and Eθ0 log f (X; θ) are getting closer, their points of maximum should also get
closer.
The proof is not completely rigorous: in order to make the argument solid, we actually
would ask the uniform convergence of n−1 `(θ) to Eθ0 log f (X; θ).
41
Theorem 4.4.2. Suppose θbn is the MLE, i.e., the global maximizer of `(θ). Then
d
θbn − θ0 −→ N (0, In (θ0 )−1 )
where θ0 is the true parameter provided that `(θ) has a continuous second derivative in a
neighborhood around θ0 and In (θ) = nI(θ) is continuous at θ0 .
Once we have the asymptotic normality of MLE, we can use it to construct a (1 − α)
confidence interval. Since θbn is a consistent estimator of θ0 , Slutsky theorem implies
that √ b
n(θ − θ ) d
rh n i 0 −→ N (0, 1)
−1
I(θbn )
p
since I(θbn ) −→ I(θ0 ) if I(θ) is continuous at θ0 . Note that In (θbn ) = nI(θbn ). Therefore,
this asymptotic distribution can also be written into
q
d d
In (θbn ) · (θbn − θ) −→ N (0, 1) or θbn − θ −→ N (0, In (θbn )−1 ).
42
Proof of asymptotic normality. By Taylor theorem, it holds that
Here o(|θ0 − θbn |) means this term has a smaller order than |θ0 − θbn |.
Note that `0 (θbn ) = 0 since θbn is the maximizer. We have
0
` (θ0 )
−`0 (θ0 ) = `00 (θ0 )(θbn − θ0 ) + o(|θbn − θ0 |) =⇒ θbn − θ0 ≈ − 00
` (θ0 )
which implies
√ √1 `0 (θ0 )
n
n(θn − θ0 ) = − 1 00
b + small error.
n
` (θ0 )
We will see why we need these terms containing n immediately.
Note that
Eθ0 `0 (θ0 ) =0
since θ0 is global maximizer to Eθ0 log f (X; θ). By CLT, we have
n 2 !
1 1 X d
d d
√ `0 (θ0 ) = √ log f (Xi ; θ) −→ N 0, E log f (Xi ; θ) = N (0, I(θ0 ))
n n i=1 dθ θ=θ0 dθ θ=θ0
d
where each dθ
log f (Xi ; θ) is an i.i.d. random variable. On the other hand, by law of large
number,
1 00 p d2
` (θ0 ) −→ E 2 log f (X; θ0 ) = −I(θ0 ).
n dθ
√ b
in probability. Therefore, we have n(θn − θ0 ) converges in distribution:
√ √1 `0 (θ0 )
n d
n(θn − θ0 ) = − 1 00
b → N (0, I(θ0 )−1 )
n
` (θ0 )
Proof: Let τ = g(θ) and θb be the MLE of L(θ|x). We prove the simplest case where g
is one-to-one. Define
L∗ (τ |x) := L(g −1 (τ )|x) ≤ L(θ|x)
b
Thus τb = g(θ)
b is the MLE of τ = g(θ).
43
It is the induced likelihood function of τ . This definition does not depend on whether
g is one-to-one or not. Since θb is the MLE of L(θ|x), L∗ (τ |x) ≤ L(θ|x).
b On the other
hand,
L∗ (b
τ |x) = sup L(θ|x) ≥ L(θ|x).b
{θ:g(θ)=b
τ}
Regarding the construction of confidence interval for τ , we need to know the asymptotic
distribution of g(θbn ).
Theorem 4.5.1 (Delta method). Suppose a sequence of random variables θbn satisfies
√
that n(θbn − θ) converges to N (0, σ 2 ) in distribution. For a given function g(x) such
that g 0 (θ) exists and is nonzero, then
√ d
n(g(θbn ) − g(θ)) −→ N (0, [g 0 (θ)]2 σ 2 ).
44
√ d
Exercise: Show that n(θbn − θ) −→ N (0, 1) implies that θbn converges to θ in probabil-
ity.
Following from Delta method, we immediately have:
Theorem 4.5.2. Suppose θbn is the MLE of θ, then τb satisfies
√ [g 0 (θ)]2
n(bτn − τ ) → N 0,
I(θ)
where I(θ) is the Fisher information of θ and g(·) has non-vanishing derivative at θ.
Exercise: what if g 0 (θ) = 0 but g 00 (θ) exists? Show that if g 0 (θ) = 0 and g 00 (θ) 6= 0,
then
d σ 2 g 00 (θ) 2
n(g(θbn ) − g(θ)) −→ χ1
2
where √ d
n(θbn − θ) −→ N (0, σ 2 ).
(Hint: Use 2nd-order Taylor approximation:
1
g(θbn ) − g(θ) = g 0 (θ)(θbn − θ) + g 00 (θ)(θbn − θ)2 + R
2
1 00 2
= g (θ)(θbn − θ) + R
2
√
Note that Z = n(θσn −θ) converges to a standard normal random variable. Z 2 is actually
b
χ21 distribution (chi-square distribution of degree 1). One can derive the corresponding
distribution of n(g(θbn ) − g(θ)) by using this fact.)
Exercise: Show that the asymptotic distribution
L(θbn )
λ(X1 , · · · , Xn ) = 2 log = 2(`(θbn ) − `(θ0 )) ∼ χ21
L(θ0 )
where Xi ∼ f (x; θ0 ), i.e., θ0 is the true parameter, and θbn is the MLE. This is called the
likelihood ratio statistic. We will see this again in likelihood ratio test.
Example: Suppose we observe X1 , · · · , Xn ∼Bernoulli(p) random variables. We are
interested in the odds
p
ψ= .
1−p
The MLE of ψ is ψb = pb/(1 − pb) where pb = X n .
The variance of ψb is
45
Example: Suppose X1 , · · · , X√n ∼Geo(p). Find the MLE of p. Use Delta method to find
an asymptotic distribution of n(b p − p).
We start with the log-likelihood function:
n
Y Pn
L(p|Xi , 1 ≤ i ≤ n) = (1 − p)Xi −1 p = (1 − p) i=1 Xi −n pn
i=1
and
`(p) = n(X n − 1) log(1 − p) + n log p.
Let’s compute the critical point and show it is a global maximizer.
n(X n − 1) n 1
`0 (p) = − + =⇒ pb = .
1−p p Xn
Note that by CLT, we have
√ d
n(X − 1/p) −→ N (0, (1 − p)/p2 )
1−p p2 (1 − p)
Var(g(X n )) ≈ [g 0 (1/p)]2 · Var(X n ) = p4 · = .
np2 n
Therefore,
√ d
p − p) −→ N 0, (1 − p)p2 .
n(b
n(X n − 1) n
`00 (p) = − 2
− 2
(1 − p) p
So the Fisher information is
n(1/p − 1) n n n n
In (p) = Ep `00 (p) = 2
+ 2 = + 2 = 2
(1 − p) p p(1 − p) p p (1 − p)
46
Theorem 4.6.2 (Cramer-Rao inequality). Suppose θbn is an estimator of θ with finite
variance. Then 2
d
E θ
dθ θ n
b
Varθ (θbn ) ≥ .
nI(θ)
where ˆ 2
d
I(θ) = log f (x; θ) f (x; θ) dx
R dθ
with f (x; θ) as the pdf.
Proof: Note that Eθ θbn is a function of θ. Assume that the integration and differentia-
tion can be exchanged, i.e.,
ˆ
d d
Eθ (θ)
b = θ(x)
b f (x; θ) dx
dθ X dθ
Differentiating it w.r.t. θ:
ˆ ˆ
df (x; θ) d Eθ θbn
(θ(x)
b − Eθ (θbn )) dx − f (x; θ) dx = 0
X dθ X dθ
which gives ˆ
d df (x; θ)
Eθ (θbn ) = (θ(x)
b − Eθ (θbn )) dx.
dθ X dθ
Applying Cauchy-Schwarz inequality, we have
2 ˆ 2
d p 1 df (x; θ)
Eθ (θbn ) = (θ(x) − ( θ )) f (x; θ) · dx
dθ
b E θ n
b p
f (x; θ) dθ
X
ˆ ˆ 2
2 1 df (x; θ)
≤ (θ(x)
b − Eθ (θbn )) f (x; θ) dx · dx
X X f (x; θ) dθ
2
d log f (x; θ)
= Varθ (θbn ) · Eθ
dθ
= Varθ (θbn ) · nI(θ).
e−λ λx
f (x; λ) = , x ∈ {0, 1, · · · }.
x!
The Fisher information is I(λ) = λ−1 . Note that the MLE of λ is X n .
λ 1 λ
Varλ (X n ) = , = .
n nI(λ) n
47
Thus X n is the best unbiased estimator of λ since Var(X n ) = λ/n.
Exercise: Suppose X1 , · · · , Xn ∼ N (0, σ 2 ). Find the MLE of σ 2 and the Fisher infor-
mation I(σ 2 ). Show that
n
1X 2
X
n i=1 i
is the unbiased estimator of σ 2 with the smallest possible variance.
Exercise: Suppose X1 , · · · , Xn ∼ N (0, σ 2 ). Find the MLE of σ and the Fisher informa-
tion I(σ). What is the actual and approximate variance of σ b?
Exercise: Suppose X1 , · · · , Xn are samples from
(
θxθ−1 , 0 < x < 1,
f (x; θ) =
0, otherwise.
for some variance σc2v ? This requires us to find out what the joint distribution of (b b2 ),
µ, σ
which is made more clearly later.
48
4.7.1 Multiparameter MLE
Suppose a family of distributions depends on several parameters θ = (θ1 , · · · , θk )> (a
column vector). By maximizing the likelihood function, we obtain the MLE:
θb = argmaxθ∈Θ L(θ).
Two questions:
• Is the MLE consistent?
• Does asymptotic normality hold?
Theorem 4.7.1. The MLE is consistent under certain regularity condition.
If you are interested in the rigorous proof of consistency, you may refer to advanced
textbook such as [2].
Exercise: Can you generalize the argument in the single parameter scenario to multi-
parameter scenario?
How about asymptotic normality of MLE? Note that the MLE is no longer a vector, we
will naturally use multivariate normal distribution.
49
Example: If X and Y are two independent standard normal random variables, then
their joint pdf is 2
x + y2
1
fX,Y (x, y) = exp − .
2π 2
This is essentially N (0, I2 ), i.e., N (0, 0, 1, 1, 0).
Example: No correlation implies independence for joint normal distribution. Suppose
(X, Y ) ∼ N (µ, Σ) with zero correlation ρ = 0. Then it holds
1 (x − µX )2 (y − µY )2
1
fX,Y (x, y; µ, Σ) = exp − 2
+
2πσX σY 2 σX σY
2
(y − µY )2
1 (x − µX ) 1
=p exp − 2
·p exp −
2πσX2 2σX 2πσY2 2σY2
= fX (x)fY (y).
In other words, they are independent.
Question: Is fX,Y indeed
˜ a probability density function? To justify it, we need to show
fX,Y ≥ 0 (obvious) and R fX,Y (x, y) dx dy = 1.
˜
Proof: We will show that R fX,Y (x, y) dx dy = 1. To do this, we first introduce a few
notations:
x µ
z= , µ= X
y µY
Then
¨ ˆ
1 1 > −1
fX,Y (x, y) dx dy = p exp − (z − µ) Σ (z − µ) dz
R2 2π det(Σ) R2 2
where dz equals dx dy. Now we perform a change of variable:
√
−1/2 1/2 > λ1 √0
w=Σ 1/2
(z − µ), Σ = U Λ U = U U>
0 λ2
where Σ = U ΛU > is the spectral decomposition (eigen-decomposition) of Σ, i.e., U is
orthogonal (U U > = U > U = I2 ) and
λ1 0
Λ= , λ1 , λ2 > 0
0 λ2
consists of all eigenvalues of Σ. This change of variable maps R2 to R2 and is one-to-one.
Now we substitute z = Σ1/2 w + µ into the integral. Note that
p p
dz = | det(Σ1/2 )| dw = λ1 λ2 = det(Σ) dw
h i
∂zi
where Σ1/2 is essentially the Jacobian matrix ∂w j
and
2×2
> −1 >
(z − µ) Σ (z − µ) = w w = w12 + w22
where w = (w1 , w2 )> . Thus the integral equals
¨ ˆ
1 1 > p
fX,Y (x, y) dx dy = p exp − w w det(Σ) dw
R2 2π det(Σ) R2 2
ˆ
1 1 >
= exp − w w dw
2π R2 2
ˆ ˆ
1 1 2 1 1 2
=√ exp − w1 dw1 · √ exp − w2 dw2
2π R 2 2π R 2
= 1.
50
Essentially, the argument above proves the following statement.
Lemma 4.7.2. Suppose Z = (X, Y )> ∼ N (µ, Σ). Then
Σ−1/2 (Z − µ) ∼ N (0, I2 ).
Proof: First note that aX + bY is normal. Thus it suffices to determines its mean and
variance.
E(aX + bY ) = a E X + b E Y = aµX + bµY
and
Var(aX + bY ) = E [a(X − E X) + b(Y − E Y )] [a(X − E X) + b(Y − E Y )]
= a2 E(X − E X)2 + 2ab E(X − E X)(Y − E Y ) + b2 E(Y − E Y )2
= a2 Var(X) + 2ab Cov(X, Y ) + b2 Var(Y )
= a2 σX
2
+ 2abρσX σY + b2 σY2 .
Thus we have the result.
Question: Why is aX + bY normal? We can use the moment generating function. For
simplicity, we can let µX and µY be zero. We let
a x w1
λ= , z= w= = Σ−1/2 z.
b y w2
Then
ˆ
t(aX+bY ) tλ> z 1 tλ> z 1 > −1
M (t) := E e = Ee = p e exp − z Σ z dz
2π det(Σ) R2 2
Note that z = Σ1/2 w and thus dz = | det(Σ1/2 )| dw.
ˆ
1 tλ> z 1 > −1
M (t) := p e exp − z Σ z dz
2π det(Σ) R2 2
ˆ
1 tλ> Σ1/2 w 1 >
= p e exp − w w | det(Σ1/2 )| dw
2π det(Σ) R2 2
ˆ
1 1 > > 1/2
= exp − w w + tλ Σ w dw
2π R2 2
ˆ
1 > 1 1 1/2 > 1/2
= exp − λ Σλ · exp − (w − Σ λ) (w − Σ λ) dw
2 2π R2 2
51
Note that the integral is 1 since it is a pdf, i.e., the pdf of N (Σ1/2 λ, I2 ). Thus
1 >
M (t) = exp − λ Σλ .
2
λ> Σλ = a2 σX
2
+ 2abρσX σY + b2 σY2 .
where Xi ∈ Rk are i.i.d. random variable (vectors). Define the Fisher information matrix
as
∂2 2 2
Eθ ∂θ∂∂θ `(θ) Eθ ∂θ∂∂θ `(θ)
Eθ ∂θ 2 `(θ)
1 2
··· 1 k
1
∂ 2 ∂2 2
∂2 Eθ ∂θ1 ∂θ2 `(θ) `(θ) ··· Eθ ∂θ∂∂θ `(θ)
Eθ ∂θ 2
2 k
In (θ) = − Eθ `(θ) = − ..
2
.. .. ..
∂θi ∂θj
. . . .
∂2 ∂2 ∂2
Eθ ∂θ ∂θ `(θ) Eθ ∂θ ∂θ `(θ) · · · Eθ ∂θ 2 `(θ)
1 k 2 k k
Fisher information matrix is equal to the Hessian matrix of −`(θ|X) under expectation.
Is it always positive semidefinite?
Theorem 4.7.4 (Asymptotic normality of MLE). Under certain regularity condi-
tion, it holds that
d
θb − θ −→ N (0, In−1 (θ))
where In−1 (θ) is the inverse of Fisher information matrix.
If Xi ∈ Rk are i.i.d. random vectors, then
√ d
n(θb − θ) −→ N (0, I −1 (θ))
∂2
[I(θ)]ij = − E log f (X; θ) .
∂θi ∂θj
52
The second order derivative of log-pdf is
∂` 1 ∂` 1 1
= − 2 (µ − X), 2
= − 2 + 4 (X − µ)2
∂µ σ ∂σ 2σ 2σ
and
∂ 2` 1 ∂ 2` 1 1 ∂ 2` 1
2
= − 2, 2 2
= 4 − 6 (X − µ)2 , 2
= 4 (µ − X)
∂µ σ ∂(σ ) 2σ σ ∂µ∂σ σ
The Fisher information is
1
2 σ2
0
I(µ, σ ) = 1
0 2σ 4
and its inverse is 2
2 −1 σ 0
I(µ, σ ) =
0 2σ 4
where the error diminishes to 0 as n → ∞. Here h·, ·i denotes the inner product between
√
two vectors. This is approximately Gaussian since n(θb − θ) is asymptotically normal
√
N (0, I −1 (θ)). The variance of n(g(θ)
b − g(θ)) is given by (∇g(θ))> I −1 (θ)∇g(θ).
Example Consider τ = g(µ, σ 2 ) = σ/µ where the samples are drawn from Gaussian
Xi ∼ N (µ, σ 2 ). The goal is to find out the asymptotic distribution of MLE of τ.
b2 = n1 ni=1 (Xi −
P
Note that the MLE of τ is given by τb = σ b/bµ where µb = X n and σ
X n )2 .
∂g σ ∂g 1
= − 2, = .
∂µ µ ∂σ 2 2µσ
53
Thus
√ σ4 σ2
τ − τ) ∼ N
n(b 0, 4 + 2
µ 2µ
where
σ2 2 1 σ4 σ2
(∇g(θ))> I −1 (θ)∇g(θ) = · σ + · 2σ 4
= +
µ4 4µ2 σ 2 µ4 2µ2
We can see that the pdf involves the inverse of covariance matrix. Usually, finding the
matrix inverse is tricky. In some special cases, the inverse is easy to obtain.
1. If a matrix Σ is 2 × 2, then its inverse is
−1 1 σ22 −σ12
Σ = 2
σ11 σ22 − σ12 −σ12 σ11
2. If Σ is a diagonal matrix (
σii , i = j,
Σij =
0, i=6 j,
then its inverse is (
σii−1 , i = j,
[Σ−1 ]ij =
0, i=6 j,
54
In fact, we have seen multivariate normal distribution before. For example, if X1 , · · · , Xk
are independent standard normal random variables, then the random vector X = (X1 , · · · , Xk )>
has a joint distribution
1 1 > −1
f (x; 0; In ) = √ k p exp − x In x
2π det(In ) 2
1 1
= √ k exp − kxk2
2π 2
k
x2i
Y 1
= √ exp −
i=1
2π 2
In other words, Σij = E(Xi − µi )(Xj − µj ) is the covariance between Xi and Xj and Σii
is the variance of Xi .
An
Pn important property of joint normal distribution is that: > the linear combination
i=1 ai Xi is still normal for any deterministic a = (a1 , · · · , ak ) . How to find its distri-
bution? Since it is normal, we only need to compute its mean and variance.
Theorem 4.7.7. Suppose X ∼ N (µ, Σ), then ki=1 ai Xi obeys
P
k k
!
X X X
ai X i ∼ N ai µ i , Σij ai aj
i=1 i=1 i,j
or equivalently,
ha, Xi ∼ N ha, µi, a> Σa
55
Its variance is
k
! k k
!
X X X
Var ai X i =E ai (Xi − E Xi ) aj (Xj − E Xj )
i=1 i=1 j=1
k X
X k
= ai aj E(Xi − E Xi )(Xj − E Xj )
i=1 j=1
k X
X k k X
X k
= ai aj Cov(Xi , Xj ) = ai aj Σij = a> Σa.
i=1 j=1 i=1 j=1
Exercise: Moment-generating function for N (µ, Σ). What is the moment generating
function for X ∼ N (µ, Σ), Pk
M (t) = E e i=1 ti Xi
where t = (t1 , · · · , tk )> ? Hints: by definition, it holds
M (t) = E exp(t> X)
ˆ
1 > 1 > −1
= √ p exp t x − (x − µ) Σ (x − µ) dx.
( 2π)n/2 det(Σ) Rn 2
Then
ˆ
1 > 1/2 1 > p
M (t) = √ p exp t (Σ z + µ) − z z det(Σ) dz
( 2π)n/2 det(Σ) Rn 2
ˆ
exp(t> µ)
> 1/2 1 >
= √ exp t Σ z − z z dz
( 2π)n/2 Rn 2
ˆ
> 1 > 1 1 1/2 > 1/2
= exp t µ + t Σt · √ exp − (z − Σ t) (z − Σ t) dz
2 ( 2π)n/2 Rn 2
1
= exp t> µ + t> Σt
2
In fact, multivariate normal distribution is still multivariate norm under linear trans-
form.
Lemma 4.7.8. Suppose A is any deterministic matrix in Rl×k . Then
AX ∼ N (Aµ, AΣA> )
56
Thus E AX = A E X. For the covariance, by definition, we have
For a special case, if A is orthogonal, i.e. AA> = A> A = Ik , then for X ∼ N (0, Ik ),
then
AX ∼ N (0, AIk A> ) = N (0, Ik ).
√
2
b−µ
µ d σ 0
n 2 −→ N 0,
σb − σ2 0 2σ 4
Pn
It seems that µ b2 = n−1
b = X n and σ i=1 (Xi − X n )2 are “near” independent. Is it
true?
Pn
Theorem 4.7.9. The sample mean µ b2 = n−1
b = X n and variance σ i=1 (Xi − X n )2 are
independent. Moreover,
n
X
2
nb
σ = (Xi − X n )2 ∼ σ 2 χ2n−1 .
i=1
But how to justify this theorem? Now we let X = (X1 , · · · , Xn )> ∼ N (0, In ), i.e.,
n
!
1 1 X
fX (x; 0, In ) = n/2
− x2i .
(2π) 2 i=1
57
With u and P , we have
1 > 1
Xn = v X = X >v
n n
n
1 X
b2 =
σ (Xi − X n )2
n i=1
n
!
1 X 2
= Xi2 − nX n
n
i=1
1 > 1 > >
= X X − X vv X
n n
1
= X > P X.
n
Assume U ∈ Rn×(n−1) consists of n−1 orthonormal eigenvectors of P w.r.t. the eigenvalue
1. Then we know that U > v = 0 and moreover P = U U > , i.e.,
n−1
X
P = ui u>
i , ui ⊥ uj , i 6= j
i=1
where ui is the ith column of U . Also ui ⊥ v holds since they belong to eigenvectors
w.r.t. different eigenvectors.
Now
n−1
1 1 1X > 2
b = X > U U > X = kU > Xk2 =
σ 2
|u X|
n n 2 i=1 i
where U > X ∈ Rn−1 .
Key: If we are able to show that v > X and U > X are independent, then X n and σ
b2 are
independent.
What is the joint distribution of v > X and U > X? Consider
−1/2 >
n v
Π= ∈ Rn×n .
U>
The term n−1/2 is to ensure that kn−1/2 vk = 1. By linear invariance of normal distribution,
ΠX is also jointly normal. It is not hard to see that
−1 >
−n−1/2 v > U
> n v v 1 0
ΠΠ = = = In .
−n−1/2 U v > U >U 0 In−1
where E X = µv and ui ⊥ v. Therefore, u>i X/σ are independent standard normal and
we have
n−1
X
>
X PX = [U > X]2i ∼ σ 2 χ2n−1 .
i=1
58
Chapter 5
Hypothesis testing
5.1 Motivation
In many applications, we are often facing questions like these:
1. Motivation: In 1000 tosses of a coin, 560 heads and 440 tails appear. Is the coin
fair?
2. Whether two datasets come from the same distribution? Do the data satisfy normal
distribution?
3. Clinical trials: does the medicine work well for one certain type of disease?
These questions are called hypothesis testing problem.
5.1.1 Hypothesis
The first question is: what is a hypothesis?
Definition 5.1.1. A hypothesis is a statement about a population parameter.
For example, X1 , · · · , Xn
The two complementary hypothesis in a hypothesis testing problem are
• the null hypothesis, denoted by H0
• the alternative hypothesis, denoted by H1
Example: Suppose X1 , · · · , Xn ∼ N (µ, σ 2 ) with known σ 2 . We are interested in testing
if µ = µ0 . The hypothesis H0 : µ = µ0 is called the null hypothesis, i.e.,
H0 : µ = µ0 , H1 : µ 6= µ0 .
In statistical practice, we usually treat these two hypotheses unequally. When we perform
a testing, we actually design a procedure to decide if we should reject the null hypothesis
59
or retain (not to reject) the null hypothesis. It is important to note that rejecting the
null hypothesis does not mean we should accept the alternative hypothesis.
Now, let’s discuss how to design a test procedure. Apparently, this procedure will depend
on the observed data X1 , · · · , Xn . We focus on the example in which X1 , · · · , Xn ∼
N (µ, σ 2 ) and want to test if µ = 0. Naturally, we can first obtain an estimator of µ
from the data and see if it is close to µ = 0 (compared with the standard deviation as
well).
1. Compute the sample average T (X) = X n .
2. If T (X) is far away from µ = 0, we should reject H0 ; if T (X) is close to µ = 0, we
choose not to reject H0 . Namely, we reject H0 if
|X n − µ0 | ≥ c
T (X) = X n , R = {x : |x − µ0 | ≥ c}.
However, is it possible that T (X) and the choice of R give you a wrong answer?
β(θ) = Pθ (T (X) ∈ R)
60
Example - continued: For X1 , · · · , Xn ∼ N (µ, σ 2 ) with unknown σ 2 . Let’s compute
the power function:
β(µ) = Pµ (|X n − µ0 | ≥ c)
= Pµ (X n − µ0 ≥ c) + Pµ (X n − µ0 ≤ −c)
= Pµ (X n − µ ≥ c − µ + µ0 ) + Pµ (X n − µ ≤ −c − µ + µ0 )
√ √
n(X n − µ) n(c − µ + µ0 )
= Pµ ≥
σ σ
√ √
n(X n − µ) n(−c − µ + µ0 )
+ Pµ ≤
σ σ
√ √
n(c − µ + µ0 ) n(−c − µ + µ0 )
=1−Φ +Φ
σ σ
where Φ(·) is the cdf of standard normal distribution.
How to quantify Type-I error?
√ √
nc nc
β(µ0 ) = 1 − Φ +Φ −
σ σ
√
nc
=2 1−Φ
σ
where Φ(x)+Φ(−x) = 1 for any x > 0. To make the Type-I error under α, we require
z1−α/2 σ
c= √ .
n
How about Type-II error? By definition, Type-II error is the probability of retaining the
null (not rejecting the null, i.e., T (X) ∈
/ R) while the alternative is true. Suppose the
true parameter is µA 6= µ0 , then
Type-II error at µA = 1 − β(µA ).
Is it possible to control both Type I and II error? It might be tricky sometimes. Here since
we don’t know the actual true parameter, the conservative way to control the Type-II
error is to find a uniform bound of the Type-II error for any µ 6= µ0 :
sup (1 − β(µ)) = 1 − β(µ0 )
µ6=µ0
which is actually given by µ = µ0 . In other words, we cannot make both Type-I and
Type II simultaneously small in this case. In fact, in most testing problem, the asymmetry
between H0 and H1 is natural. We usually put a tighter control on the more serious error
such as Type-I error.
Exercise: Show that supµ6=µ0 (1 − β(µ)) = 1 − β(µ0 ).
How to control the Type-I error? Sometimes, the parameter space Θ0 of H0 is not a
singleton (e.g. H0 : θ = θ0 ). To overcome this issue, we introduce the size of a test:
Definition 5.1.3 (Size of a test). The size of a test is defined to be
α = sup β(θ)
θ∈Θ0
61
A test is said to have level α if its size is less than or equal to α.
• The size of a test is the maximal probability of rejecting the null hypothesis when
the null hypothesis is true.
• If the level α is small, it means type I error is small.
Example - continued: If Θ0 = {µ0 }, then the size equals β(µ0 ). To make the size
smaller than a given α, we need to have
√
nc
β(µ0 ) = 2 1 − Φ =α
σ
which gives √
nc √
= z1−α/2 ⇐⇒ nc = z1−α/2 σ.
σ
Given the number of samples n, σ, and size α, we can determine c such that the Type I
error is at most α : z1−α/2 σ
c= √ .
n
In other words, we reject the null hypothesis H0 : µ = µ0 if
√
z1−α/2 σ n|X m − µ0 |
|X n − µ0 | ≥ √ ⇐⇒ ≥ z1−α/2 .
n σ
62
Figure 5.1: The pdf of Student-t distribution. Source: wikipedia
63
5.2 More on hypothesis testing
5.2.1 Composite hypothesis testing
So far, we have discussed simple null hypothesis, i.e., |Θ0 | = 1. On the other hand, we
often encounter composite hypothesis, the parameter space Θ0 contains multiple or even
infinitely many parameters.
Example: Let X1 , · · · , Xn ∼ N (µ, σ 2 ) where σ is known. We want to test
H0 : µ ≤ µ0 , H1 : µ > µ0
Hence
Θ0 = (−∞, µ0 ] versus Θ1 = (µ0 , ∞).
Note that
T (X) = X n
is the MLE of µ. We reject H0 if T (X) > c where c is a number.
We reject H0 if T (X) > c. The power function is
θbn − θ0
W := ∼ N (0, 1)
se(
b θbn )
where θ0 is the true parameter and se( b θbn ) is an estimation of the standard deviation of
θbn . Then a size-α Wald test is: reject H0 if |W | ≥ z1− α2 .
This can be extended to one-sided hypothesis as well:
64
1. For H0 : θ < θ0 and H1 : θ ≥ θ0 , we reject the null hypothesis if
θbn − θ0
> z1−α
se(
b θbn )
θbn − θ0
< −z1−α
se(
b θbn )
Exercise: Show that the size is approximately α for the two one-sided hypothesis testing
problems above.
Why? Let’s focus on the first one-sided hypothesis. We reject the null hypothesis if
θbn > c for some c. How to compute its power function?
β(θ) = Pθ θbn ≥ c
!
θn − θ
b c−θ
= Pθ ≥
se(
b θn )
b se(
b θbn )
!
c−θ
≈1−Φ .
se(
b θbn )
Note that the power function is an increasing function of θ; thus the maximum is assumed
when θ = θ0 . !
c − θ0
sup β(θ) = 1 − Φ =α
θ∈Θ0 se(
b θbn )
which gives
c = θ0 + z1−α se(
b θbn ).
Thus we reject H0 if
θbn − θ0
θbn ≥ c ⇐⇒ > z1−α .
se(
b θbn )
H0 : θ = θ0 H1 : θ 6= θ0 .
Equivalently, we reject H0 : θ = θ0 if
xn − θ0
θ0 (1−θ0 ) > z1− α2 .
q
n
65
We reject the null hypothesis if
√
n(X − θ )
n 0
> .
θ0 (1 − θ0 ) 1−α/2
p
Question: What is the connection between confidence interval and rejection region?
The confidence interval is closely related to the rejection region. The size-α Wald test
rejects H0 : θ = θ0 v.s. H1 : θ 6= θ0 if and only if θ0 ∈
/ C where
C = θbn − z1− α2 · se(
b θbn ), θbn + z1− α2 · se(
b θbn ) .
which is equivalent to
|θbn − θ0 | ≥ z1− α2 · se(
b θbn ).
Thus, testing the null hypothesis is equivalent to checking whether the null value is in
the confidence interval for this simple hypothesis. When we reject H0 , we say that the
result is statistically significant.
5.2.3 p-value
What is p-value? Let’s first give the formal definition of p-value and then discuss its
meaning.
Definition 5.2.1 (p-value). Suppose that for every α ∈ (0, 1), we have a size α test with
rejection region Rα . Then
where T (x) is the observed value of T (X). The p-value is the smallest level at which we
can reject H0 . The smaller α is, the smaller the rejection region is.
66
The definition of p-value does not look obvious at the first glance. We try to do a concrete
example.
Example: Suppose X1 , · · · , Xn ∼ N (µ, σ 2 ). Consider the testing problem H0 : µ = µ0
v.s. H1 : µ 6= µ0 . We reject the null hypothesis with size α if
√
n(X n − µ0 )
> z1−α/2 ⇐⇒ |X n − µ0 | > z1−α/2
√
σ
.
σ n
As we can see that, if α decreases, then z1−α/2 increases to infinity and the rejection region
shrinks. Now suppose we observe the data and calculate the test statistic: T (x) = xn . We
try to find the smallest possibly α∗ such that the rejection region includes xn , i.e.,
z1−α∗ /2 σ
|xn − µ0 | = √
n
which means
√ √
α∗
n|xn − µ0 | ∗ n|xn − µ0 |
Φ =1− ⇐⇒ α =2 1−Φ .
σ 2 σ
This gives an example of how to compute the p-value (i.e., equal to the value of α∗ by
definition) for an outcome of the statistic T (x) = xn . But what does it mean? It becomes
more clear if we write this α∗ in another form:
p-value = α∗
√ √
n(X n − µ0 ) n(xn − µ0 )
= Pµ0 ≥
σ σ
= Pµ0 X n − µ0 ≥ |xn − µ0 | .
In other words, p-value equals the probability under H0 of observing a value of the test
statistic the same as or more extreme than what was actually observed.
What is the point of computing p-value? If p-value is small, say smaller than 0.05, we
say the result is significant: the evidence is strong against H0 and we should reject the
null hypothesis.
67
p-value evidence
< 0.01 very strong evidence against H0
[0.01, 0.05] strong evidence against H0
[0.05, 0.1] weak evidence against H0
> 0.1 little or no evidence against H0
H0 : θ ∈ Θ0 v.s. H1 : θ ∈
/ Θ0 .
where cα depends on α and cα increases as α decreases since the rejection region shrink
as the size/level α decreases.
For the observed data x, we find the smallest α∗ such that
T (x) = cα∗
denotes the observed absolute value of the Wald statistics W (X). Then p-value is given
by
p-value = Pθ0 (|W (X)| ≥ |W (x)|) ≈ P(|Z| ≥ |w|) = 2Φ(−|w|)
where Z ∼ N (0, 1). In other words, |w| = z1−α∗ /2 and α∗ is the p-value.
For example, in the coin tossing example,
b − θ0
θ(x) 0.56 − 0.5
W (x) = =q = 3.8224
se(x)
b 0.56(1−0.56)
1000
and the p-value is equal to P(|Z| ≥ 3.8224) which is approximately 0.0001 ¡ 0.01. In other
words, the observed data are strongly against the null hypothesis H0 : θ = 1/2 and we
should reject the null.
68
5.3 Likelihood ratio test
We have spent a lot of time discussing the basics of hypothesis testing. You may have
realized that the key components in hypothesis testing are: (a) find a proper testing
statistic; (b) identify the rejection region with a given size/level α. In all the examples
we have covered so far, the construction of the rejection region and testing statistics rely
on our intuition. In this lecture, we will introduce a systematic way to tackle the two
aforementioned problems, which is based on the likelihood function.
Now let’s consider the following testing problem:
H0 : θ ∈ Θ0 versus H1 : θ ∈
/ Θ0 .
Suppose we observe samples X1 , · · · , Xn from f (X; θ). We define the likelihood ratio
statistic as
supθ∈Θ L(θ) b − `(θb0 ))
λ(X) = 2 log = 2(`(θ)
supθ∈Θ0 L(θ)
where θb is the MLE, θb0 is the MLE when θ is restricted to Θ0 , and `(θ) = log L(θ) is the
log-likelihood function.
What properties does λ(X) satisfy?
1. Since Θ (the natural parameter space) contains Θ0 , thus supθ∈Θ L(θ) ≥ supθ∈Θ0 L(θ),
implying that λ(X) ≥ 0.
2. Suppose H0 is true, then the MLE is likely to fall into Θ0 if n is sufficiently large
because of the consistency of MLE.
Therefore, we can use λ(X) as a testing statistic: if λ(x1 , · · · , xn ) is close to 0, we should
retain H0 ; if λ(x1 , · · · , xn ) is large, we reject H0 .
Example: Testing of normal mean with known variance. Suppose X1 , · · · , Xn ∼ N (µ, σ 2 )
and we want to test
H0 : µ = µ0 v.s. H1 : µ 6= µ0 .
Now we consider the likelihood ratio statistics. Note that the likelihood function is
n
2 1 X
`(µ, σ ) = − 2 (Xi − µ)2 + C
2σ i=1
where C is a scalar which does not depend on µ. There is no need to take σ 2 into
consideration since it is assumed known.
Next, we compute the MLE of `(µ, σ 2 ) on Θ0 and Θ:
Θ0 := {µ : µ = µ0 }, Θ = {µ : µ ∈ R}.
69
Note that n n
X X
2
(Xi − µ0 ) = (Xi − X n )2 + n(X n − µ0 )2 .
i=1 i=1
Thus
n n
!
1 X X n(X n − µ0 )2
λ(X) = 2 (Xi − µ0 )2 − (Xi − X n )2 = .
σ i=1 i=1
σ2
Since we reject the null hypothesis if λ(X) is large, it is equivalent to rejecting the null
if |X n − µ0 | is large which matches our previous discussion. Now, how to evaluate this
test? What is the power function?
70
5.3.1 Asymptotics of LRT
Theorem 5.3.1 (Asymptotic behavior). For simple test H0 : θ = θ0 versus H1 : θ 6= θ0 .
Suppose X1 , · · · , Xn are i.i.d. f (x; θ) and θb is the MLE of θ. Under H0 , as n → ∞,
d
λ(X) −→ χ21
Proof: Let θbn be the MLE and θ0 is the parameter under null hypothesis:
`00 (θbn ) b
`(θ0 ) − `(θbn ) = `0 (θbn )(θ0 − θbn ) + (θn − θ0 )2 + remainder.
2
If the null is true, `00 (θbn ) ≈ −nI(θ0 ) where I(θ0 ) is Fisher information. By asymptotical
√
normality of MLE, n(θbn − θ0 ) → N (0, 1/I(θ0 )). Therefore, by Slutsky’s theorem, it
d
holds that λ(X) −→ χ21 in distribution.
where λ(x1 , · · · , xn ) is the observed value of the likelihood ratio statistic. Or we can
follow another definition: p-value is the smallest size/level at which we reject the null
hypothesis. Thus the p-value equals α∗ which satisfies
λ(x1 , · · · , xn ) = χ21,1−α∗
So we reject the null hypothesis since the evidence is very strong against the null hypoth-
esis.
71
5.3.2 General LRT and asymptotics
Theorem 5.3.2. Under certain regularity condition for the pdf/pmf of the sample (X1 , · · · , Xn ),
it holds that under H0 : θ ∈ Θ0 ,
d
λ(X) −→ χ2r−q
in distribution as n → ∞ where
• r is the number of free parameters specified by θ ∈ Θ;
• q is the number of free parameter specified by θ ∈ Θ0
Example: Suppose θ = (θ1 , · · · , θq , θq+1 , · · · , θr ) ⊆ Rr . Let
Θ0 = {θ ∈ Rr : (θq+1 , · · · , θr ) = (θq+1,0 , · · · , θr,0 )}
which means the degree of freedom of Θ0 is q. Then under H0 , it holds that
λ(X) → χ2r−q .
Question: How to calculating the degree of freedom? Suppose the parameter space Θ
contains an open subset in Rr and Θ0 contains an open subset in Rq . Then r − q is the
degree of freedom for the test statistic.
We reject H0 at the size α if
λ(X) > χ2r−q,1−α
where χ2r−q,1−α is 1 − α quantile of χ2r−q distribution.
The p-value is calculated via
p-value = Pθ0 (χ2r−q ≥ λ(x))
where λ(x) is the observed data.
Example: Mendel’s pea. Mendel bred peas with round yellow and wrinkled green
seeds. The progeny has four possible outcomes:
{Yellow, Green} × {Wrinkled, Round}
His theory of inheritance implies that P(Yellow) = 3/4, P(Green) = 1/4, P(Round) = 3/4,
and P(Wrinkled) = 1/4:
9 3 3 1
p0 = , , , .
16 16 16 16
What he observed in the trials are X = (315, 101, 108, 32) with n = 556.
We want to test if Mendel’s conjecture is valid:
H0 : p = p0 , H1 : p 6= p0 .
It means:
4
X
Θ = {p : p ≥ 0, pi = 1}, Θ0 := {p : p = p0 }.
i=1
The number of each type follows multinomial distribution. The multinomial distribution
has its pmf as
k k
n x1 xk +
X X
f (x1 , · · · , xk |p) = p · · · pk , xi ∈ Z , xi = n, pi = 1.
x1 , · · · , xk 1 i=1 i=1
72
• Suppose there are k different types of coupons. The probability of getting coupon
i is pi . Collect n coupons and ask what is the distribution for the count of each
coupon Xi ?
• It satisfies the multinomial distribution with parameter n and p, denoted by M(n, p).
• It is a generalization of binomial distribution.
What is the MLE of p? We have shown in the homework that
xi
pbi = .
n
Therefore,
pb = (0.5665, 0.1817, 0.1942, 0.0576).
Consider the LRT:
supp:p≥0,Pi pi =1 L(p)
λ(X) = 2 log
supp:p=p0 L(p)
Q k xi
pbi
= 2 log Qi=1
k xi
i=1 p0i
k
X pbi
=2 xi log
i=1
p0i
d
If the null is true, then it holds that λ(X) −→ χ2(4−1)−0 = χ23 . In this case, λ(X) = 0.4754
and p-value is
Pp0 (χ23 ≥ 0.4754) = 0.9243
H0 : FX = F0 versus H0 : FX 6= F0
where F0 is the cdf of a specific distribution. This problem gives rise to an important
topic in hypothesis testing: Goodness-of-fit test. In this lecture, we will discuss a few
approaches to test the goodness-of-fit.
73
where a0 and ak+1 are −∞ and ∞ respectively. Then we count
k
X
ni = |{Xk : Xk ∈ (ai , ai+1 ], 1 ≤ k ≤ n}|, n= ni .
i=1
H0 : FX = F0 versus H1 : FX 6= F0 .
One natural way is to first compute pi0 = F0 (ai+1 ) − F0 (ai ), 1 ≤ i ≤ k; then test if
Therefore, we can reject the null hypothesis at level α if the observed value λ(x) of LRT
is too large:
λ(x) > χ2k−1,1−α .
74
Let Oi = nbpi be the number of observed samples in the ith category and Ei = npi is the
expected number of samples in the ith category. The Pearson χ2 -test uses the following
statistic:
k k
2
X (Oi − Ei )2 X pi − pi ) 2
(b
χ = =n .
i=1
Ei i=1
pi
Now the question is: what is the distribution of χ2 if the null is true?
Theorem 5.4.1. Under null hypothesis, it holds that
k
X pi − pi0 )2
(b d
χ2 = n −→ χ2k−1 .
i=1
pi0
Σ = diag(p) − pp>
Note that this covariance matrix is not strictly positive semidefinite since one eigenvalue
is zero. It is called the degenerate multivariate normal distribution, i.e., one coordinate
in the random vector can be represented as a linear combination of other coordinates.
Now let’s study why
k
2
X pi − pi )2
(b
χ = ∼ χ2k−1
i=1
p i
holds. First of all, χ2 -statistic can be written as the squared norm of a vector:
√
χ2 = k n diag(p)−1/2 (pb − p)k2
75
where diag(p) is a diagonal matrix whose diagonal entries are given by p.
√ −1/2
√
What is the distribution of n[diag(p)] ( b − p)? Since n(pb − p) is asymptotically
p
√
normal, so is n[diag(p)]−1/2 (pb − p). Its covariance matrix is given by
√ √ √
Cov( n[diag(p)]−1/2 (pb − p)) = [diag(p)]−1/2 Σ[diag(p)]−1/2 = Ik − p p> .
Thus √ d √ √
n[diag(p)]−1/2 (pb − p) −→ N (0, Ik − p p> ).
√ √
Now we can see that Ik − p p> is actually a projection matrix. Since it is a projection
√ √
matrix of rank k−1, we can find a matrix U (eigenvectors of Ik − p p> w.r.t. eigenvalue
√ √
1) of size k × (k − 1) such that Ik − p p> = U U > and U > U = Ik−1 .
√ √
Exercise: Show that Ik − p p> is an orthogonal projection matrix and compute its
eigenvalues (with multiplicities). Remember that ki=1 pi = 1.
P
To test the independence, we are actually interested in testing the following hypothe-
sis
H0 : θij = pi qj , ∀i, j, versus H1 : θij =
6 pi qj , for some i, j.
The null hypothesis H0 is a necessary condition for the independence of X and Y.
How to construct a testing statistic? We can first use LRT. Note that {nij }1≤i≤r,1≤j≤c
satisfies multinomial distribution M(n, {θij }1≤i≤r,1≤j≤c ), whose pmf is
Yr Y c
n n
f (n11 , n12 , · · · , nrc ; n, θij ) = θ ij
n11 , n12 , · · · , nrc i=1 j=1 ij
P
where n = i,j nij .
Exercise: Show the MLE for p and q under H0 : θij = pi qj is given by
c r
1X 1X
pbi = nij , qbj = nij .
n j=1 n i=1
76
Exercise: Show the likelihood ratio statistic for this hypothesis testing problem is
r X
c r X c
X θbij X θbij
λ(X) = 2 nij log = 2n θbij log .
i=1 j=1
pbi qbj i=1 j=1
p
b i q
b j
where
If θbij deviates from pbi qbj by a large margin, we are likely to reject the null hypothesis.
Therefore, we introduce the following χ2 -statistic:
r X c
2
X (θbij − pbi qbj )2
χ =n
i=1 j=1
pbi qbj
Exercise: Show that χ2 statistic is approximately equal to likelihood ratio statistic (Hint:
use Taylor approximation).
How to construct a rejection region? To do that, we need to know the asymptotic
distribution of χ2 under the null.
Theorem 5.4.2. Under the null hypothesis,
d
χ2 −→ χ2(r−1)(c−1) .
77
Example: (DeGroot and Schervish, Ex. 10.3.5 ) Suppose that 300 persons are selected
at random from a large population, and each person in the sample is classified according
to blood type, O, A, B, or AB, and also according to Rh (Rhesus, a type of protein
on the surface of red blood cells. Positive is the most common blood type), positive or
negative. The observed numbers are given in Table 10.18. Test the hypothesis that the
two classifications of blood types are independent.
Here we have n = 300, and we compute pbi , qbi , and θbij : and
The result is χ2 = 8.6037. Under the null hypothesis, χ2 ∼ χ23 , with r = 4 and c = 2.
The p-value is
P(χ23 > 8.6037) = 0.0351
We reject the null hypothesis that Rh factor and ABO system are independent at the
level 0.05.
Exercise: Perform the independence test by using LRT and show if you can get a similar
result.
H0 : FX = F0 , H1 : FX 6= F0
78
Can we design a testing procedure purely based on the empirical cdf? Recall that the
empirical cdf equals:
n
1X
Fn (x) = 1{Xi ≤ x}.
n i=1
Under the null hypothesis, we know that Fn (x) converges to F0 (x) for any fixed x as
n → ∞, even uniformly, as implied by Glivenko-Cantelli theorem.
Therefore, if Fn (x) is far away from F0 (x), we are more likely to reject the null hypothesis.
One intuitive way to measure the difference between Fn (x) and F0 is to use the supremum
of |Fn (x) − F0 (x)|, i.e., Kolmogorov-Smirnov statistic:
If TKS is large, we are likely to reject the null hypothesis. The tricky part is: what is the
distribution of TKS under the null?
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
Figure 5.2: Left: empirical cdf v.s. population cdf; Right: cdf of Kolmogorov distribution
√
Under null hypothesis, K = nTKS satisfies the Kolmogorov distribution
∞
2 2
X
P(K ≤ x) = 1 − 2 (−1)k−1 e−2k x .
k=1
H0 : FX = FY , H1 : FX 6= FY .
79
We reject the null hypothesis if
r
mn
TKS > K1−α
m+n
mn
p
where T
m+n KS
satisfies the Kolmogorov distribution under H0 .
Exercise: Under null hypothesis, show that
mn
Var (FX,n − FY,m ) = FX (1 − FX ).
m+n
80
Chapter 6
81
the average loss, i.e., risk function, is as small as possible:
Can we find the global minimizer to this risk function? If so, what is it? Surprisingly,
the global minimizer is the conditional expectation of Y given X, i.e., the optimal choice
of g is ˆ
g(x) = E(Y |X = x) = fY |X (y|x) dy
R
where f (y|x) denotes the condition pdf of Y given X,
fX,Y (x, y)
fY |X (y|x) = .
fX (x)
Here E(Y |X) is called the regression of Y on X, the “best” predictor of Y conditioned
on X, which is also a function of X. We provide the proof to justify why conditional
expectation is the best predictor of Y under quadratic loss function.
Theorem 6.1.1. Suppose the random vector (X, Y ) has finite second moments, then
Note that given X, E(Y |X) and g(X) are both known. Thus
EY [(2Y − g(X) − E(Y |X))(E(Y |X) − g(X))]X
= (E(Y |X) − g(X)) EY (2Y − g(X) − E(Y |X))X
It seems that we have already found the solution to the aforementioned prediction prob-
lem: the answer is conditional expectation. However, the reality is not so easy. What is
the main issue here? Apparently, in practice, we do not know the actual population dis-
tribution fX,Y directly. Instead, we only have an access to a set of n data points (Xi , Yi ),
which can be viewed as a realization from a bivariate distribution. Therefore, we cannot
write down the loss function explicitly since it involves an expectation taken w.r.t. the
joint distribution fX,Y .
82
6.1.2 Empirical risk minimization
How to resolve this problem? Recall that in bootstrap, we replace the population cdf by
the empirical cdf. It seems that we can follow a similar idea here:
It suffices to minimize this loss function by a proper function g. The value of this loss
function is called training error and `(g) is called the empirical loss (risk) function. How-
ever, this cannot work directly. Why? In fact, there are infinitely many ways to minimize
this loss function. Let’s look at a simple case: suppose (Xi , Yi ) are mutually different, we
can always find a curve (a function g) going through all the observed data.
can be large.
Therefore, in practice, we usually would restrict g to a class of functions. For example,
the function class G could be
• Polynomial: G = {an xn + · · · + a1 x + a0 |ai ∈ R};
• Logistic function:
1
G= θ ∈ R
exp(θx) + 1
83
or
G = {g(x) = αx + β : α, β ∈ R}.
If G is linear function, we say the regressor is linear, which is the focus of this chap-
ter linear regression. You may ask what is the point of studying linear regression? Linear
regression is seemingly too naive. In fact, linear regression is extremely useful in statis-
tics. Statisticians have developed comprehensive theory for linear models. In practice,
linear models have already provided a satisfactory explanation for many datasets. More-
over, for data which are close to multivariate normal distribution, using linear model is
sufficient.
2
Exercise: For (X, Y ) ∼ N (µX , µY , σX , σY2 , ρ), E(Y |X) is a linear function of X.
E(Yi |Xi = xi ) = β0 + β1 xi ,
Yi = β0 + β1 Xi + i
where
G = {g(x) : g(x) = β0 + β1 x, β0 , β1 ∈ R}.
This approach is the well-known l inear least squares method. The minimizers are called
linear least squares estimator.
Definition 6.2.1 (The least squares estimates). The least squares estimates are the
values (βb0 , βb1 ) such that the residual sum of squares or RSS is minimized, i.e.,
n
X
(βb0 , βb1 ) = argminβ0 ,β1 (Yi − (β0 + β1 Xi ))2 .
i=1
Pn
Exercise: The empirical risk function R(β0 , β1 ) = i=1 (Yi − (β0 + β1 Xi ))2 is a convex
function of β0 and β1 .
Lemma 6.2.1. The least squares estimator is given by
P P
i Xi Yi − nX n Y n (Xi − X n )(Yi − Y n )
β0 = Y n − β1 X n , β1 = P
b b b
2 = i P 2
i (Xi − X n )
2
i Xi − nX n
84
Pn
Proof: Define R(β0 , β1 ) = i=1 (Yi − (β0 + β1 Xi ))2 .
n n
∂R X ∂R X
=2 (β0 + β1 Xi − Yi ) = 0, =2 Xi (β0 + β1 Xi − Yi ) = 0
∂β0 i=1
∂β1 i=1
Solving for (α, β) gives the linear squares estimator. Substitute β0 = Y n − β1 X n into the
second equation:
n
X Xn
2
n(Y n − β1 X n )X n + β1 Xi = X i Yi
i=1 i=1
It gives
n
! n P
X 2 X
i (Xi − X n )(Yi − Y n )
Xi2 − nX n β1 = Xi Yi − nX n Y n ⇐⇒ βb1 = P 2
i=1 i=1 i (Xi − X n )
where
• βb0 is related to the sample mean of Xi and Yi .
• βb1 is the ratio of sample covariance of (Xi , Yi ) over the sample variance of Xi .
The residuals (prediction error) are
In other words, the least squares approach aims to minimize the sum of squares of the
prediction error.
Example: Predict grape crops: the grape vines produce clusters of berries and a count
of these clusters can be used to predict the final crop yield at harvest time.
• Predictor: cluster count, {Xi }ni=1
• Response: yields {Yi }ni=1
The LS estimator of (β0 , β1 ) is
85
Year Cluster count (X) Yields (Y )
1971 116.37 5.6
1973 82.77 3.2
1974 110.68 4.5
1975 97.5 4.2
1976 115.88 5.2
1977 80.19 2.7
1978 125.24 4.8
1979 116.15 4.9
1980 117.36 4.7
1981 93.31 4.1
1982 107.46 4.4
1983 122.3 5.4
Table 6.1: Data source: [Casella, Berger, 2001]. The data in 1972 is missing due to
Hurricane.
Yield vs cluster count
6
5.5
4.5
Yield
3.5
2.5
80 85 90 95 100 105 110 115 120 125 130
Cluster count
Y vs X
6
5.5
4.5
Y: yield
3.5
3
Data
Fit
Confidence bounds
2.5
80 85 90 95 100 105 110 115 120 125 130
X: cluster count
86
general model. Consider Yi is the value of the response variable in the ith case and Xi is
the value of the predictor variable. We assume that
Yi = β0 + β1 Xi + i , 1≤i≤n
and
Cov(i , j ) = 0, 1 ≤ i 6= j ≤ n.
The following parameters are unknown and to be estimated:
• β0 : intercept
• β1 : slope
• σ 2 : unknown variance
An estimator of β is essentially a function of the response Y1 , · · · , Yn . Now let’s only
focus on a small set of estimators: the linear estimators of (β0 , β1 ), i.e., these estimators
of the following form ( n )
X
αi Yi : αi ∈ R, 1 ≤ i ≤ n .
i=1
In other words, we are looking at the estimators which are given by the linear combination
of Yi where the coefficients αi are to be determined. In particular, we are interested in
finding an unbiased linear estimator for (β0 , β1 ) with the smallest variance. What is
it?
We take β1 as an example as the same argument applies to β0 accordingly. In order to
ensure unbiasedness of the estimator, we need to have
n
! n n
X X X
E αi Yi = αi E Yi = αi (β0 + β1 Xi )
i=1 i=1 i=1
n
X Xn
= β0 αi + β1 α i Xi = β1 .
i=1 i=1
This gives
n
X n
X
αi = 0, αi Xi = 1.
i=1 i=1
Pn
What is the variance of i=1 αi Yi ? Note that Yi are uncorrelated and then
n
! n n
X X X X
Var αi Yi = αi2 Var(Yi ) + αi αj Cov(Yi , Yj ) = σ 2
αi2 .
i=1 i=1 i<j i=1
Now to find the best linear unbiased estimator (BLUE), it suffices to minimize:
n n n
1X 2 X X
min α s.t. αi = 0, αi Xi = 1.
αi 2 i=1 i i=1 i=1
87
We resort to the method of Lagrangian multiplier to find the optimal αi . Let λ and µ be
the multiplier, and then
n n n
!
1X 2 X X
L(αi , λ, µ) = α −λ αi − µ α i Xi − 1 .
2 i=1 i i=1 i=1
λ + X n µ = 0,
Xn
nX n λ + Xi2 µ = 1.
i=1
1 Xn
µ = Pn 2, λ = − Pn 2 .
2 2
i=1 Xi − nX n i=1 Xi − nX n
Y = Xβ + .
88
Recall that the least squares estimator of β0 and β1 satisfy (6.2.1)
n
X n
X
β0 + β1 X n = Y n , β0 nX n + β1 Xi2 = Xi Yi
i=1 i=1
Therefore, we have
Pn
Xi2 − ni=1 Xi P ni=1 Yi
P P
1
βb = (X > X)−1 X > Y = i=1
n n .
n ni=1 Xi2 − ( i Xi )2 − i=1 Xi
P P P
n i=1 Xi Yi
Lemma 6.2.2. Under the assumption of simple linear regression, i.e., i are uncorrelated,
equal variance, and zero-mean random variables. The LS estimator βb has mean and
variance as follows:
σ2
P 2
2 > −1 1 i Xi −nX n
E(β) = β, Cov(β) = σ (X X) =
b b .
n P X 2 − nX 2 −nX n n
i i n
βb = (X > X)−1 X > Y = (X > X)−1 X > (Xβ + ) = β + (X > X)−1 X > .
89
5.0
4.5
4.0
eruptions
3.5
3.0
2.5
2.0
1.5
50 60 70 80 90
waiting
3 3.333 74
4 2.283 62
> lmfit <- lm(eruptions~waiting, data = data)
> lmfit$coefficients
(Intercept) waiting
-1.87401599 0.07562795
We obtain βb0 = −1.874 and βb1 = 0.0756. Then we have a linear model:
Y = −1.874 + 0.0756X +
Yi = β0 + β1 Xi + i , 1≤i≤n
where i ∼ N (0, σ 2 ) are i.i.d. normal random variables with unknown σ 2 . In other words,
Yi ∼ N (β0 +β1 Xi , σ 2 ) are independent random variables with equal variance. Apparently,
the least squares estimator in this case is the BLUE, i.e., best linear unbiased estimator.
The reason is simple: all the noise i are i.i.d., and thus their covariance satisfies:
(
0, i 6= j,
E i j = 2
⇐⇒ E > = σ 2 In .
σ , i = j,
90
6.3.1 MLE under normal error model
Since we observe a set of random variables Yi with unknown parameters β0 , β1 , and σ 2 ,
it is natural to use maximal likelihood estimation to estimate the parameter.
Lemma 6.3.1. The MLE of β = (β0 , β1 )> matches the least square estimators. The
MLE of σ 2 is
n n
2 1X 2 1X
σ
bM LE = = (Yi − βb0 − βb1 Xi )2
n i=1 i
b
n i=1
To maximize the log-likelihood function, we first maximize over β0 and β1 whose maxi-
mizer equals the minimizer to the least squares risk function:
n
X
(βb0 , βb1 ) = argmin(β0 ,β1 ) (Yi − (β0 + β1 Xi ))2
i=1
How about the MLE of σ 2 ? Once β is fixed, we just need to take the derivative w.r.t.
σ2: n
∂` n 1 X
=− 2 + 4 (Yi − βb0 − βb1 Xi )2 = 0
∂σ 2 2σ 2σ i=1
and the MLE of σ 2 is
n n
2 1X 1X 2
σ
bM LE = (Yi − (βb0 + βb1 Xi ))2 = .
n i=1 i
b
n i=1
where Pn Pn
> n i=1 X i > i=1 Y i
X X = Pn Pn 2 , X Y = Pn .
i=1 Xi i=1 Xi i=1 Xi Yi
91
Note that ∼ N (0, σ 2 In ) is a normal random vector in Rn . By linearity of multivariate
normal distribution, βb is also multivariate normal, with mean and variance:
Eβ
b = β,
b = E(βb − β)(βb − β)>
Cov(β)
= E(X > X)−1 X > > X(X > X)−1
= σ 2 (X > X)−1 .
i = Yi − (βb0 + βb1 Xi ) are the residuals. The mean squared error (MSE) is
where b
SSE
M SE =
n−2
Theorem 6.3.2. Under normal error model, the error sum of squares SSE satisfies
• SSE/σ 2 ∼ χ2n−2 .
• SSE is independent of β.
b
Pn
since 2i
i=1 b ∼ χ2n−2 and the consistency follows from law of large number.
Proof: Now let’s prove the theorem. The proof is very straightforward. Note that
b = Y − Yb = Y − X βb
= Y − X(X > X)−1 X > Y
= (In − X(X > X)−1 X > )(Xβ + )
= (In − X(X > X)−1 X > )
92
where (In − X(X > X)−1 X > )X = 0 and Yb = X β.
b
In fact, the matrix P = In − X(X > X)−1 X > is a projection matrix with rank n − 2.
Therefore,
SSE = b> b = > (In − X(X > X)−1 X > ) ∼ σ 2 χ2n−2 .
We leave it as an exercise to show P is a projection matrix. The rank of P follows from
Tr(P ) = Tr(In − X(X > X)−1 X > ) = Tr(In ) − Tr((X > X)−1 X > X) = n − Tr(I2 ) = n − 2
>
b b) = E(βb − β)b
Cov(β,
= E(X > X)−1 X > > (In − X(X > X)−1 X > )
= σ 2 (X > X)−1 X > (In − X(X > X)−1 X > ) = 0.
For two random Gaussian vectors which are uncorrelated, they are independent.
βb0 − β0 βb1 − β1
∼ tn−2 , ∼ tn−2 .
se(
b βb0 ) se(
b βb0 )
Proof: We just prove for βb0 since the justification applies to βb1 similarly.
! !
βb0 − β0 βb0 − β0 se(
b βb0 )
= /
se(
b βb0 ) se(βb0 ) se(βb0 )
93
where
r s
βb0 − β0 se(
b βb0 ) M SE SSE SSE
∼ N (0, 1), = = , ∼ χ2n−2 .
se(βb0 ) se(βb0 ) σ2 (n − 2)σ 2 σ2
Theorem 6.3.4 (Hypothesis testing for β). Under normal error model, we have
• (1 − α) confidence interval for β0 and β1 is
The p-value is P(|tn−2 | > |w|) where tn−2 is a Student-t distribution of degree n − 2.
If n is large, we have
βb0 − β0 βb1 − β1
∼ N (0, 1), ∼ N (0, 1)
se(
b βb0 ) se(
b βb0 )
Thus with large samples, it is safe to replace tn−2 by N (0, 1) (e.g. tn−2,1− α2 by z1− α2 .)
94
What is the distribution of Yb∗ ?
Yb∗ ∼ N Y∗ , Var(Yb∗ ) .
Replacing σ 2 by σ
b2 in Var(Yb∗ ), we have
Yb∗ − Y ∗
∼ tn−2 .
se(
b Yb∗ )
where s
√ 1 (X n − X∗ )2
se(
b Yb∗ ) = M SE +P 2
n 2
i Xi − nX n
Therefore, an approximate 1 − α confidence interval of the mean response value is
where
• Yi : value of the response variable Y in the ith case
• Xi1 , · · · , Xi,p−1 : values of the variables X1 , · · · , Xp−1
• β0 , · · · , βp−1 : regression coefficients. p: the number of regression coefficients; in
simple regression p = 2
• Error term: E(i ) = 0, Var(i ) = σ 2 and Cov(i , j ) = 0
The mean response is
Y = |{z}
|{z} X β + |{z}
|{z}
n×1 n×p p×1 n×1
where
···
1 X11 X12 X1,p−1
.. .. .. .. β0
. . . ··· . β1
X := 1 Xi1 Xi2 ··· Xi,p−1 , β :=
..
.
.. .. .. .. .
. . ··· . βp−1
1 Xn1 Xn2 ··· Xn,p−1
95
Under model assumption:
Now let’s consider the least squares estimator for the multiple regression. The LS esti-
mator is given by
βbLS = argminβ∈Rp kXβ − Y k2
where kXβ − Y k2 = (Xβ − Y )> (Xβ − Y ).
How to find the global minimizer to this program? A few facts on how to take gradients
of vector-valued functions:
Lemma 6.4.1. Let hx, vi = v > x denote the inner product between two column vectors
x and v and A is a symmetric matrix:
∂f1
f1 (x) = x> v, = v,
∂x
∂f2
f2 (x) = x> x, = 2x,
∂x
∂f3 ∂ 2 f3
f3 (x) = x> Ax, = 2Ax, = 2A.
∂x ∂x2
Pn
Proof: For f1 (x), we know f1 (x) = i=1 vi xi is a linear function of x:
v1
∂f1 ∂f1 .
= vi , = .. = v.
∂xi ∂x
vn
Pn
For f2 (x), f2 (x) = i=1 x2i is a quadratic function.
2x1
∂f2 ∂f2 .
= 2xi , = .. = 2x.
∂xi ∂x
2xn
n n
!
∂f3 ∂ X X
= aij xj + aji xj + aii x2i xi
∂xi ∂xi j6=i j6=i
n
X n
X
= aij xj + aji xj + 2aii xi
j6=i j6=i
X n
=2 aij xj = 2[Ax]i ,
j=1
96
The Hessian of f3 is defined by
∂ 2 f3
2
∇ f3 = .
∂xi ∂xj 1≤i,j≤n
Then !
n
∂ 2 f3 ∂ ∂f3 ∂ X
= = 2 aij xj = 2aij .
∂xi ∂xj ∂xj ∂xi ∂xj j=1
Theorem 6.4.2. Suppose X is of rank p, i.e., all the columns are linearly independent,
then the least squares estimator is
∂ 2f
= 2X > X 0.
∂β 2
Why X > X 0? Its quadratic form v > X > Xv = kXvk2 > 0 for any v 6= 0.
· · · X1p
1 X1
1 X2 · · · X2p
X = .. .. ∈ Rn×(p+1)
. . . . . ..
.
1 Xn · · · Xnp
97
and
Y = Xβ + , β ∈ Rp+1 .
From the result discussed before, we know that the least-squares estimator for β is
βb = (X > X)−1 X > Y .
Why is X > X invertible? X is called Vandermonde matrix and is full rank as long as
there are p + 1 distinct value of Xi .
Eβ
b = (X > X)−1 X > E(Y ) = (X > X)−1 X > Xβ = β.
98
6.4.2 Geometric meaning of least squares estimator
Here we assume X ∈ Rn×p is rank-p, i.e., all the columns are linear independent. We
want to under understand what least squares mean geometrically?
Let’s first understand the meaning of minimizing kXβ − Y k2 mean.
• Find a vector in the range of X such that its distance to Y is minimized
• Equivalent to projecting Y onto the linear subspace spanned by the columns of X.
• The residue b = Y − X βb is perpendicular to any vectors in the column space of
X, i.e., X > b = 0 ⇐⇒ X > X βb = X > Y .
Projection matrix
Note that βbLS = (X > X)−1 X > Y . The fitted value Yb is
The matrix H := X(X > X)−1 X > is called hat matrix, i.e., projection matrix. The
projection matrix H (projection onto the column (range) space of X) has the following
properties:
• Symmetric: H > = H
• Idempotent: H 2 = H. Applying the projection twice won’t change the outcome.
• All of its eigenvalues are either 0 (multiplicity n − p) or 1 (multiplicity p)
Exercise: show that I − H is also a projection matrix.
Question: how to representing fitted value and residuals by using H?
• The fitted data Yb is the projection of Y on the range of X, i.e.,
• The residue b
is equal to
b = Y − Yb = (I − H)Y = (I − H)
99
6.4.3 Inference under normal error bound
Question: how to representing SSE and M SE by using H?
The SSE and M SE are defined by
n
X
SSE := 2i = kb
b k2 = k(I − H)k2
i=1
= (I − H)(I − H) = > (I − H),
>
SSE
M SE := ,
n−p
where In − H is a projection matrix.
Question: What is E(SSE) and E(M SE) under normal error model, i.e., i ∼ N (0, σ 2 )?
In this part, we use the fact that all eigenvalues of projection matrices are either 0 or 1,
and its trace is
Tr(H) = Tr(X(X > X)−1 X > ) = Tr(X > X(X > X)−1 ) = Tr(Ip ) = p.
SSE ∼ σ 2 χ2n−p
and is independent of β.
b
Note that U > is N (0, σ 2 In−p ). Therefore, SSE/σ 2 is the sum of n − p independent
squared standard normal random variables, i.e., χ2n−p .
On the other hand,
We can see that b = (I − H) and βb are jointly normal. Why? Simply speaking, b and
βb can be obtained by applying a linear transform to :
b (I − H) 0
= > −1 > +
βb (X X) X β
By the invariance of normal random vectors under linear transform, we know that b and
βb are jointly normal.
100
Moreover, they are independent since they are uncorrelated:
Exercise: Are b
i mutually independent?
Note that under normal error model, M SE is an unbiased and consistent estimator of
σ 2 . Therefore, we can have the following asymptotic distribution of β.
b
For each βbi , we can derive a similar result for the distribution of each βbi :
βbi − βi
p ∼ tn−p .
M SE[(X > X)−1 ]ii
With that, we are able to construct confidence interval for βi and perform hypothesis
testing.
• An approximately 1 − α confidence interval of each βi is
q q
> −1 b > −1
βi − tn−p,1− 2 M SE(X X)ii , βi + tn−p,1− 2 M SE(X X)ii
b α α
101
6.5 Model diagnostics
We have discussed statistical inferences for linear models: perform interval estimation and
hypothesis testing for coefficients. The key ingredient is that we assume the underlying
model is a linear model with normal error. Let’s recall the simple linear model with
normal error:
i.i.d.
Yi = β0 + β1 Xi + i , i ∼ N (0, σ 2 ), 1 ≤ i ≤ n.
All stated assumptions are crucial: linear regression relation and i.i.d. normal error.
However, in practice, the data are unlikely to satisfy these assumptions which may cause
inaccuracy in our inference. This is the reason why we need to consider model diagnostics
and come up with remedies to solve these issues.
There are a few typical violations for the simple linear model with normal error.
• Nonlinearity of the regression relation
• Nonconstant variance of the error terms
• Non-normality of the error terms
• Independence of the error terms
• Existence of outliers
In this lecture, we will briefly discuss model diagnostics by using residual plots. Due to
the time limitation, we will focus on the first three issues and propose solutions. We will
not discuss how to handle outliers and influential cases, which are also very important in
practice.
If we simply look at Figure 6.1, the scatterplot of (Xi , Yi ), it seems linear model is quite
adequate. Let’s fit a linear model to the data. Under the simple regression with normal
error, the residuals bi = Yi − Ybi are multivariate normal and independent of the fitted
value Ybi ,
b = (I − H), Yb = X β. b
Therefore, the residuals v.s. fitted values plot should have no distinct patterns, i.e.,
residuals should spread equally around a horizontal line.
What do we observe in Figure 6.2? We can see a nonlinear pattern between the residuals
and fitted values, which is not captured by a linear model. Therefore, one needs to
take nonlinearity into consideration, such as adding nonlinear predictor, i.e., consider
polynomial regression,
Yi = β0 + β1 Xi + · · · βp Xip + i , 1 ≤ i ≤ n,
which is an important example of multiple regression.
Adding the extra quadratic term to the simple linear model indeed makes the pattern
disappear, i.e., the red line becomes horizontal and the residuals spread equally around
the fitted values.
102
100 110 120 130 140 150 160 170
y
9 10 11 12 13
100 35
2
35
2
1
1
Residuals
Residuals
0
0
−1
−1
−2
−2
83
72 72
−3
100 110 120 130 140 150 160 100 110 120 130 140 150 160 170
Figure 6.2: Residuals v.s. fitted values for Yi = 10Xi + 0.2Xi2 + i . Left: without adding
quadratic term; right: after adding quadratic term
where i ∼ N (0, 1). In the simulation, we let Xi ∼Unif[0,10]. We first obtain the scatter-
plot in Figure 6.3, and also fit a linear model and get the residuals v.s. fitted plot. We
103
can see that the residuals are spreading out more as the fitted value increases.
Residuals vs Fitted
40
10
30
5
Residuals
0
y_nc
20
−5
−10
10
84
85
−15
99
0
0 2 4 6 8 10 5 10 15 20 25 30
104
To test the normality of error terms in the linear models, we can simply obtain the
residuals and plot their QQ-plot. The line in the QQ-plot passes the first and third
quartiles.
2
2
Standardized residuals
1
Quantiles of Input Sample
0
0
−1
-1
−2
785
−3
-2
99
-3
-3 -2 -1 0 1 2 3 −2 −1 0 1 2
Standard Normal Quantiles
Theoretical Quantiles
lm(sqrt(y_nc) ~ x_nc)
Figure 6.4: QQ-plot for normal data and for the residuals in Model 6.5.2
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
• Student-t: heavy-tailed
• χ23 − 3: right skewed, the right tail is longer; the mass of the distribution is concen-
trated on the left of the figure
• 3−χ23 : left skewed, the left tail is longer; the mass of the distribution is concentrated
on the right of the figure.
• N (0, 1): standard normal distribution.
105
QQ-plot Example 1:
QQ Plot of Sample Data versus Standard Normal
12
10
-2
-4
-6
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
QQ plot shows more probabilities in the tails than a normal distribution. It is student-t
distribution.
QQ-plot Example 2:
QQ Plot of Sample Data versus Standard Normal QQ Plot of Sample Data versus Standard Normal
10 8
8 6
6 4
Quantiles of Input Sample
4 2
2 0
0 -2
-2 -4
-4 -6
-6 -8
-8 -10
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Standard Normal Quantiles Standard Normal Quantiles
Figure 6.6: QQ plot for χ23 − 3 Figure 6.7: QQ plot for 3 − χ23
Figure 6.6 left: QQ plot shows more probabilities in the right tail and less probabilities in
the left tail. It is χ23 − 3 distribution; Figure 6.7 right: Q-Q plot shows more probabilities
in the left tail and less probabilities in the right tail. It is 3 − χ23 distribution.
106
such that the resulting model is close to a linear model with normal error. More precisely,
we perform a power transform on the response Yi using the following form:
Y0 =Yλ
Yiλ = β0 + β1 Xi + i .
Essentially, we treat λ as an additional variable and hope to identify the best λ (as well
as β and σ 2 ) such that the model fits the data.
Question: How to identify a suitable λ?
Denote Yi (λ) = Yiλ . Then under the normal error model, the joint pdf for Yi (λ) is
2 1 1 2
fY (λ) (Yi (λ); β, λ, σ ) = − 2 (β0 + β1 Xi − Yi (λ)) .
(2πσ 2 )n/2 2σ
Recall that we only observe Yi instead of Yi (λ); to derive a likelihood function based on
Yi , we need to perform a change of variable, i.e., obtaining the joint pdf of Yi . This can
be done by introducing a Jacobian factor.
n
Y dYi (λ)
fY (Yi ; β, λ, σ 2 ) = fY (λ) (Yi (λ); β, λ, σ 2 ) ·
i=1
dYi
Yn
1 1 2
= − 2 (β0 + β1 Xi − Yi (λ)) λYiλ−1
(2πσ 2 )n/2 2σ i=1
where
dYi (λ) d λ
= Y = λYiλ−1 .
dYi dYi i
Denote K = ( i Yi )1/n as the geometric mean of {Yi }ni=1 . The log-likelihood function
Q
is
n 1
`(λ, β, σ 2 ) = − log σ 2 − 2 kXβ − Y (λ)k2 + n log(K λ−1 λ).
2 2σ
Note that the MLE for β is (X > X)−1 X > Y (λ) and the MLE for σ 2 is
1
b2 (λ) =
σ k(In − H)Y (λ)k2 .
n
Therefore, we have
107
where C is a constant. For each λ, the log-likelihood function equals to
n
`(λ) = − log SSE(λ)
2
where
(In − H)Y (λ)
2
SSE(λ) =
.
K λ−1 λ
If we choose to normalize Yiλ via
(
K 1−λ (Yiλ − 1)/λ, λ 6= 0,
Yi0 (λ) =
K log(Yi ), λ = 0,
where K = ( i Yi )1/n is the geometric mean of {Yi }ni=1 . Then we try to identify a suitable
Q
λ such that the SSE associated with Yi0 (λ), i.e., SSE(λ), is the smallest.
Many statistical packages provides existing built-in function to compute this likelihood
function w.r.t. λ. In R, the function is boxCox().
Example: A marketing researcher studied annual sales of a product that had been
introduced 10 years ago. If we fit a linear model, we may feel that a linear model is quite
Year 0 1 2 3 4 5 6 7 8 9
Sales (thousands of unit) 98 135 162 178 221 232 283 300 374 395
adequate for the data. Now let’s perform model diagnostics by plotting its residuals v.s.
fitted and QQ plot.
400
350
300
Y_sales
250
200
150
100
0 2 4 6 8
X_year
What do you observe? There is a pattern (nonlinearity) in the residuals v.s. fitted plot.
How to deal with it? We perform a box-cox transform.
√
Figure 6.10 indicates λ = 1/2 is the best choice. We use λ = 1/2, i.e., Y 0 = Y and use
the following new model: p
Yi = β0 + β1 Xi + i .
108
Residuals vs Fitted Normal Q−Q
2.0
9 9
20
1.5
Standardized residuals
1.0
10
Residuals
0.5
0
−0.5
−10
−20
8
6
−1.5
8
6
100 150 200 250 300 350 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
95%
−40
log−likelihood
−45
−50
−55
−2 −1 0 1 2
Yi = (β0 + β1 Xi + i )2 .
We fit the model again and perform a model diagnostic. Figure 6.11 implies that the
model fits the data well: the residuals are approximately normal and also spread equally
around the fitted values.
Finally, we briefly discuss what if some of the data Yi are negative. In fact, a more general
109
Residuals vs Fitted Normal Q−Q
1.5
9 9
0.4
1.0
Standardized residuals
0.2
0.5
Residuals
0.0
0.0
−0.2
−0.5
−0.4
6
8 1
−1.5
8
Figure 6.11: Residuals v.s. fitted and QQ plot for the transformed model
Y 0 = (Y + λ2 )λ1 ,
which is a family of transform with two parameters. Then we can follow the similar
procedure to obtain (λ1 , λ2 ) which maximizes the likelihood function, and use them to
transform the response.
Figure 6.12: Data from Example 1.13 [Robert, Casella, 2004]: 1 stands for failure, 0 for
success.
110
1.0
0.8
0.6
Y
0.4
0.2
0.0
55 60 65 70 75 80
regression is: the response is a binary random variable depending on the predictor,
eβ0 +β1 Xi
P(Yi = 1|Xi ) = ,
1 + eβ0 +β1 Xi
1
P(Yi = 0|Xi ) = .
1 + e 0 +β1 Xi
β
eβ0 +β1 X
p(X) =
1 + eβ0 +β1 X
The following function is called logistic function:
ex
f (x) =
1 + ex
In other words, p(X) = f (β0 + β1 X). How does f (x) look like?
• f (x) is increasing,
• f (x) takes value between 0 and 1.
Logistic regression is an example of generalized linear model :
pi
β0 + β1 Xi = log = logit(pi )
1 − pi
where
x
logit(x) = log , 0<x<1
1−x
is called logit function, which is a canonical link function for binary response.
111
1.0
0.8
0.6
logit_fun
0.4
0.2
0.0
−4 −2 0 2 4
x_var
Xn
= Yi (β0 + β1 Xi ) − log(1 + eβ0 +β1 Xi )
i=1
112
For the Hessian matrix, we have
n n
∂ 2` X eβ0 +β1 Xi ∂ 2` X eβ0 +β1 Xi
= − , = − X2
β0 +β1 Xi )2 i
∂β02 i=1
(1 + eβ0 +β1 Xi )2 ∂β12 i=1
(1 + e
and n
∂ 2` X eβ0 +β1 Xi
=− X
β0 +β1 Xi )2 i
∂β0 ∂β1 i=1
(1 + e
Therefore, the Hessian matrix equals
Pn Pn
2 i=1 p i (1 − p i ) i=1 p i (1 − p i )X i
∇ ` = − Pn Pn 2
i=1 pi (1 − pi )Xi i=1 pi (1 − pi )Xi
1
Iterates
-1
-2
-3
-4
-5
0 1 2 3 4 5 6 7 8 9 10
# of iteration
Most programs have build-in packages to get an estimation of β. Call the “glm” function
in R.
glm_fit <- glm(Y~X, data = data, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429 7.3786 2.039 0.0415 *
X -0.2322 0.1082 -2.145 0.0320 *
The estimated coefficients of Challenger data are
βb0 = 15.0429, βb1 = −0.2322.
The predicted value of probability pi at Xi is
eβ0 +β1 Xi
b b
pbi = .
1 + eβb0 +βb1 Xi
113
1.0
0.8
0.6
Y
0.4
0.2
0.0
55 60 65 70 75 80
We plot (Xi , pbi ) in Figure 6.15 where pbi is the fitted value of probability and the curve
“decreases” as X gets larger.
How to interpret β0 and β1 ? The odds is given by
p(X)
odds(X) = = eβ0 +β1 X .
1 − p(X)
In other words, the odds at X + 1 is about eβ1 times the odds at X. In the example of
Shuttle Challenger, βb1 is −0.232 and
H0 : β1 < 0 versus H1 : β1 ≥ 0.
To derive the rejection region, we need to know the distribution of the MLE.
Theorem 6.6.1. The MLE enjoys
114
• consistency: as n → ∞,
p
βbn −→ β
where β is the underlying parameter.
• asymptotic normality:
d
βbn − β −→ N (0, [I(β)]−1 )
where I(β) is the Fisher information.
The Fisher information matrix is the negative inverse of the second order derivative of
log-likelihood function evaluated at β, i.e.,
where Pn Pn
2 p i (1 − p i ) p i (1 − p i )X i
∇ `(β) = − Pn i=1 Pni=1 2
i=1 pi (1 − pi )Xi i=1 pi (1 − pi )Xi
eβ0 +β1 Xi
and pi = However, the exact value of β = [β0 , β1 ]> is unknown. In practice,
1+eβ0 +β1 Xi
.
we can approximate β via βb and get an estimate of the information matrix:
Pn
pbi (1 − pbi ) P ni=1 pbi (1 − pbi )Xi
P
I(β) = i=1
n n
b
bi (1 − pbi )Xi2
P
bi (1 − pbi )Xi
i=1 p i=1 p
eβ0 +β1 Xi
b b
where pbi = .
1+eβ0 +β1 Xi
b b
se( b −1 ]11 ,
b βb0 ) = [I(β) se( b −1 ]22
b βb1 ) = [I(β)
βb0 − β0 βb1 − β1
∼ N (0, 1), ∼ N (0, 1).
se(
b βb0 ) se(
b βb1 )
se(
b βb0 ) = 7.3786, se(
b βb1 ) = 0.108.
115
6.6.3 Hypothesis testing
Question: How to perform hypothesis testing?
z-test: Consider the hypothesis testing problem,
H0 : βi = 0 v.s. H1 : βi 6= 0.
βbi
z=
se(
b βbi )
Why do we use z-value? The z-value is used as a test statistic: recall under H0 ,
βbi
z= ∼ N (0, 1), i = 0 or 1
se(
b βbi )
βb1 −0.2322
z= = = −2.144
se(
b β1 )
b 0.1082
What is the p-value? We reject the null if |z| is too large, the p-value equals
Therefore,
λ(X) = 2(−10.16 + 14.13) = 7.95.
The p-value is
P(χ21 ≥ 7.95) = 0.005 < 0.05.
Therefore, we should reject the H0 .
116
Level i Price reduction Xi # of households ni # of coupons redeemed
1 5 100 15
2 10 120 33
3 15 110 38
4 20 200 100
5 30 160 110
0.5
0.0
log_odd
−0.5
−1.0
−1.5
5 10 15 20 25 30
117
First, let’s derive the log-likelihood function:
ni
m Y
Y
Y
`(β) = log pi ij (1 − pi )1−Yij
i=1 j=1
m ni
!
X X
= Yij log pi + (1 − Yij ) log(1 − pi )
i=1 j=1
m ni ni
! !
X X X
= Yij log pi + ni − Yij log(1 − pi )
i=1 j=1 j=1
Xm
= pi log(pi ) + (1 − pbi ) log(1 − pi ))
ni (b
i=1
where Pni
j=1 Yij
pbi = .
ni
Call:
glm(formula = cbind(N_s, N_f) ~ X, family = "binomial", data = mydata)
Deviance Residuals:
1 2 3 4 5
-0.7105 0.4334 -0.3098 0.6766 -0.4593
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02150 0.20908 -9.669 <2e-16 ***
X 0.09629 0.01046 9.203 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fitted.values(glm_fit)
The estimation βb of β is
βb0 = −2.02, βb1 = 0.096
118
0.5
0.0
log_odd
−0.5
−1.0
−1.5
5 10 15 20 25 30
119
Recall the likelihood function is
n >
X eXi β 1
`(β) = Yi log > + (1 − Y i ) log >
i=1
1 + eXi β 1 + eXi β
n
Xi> β
X
>
= Xi β · Yi − log(1 + e )
i=1
and n >
∂ 2` X eXi β
= − Xi> β )2
Xi Xi> 0
∂β∂β > i=1
(1 + e
This is actually a concave function.
Example: Continuation of Shuttle Challenger data. We want to see if adding
more predictors helps fit the data better. Now we consider
2
!
eβ0 +β1 Xi +β2 Xi
Yi ∼ Bernoulli 2 .
1 + eβ0 +β1 Xi +β2 Xi
H0 : β2 = 0 v.s. H1 : β2 6= 0.
Under the full model (including three parameters) and the null hypothesis, the log-
likelihood value is
120
Bibliography
121