0% found this document useful (0 votes)
8 views

Week04

The document outlines various chapters and topics related to econometrics, focusing on methods such as Principal Components Analysis (PCA) for multivariate data. It discusses the mathematical foundations of PCA, including the extraction of principal components from covariance matrices and the implications of scaling data. Additionally, it highlights the application of PCA in descriptive analysis and dimension reduction techniques in econometric studies.

Uploaded by

beautyladoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week04

The document outlines various chapters and topics related to econometrics, focusing on methods such as Principal Components Analysis (PCA) for multivariate data. It discusses the mathematical foundations of PCA, including the extraction of principal components from covariance matrices and the implications of scaling data. Additionally, it highlights the application of PCA in descriptive analysis and dimension reduction techniques in econometric studies.

Uploaded by

beautyladoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Outline

1 Chapter 00: Measurable Space, Measure, and Probability


2 Chapter 02: Conditional Expectation and Variance
3 Chapter 06: Large Sample Asymptotics
4 Chapter 07: Asymptotic Theory for Least Squares
5 Chapter 10: Resampling Methods
6 Chapter 11: Factor Models and Max-linear Regressions
7 Chapter 12: Instrumental Variables
8 Chapter 13: Generalized Method of Moments
9 Chapter 18: Difference in Differences
10 Chapter 24: Quantile Regression
11 Chapter 25: Binary Choice
12 Topic 01: Copula methods for time series
13 Topic 02: Semi-parametric dynamic max-copula model for
multivariate time series beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 03/17/2025 84 / 230


Introduction
Often, a multivariate response has relatively few dominant modes
of variation.
E.g. length, width, and height of turtle shells: overall size is a
dominant mode of variation, making all three responses larger or
smaller simultaneously.
The method of principal components is a technique for extracting
linear combinations from multivariate data which capture most of
the variability in the data.
It is used in numerous different ways. One interpretation is that it is
largely a descriptive technique - given a large array of
high-dimensional data which we do not know what to do with, a
principal components analysis (PCA) will help us identify key
components which can then be subjected to more detailed
examination.
Another application is to the formation of indices, e.g. given annual
statistics of crimes committed in a number of different categories,
how can we best combine the different numbers into an overall
index of criminal behavior? beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 85 / 230


Introduction
Continue...
A third interpretation is that PCA is a dimension reduction technique
to be applied prior to some other form of analysis. For example,
one way to reduce the dimensionality of a multiple regression
problem is to perform an initial PCA on the regressor variables,
followed by an ordinary multiple regression on some of the leading
components. This is called principal components regression.
Basic idea: find linear combinations of responses that have most
of the variance.
More precisely: given X with E(X ) = 0 and Cov (X ) = Σ, find a1 to
maximize Var (a10 X ), subject to a10 a1 = 1. a10 X is called the first
principle component of X .

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 86 / 230


Introduction
Notes:
a1 must be constrained, otherwise the variance could be made
arbitrarily large just by making a1 large.
Why this constraint? Because the problem has a convenient
solution.
Solution:
Var (a10 X ) = a10 Σa1
which is maximized at a1 = e1 , the first eigenvector of Σ;
The maximized value is λ1 , the associated eigenvalue.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 87 / 230


Introduction
Often, the elements of e1 are all positive and similar in magnitude
⇒ a mode in which all responses vary together:
not often interesting;
a useful summary or composite variable.
What about other modes of variation? If we want to go beyond the
first PC, we then repeat the optimization in the space orthogonal
to a1 : find a2 to maximize a20 Σa2 such that a10 a1 = 1, a20 a1 = 0.
Then a20 X is the second principal component.
The process continues iteratively: suppose a1 , . . . , ai−1 are given,
for some i − 1 ≤ p, then find ai to maximize ai0 Σai subject to
ai0 ai = 1, ai0 ak = 0 for k = 1, . . . , i − 1. Then ai0 X is the i’th principal
component. In principle we could go on to find all p PCs, though in
practice it is usual to stop after selecting enough PCs to capture
most of the variability in the data.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 88 / 230


Proposition (JW Result 8.1 8.2)
Let Σ be the covariance matrix associated with the random vector
X 0 = [X1 , X2 , . . . , Xp ]. Let Σ have the eigenvalue-eigenvector pair
(λ1 , e1 ), (λ2 , e2 ), . . . , (λp , ep ) where λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. Then the ith
principal component is given by

Yi = ei0 X = ei1 X1 + ei2 X2 + · · · + eip Xp , i = 1, 2, . . . , p (5)

With these choices,

Var (Yi ) = ei0 Σei = λi i = 1, 2, . . . , p


Cov (Yi , Yk ) = ei0 Σek =0 i 6= k

If some λi are equal, the choices of corresponding coefficients vectors,


ei , and hence Yi are not unique.
Furthermore, the sum of variances of PCs is the sum of the variances
of the individual components of X .
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 89 / 230


Remarks
The above has been presented as if it all applied to the population
covariance matrix X .
In practice, we usually don’t know Σ and have to estimate it by the
sample covariance matrix S based on a sample of n values of X .
The preceding operations are then performed on S instead of Σ to
produce the sample PCs. One side comment here is that for a
continuous probability distribution with nondegenerate Σ, the
sample covariance matrix S has distinct eigenvalues with
probability 1, so the sample PCs are uniquely defined even though
the population PCs may not be.
In the case of multivariate normal data, there is a rich sampling
theory of how well the sample ai ’s and λi ’s approximate the
corresponding population values, but it is more usual in practice to
treat PCA as largely a data-descriptive technique without giving
particular attention to sampling issues.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 90 / 230


PCA based on the correlation matrix
Scaling in PCA: One difficulty associated with the covariance
matrix is that the problem is not scale invariant:
If the original data were lengths measured in inches and weights
measured in pounds, and if we then changed the scales of
measurement to centimeters and kilograms, the PCs and the
corresponding λi ’S would change.
Variables in different units must be scaled.
A way to avoid this difficulty is to rescale the problem prior to
computing the PCs, so that each component of X had either
population variance or sample variance equal to 1.
Variables in the same units but with very different variances are
usually scaled.
Simplest scaling: divide each variable by its standard deviation ⇒
covariances are correlations.
In other words: use eigen structure of correlation matrix ρ, not
covariance matrix Σ.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 91 / 230


Example and graphical illustration

Example
Suppose the random variables X1 , X2 and X3 have the covariance
matrix  
1 −2 0
Σ = −2 5 0
0 0 2
Calculating the population principal components.

λ1 = 5.83, e10 = [.383, −.924, 0]


λ2 = 2.00, e20 = [0, 0, 1]
λ3 = 0.17, e30 = [.924, .383, 0]

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 92 / 230


Figure: JW Figure 8.1: Illustration of coordinates of PCs

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 93 / 230


Normal theory
Suppose X is distributed as Np (µ, Σ). We know that the density of X is
constant on the µ centered ellipsoids

(x − µ)0 Σ−1 (x − µ) = c 2
p
which have axes ±c λi ei , i = 1, 2, . . . , p, where the (λi , ei ) are the
eigenvalue-eigenvector pairs of Σ. Assume µ = 0, the equation above
can be rewritten as
1 0 2 1 0 2 1
c 2 = x 0 Σ−1 x = (e x) + (e2 x) + · · · + (ep0 x)2
λ1 1 λ2 λp
1 1 1
= y12 + y22 + · · · + yp2 .
λ1 λ2 λp

where e10 x, e20 x, . . . , ep0 x are recognized as the principal components of


x. The equation above defines in a coordinate system with axes
y1 , y2 , . . . , yp lying in the direction e1 , e2 , . . . , ep , respectively. beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 94 / 230


PCA for some Special Cases
Diagonal matrix: if
 
σ11 0 · · · 0
 0 σ22 . . . 0 
Σ= .
 
.. .. .. 
 .. . . . 
0 0 · · · σpp

then the principal components are just the original variables.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 95 / 230


Compound symmetry: if

σ 2 ρσ 2 · · · ρσ 2
 
ρσ 2 σ 2 . . . ρσ 2 
Σ= .
 
. .. . . .. 
 . . . . 
ρσ ρσ · · · σ 2
2 2

then (if ρ > 0):


λ1 = 1 + (p − 1)ρ and e1 = p−1/2 (1, 1, ..., 1)0 ;
λk = 1 − ρ, k > 1;
e2 , e3 , . . . , ep are an arbitrary basis for the rest of R p .
If ρ < 0 the order is reversed, but note that ρ must satisfy

1 + (p − 1)ρ > 0 ⇒ ρ > −1/(p − 1).

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 96 / 230


Time series (1st order autoregression):

σ2 φσ2 · · · φ p−1 σ 2
 
 φσ2 σ2 . . . φ p−2 σ 2 
Σ= .
 
. .. . . .. 
 . . . . 
φ p−1 2
σ φ p−2 2
σ ··· σ2

No closed form, but for large p the eigen vectors are like sines and
cosines.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 97 / 230


Sample PCA
Essentially the eigen analysis of S (or R):

Se λk e
bk = b bk ,

and
ybk = Xdev e
bk ,
where
1 1
Xdev = X − 110 X = (I − 110 )X
n n
Recall:
 0  
1 1 1
S= X 0 Xdev = √ Xdev √ Xdev
n − 1 dev n−1 n−1
Singular value decomposition:

1
√ Xdev = UDV 0 beamer-tu-logo
n−1
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 98 / 230
The diagonal entries of D are the square roots of the largest p
eigenvalues of both (n − 1)−1 Xdev
0 X −1
dev = S and (n − 1) Xdev Xdev
0

0 X
The columns of V are the eigenvectors of Xdev dev .
Also √
Xdev V = [yb1 , yb2 , . . . , ybp ] = ( n − 1)UD
so the singular value decomposition of (n − 1)−1/2 Xdev provides all
the details of the sample principal components:
the coefficients V ;
the values UD.
Similarly, if X ∗ is Xdev with its columns normalized (sum of
squares = 1), then
R = X ∗0X ∗,
and the singular value decomposition of X ∗ gives the PCA of R.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 99 / 230


R - output interpretation
By default, prcomp is based on the sample covariance matrix, not
the correlation matrix. Setting scale=TRUE for correlation scale.
The Standard deviations are the square roots of the eigenvalues.
The Cumulative Proportions are the scaled inertia values.
The default plot method for prcomp produces a scree plot, a bar
chart of the PC variances
The prcomp method is doing nothing more than computing the
singular value decomposition (X − 1x̄ 0 ) = UD 1/2 V 0 . The matrix V
is the same as the "rotation" matrix computed above. The "rotated
data", or principal component scores, are

(X − 1x̄ 0 )V = UD 1/2 V 0 V = UD 1/2

that is, the left singular vectors multiplied by the singular values
(square roots of the eigenvalues of S). These are returned in the
argument p1$x. beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 100 / 230


Example
(Summarizing sample variability with two sample principal
components) A census provided information, by tract, on five
socioeconomic variables for Madison, Wisconsin, area. The data from
61 tracts are listed in Table 8.5. These data produced the following
summary statistics
x̄ 0 = [4.47, 3.96, 71.42, 26.91, 1.64
total professional employed government media
population degree age over 16 employment home v
(thousands) (percent) (percent) (percent) $100,0

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 101 / 230


> ses <- read.table("T8-5.dat")
> ses
V1 V2 V3 V4 V5
1 2.67 5.71 69.02 30.3 1.48
....
61 6.48 4.93 74.23 20.9 1.98
> colnames(ses) <- c("TOT", "prof", "emp16", "gov", "m
> colMeans(ses)
TOT prof emp16 gov medS
4.469016 3.962295 71.419836 26.914754 1.635574
> cov(ses)
TOT prof emp16 gov medS
TOT 3.396 -1.102 4.305 -2.078 0.027
prof -1.102 9.672 -1.513 10.953 1.203
emp16 4.305 -1.513 55.625 -28.937 -0.043
gov -2.078 10.953 -28.937 89.066 0.957
medS 0.027 1.203 -0.043 0.957 0.318 beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 102 / 230


> print(cor(ses), digits = 3)
TOT prof emp16 gov medS
TOT 1.000 -0.192 0.313 -0.119 0.026
prof -0.192 1.000 -0.065 0.373 0.685
emp16 0.313 -0.065 1.000 -0.411 -0.010
gov -0.119 0.373 -0.411 1.000 0.179
medS 0.026 0.685 -0.010 0.180 1.000
> print(p1 <- prcomp(~., ses, scale = T), digits = 3)
Standard deviations:
[1] 1.411 1.169 0.930 0.731 0.491
~
Rotation:
PC1 PC2 PC3 PC4 PC5
TOT 0.263 -0.463 0.7839 -0.217 0.235
prof -0.593 -0.326 -0.1641 0.145 0.703
emp16 0.326 -0.605 -0.2249 0.663 -0.194
gov -0.479 0.252 0.5507 0.572 -0.277
beamer-tu-logo
medS -0.493 -0.500 -0.0688 -0.407 -0.580
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 103 / 230
> print(head(p1$x), digits = 3)
PC1 PC2 PC3 PC4 PC5
1 -0.730 0.6919 -0.5685 0.397 0.289
2 -0.987 0.9996 -0.0324 1.552 -0.511
3 -2.351 -0.0797 -0.4723 -0.158 0.786
4 -0.638 -0.7766 -0.0612 -0.230 0.725
5 -1.429 -1.5252 0.2363 0.251 0.509
6 -1.914 1.8106 1.9799 -0.294 0.148
> summary(p1)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 1.411 1.169 0.930 0.731 0.491
Proportion of Variance 0.398 0.274 0.173 0.107 0.048
Cumulative Proportion 0.398 0.672 0.845 0.952 1.000
> plot(p1)
> biplot(p1)
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 104 / 230


> sesPCAcov = prcomp(ses);
> print(sesPCAcov)
Standard deviations:
[1] 10.344 6.298 2.893 1.693 0.39334
~
Rotation:
PC1 PC2 PC3 PC4 PC5
TOT 0.038 -0.071 0.187 0.977 -0.057
prof -0.105 -0.129 -0.960 0.171 -0.138
emp16 0.492 -0.864 0.045 -0.091 0.004
gov -0.863 -0.480 0.153 -0.029 0.006
medS -0.009 -0.014 -0.124 0.081 0.988
> plot(sesPCAcov)
> biplot(sesPCAcov)
> screeplot(sesPCAcov)
> screeplot(p1)
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 105 / 230


100
sesPCAcov p1

1.5
80
60
Variances

Variances

1.0
40

0.5
20

0.0
0

Figure: JW Example 8.3: Eigenvalues PCs

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 106 / 230


−60 −40 −20 0 20 40 60 −6 −4 −2 0 2 4

4
60
34 34
0.4

37
47

0.2
6
gov
8 20 41

40
38

2
37 36
33 32
21 72 42
5051
0.2

21
4314
35
26 1 1124
36 2414

20
32
51 45
275049 19 44 31
11 27

0.0
6 17

0
4133 43 3
3 42 22 45
38 15 59 3025
23 16 5355
22
20 1 4 61 28 10 4039
0.0

59 1053

0
medS
15 46
78
19 3156 39 48 4 60
PC2

PC2
TOT 52
prof44 35 25 60 49 29
18 1312
30 46 58
55 40 26
5461

−2
16 54 52 56
5 5
48

−0.2
17 129

−20
9
gov
47 13 57 58
−0.2

2 28 prof 57
18
29 23

−4
−40
emp16 TOT
medS

−0.4
−0.4

emp16

−60

−6
−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2

PC1 PC1

Figure: JW Example 8.3: Biplots of PCs

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 107 / 230


Exercise 4.39 Page 207
> Peru <- read.table("T4-6.dat")
> colnames(Peru) <- c("Indep", "Supp", "Benev","Confor
> Peru
Indep Supp Benev Conform Leader Gender Socio
1 27 13 14 20 11 2 1
.....
> colMeans(Peru)
Indep Supp Benev Conform Leader Gender Socio
15.669 17.076 18.784 15.500 11.7307 1.523 1.446
> print(cor(Peru), digits = 3)
Indep Supp BenevConform Leader Gender Soc
Indep 1.000 -0.173 -0.561 -0.471 0.187 0.127 0.2
Supp -0.173 1.000 0.018 -0.327 -0.401 0.172 0.0
Benev -0.561 0.018 1.000 0.297 -0.492 0.061 -0.2
Conform -0.471 -0.327 0.297 1.000 -0.333 -0.088 -0.3
Leader 0.187 -0.401 -0.491 -0.333 1.000 -0.284 0.1
beamer-tu-logo
Gender 0.127 0.172 0.061 -0.088 -0.285 1.000 0.0
Socio 0.219
Z. Zhang (UCAS)
0.016 -0.211 -0.391
Econometrics II Week 4
0.145 0.082
March 19, 2025
1.0
108 / 230
> print(p1 <- prcomp(~. - Socio, Peru, scale = T), dig
Standard deviations:
[1] 1.484 1.242 0.988 0.796 0.745 0.300
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6
Indep -0.509 0.200 -0.398 0.312 -0.488 -0.455
Supp 0.139 0.623 0.535 0.216 0.233 -0.452
Benev 0.547 -0.033 0.013 -0.495 -0.554 -0.385
Conform 0.425 -0.427 -0.328 0.445 0.354 -0.452
Leader -0.485 -0.364 0.197 -0.498 0.331 -0.486
Gender 0.077 0.503 -0.638 -0.406 0.408 -0.021

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 109 / 230


> print(head(p1$x), digits = 3)
PC1 PC2 PC3 PC4 PC5 PC6
1 -1.117 0.002 -2.191 0.853 -0.061 -0.403
2 2.016 -0.590 -1.627 0.021 0.156 0.125
3 0.415 1.176 -0.339 0.511 0.802 0.463
4 0.067 1.667 -0.414 0.327 -0.041 0.422
5 2.077 0.959 -0.050 0.199 0.862 -0.160
6 0.583 -0.255 -1.681 0.788 0.671 -0.342
> summary(p1)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 1.484 1.242 0.988 0.796 0.745 0
Proportion of Variance 0.367 0.257 0.163 0.106 0.092 0
Cumulative Proportion 0.367 0.624 0.787 0.892 0.985 1
> plot(p1)
> library(car)
> scatterplot.matrix(~p1$x[, 1:3] | as.factor(Peru$Soc
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 110 / 230


−2 −1 0 1 2 3

2
PC1

1
0
−1
−2
−3
−4
| | || || | ||||| |||||||| ||||||||||||| ||||||||||||||||||
|||| ||| ||| ||| |||

3
2
1
0
−1
−2
PC2

| | | ||||||| ||| | |||||| | ||||||||||||||||||||||||| ||||||||||||||||||| ||| | | |

2
PC3

1
1

0
2

−1
−2
| || ||| ||||| | | ||||| ||| ||||||||||
| |||||||||| |||||||||||| ||||| |||| ||||||||| | |

−4 −3 −2 −1 0 1 2 −2 −1 0 1 2

Figure: JW Exercise 4.39: Scatter plots of PCs

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 111 / 230


The number of Principal Components
How many components are important?
No definitive answer in the PCA framework.
The factor analysis model allows maximum likelihood estimation,
hence hypothesis testing.
Things to consider:
the amount of total sample variance explained,
the relative sizes of the eigenvalues (the variances of the sample
components,)
the subject-matter interpretation of the components.
In addition, a component associated with an eigenvalue near zero
and, hence deemed unimportant, may indicate an unsuspected
linear dependency in the data.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 112 / 230


Principle
To reduce the dimensionality of the problem, we would like to
restrict attention to the first k PCs, where k is much less than p,
but to avoid losing too much of the variability in the original data,
we would also like to choose this so that the proportion of variance
explained by the first k PCs, which may be expressed as

λ1 + · · · + λk
ψk = ,
λ1 + · · · + λp

is close to 1. the question is, how should we coose k to balance


these two criteria.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 113 / 230


Three methods are widely proposed
1 The "screeplot": plot the ordered λk against k and decide visually
when the plot has flattened out. The name comes from an analogy
with rocks on a mountain - the initial part of the plot, in which λk is
decreasing rapidly with k , is like the side of a mountain, while the
flat portion, in which each λk is only slightly smaller than its
predecessor λk −1 , is like the rough scree at the bottom. The task
of the data analyst is to decide when the "scree" begins.
2 Choose k so that ψk ≥ c, for some arbitrary cutoff c. For some
reason, everybody uses c = 0.9 when applying this rule.
Presumably this is no less arbitrary than the convention that all
tests of signficance should be based on α = .05.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 114 / 230


Continue
3. Kaiser’s rule: exclude all PCs with eigenvalues less than
the overall average of the eigenvalues (which, in the case
of a correlation-based PCA, is always 1). This rule also
seems to be arbitrary, e.g. we could with no less logic set
the cutoff at twice, or half, or 0.9354 times, the mean
eigenvalue. (In fact it seems to be widely believed that
Kaiser’s rule leads to the inclusion of too few PCs,
whereas the screeplot often tempts one to include too
many. This would seem to be an argument in favor of
using a smaller multiplying factor than 1 in Kaiser’s rule.)

The biplot is another useful device for visualizing the interaction


between the first two PCs, obtained by plotting the first two PCs
for each subject as a scatterplot.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 115 / 230


Principle components in regression
Consider a regression model of the form
p
(j)
yi = ∑ β j xi + εi , (6)
j=1

in which {εi } satisfy the usual assumptions (for example,


uncorrelated, mean 0, common variance) but there are a large
(j)
number p of possible regressors {xi }. The idea of PC regression
is to use PCA to reduce the number of regressors prior to fitting a
model of the form (6).

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 116 / 230


A further reason for doing this is that since the PCs are
orthogonal, the X matrix in the transformed regression problem
will be orthogonal, thus avoiding all the problems which often arise
in regression analysis due to multicollinearity. Indeed, PC
regression is sometimes cited as an alternative to ridge
regression, which has also been proposed as a way of dealing
with multicollinearity in high-dimensional regression analysis, but
which goes about the problem in a quite different way.
The main problem posed by this approach is, once again, the
selection of which PCs to include. The methods proposed in the
previous section can of course be applied, but there are additional
possibilities based on the correlation between the PCs and the
dependent variable yi . We can, for example,

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 117 / 230


Strategy
(a) Order the PCs according to their sample variances,
choosing some k such that we ignore all PCs after the
k ’th,
(b) Order the PCs according to their correlations with y ,
again choosing a cutoff k ,
(c) A compromise between (a) and (b), in which we order
λ1 ≥ · · · ≥ λp as usual and then test in reverse order for
the significance of ap0 X (j) , ap−1
0 X (j) , . . . , stopping as soon
as one is significant (Jolliffe’s rule).

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 118 / 230


Strategy – continue
(d) Another strategy entirely is to use the yi ’s in defining the
(1)
components, for example, defining ti = ∑j xi (j)cj with
(1) (1) 2
weights cj such that ∑ cj = 1, to maximize the sample
(1)
correlation between {yi } and {ti }, then choosing
(2) (2) 2
ti = ∑j xi (j)cj with ∑ cj = 1 to maximize the correlation
with {yi } among all linear combinations orthogonal to
(1)
{ti }, and so on, followed by ordinary least squares
(1) (2)
regression of yi on {ti }, {ti }, ... This is, however, really
a quite different method, known as partial least squares
regression, but also one which has been studied widely in
recent years.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 119 / 230


Large samples
Suppose that the rows of the data matrix X are a random sample
of size n from Np (µ, Σ).
Assume that Σ has distinct eigenvalues

λ1 > λ2 > · · · > λp > 0.

Then, approximately for large n,



λ − λ ) ∼ Np (0, 2 × diag{λ 2 }).
n(b

In other words, b λk are approximately independent for k 6= i,


λi and b
and
2λ 2
λi ∼ N(λi , i ).
b
n

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 120 / 230


Note: if
nbλ
∼ χn2
λ
then similarly, approximately,

2λ 2
λ ∼ N(λ ,
b ).
n
So we could also state that, approximately,

nbλi
∼ χn2 .
λi

Simulations suggest that this is a better approximation for small n,


if the eigenvalues are well separated.
Asymptotics suggest that the degrees of freedom for b λi could be
n − i + 1 instead of n.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 121 / 230



Also bi − ei ) is approximately Np (0, Ei ), where
n(e
p
λk
Ei = λi ∑ 2
× ek ek0 .
k =1,k 6=i (λk − λi )

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 122 / 230


Nonlinear PCA: Kernel PCA
Kernels: A useful tool for modern multivariate analysis
Kernel PCA as a nonlinear feature extractor has proven powerful
as a preprocessing step for classification algorithms.
By the use of integral operator kernel functions, one can efficiently
compute principal components in high dimensional feature
spaces, related to input space by some nonlinear map.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 123 / 230


Gaussian process
A stochastic process (Zt )t∈T on T is called a Gaussian process if,
for each finite subset A = {t1 , . . . , tk } ⊂ T the random vector

(Zt1 , . . . , Ztk ) ∼ Nk (µA , ΣA )

The matrix
ΣA,ij = Cov (Zti , Ztj ) = R(ti , tj )
is the covariance matrix and is determined by a function R called
the covariance kernel of Z .

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 124 / 230


Properties of kernels
Kernels which have successfully been used in Support Vector
Machines include polynomial kernels

k (x, y ) = (x · y )d ;

radial basis functions (Gaussian kernel)

k (x, y ) = exp(−||x − y ||2 /(2σ 2 ));

and sigmoid kernels

k (x; y ) = tanh(κ(x, y ) + Θ).

The key property of the kernel is that it is positive semi-definite.


That is, for each subset A = {t1 , . . . , tk } ⊂ T and sequence
{c1 , . . . , ck }
k k
∑ ci R(ti , tj )cj = Var ( ∑ ci Zti ). beamer-tu-logo
i,j=1 i=1
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 125 / 230
Properties of kernels
Assume for the moment that our data mapped into feature space,
Φ(x1 ), . . . , Φ(xn ), is centered, i.e. ∑nk =1 Φ(xk ) = 0. To do PCA for
the covariance matrix
1 n
C= Φ(xk )Φ(xk )0
n k∑
=1

where the function Φ(x) maps the data nonlinearly into a feature
space.
Want Cε = λ ε. We have
1 n 1 n
Cε = ∑ Φ(xk )Φ(xk )0 ε = ∑ Φ(xk )αk
n k =1 n k =1

i.e. all solutions ε lie in the span of Φ(x1 ), . . . , Φ(xn ).


beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 126 / 230


This implies that we may consider the equivalent system

λ (Φ(xk ) · ε) = (Φ(xk ) · Cε) for all k = 1, . . . , n, (7)

and that there exist coefficients α1 , . . . , αn such that

ε= ∑ αk Φ(xk ),
k =1

Define
Kij = (Φ(xi ) · Φ(xj )),
we arrive at
nλ K α = K 2 α (8)
or equivalently
nλ α = K α (9)

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 127 / 230


Define
1
K̃ = (I − 110 )K
n
The first kernel PC is to maximize
n
1 1 1
Var ( ∑ αj Φ(X )) = α 0 K̃ 0 K̃ α = α 0 K (I − 110 )K α
j=1
n−1 n−1 n

subject to α 0 K̃ α = 1.
The jth kernel PC eigenvector αj is to maximize

1 1
α 0 K (I − 110 )K α
n−1 n

subject to α 0 K̃ α = 1, α 0 K̃ αl = 0, 0 < l < j.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 128 / 230


Factor Analysis
Introduction
Factor analysis is an alternative to principal components analysis
with which it is often confused, though the underlying principles
behind the two methods are completely different.
Principal Components Analysis, e.g. of stock price movements,
sometimes suggests that several variables may be responding to
a small number of underlying forces.
Like PCA, the essential purpose of factor analysis is to describe, if
possible, the covariance relationships among many variables in
terms of a few underlying, but unobservable, random quantities
called factors which are assumed to exist.
Unlike PCA, which is basically a model free descriptive technique,
factor analysis is model-based and its credibility depends a lot
both on the model itself, and the extent to which a reasonable
interpretation can be placed on the factors which are identified.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 129 / 230


The Orthogonal Factor Model
The observable random vector X , with p components, has mean
µ and covariance matrix Σ.
The factor model postulates that X is linearly dependent upon a
few unobservable random variables F1 , F2 , . . . , Fm , called common
factors, and p additional sources of variation ε1 , ε2 , . . . , εp , called
errors, sometimes, specific factors.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 130 / 230


The Orthogonal Factor Model
In particular, the factor analysis model is

X1 − µ1 = l11 F1 + l12 F2 + · · · + l1m Fm + ε1


X2 − µ2 = l21 F1 + l22 F2 + · · · + l2m Fm + ε2
..
. (10)
Xp − µp = lp1 F1 + lp2 F2 + · · · + lpm Fm + εp
(11)

or in matrix notation
X − µ = LF + ε.
The coefficient lij is called the loading of the ith variable on the jth
factor, so the matrix L is the matrix of factor loadings.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 131 / 230


Assumptions
To make this identifiable, we further assume, with no loss of
generality,
E(F ) = 0, Cov (F ) = I
E(ε) = 0, Cov (ε, F ) = 0.
and with serious loss of generality: Cov (ε) = Ψ, where Ψ is a
diagonal matrix.
In terms of the observable variables X , these assumptions mean
that
E(X ) = µ, Cov (X ) = Σ = LL0 + Ψ.
or
2 2
Var (Xi ) = li1 + · · · + lim + ψi = hi2 + ψi
Cov (Xi , Xk ) = li1 lk 1 + · · · + lim lkm
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 132 / 230


Remarks
Usually X is standardized, so Σ = R. hi2 = li1
2 + · · · + l 2 is termed
im
as the ith communality, the portion of the variance of the ith
variable contributed by the m common factors. ψi is called specific
variance.

Assumption
The observable X and the unobservable F are related by

Cov (X , F ) = L.

or
Cov (Xi , Fj ) = lij .
Note that if T is (m × m) orthogonal, then (LT )(LT )0 = LL0 , so
loadings LT generate the same Σ as L: loadings are not unique.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 133 / 230


Existence of Factor Representation
For any p, every (p × p) Σ can be factorized as

Σ = LL0

for (p × p) L, which is a factor representation with m = p and


Ψ = 0; however, m = p is not much use – we usually want m << p.
For p = 3, every (3 × 3) Σ can be represented as

Σ = LL0 + Ψ

for (3 × 1)L, which is a factor representation with m = 1, but Ψ may


have negative elements.

Illustrate Example 9.1 Page 484

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 134 / 230


Nonexistence of a proper solution
The primary question in factor analysis is whether the data are
consistent with a prescribed structure.
Unfortunately, for the factor analyst, most covariance matrices
cannot be factored as LL0 + Ψ, where the number of factors m is
much less than p.
Illustrate Example 9.2 on Page 486.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 135 / 230


Methods of Estimation
In general, we can only approximate Σ by LL0 + Ψ.
Principal components method: the spectral decomposition of Σ is

Σ = EΛE 0 = (EΛ1/2 )(Λ1/2 E 0 ) = LL0

with m = p.
If λ1 + λ2 + · · · + λm >> λm+1 + · · · + λp , and L(m) is the first m
columns of L, then
0
Σ ≈ L(m) L(m)
gives such an approximation with Ψ = 0.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 136 / 230


Methods of Estimation
0
The remainder term Σ − L(m) L(m) is non-negative definite, so its
diagonal entries are non-negative, we can get a closer
approximation as
0
Σ = L(m) L(m) + Ψ(m)
0
where Ψ(m) = diag (Σ − L(m) L(m) ).

A modified approach: Principal Factor Solution


We can sometimes achieve higher communalities (= diag (LL0 )) by
either:
specifying an initial estimate of the communalities
iterating the solution or both.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 137 / 230


Methods of Estimation
Suppose we are working with R. Given initial communalities hi∗2 ,
form the reduced correlation matrix
 ∗2 
h1 r12 · · · r1p
 r12 h∗2 · · · r2p 
2
Rr =  .
 
.. .. .. 
 .. . . . 
r1p r2p ··· hp∗2

Now use the spectral decomposition of Rr to find its best rank-m


approximation
0
Rr ≈ L∗r L∗r .

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 138 / 230


Methods of Estimation
New communalities are
m
h̃i∗2 = ∑ li,j∗2 .
j=1

Find Ψ by equating the diagonal terms:

ψi∗ = 1 − h̃i∗2 ,

or
0
Ψ = I − diag(L∗r L∗r ).
This is the Principal Factor solution.
The Principal Component solution is the special case where the
initial communalities are all 1.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 139 / 230


Iterated Principal Factors
One issue with both Principal Components and Principal Factors:
if S or R is exactly in the form LL0 + Ψ (or, more likely, approximately
in that form), neither method produces L and Ψ (unless you specify
the true communalities).
Solution: iterate!
Use the new communalities as initial communalities to get another
set of Principal Factors.
Repeat until nothing much changes.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 140 / 230


Example (JW Example 9.3)
The weekly rates return for five stocks (JP Morgen, Citibank, Wells
Fargo, Royal Dutch Shell, and Exxon Mobil) list on the New York Stock
Exchange were determined for the period January 2004 through
December 2005. The weekly rates of return are defined as (current
week closing price-previous week closing price)/(previous week
closing price), adjusted for stock splits and dividends, The data are
listed in Table 8.4. The observations in 103 successive weeks appear
to be independently distributed, but the rates of return across stocks
are correlated, because as one expects, stocks tend to move together
in response to general economic conditions. Do factor analysis for this
data.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 141 / 230


Correlation matrix
 
1.0000 0.6323 0.5105 0.1146 0.1545
0.6323 1.0000 0.5741 0.3223 0.2127
 
0.5105
R= 0.5741 1.0000 0.1825 0.1462
0.1146 0.3223 0.1825 1.0000 0.6834
0.1545 0.2127 0.1462 0.6834 1.0000

Estimation of factor loadings


The decomposition Σ = LL0 + Ψ is to be estimated to match the
sample correlation matrix R as closely as possible. In particular
we want to reproduce the large correlations in this matrix,
between J P Morgan and Citibank, and between Royal Dutch
Shell and Exxon Mobil. Each of these will require a separate
factor (column of the L matrix), so a solution of at least two factors
is probably needed, and we will try a two-factor solution.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 142 / 230


SAS program:
/* Table 8.4 of J&W: Stock return data */
/* ods html file = ’stock.html’; */
options nodate nonumber linesize = 80;
~
data stock;
infile ’T8_4a.txt’;
input JPM CITI WFargo RDShell Exxon;
run;
~
proc factor data = stock method = prin priors = smc;
title ’Method = Principal Factors’;
var JPM -- Exxon;
run;

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 143 / 230


SAS parameter specifications
In proc factor, use method = prin as for the Principal Component
solution, but also specify the initial communalities:
the priors = . . . option on the proc factor statement specifies a
method, such as squared multiple correlations (priors = SMC);
the priors statement provides explicit numerical values.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 144 / 230


Method = Principal Factors
The FACTOR Procedure
Initial Factor Method: Principal Factors

Prior Communality Estimates: SMC


JPM CITI WFargo RDShell Exxon
0.45171141 0.53436120 0.36636018 0.51724158 0.47826295

Eigenvalues of the Reduced Correlation Matrix:


Total = 2.34793731 Average = 0.46958746

Eigenvalue Difference Proportion Cumulative

1 1.90936967 1.02131465 0.8132 0.8132


2 0.88805502 0.97304209 0.3782 1.1914
3 -.08498706 0.03510722 -0.0362 1.1552
4 -.12009429 0.12431175 -0.0511 1.1041
5 -.24440604 -0.1041 1.0000

2 factors will be retained by the PROPORTION criterion.beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 145 / 230


Factor Pattern

Factor1 Factor2

JPM 0.63698 -0.35710


CITI 0.75467 -0.25567
WFargo 0.60547 -0.28118
RDShell 0.55603 0.55392
Exxon 0.50827 0.55612

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 146 / 230


Method = Principal Factors

The FACTOR Procedure


Initial Factor Method: Principal Factors

Variance Explained by Each Factor

Factor1 Factor2

1.9093697 0.8880550

Final Communality Estimates: Total = 2.797425

JPM CITI WFargo RDShell Exxon

0.53325801 0.63489595 0.44565972 0.61599835 0.56761267


beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 147 / 230


Likelihood Methods:
If we assume that X ∼ Np (µ, Σ) with Σ = LL0 + Ψ, we can fit by
maximum likelihood:
np n 1 h  n 
L(µ, Σ) = (2π)− 2 | Σ |− 2 exp{− tr Σ−1 ∑ (xj − x̄)(xj − x̄)0 + n(x̄ − µ)(x̄ − µ)0
2 j=1
(n−1)p n−1 1 h  n i
= (2π)− 2 | Σ |− 2 exp{− tr Σ−1 ∑ (xj − x̄)(xj − x̄)0 } (
2 j=1
p 1 n
×(2π)− 2 | Σ |− 2 exp{− (x̄ − µ)Σ−1 (x̄ − µ)0 }
2
b = x̄.
µ
L is not identified without a constraint (uniqueness condition) such
as
L0 Ψ−1 L = ∆ a diagonal matrix

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 148 / 230


Likelihood Methods:
still no closed form equation for L;
b numerical optimization required.
The factanal does maximum likelihood factor analysis, adding the
assumption that X ∼ Np (µ, Σ).

Output description:
By default the program will convert the sample covariance matrix
S to a correlation matrix before computing. If you want to override
this behavior, you can choose the matrix yourself using the
covmat argument.
The uniquenesses are the estimates of the diagonal elements of
Ψ. In the text- book, these are called specific variances. The
larger the specific variance, the less a particular variable is
determined by the latent factors.
At the foot of the loadings, the SS loadings are the column sum of
squares ∑i lij2 .
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 149 / 230


> stock <- read.table(’T8-4.DAT’)
> colnames(stock) <- c("J P Morgan", "Citibank", "Well
"Royal Dutch Shell", "Exxon Mobil")
> f2 <- factanal(stock, factor = 2, rotation = "none")
> f2

Call:
factanal(x = stock, factors = 2, rotation = "none")

Uniquenesses:
J P Morgan Citibank Wells Fargo Royal Dutch Shell
0.417 0.275 0.542 0.005
Exxon Mobil
0.530

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 150 / 230


Loadings:
Factor1 Factor2
J P Morgan 0.121 0.754
Citibank 0.328 0.786
Wells Fargo 0.188 0.650
Royal Dutch Shell 0.997
Exxon Mobil 0.685

Factor1 Factor2
SS loadings 1.622 1.610
Proportion Var 0.324 0.322
Cumulative Var 0.324 0.646

Test of the hypothesis that 2 factors are sufficient.


The chi square statistic is 1.97 on 1 degree of freedo
The p-value is 0.16
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 151 / 230


The maximum likelihood estimates of the communalities
b2 = bl 2 +bl 2 + · · · +bl 2 , for i = 1, 2, . . . , p
h (13)
i i1 i2 im

so
  bl 2 +bl 2 + · · · +bl 2
Proportion of total sample 1j 2j pj
= (14)
variance due to jth factor s11 + s22 + · · · + spp

The loadings are the estimate of L, in this case computed as if


k = 2 factors were sufficient. The communality, which is one
2 , and so
minus the specific variance, is also the sum of squares li1
gives the same information as the specific variance. Entries in Lb
that are shown as blank are really just small: the default is to
display a blank if lij < 0.1.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 152 / 230


Σ
> L <- loadings(f2)
> Psi <- diag(f2$uniquenesses)
> (sighat <- (L %*% t(L) + Psi))
J P Morgan Citibank WFargo Shell Exxon Mobil
J P Morgan 0.9999999 0.6322803 0.5130616 0.1149345 0.1024805
Citibank 0.6322803 1.0000000 0.5725336 0.3220805 0.2457536
Wells Fargo 0.5130616 0.5725336 0.9999999 0.1825087 0.1456520
Shell 0.1149345 0.3220805 0.1825087 1.0000016 0.6832558
Exxon Mobil 0.1024805 0.2457536 0.1456520 0.6832558 0.9999997

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 153 / 230


Correlation matrix R
The following matrix should approximate R, if a two-factor solution
is adequate:

> round(R - sighat, 3)


J P Morgan Citibank Wells Fargo Shell Exxon Mobi
J P Morgan 0.000 0.000 -0.003 0 0.052
Citibank 0.000 0.000 0.002 0 -0.033
Wells Fargo -0.003 0.002 0.000 0 0.001
Royal Dutch Shell 0.000 0.000 0.000 0 0.000
Exxon Mobil 0.052 -0.033 0.001 0 0.000

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 154 / 230


Testing hypothesis of dimensionality
If a few of the correlations are not well-approximated, suggesting
that the two-factor solution may not be adequate.
We can also test hypotheses about m with the likelihood ratio test
(Bartlett’s correction improves the χ 2 approximation):
H0 : m = m 0 ⇔ HA : m > m 0 ;
The log-likelihood ratio method:
 
maximized likelihood under H0
−2 ln Λ = −2 ln
maximized likelihood
!−n/2

b|
b −1 Sn ) − p]
= −2 ln + n[tr (Σ (15)
| Sn |

with 12 [(p − m0 )2 − p − m0 ] degrees of freedom.


Degrees of freedom m0 > 0 ⇔ m0 < 12 (2p + 1 − 8p + 1).
p

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 155 / 230


Bartlett’s correction
It can be viewed as a test of dimensionality. The small p-value
suggests that the two-factor model is not adequate.
b −1 Sn ) − p = 0 provided that
Supplement 9A indicates that tr (Σ
b =L
Σ b0 + Ψ
bL b is the maximum likelihood estimate of Σ = LL0 + Ψ.
Thus we use !

b|
− ln Λ = n ln .
Sn
Use Bartlett’s correction, we reject H0 at the α level of significance
if b0 + Ψ

b =L bL b|
2
(n − 1 − (2p + 4m + 5)/6) ln > χ[(p−m) 2 −p−m]/2 (α)
| Sn |
(16)

> det(R)
[1] 0.1752028
> det(L%*%t(L)+Psi) beamer-tu-logo
[1] 0.1788174
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 156 / 230
Scaling and the Likelihood
If the maximum likelihood estimates for a data matrix X are L
b and
b and
Ψ,
Yn×p = Xn×p D
is a scaled data matrix, with the columns of X scaled by the
entries of the diagonal matrix D, then the maximum likelihood
b and D 2 Ψ.
estimates for Y are D L b
That is, the mle’s are invariant to scaling:
b Y = DΣ
Σ b X D.

Proof. LY (µ, Σ) = LX (D −1 µ, D −1 ΣD −1 ).
No distinction between covariance and correlation matrices.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 157 / 230


Weighting and the Likelihood
Recall the uniqueness condition

L0 Ψ−1 L = ∆ a diagonal matrix

Write

Σ∗ = Ψ−1/2 ΣΨ−1/2
= Ψ−1/2 (LL0 + Ψ)Ψ−1/2
= (Ψ−1/2 L)(Ψ−1/2 L)0 + Ip
= L∗ L∗ 0 + Ip .

Σ∗ the weighted covariance matrix.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 158 / 230


Weighting and eigenvectors
Note:
Σ∗ L∗ = L∗ L∗ 0 L∗ + L∗
= L∗ ∆ + L∗
= L∗ (∆ + Im )

so the columns of L∗ are the (unnormalized) eigenvectors of Σ∗ ,


the weighted covariance matrix.
Also
(Σ∗ − Ip )L∗ = L∗ ∆
so the columns of L∗ are also the eigenvectors of

Σ∗ − Ip = Ψ−1/2 (Σ − Ψ)Ψ−1/2 ,

the weighted reduced covariance matrix.


Since the likelihood analysis is transparent to scaling, the
weighted reduced covariance matrix is the same as the weighted
beamer-tu-logo
reduced covariance matrix.
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 159 / 230
Factor Rotation
In the orthogonal factor model, factor loadings are not always
easily interpreted.
Ideally, we should like to see a pattern of loadings such that each
variable loads highly on a single factor and has small to moderate
loadings on the remaining factors.
That is, each row of L should have a single large entry.
We can choose T to make the rotated loadings LT more readily
interpreted. Among the orthogonal T , a common choice is the
varimax rotation proposed by Kaiser. Define l̃ij∗ = blij∗ /h
bi to be the
rotated coefficients scaled by the square root of the
communalities. Then the (normal) varimax procedure selects the
orthogonal transformation T that makes
" #
1 m p ∗4  p ∗2 2
V = ∑ ∑ l̃ij − ∑ l̃ij /p (17)
p j=1 i=1 i=1
beamer-tu-logo
as large as possible.
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 160 / 230
Factor Rotation
Scaling the rotated coefficient blij∗ has the effect of giving variables
with small communalities relatively more weight in the
determination of simple structure. After the transformation T is
determined, the loadings blij∗ are multiplied by hbi so that the original
communalities are preserved.
Note that rotation changes neither Σ nor Ψ, and hence the
communalities are also unchanged.
Note that the term in [ ]s is the variance of the l̃ij∗2 in column i.
Making this variance large tends to produce two clusters of scaled
loadings, one of small values and one of large values.
So each column of the rotated loading matrix tends to contain:
a group of large loadings, which identify the variables associated
with the factor;
the remaining loadings are small.
beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 161 / 230


R code
> varimax(loadings(f2))
$loadings
Loadings:
Factor1 Factor2
J P Morgan 0.763
Citibank 0.232 0.819
Wells Fargo 0.108 0.668
Royal Dutch Shell 0.991 0.113
Exxon Mobil 0.677 0.108
Factor1 Factor2
SS loadings 1.507 1.725
Proportion Var 0.301 0.345
Cumulative Var 0.301 0.646
$rotmat
[,1] [,2]
[1,] 0.9927706 0.1200276
[2,] -0.1200276 0.9927706 beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 162 / 230


Factor scores
Interpretation of a factor analysis is usually based on the factor
loadings.
The estimate values of the common factors, called factor scores
may also required. These quantities are often used for diagnostic
purposes, as well as inputs to a subsequent analysis.
Factor scores are not estimates of unknown parameters in the
usual sense. Rather, they are estimates of values for the
unobserved random factor vectors Fj , j = 1, 2, . . . , n. That is, factor
scoresbfj = estimate of the values fj attained by Fj (jth case).

Normally the factor score approaches have two elements in


common:
They treat the estimate factor loadings blij and specific variance ψ bi
as if they were true values.
They involve linear transformations of the original data, perhaps
centered or standardized. Typically, the estimated rotated loadings, beamer-tu-logo
rather than the original estimated loadings, are used to compute
factor
Z. Zhang scores.
(UCAS) Econometrics II Week 4 March 19, 2025 163 / 230
Two estimation methods:
Bartlett’s Weighted Least Squares
X − µ = LF + ε ⇒ bf = (L0 Ψ−1 L)−1 L0 Ψ−1 (X − µ).

With L, Ψ, and µ replaced by estimates, and for the jth observation


xj , this gives
b0 Ψ
bfj = (L b −1 L)
b −1 L
b0 Ψ
b −1 (xj − x̄)

as estimated values of the factors.


The sample mean of the scores is 0.
b0 Ψ
If the factor loadings are ML estimates, L b −1 L
b is a diagonal
matrix ∆, and the sample covariance matrix of the scores is
b

n b −1 ).
(I + ∆
n−1
beamer-tu-logo
In particular, the sample correlations of the factor scores are zero.
Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 164 / 230
Two estimation methods:
Regression Method
X and F have a joint multivariate normal distribution

Σ = LL0 + Ψ L
   
X −µ
Var =
F L0 Im

and so the regression estimate of the factor score is

E(F | X = x) = L0 S −1 (x − x̄)

which leads to
b0 (L
bfj = L bLb0 + Ψ)
b −1 (xj − x̄)

= (I + Lb0 Ψ
b −1 L)
b −1 L
b0 Ψ
b −1 (xj − x̄)

The two methods are related by


beamer-tu-logo

bf LS = [I + (L Ψ b0 b −1 b −1
L) ]bfjR .
Z. Zhang (UCAS) j Econometrics II Week 4 March 19, 2025 165 / 230
> f2 <- factanal(stock, factor = 2, rotation = "none",
scores="regression")
> pairs(f2$scores)

−2 −1 0 1 2

2
1
Factor1

0
−1
−2
2
1

Factor2
0
−1
−2

−2 −1 0 1 2

Figure: Scatter plot of factor scores.


beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 166 / 230


Perspectives and a Stragegy for Factor Analysis
1 At the present time, factor analysis still maintains the flavor of an
art, and no single strategy should yet be “chiseled into stone". We
suggest and illustrate one reasonable option:
2 Perform a principal component factor analysis. This method is
particularly appropriate for a first pass through the data. (It is not
required that R or S be nonsingular)
(a) Look for suspicious observations by plotting the factor scores. Also,
calculate standardized scores for each observation and squared
distances.
(b) Try a varimax rotation.
3 Perform a maximum likelihood factor analysis, including a varimax
rotation.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 167 / 230


4 Compare the solution obtained from the two factor analysis.
(a) Do the loadings group in the same manner ?
(b) Plot factor scores obtained for principal components against scores
from the maximum likelihood analysis.
5 Repeat the first three steps for other numbers of common factors
m. Do extra factors necessarily contribute to the understanding
and interpretation of the data ?
6 For large data sets, split them in half and perform a factor analysis
on each part. Compare the two results with each other and with
that obtained from the complete data set to check the stability of
the solution.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 168 / 230


Further thoughts: Nonlinear PCA and FA

Quotient correlation PCA and M4 factor analysis


We may consider using quotient correlation matrix to perform PC
analysis.
M4 factor analysis: Consider the following model

Yi = (ai1 Z1 ∨ ai2 Z2 ∨ · · · ∨ aim Zm )Si

where Zj are unit Fréchet random variables, Si is Levy random


variable. Taking logarithm transfromation on both side, we have a
factor model like decomposition.

beamer-tu-logo

Z. Zhang (UCAS) Econometrics II Week 4 March 19, 2025 169 / 230

You might also like