Cause and Correlation in Biology - A User's Guide To Path Analysis, Structural Equations and Causal Inference
Cause and Correlation in Biology - A User's Guide To Path Analysis, Structural Equations and Causal Inference
2
2
e
FROM CAUSE TO CORRELATI ON AND BACK
32
When only one variable is measured on each observational unit,
then one obtains a univariate distribution. When one measures more than
one variable on each observational unit (say, both the protein content and
the average seed weight per plant) then one obtains a multivariate distribu-
tion
6
. If one obtains the relative frequencies of values of each unique set of
multivariate observations, then one has a multivariate probability distribu-
tion. Again, there are many multivariate mathematical functions that
approximate such multivariate probability distributions. Figure 2.7 shows
two versions of a bivariate normal distribution.
2.6 Probabilistic independence
By denition, two random variables (X, Y ) are (unconditionally) indepen-
dent if the joint probability density of X and Y is the product of the prob-
ability density of X and the probability density of Y. Thus:
If I(X,,Y ) then P(X,Y ) P(X) P(Y )
For instance, if X and Y are each distributed as a standard normal
distribution and they are also independent (Figure 2.7A), then the joint
probability distribution can be obtained as follows:
f(X;0,1)
f(Y;0,1)
f(X;Y ) f(X;0,1) f(Y;0,1)
If two random variables (X, Y ) are not (unconditionally) indepen-
dent then the joint probability density of X and Y is not the product of the
two univariate probability densities. If the variables are dependent then one
cant simply multiply one univariate probability density by the other because
we have to take into consideration the interaction between the two (Figure
2.7B).
Figure 2.7A shows the bivariate normal density function of two
independent variables. Note that the mean value of Y is the same (0) no
matter what the value of X, and vice versa; the value of one variable doesnt
(X
2
Y
2
)
2
1
2
e
(Y)
2
2
1
2
e
(X)
2
2
1
2
e
2. 6 PROBABI LI STI C I NDEPENDENCE
33
16
In this case, a bivariate normal distribution.
F
i
g
u
r
e
2
.
7
.
T
w
o
d
i
f
f
e
r
e
n
t
v
e
r
s
i
o
n
s
o
f
a
b
i
v
a
r
i
a
t
e
n
o
r
m
a
l
p
r
o
b
a
b
i
l
i
t
y
d
i
s
t
r
i
b
u
t
i
o
n
.
(
A
)
T
h
e
j
o
i
n
t
d
i
s
t
r
i
b
u
t
i
o
n
o
f
t
w
o
i
n
d
e
p
e
n
d
e
n
t
,
n
o
r
m
a
l
l
y
d
i
s
t
r
i
b
u
t
e
d
r
a
n
d
o
m
v
a
r
i
a
b
l
e
s
.
(
B
)
T
h
e
j
o
i
n
t
d
i
s
t
r
i
b
u
t
i
o
n
o
f
t
w
o
n
o
r
m
a
l
l
y
d
i
s
t
r
i
b
u
t
e
d
r
a
n
d
o
m
v
a
r
i
a
b
l
e
s
t
h
a
t
a
r
e
n
o
t
i
n
d
e
p
e
n
d
e
n
t
.
change the average value (expected value) of the other variable. Figure 2.7B
shows the bivariate normal density function of two dependent variables.
Here, the mean value of Y is not independent of the value of X.
Similarly, X and Y are independent, conditional on (given) a set of
other variables Z, if the joint probability density of X and Y given Z equals
the product of the probability density of X given Z and the probability
density of Y given Z for all values of X, Y and Z for which the probability
density of Z is not equal to zero
7
. The notion of conditional independence
will be explained in more detail in Chapter 3. Thus:
If I(X,Z,Y ) then P(X,Y|Z) P(X|Z) P(Y|Z)
2.7 Markov condition
Many ecologists, especially those who study vegetation dynamics, are famil-
iar with Markov chain models (Van Hulst 1979). These models predict veg-
etation dynamics based on a transition matrix. The transition matrix gives
the probability that a location that is occupied by a species s
i
at time t will
be replaced by species s
j
at time t1. The model is Markovian because of
the assumption that changes in the vegetation at time t1 depend at most
on the state of the vegetation at time t, but not on states of the vegetation
at earlier times. Stated another way, these models are Markovian because
they assume that the more distant past (t 1) aects the immediate future
(t1) only indirectly through the present (t), thus: (t 1)(t)(t1).
In the context of causal models, the Markov condition is a prop-
erty both of a directed acyclic (causal) graph and the joint probability dis-
tribution that is generated by the graph. The condition is satised if, given
a vertex v
i
in the graph, or a random variable v
i
in the probability distribu-
tion, v
i
is independent of all ancestral causes given its causal parents
8
. In the
context of a causal model, this assumption is simply the reasonable claim
that, once we know the direct causes of an event, then knowledge of more
distant (indirect) causes provides no new information. To use a previous
example
9
, assume that the only cause of an increased concentration of
photosynthetic enzymes in a leaf is the added fertiliser that was put on the
ground, and that the only cause of an increased photosynthetic rate is the
increased concentration of photosynthetic enzymes. Then, knowing how
2. 7 MARKOV CONDI TI ON
35
17
This can be generalized to joint distributions of sets of variables X and Y conditional on
another set Z.
18
P(v
i
)P(v
i
|parents(v
i
)).
19
Fertiliserphotosynthetic enzymesphotosynthetic rate.
much fertiliser was added gives us no new information about the photosyn-
thetic rate once we already know the concentration of photosynthetic
enzymes in the leaf.
An important property of probability distributions that obey the
Markov condition is that they can be decomposed into conditional prob-
abilities involving only variables and their causal parents. For example,
Figure 2.8 shows a causal graph and the joint probability distribution that is
generated by it. This decomposition states that to know the probability dis-
tribution of D, we need only know the value of C; i.e. P(D|C). To know
the probability distribution of C we need only know the values of A and B;
i.e. P(C|{A,B}). A and B are independent and so to know the joint prob-
ability distribution of A and B we need only know the marginal distribu-
tions of A and B; i.e. P(A)P(B).
2.8 The translation from causal models to observational
models
Although causal models and observational models are not the same thing,
there is a remarkable relationship between the two. Consider rst the case
of causal graphs that do not have feedback relationships; that is, directed
paths from some vertex that do not lead back to the same vertex. Theorem
10 of Pearl (1988) states that for any causal graph without feedback loops (a
directed acyclic graph, or DAG), every d-separation statement obtained
from the graph implies an independence relation in the joint probability dis-
tribution of the random variables represented by its vertices.
This central insight has been a long time in coming, and I imagine
FROM CAUSE TO CORRELATI ON AND BACK
36
Figure 2.8. A causal graph involving four variables and the joint
probability distribution that is generated by it.
that many readers will wonder whether the eort was worth the return, so
let me rephrase it:
Once we have specied the acyclic causal graph, then every d-separation relation that
exists in our causal graph must be mirrored in an equivalent statistical independency
in the observational data if the causal model is correct.
The above statement is incredibly general; it does not depend on
any distributional assumptions of the random variables or on the functional
form of the causal relationships. In the same way, if even one statistical inde-
pendency in the data disagrees with what d-separations of the causal graph
predict, then the causal model must be wrong. This is the translation device
that we needed in order to properly translate the causal claims represented
in the directed graph into the ocial language of probability theory used
by statisticians to express observational models. After wading through the
jargon developed above, I hope that the reader will recognise the elegant
simplicity of this strategy (Figure 2.9). First, express ones causal hypothesis
in a mathematical language (directed graphs) that can properly express the
asymmetric types of relationship that scientists imply when they use the lan-
guage of causality. Second, use the translation device (d-separation) to trans-
late from this directed graph into the well-known mathematical language
(probability theory) that is used in statistics to express notions of association.
Finally, determine the types of (conditional) independence relationship that
must occur in the resulting joint probability distribution. Continuing with
the analogy of a correlation as being an observational shadow of the under-
lying causal process, the translation device (d-separation) is the method by
which one can predict these shadows. The shadows are in the form of con-
ditional independence relationships that the joint probability distribution
(and therefore the observational model) must possess if the data are really
generated by the hypothesised directed graph.
2.9 Counterintuitive consequences and limitations of
d-separation: conditioning on a causal child
Although d-separation can also be used to obtain predictions concerning
how a causal system will respond following an external manipulation
10
,
d-separation is really only a mathematical operation that gives the correla-
tional consequences of conditioning on a variable in a causal system. One
non-intuitive consequence is that two causally independent variables will be
correlated if one conditions on any of their common children. This is
because conditioning on a collider vertex along a path between vertices X
2. 9 CONDI TI ONI NG ON A CAUSAL CHI LD
37
10
This is explained later in this chapter.
and Y means that X and Y are not d-separated. This has important conse-
quences for applied regression analysis and shows how such a method can
give very misleading results if these are interpreted as giving information
about causal relationships.
Consider a causal system in which two causally independent vari-
ables (X and Y ) jointly cause variable Z: XZY. To be more specic,
lets assume that the nitrogen content (X) and the stomatal density (Y ) of
the leaves of individuals of a particular species jointly cause the observed net
photosynthetic rate (Z). Further, assume that leaf nitrogen content and
stomatal density are causally independent. So, the causal graph is: leaf
nitrogennet photosynthetic ratestomatal density. Let the functional
relationships between these variables be as follows:
leaf nitrogenN(0,1)
stomatal densityN(0,1)
net photosynthesis 0.5 leaf nitrogen0.5 stomatal
densityN(0,0.707)
These three equations can be used to conduct numerical simula-
tions
11
that can demonstrate the consequences of conditioning on a
common causal child (net photosynthetic rate). Since I use this method
repeatedly in this book, I will explain how it is done in some detail. The
rst equation states that the leaf nitrogen concentration of a particular plant
has causes not included in the model. Since the plant is chosen at random,
the leaf nitrogen concentration is simulated by choosing at random from a
normal distribution whose population mean is zero and whose population
standard deviation is 1. The second equation states that the stomatal density
of the same leaf of this individual also has causes not included in the model
(not the same unknown causes, since otherwise it would not be causally
independent) and its value is simulated by choosing another (independent)
number from the same probability distribution. The third equation states
that the net photosynthetic rate of this same leaf is jointly caused by the two
FROM CAUSE TO CORRELATI ON AND BACK
38
11
Such simulations are often called Monte Carlo simulations, after the famous gambling city,
because they make use of random number generators to simulate a random process.
Figure 2.9. The strategy used to translate from a causal model to an
observational model.
previous variables. The quantitative eect of these two causes on the net
photosynthetic rate is obtained by adding 0.5 times the leaf nitrogen con-
centration plus 0.5 times the stomatal density plus a new (independent)
random number taken from a normal distribution whose population mean
is zero, whose population variance is 12(0.5
2
), and whose population
standard deviation is therefore the square root of this value; this third
random variable represents all those other causes of net photosynthetic rate
other than leaf nitrogen and stomatal density and these other unspecied
causes are not causally connected to either of the specied causes. By repeat-
ing this process a large number of times, one obtains a random sample of
observations that agree with the generating process specied by the equa-
tions
12
. As is described in Chapter 3, this model is actually a very simply
path model. After generating 1000 independent observations that agree
with these equations, and respecting the causal relationships specied by our
causal system, here are the regression equations that are obtained:
leaf nitrogenN(0.035,1.006)
stomatal densityN(0.031,1.017)
net photosynthesis 0.0030.527 leaf nitrogen0.498 stomatal
densityN(0,0.693)
Happily, the partial regression coecients as well as the means and
standard deviations of the random variables are what we should nd, given
sampling variation with a sample size of 1000. What happens if we give
these data to a friend who mistakenly thinks that leaf nitrogen concentra-
tion is actually caused by net photosynthetic rate and stomatal density? That
is, she mistakenly thinks that the causal graph is: net photosynthetic
rateleaf nitrogenstomatal density. We know, because we generated the
numbers, that leaf nitrogen and stomatal density are actually independent
(the Pearson correlation coecient between them is 0.037) but this is the
set of regression equations that results from this incorrect causal hypothesis:
net photosynthesis N(0.001,0.994)
stomatal densityN(0.031,1.017)
leaf nitrogen0.0230.70 net photosynthesis 0.366 stomatal
densityN(0,0.799)
2. 9 CONDI TI ONI NG ON A CAUSAL CHI LD
39
12
Many commercial statistical packages can generate random numbers from specied prob-
ability distributions. A good reference, along with FORTRAN subroutines, is Press et al.
(1986).
Tests of signicance for the two partial regression coecients show
that each is signicantly dierent from zero at a probability of less than
110
6
. Why would the multiple regression mistakenly report a highly
signicant eect of stomatal density on leaf nitrogen when we know that
they are both statistically and causally independent (because we made them
that way in the simulation)? There is no mistake in the statistics; rather it
is our friends interpretation that is mistaken. The regression equation is an
observational model and it is simply telling us that knowing something
about the net photosynthetic rate gives us extra information about (or helps
to predict) the amount of nitrogen in the leaf, when we compare leaves with
the same stomatal density
13
. This is exactly what d-separation, applied to the
correct causal graph, tells us will happen: leaf nitrogen and stomatal density,
while unconditionally d-separated, are not d-separated (therefore observa-
tionally associated) upon conditioning on their causal child (net photosyn-
thetic rate).
This counterintuitive claim is easier to understand with an every-
day example. Consider again the simple causal world consisting only of rain,
watering pails and mud, related as: rainmudwatering pails. Now, in this
world there are no causal links between watering pails and rain. Knowing
that no one has dumped water from the watering pail tells us nothing about
whether or not it is raining; we can predict nothing about the occurrence
of rain by knowing something about the watering pail. On the other hand,
if we see that there is mud (the causal child of the two independent causes),
and we know that no one has dumped water from the watering pail (i.e.
conditional on this variable) then we can predict that it has rained.
Conditioning on a common child of the two causally independent variables
(rain and watering pails) renders them observationally dependent. This is
because information, unlike causality, is symmetrical.
Many researchers believe that the more variables that can be statis-
tically controlled in a multiple regression, the less biased and the more reli-
able the resulting model. The above example shows this to be wrong and
warns against such methods as stepwise multiple regression if the resulting
model is to be interpreted as something more than simply a prediction
device
14
. This point is almost never mentioned in most statistics texts.
FROM CAUSE TO CORRELATI ON AND BACK
40
13
Remember that a partial regression coecient is a function of the partial correlation
coecient. The partial correlation coecient measures the degree of linear association
between two variables upon conditioning on some other set of variables; see Chapter 3.
14
Even as a prediction device, such models are only valid if no manipulations are done to
the population.
2.10 Counterintuitive consequences and limitations of
d-separation: conditioning due to selection bias
There is also an interesting consequence of d-separation that might occur
in experiments using articial selection. Body condition is a somewhat
vague concept that is sometimes used to refer to the general health and
vigour of an animal. It is occasionally operationalized as an index based on
a weighting of such things as the amount of subcutaneous fat, the parasite
load, or other variables judged relevant to the health of the species. Imagine
a wildlife manager who wants to select for an improved body condition of
Bighorn Sheep. His measure of body condition is obtained by adding
together the thickness of subcutaneous fat in the autumn (in centimetres)
and a score for parasite load (0none, 1average load, 2above-average
load) as follows: body condition0.5 fatparasite load. These two com-
ponents of body condition are causally unrelated. He decides to protect all
individuals whose body condition is greater than 3 and removes all others
from the population by allowing hunters to kill them. The causal graph of
this process is: fat thicknessbody conditionparasite load. If someone
were to then measure the fat thickness and parasite load in the remaining
population after the selective hunt, she would nd that these two variables
were correlated, even though there is, in reality, no causal link between the
two
15
. This occurs because the selection process has removed all those indi-
viduals not meeting the selection criterion and this eectively results in con-
ditioning on body condition.
We can simulate this with the following generating equations
16
.
fat thickness Gamma(shape2)
parasite loadMultinomial( p1/3,1/3,1/3)
body condition0.5 fat thickness parasite load
After generating 1000 independent sheep following this process
we nd the Spearman non-parametric correlation coecient between fat
thickness and parasite load in the original population before articial selec-
tion to be 0.018, consistent with independence. There were 493 sheep
2. 10 CONDI TI ONI NG DUE TO SELECTI ON BI AS
41
15
On the other hand, if this process were to be repeated for a number of generations and
the two attributes were heritable, then there would develop a causal link, since the average
values of the attributes in the next generation would depend on who survives, and this is
caused by the same attributes in the previous generation.
16
Gamma(shape2) is the incomplete Gamma distribution which gives values greater than
zero with a right-tailed skew. Multinomial(1/3,1/3,1/3) means a multinomial distribu-
tion with equal probability of values being 0, 1 or 2.
whose body condition was at least 3, and so these are kept to represent the
post-selection population, the rest being killed. The Spearman non-
parametric correlation coecient between fat thickness and parasite load for
this post-selection population was 0.593. This occurs even though these
two variables are causally independent.
2.11 Counterintuitive consequences and limitations of
d-separation: feedback loops and cyclic causal graphs
The relationship between d-separation in an acyclic causal model (a directed
acyclic graph) and independencies in a probability distribution is therefore
very general. What happens if there are feedback loops in the causal model?
We dont know for sure, although this is an area of active research
(Richardson 1996b). Spirtes (1995) has shown that d-separation in a cyclic
causal model still implies independence in the joint probability distribution
that it generates, but only if the relationships are linear. Pearl and Dechter
(1996) have also shown that the relationship between d-separation and pro-
babilistic independence also holds if all variables are discrete without any
restriction on the functional form of the relationships. Unfortunately,
Spirtes (1995) has also shown, by a counter-example, that d-separation does
not always imply probabilistic independence when the functional relation-
ships are non-linear and the variables are continuous. There are some gram-
matical constructs in the language of causality for which no one has yet
found a good translation.
There are other curious properties of causal models with feedback
loops. Consider Figure 2.10. Such a causal model seems to violate many
properties of causes. The relationship is no longer asymmetrical, since X
causes Z (indirectly through Y ) and Z also causes X. The relationship is no
longer irreexive, since X seems to cause itself through its eects on Y and
Z.
These counterintuitive aspects of feedback loops can be resolved if
we remember that causality is a process that must follow times arrow but
causal graphs do not explicitly include this time dimension. Causal graphs
with feedback loops represent either a time slice of an ongoing dynamic
process or a description of this dynamic process at equilibrium, an interpre-
tation that appears to have been rst proposed by F. M. Fisher (1970).
Richardsons very interesting Ph.D. thesis (Richardson 1996b) provides a
history of the use and interpretation of such cyclic, or feedback models
17
FROM CAUSE TO CORRELATI ON AND BACK
42
17
In the literature of structural equation modelling, cyclic or feedback models are called
non-recursive. This whole subject area is replete with confusing and intimidating jargon.
in economics. A more complete causal description of the process shown in
Figure 2.10 is given in Figure 2.11; the subscripts on the vertices index the
state of that vertex at a given time. From Figure 2.11 we see that, once the
explicit time dimension is included in the directed graph, the apparent par-
adoxes disappear. Rather than circles, when we ignore the time dimension
(as in Figure 2.10) we have spirals that never close on themselves when the
time dimension is included. Just as the 20 year old Bill Shipley is not the
same individual as I am as I write these words, the X that causes Y at time
t1 will not be that same X that is caused by Z at time t4 in Figure 2.11.
Conceived in this way, both acyclic and cyclic causal models repre-
sent time slices of some causal process. Samuel Mason, described by Heise
(1975), provided a general treatment of feedback loops in causal graphs over
40 years ago for the case of linear relationships between variables. None the
less, trying to model causal processes with feedback using directed graphs
that ignore this time dimension is more complicated and requires that we
make assumptions about the linearity of the functional relationships.
2.12 Counterintuitive consequences and limitations of
d-separation: imposed conservation relationships
Relationships derived from imposed (as opposed to dynamic) conservation
constraints are supercially similar to cyclic relationships, but they are con-
ceptually quite dierent. By conservation I mean variables that are con-
strained to maintain some conserved property. For instance, if I purchase
fruits and vegetables in a shop and then count the total amount of money
that I have spent, I can represent this as: money spent on fruitstotal money
spentmoney spent on vegetables. If the total amount of money that I can
spend is not xed, then the amount that I spend on fruits and the amount
that I spend on vegetables are causally independent. However, if the total
amount of money is xed, or conserved, due to some inuence outside of the
causal system then every dollar that I spend on fruit causes a decrease in the
amount of money that I spend on vegetables. There is now a causal link
2. 12 I MPOSED CONSERVATI ON RELATI ONSHI PS
43
Figure 2.10. A cyclic causal graph that seemingly violates many of the
properties of causal relationships.
between the amount of money spent on fruits and on vegetables due only
to the requirement that the total amount of money be conserved.
There is no obvious way to express such relationships in a causal
graph. One might be tempted to modify our original acyclic graph by
adding a cyclic path between fruits and vegetables but, if we do this, then
we cant interpret such a cyclic graph as a static graph of a dynamic process;
the conservation constraint is imposed from outside and is not due to a
dynamic equilibrium that results from the prior interaction of money spent
on fruits and money spent on vegetables. In other words, it is not as if
spending one dollar more on fruits at time t1 causes me to spend one
dollar less on vegetables at time t2, which then causes me to spend one
dollar less on fruits at time t3, and so on until some dynamic equilibrium
FROM CAUSE TO CORRELATI ON AND BACK
44
Figure 2.11. The causal relationships between X, Y and Z from Figure
2.10 when the time dimension is included in the causal graph.
is attained. The conservation of the total amount of money spent is imposed
from outside the causal system.
One might also be tempted to interpret the conservation require-
ment as equivalent to physically xing the total amount of money at a con-
stant value. If this were true, then one could maintain the causal graph
money spent on fruitstotal money spentmoney spent on vegetables
but with the variable total money spent being xed due to the imposed
conservation requirement. Because total money spent is now viewed as
being xed rather than being allowed to vary randomly, then money spent
on fruits would not be d-separated from money spent on vegetables
(remember d-separation); this is because total money spent is the causal
child of each of money spent on fruits and money spent on vegetables.
This would indeed imply a correlation between fruits and vegetables.
Unfortunately, our causal system does not imply simply that the money
spent on fruits is correlated with the money spent on vegetables, but that there
is actually a causal connection between them that exists only when the con-
servation requirement is in place. d-separation upon conditioning on a
common causal child does not imply that any new causal connections form
between the causal parents. Perhaps the best causal representation is to
consider that the causal graph money spent on fruitstotal money
spentmoney spent on vegetables is actually replaced by the causal graph
money spent on fruitstotal money spentmoney spent on vegetables
with the convention that total money spent is not random.
Systems that contain imposed conservation laws (conservation of
energy, mass, volume, number, etc.) cannot yet be properly expressed using
directed graphs and d-separation. In fact, such causal relationships resem-
ble Platos notion of formal causes rather than the ecient causes with
which scientists are used to working. It is important to keep in mind,
however, that this does not apply to conservation relationships that are due
to a dynamic equilibrium, for which cyclic graphs can be used, but rather
to conservation relationships that are imposed independently of the casual
parents of the conserved variable.
2.13 Counterintuitive consequences and limitations of
d-separation: unfaithfulness
Lets go back to the relationship between d-separation and probabilistic
independence. We now know that once we have specied the acyclic causal
model, then every d-separation relation that exists in our causal model
must be mirrored in an equivalent statistical independency in the observa-
tional data if the causal model is correct. This does not depend on any
2. 13 UNFAI THFULNESS
45
distributional assumptions of the random variables or on the functional form
of the causal relationships. Is the contrary also true? Can there be indepen-
dencies in the data that are not predicted by the d-separation criterion?
Yes, but only as limiting cases. For instance, this can occur if the
quantitative causal eect of two variables along dierent directed paths
exactly cancel each other out. Two examples are shown in Figure 2.12. In
these causal models we see that no vertex is unconditionally d-separated
from any other vertex. Assume that the joint probability distribution over
the three vertices is multivariate normal and that the functional relationships
between the variables are linear. Under these conditions, we can use
Pearsons partial correlation to measure probabilistic independence
18
. By
denition, the partial correlation between X and Z, conditioned on Y,
19
is
given by:
XZ.Y
XY
ZY
) even though X and
Z are not d-separated given Y, if the correlations between each pair of var-
iables exactly cancel each other. Using the rules of path analysis (Chapter
4), this will happen only if Y is perfectly correlated with X in the rst model
in Figure 2.12, or if the indirect eect of X on Z is exactly equal in strength
but opposite in sign to the direct eect of X on Z.
When this occurs, we say that the probability distribution is unfaith-
ful to the causal graph (Pearl 1988; Spirtes, Glymour and Scheines 1993). I
will call such probabilistic independencies that are not predicted by
d-separation, and that depend on a particular combination of quantitative
eects, balancing independencies, to emphasise that such independencies
require a very peculiar balancing of the positive and negative eects between
the variables along dierent paths. Clearly, this can occur only under very
special conditions, and anyone who wanted to link a causal model with such
an unfaithful probability distribution would require strong external evi-
dence to support such a delicate balance of causal eects. This is not to say
that these things are impossible. It sometimes occurs that an organism
attempts to maintain some constant set-point value by balancing dierent
causal eects; an example is the control of the internal CO
2
concentration
of a leaf, as described in Chapter 3. Essentially, in proposing such a claim
we are saying that nature is conspiring to give the impression of indepen-
dence by exactly balancing the positive and negative eects.
XZ
XY
ZY
(1
2
XY
)(1
2
ZY
)
FROM CAUSE TO CORRELATI ON AND BACK
46
18
Pearson partial correlations are explained more fully in Chapter 3.
19
See p. 84 for a more detailed explanation of the notation XZ.Y.
2.14 Counterintuitive consequences and limitations of
d-separation: context-sensitive independence
Another way in which independencies can occur in the joint probability dis-
tribution without being mirrored in the d-separation criterion is due to
context-sensitive independence. An example of this in biology is enzyme
induction
20
. Imagine a case in which the number (G) of functional copies
of a gene determines the rate (E) at which some enzyme is produced. If
there are no functional copies of the gene then the enzyme is never pro-
duced. However, the rate at which these genes are transcribed is determined
by the amount (I ) of some environmental inducer. If the environment com-
pletely lacks the inducer, then no genes are transcribed and the enzyme is
still never produced. It is possible to arrange an experimental set-up in
which the number (G) of functional genes is causally independent of the
concentration (I ) of the inducer in the environment
21
. Both the number of
functional genes and the concentration of the inducer are causes of enzyme
production. We can construct a causal graph of this process (Figure 2.13).
Now, applying d-separation to the causal graph in Figure 2.13 pre-
dicts that G is independent of I, but that E is dependent on both G and I.
2. 14 CONTEXT- SENSI TI VE I NDEPENDENCE
47
20
A classic example is the lac operon of Escherichia coli, whose transcription in the presence
of lactose induces the production of -galactosidase, lac permease and transacetylase, thus
converting lactose into galactose and glucose (De Robertis and De Robertis 1980).
21
Whether this would be true in the biological population is an empirical question. Perhaps
the presence of a functional gene was selected based on the presence of the inducer. In
this case, the inducer would be a cause of the presence (and perhaps the number of copies)
of the gene.
Figure 2.12. Two causal graphs for which special combinations of
causal strengths can result in unfaithful probability distributions.
However, if there are no copies of G (i.e. G0) then the concentration of
the inducer will be independent of the amount of enzyme that is produced
(which will be zero). Similarly, if there is no inducer (i.e. I0) then the
number of copies of the gene will be independent of the amount of enzyme
that is produced (which will be zero). In other words, for the special cases
of G0 and/or I0 d-separation predicts a dependence when, in fact,
there is independence. Note that the d-separation theorem still holds;
d-separation does not predict any independence relations that do not exist. So
long as the experiment involves experimental units, at least some of which
include G0 and I 0, the d-separation criterion still predicts both pro-
babilistic independence and dependence. Similarly, if both G and I were
true random variables (i.e. in which the experimenter did not x their
values), then any reasonably large random sample would include such cases.
2.15 The logic of causal inference
Now that we have our translation device and are aware of some of the
counterintuitive results and limitations that can occur with d-separation, we
have to be able to infer causal consequences from observational data by using
this translation device. The details of how to carry out such inferences will
occupy most of Chapters 3 to 7. Before looking at the statistical details,
however, we must rst consider the logic of causal and statistical inferences.
Since we are talking about the logic of inferences from empirical
experience, it is useful to briey look at what philosophers of science have
had to say about valid inference. Logical positivism, itself being rooted in
FROM CAUSE TO CORRELATI ON AND BACK
48
Figure 2.13. A biological example of a causal process that can
potentially result in context-sensitive independence.
the British empiricism of the last century that so inuenced people like Karl
Pearson
22
, was dominant in this century up to the mid 1960s. This philo-
sophical school was based on the veriability theory of meaning; to be
meaningful, a statement had to be of a kind that could be shown to be either
true or false. For logical positivism, there were only two kinds of meaning-
ful statement. The rst kind was composed of analytical statements (tautol-
ogies, mathematical or logical statements) whose truth could be determined
by deducing them from axioms or denitions. The second kind was com-
posed of empirical statements that were either self-evident observations (the
water is 23 C) or could be logically deduced from combinations of basic
observations whose truth was self-evident
23
. Thus logical positivists empha-
sised the hypothetico-deductive method: a hypothesis was formulated to
explain some phenomenon by showing that it followed deductively from
the hypothesis. The scientist attempted to validate the hypothesis by deduc-
ing logical consequences of the hypothesis that were not involved in its for-
mulation and testing these against additional observations. A simplied
version of the argument goes like this:
If my hypothesis is true, then consequence C must also be true.
Consequence C is true.
Therefore my hypothesis is true.
Readers will immediately recognise that such an argument commits
the logical fallacy of arming the consequent. It is possible for the conse-
quence to be true even though the hypothesis that deduced it is false, since
there can always be other reasons for the truth of C.
Popper (1980) pointed out that, although we cannot use such an
argument to verify hypotheses, we can use it to reject them without com-
mitting any logical fallacy:
If my hypothesis is true, then consequence C must also be true.
Consequence C is false.
Therefore my hypothesis is false.
Practising scientists would quickly recognise that this argument,
although logically acceptable, has important shortcomings when applied to
empirical studies. It was recognised as long ago as the turn of the century
(Duhem 1914) that no hypothesis is tested in isolation. Every time that we
draw a conclusion from some empirical observation we rely on a whole set
2. 15 THE LOGI C OF CAUSAL I NFERENCE
49
22
This is explored in more detail in Chapter 3.
23
That even such simple observational or experiential statements cannot be considered
objectively self-evident was shown at the beginning of the twentieth century by Duhem
(1914).
of auxiliary hypotheses (A
1
, A
2
. . .) as well. Some of these have been repeat-
edly tested so many times and in so many situations that we scarcely doubt
their truth. Other auxiliary assumptions may be less well established. These
auxiliary assumptions will typically include those concerning the experi-
mental or observational background, the statistical properties of the data,
and so on. Did the experimental control really prevent the variable from
changing? Were the data really normally distributed, as the statistical test
assumes? Such auxiliary assumptions are legion in every empirical study,
including the randomised experiment, the controlled experiment or the
methods described in this book involving statistical controls. A large part of
every empirical investigation involves checking, as best one can, such aux-
iliary assumptions so that, once the result is obtained, blame or praise can
be directed at the main hypothesis rather than at the auxiliary assumptions.
So, Poppers process of inference might be simplistically para-
phrased
24
as:
If auxiliary hypotheses A
1
, A
2
, . . . A
n
are true, and
if my hypothesis is true, then consequence C must be true.
Consequence C is false.
Therefore, my hypothesis is false.
Unfortunately, to argue in such a manner is also logically fallacious.
Consequence Cmight be false, not because the hypothesis is false, but rather
because one or more of the auxiliary hypotheses are false. The empirical
researcher is now back where he started: there is no way of determining
either the truth of falsity of his or her hypothesis in any absolute sense from
logical deduction. This conclusion applies just as well to the randomised
experiment, the controlled experiment or the methods described in this
book. Yet, most biologists would recognise the falsiability criterion as
important to science and would probably modify the simplistic paraphrase
of Poppers inference by attempting to judge which, the auxiliary hypoth-
eses and background conditions, or the hypothesis under scrutiny, is on
rmer empirical ground. If the auxiliary assumptions seem more likely to
be true than the hypothesis under scrutiny, yet the data do not accord with
the predicted consequences, then the hypothesis would be tentatively
rejected. If there are no reasoned arguments to suggest that the auxiliary
assumptions are false, and the data also accord with the predictions of the
hypothesis under scrutiny, then the hypothesis would be tentatively
accepted.
Pollack (1986) called such reasoning defeasible reasoning
25
. Reveal-
FROM CAUSE TO CORRELATI ON AND BACK
50
24
Simplistic because it is wrong. Popper did not make such a claim.
25
Defeasible because it can be defeated with subsequent evidence.
ingly, practising scientists have explicitly described their inferences in such
terms for a long time. At the turn of the century T. H. Huxley likened the
decision to accept or reject a scientic hypothesis to a criminal trial in a
court of law (reproduced in Rapport and Wright 1963) in which guilt must
be demonstrated beyond reasonable doubt.
Lets apply this reasoning to the examples in Chapter 1 involving
the randomised and the controlled experiments. Later, I will apply the same
reasoning to the methods involving statistical control.
Here is the logic of causal inference with respect to the randomised
experiment to test the hypothesis that fertiliser addition increases seed yield:
If the randomisation procedure was properly done so that the alter-
native causal explanations were excluded;
if the experimental treatment was properly applied;
if the observational data do not violate the assumptions of the sta-
tistical test;
if the observed degree of association was not due to sampling uc-
tuations;
then by the causal hypothesis the amount of seed produced will be
associated with the presence of the fertiliser.
There is/is not an association between the two variables.
Therefore, the fertiliser addition might have caused/did not cause
the increased seed yield.
This list of auxiliary assumptions is only partial. In particular, we
still have to make the basic assumption linking causality to observational
associations, as described in Chapter 1. At this stage we must either reject
one of the auxiliary assumptions or tentatively accept the conclusion con-
cerning the causal hypothesis. If the probability associated with the test for
the association is suciently large
26
, traditionally above 0.05, then we are
willing to reject one of the auxiliary assumptions (the observed measure of
2. 15 THE LOGI C OF CAUSAL I NFERENCE
51
26
See Cowles and Davis (1982b) for a history of the 5% signicance level. The rst edition
of Fishers (1925) classic book states: It is convenient to take this point as a limit in judging
whether a deviation is to be considered signicant or not. Deviations exceeding twice the
standard deviation are thus formally regarded as signicant. The words convenient and
formal emphasise the somewhat arbitrary nature of this value. In fact, this level can be
traced back even further to the use of three times the probable error (about 2/3 of a stan-
dard deviation). Strictly speaking, twice the standard deviation of a normal distribution
gives a probability level of 0.0456; perhaps Fisher simply rounded this up to 0.05 for his
tables. E. S. Pearson and Kendall (1970) record Karl Pearsons reasons at the turn of the
century: p0.5586 thus we may consider the t remarkably good; p0.28 fairly rep-
resented; p0.10 not very improbable; p0.01 this very improbable result. Note that
some doubt began at 0.1 and Pearson was quite convinced at p0.01. The midpoint
association was not due to sampling uctuations) rather than accept the
causal hypothesis. Thus we reject our causal hypothesis. This rejection must
remain tentative. This is because another of the auxiliary assumptions (not
listed above) is that the sample size is large enough to permit the statistical
test to dierentiate between sampling uctuations and systematic dier-
ences. Note, however, that it is not enough to propose any old reason to
reject one of the auxiliary assumptions; we must propose a reason that has
empirical support. We must produce reasonable doubt in the context of the
assumption concerning sampling uctuations scientists generally require a
probability above 0.05. Here it is useful to cite from the rst edition of
Fishers (1925) inuential Statistical methods for research workers: Personally,
the writer prefers to set a low standard of signicance at the 5 per cent point,
and ignore entirely all results which fail to reach this level. A scientic fact
should be regarded as experimentally established only if a properly designed
experiment rarely fails to give this level of signicance. It is clear that Fisher
was demanding reasonable doubt concerning the null hypothesis, since he
asks only that a result rarely fail to reject it. What if the probability of the
statistical test was suciently small, say 0.01, that we do not have reasonable
grounds to reject our auxiliary assumption concerning sampling uctua-
tions? What if we do not have reasonable grounds to reject the other aux-
iliary assumptions? What if the sampling variation was small compared with
a reasonable eect size? Then we must tentatively accept the causal hypoth-
esis. Again, this acceptance must remain tentative, since new empirical data
might provide such reasonable doubt. Is there any automatic way of meas-
uring the relative support for or against each of the auxiliary assumptions
and of the principal causal hypothesis? No. Although the support (in terms
of objective probabilities) for some assumptions can be obtained for
instance, those concerning normality or linearity of the data there are
many other assumptions that deal with experimental procedure or lack of
confounding variables for which no objective probability can be calculated.
This is one reason why so many contemporary philosophers of science
prefer Bayesian methods to frequency-based interpretations of probabilistic
FROM CAUSE TO CORRELATI ON AND BACK
52
footnote 26 (cont.)
between 0.1 and 0.01 is 0.05. Cowles and Davis (1982a) conducted a small psychological
experiment by fooling students into believing that they were participating in a real betting
game (with money) that was, in reality, xed. The object was to see how unlikely a result
people would accept before they began to doubt the fairness of the game. They found
that on average, people do have doubts about the operation of chance when the odds
reach about 9 to 1 [i.e. 0.09], and are pretty well convinced when the odds are 99 to 1
[i.e. 0.0101] . . . If these data are accepted, the 5% level would appear to have the appeal-
ing merit of having some grounding in common sense.
inference (see, for example, Howson and Urbach 1989). Such Bayesian
methods suer from their own set of conceptual problems (Mayo 1996). In
the end, even the randomised experiment requires subjective decisions on
the part of the researcher. This is why the independent replication of
experiments in dierent locations, using slightly dierent environmental or
experimental conditions and therefore having dierent sets of auxiliary
assumptions, is so important. As the causal hypothesis continues to be
accepted in these new experiments, it becomes less and less reasonable to
suppose that incorrect auxiliary assumptions are conspiring to give the illu-
sion of a correct causal hypothesis.
Here is the logic of our inferences with respect to the controlled
experiment to test the hypothesis that renal activity causes the change in the
colour of the renal vein blood, as described in Chapter 1:
If the activity of the kidney was eectively controlled;
if the colour of the blood was accurately determined;
if the experimental manipulation did not change some other
uncontrolled attribute besides kidney function that is a common
cause of the colour of blood in the renal vein before entering, and
after leaving the kidney;
if there was not some unknown (and therefore uncontrolled)
common cause of the colour of blood in the renal vein before
entering, and after leaving the kidney;
if a rare random event did not occur;
then by the causal hypothesis, blood will change colour only when
the kidney is active.
The blood did change colour in relation to kidney activity.
Therefore, kidney activity does cause the change in the colour of
blood leaving the renal vein.
Again, this list of auxiliary assumptions is only partial. Again, one
must either produce reasonable evidence that one or more of the auxiliary
assumptions is false or tentatively accept the hypothesis. In particular, more
of these auxiliary assumptions concern properties of the experiment or of
the experimental units for which we cannot calculate any objective prob-
ability concerning their veracity. This was one of the primary reasons why
Fisher rejected the controlled experiment as inferior. In the controlled
experiment these auxiliary assumptions are more substantial but it is still not
enough to raise any doubt; there must be some empirical evidence to
support the decision to reject one of these assumptions. Since we want the
data to cast doubt or praise on the principal causal hypothesis and not on the
auxiliary assumptions, we will ask only for evidence that casts reasonable
2. 15 THE LOGI C OF CAUSAL I NFERENCE
53
doubt. It is not enough to reject the causal hypothesis simply because
experimental manipulation might have changed some other uncontrolled
attribute besides kidney function that is a common cause of the colour of
blood in the renal vein before entering, and after leaving the kidney. We
must advance some evidence to support the idea that such an uncontrolled
factor actually exists. For instance, a critic might reasonably point out that
some other attribute is also known to be correlated with blood colour and
that the experimental manipulation was known to have changed this attrib-
ute. Although such evidence would certainly not be sucient to demon-
strate that this other attribute denitely was the cause, it might be enough
to cast doubt on the veracity of the principal hypothesis. This is the same
criterion as we used before to choose a signicance level in our statistical
test. Rejecting a statistical hypothesis because the probability associated with
it was, say, 0.5 would not be reasonable. Certainly, this gives some doubt
about the truth of the hypothesis but our doubt is not suciently strong that
we would have a clear preference for the contrary hypothesis. It is the same
defeasible argument that might be raised in a murder trial. If the prosecu-
tion has demonstrated that the accused had a strong motive, if it produced a
number of reliable eyewitnesses and if it produced physical evidence impli-
cating the accused, then it would not be enough for the defence to claim
simply that maybe someone else did it. If, however, the defence could
produce some contrary empirical evidence implicating someone else, then
reasonable doubt would be cast on the prosecutions argument. In fact, I
think that the analogy between testing a scientic hypothesis and testing the
innocence of the accused in a criminal trial can be stretched even further.
There is no objective denition of reasonable doubt in a criminal trial; what
is reasonable is decided by the jury in the context of legal precedence. In the
same way, there is no objective denition of reasonable doubt in a scientic
claim. In the rst instance reasonable doubt is decided by the peer review-
ers of the scientic article and, ultimately, reasonable doubt is decided by
the entire scientic community. One should not conclude from this that
such decisions are purely subjective acts and that scientic claims are there-
fore simply relativistic stories whose truth is decided by at by a power elite.
Judgements concerning reasonable doubt and statistical signicance are con-
strained in that they must deliver predictive agreement with the natural
world in the long run.
Now lets look at the process of inference with respect to causal
graphs.
If the data were generated according to the causal model;
if the causal process generating the data does not include non-linear
feedback relationships;
FROM CAUSE TO CORRELATI ON AND BACK
54
if the statistical test used to test the independence relationships is
appropriate for the data;
if a rare sampling uctuation did not occur;
then each d-separation statement will be mirrored by a probabilis-
tic independence in the data.
At least one predicted probabilistic independence did not exist;
therefore, the causal model is wrong.
By now, you should have recognised the similarity of these infer-
ences. We can prove by logical deduction that d-separation implies probab-
ilistic independence in such directed acyclic graphs. We can prove that,
barring the case of non-linear feedback with non-normal data (an auxiliary
assumption), every d-separation statement obtained from any directed graph
must be mirrored by a probabilistic independence in any data that were gen-
erated according to the causal process that was coded by this directed graph.
We can prove that, barring a non-faithful probability distribution (another
auxiliary assumption, but one that is only relevant if the causal hypothesis is
accepted, not if it is rejected), there can be no independence relation in the
data that is not mirrored by d-separation. So, if we have used a statistical test
that is appropriate for our data and have obtained a probability that is
suciently low to reasonably exclude a rare sampling event, then we must
tentatively reject our causal model. As in the case of the controlled experi-
ment, if we are led to tentatively accept our causal model, then this will
require that we cant reasonably propose an alternative causal explanation
that also ts our data as well. As always, it is not sucient to simply claim
that maybe there is such an alternative causal explanation. One must be able
to propose an alternative causal explanation that has at least enough empir-
ical support to cast reasonable doubt on the proposed explanation.
2.16 Statistical control is not always the same as physical
control
We have now seen how to translate from a causal hypothesis into a statisti-
cal hypothesis. First, transcribe the causal hypothesis into a causal graph
showing how each variable is causally linked to other variables in the form
of direct and indirect eects. Second, use the d-separation criterion to
predict what types of probabilistic independence relationship must exist
when we observe a random sample of units that obey such a causal process.
In Chapter 1 I alluded to the fact that the key to a controlled experiment is
control over variables, not how the control is produced. It is time to look at
this more carefully. The relationship between control through external
(experimental) manipulation and probability distributions is given by the
2. 16 STATI STI CAL CONTROL VERSUS PHYSI CAL CONTROL
55
Manipulation Theorem (Spirtes, Glymour and Scheines 1993). Let me
introduce another denition in Box 2.2.
Box 2.2. Denition of a backdoor path
Given two variables, X and Y, and a variable F that is a causal ancestor of both
X and Y, a backdoor path goes from F to each of X and Y. Thus
X F Y
Whenever someone directly physically controls some set of va-
riables through experimental manipulation, he or she is changing the causal
process that is generating the data. Whenever someone physically xes some
variable at a given level the variable stops being random
27
and is then under
the complete control of the experimenter. In other words, whatever causes
might have determined the random values of the variable before the manip-
ulation have been removed by the manipulation. The only direct cause of
the controlled variable after the manipulation has been performed is the will
of the experimenter.
Imagine that someone has randomly sampled herbaceous plants
growing in the understorey of an open stand of trees. The measured vari-
ables are the light intensities experienced by the herbaceous plants, their
photosynthetic rates and the concentration of anthocyanins (red-coloured
pigments) in their leaves. Each of these three are random variables, since
they are outside the control of the researcher. One cause of variation in light
intensity at ground level is the presence of trees. The researcher proposes
two alternative causal explanations for the data (Figure 2.14).
To test between these two explanations, the researcher experi-
mentally manipulates light intensity by installing a neutral-shade cloth
between the trees and the herbs, and then adds an articial source of light-
ing. Remembering that this is a controlled experiment, the researcher
would want to take precautions to ensure that other environmental variables
(temperature, humidity and so on) are not changed by this manipulation.
The Manipulation Theorem, in graphical terms
28
, states that the probabil-
ity distribution of this new causal system can be described by taking the
original (unmanipulated) causal graphs, removing any arrows leading into
FROM CAUSE TO CORRELATI ON AND BACK
56
27
The notion of randomness is another example of a concept that is regularly invoked in
science even though it is extraordinarily dicult to dene.
28
The Manipulation Theorem also predicts how the joint probability distribution in the new
manipulated causal system diers, if at all, from the original distribution before the manip-
ulation.
the manipulated variable (light intensity) and adding a new variable repre-
senting the new causes of the manipulated variable (Figure 2.15).
d-separation will predict the pattern of probabilistic independen-
cies in this new causal system. Notice that anthocyanin concentration is d-
separated from photosynthetic rate according to the rst hypothesis in both
the manipulated system (Figure 2.15), when light intensity is experimen-
tally xed, and in the unmanipulated system (Figure 2.14), when light
intensity is statistically xed by conditioning. The same d-connection rela-
tionships between anthocyanin concentration and photosynthetic rate hold
in the second scenario whether based on physically or on statistically con-
trolling light intensity. In other words, statistical and experimental controls
are alternative ways of doing the same thing: predicting how the associations
between variables will change once other sets of variables are held con-
2. 16 STATI STI CAL CONTROL VERSUS PHYSI CAL CONTROL
57
Figure 2.14. Two different causal scenarios linking the same four
variables.
Figure 2.15. Experimental manipulation of the causal systems that are
shown in Figure 2.14.
stant. This does not mean that the two types of control always predict the
same types of observational independency in our data; remember the
example of d-separation upon conditioning on a causal child, described pre-
viously. Once we have a way of measuring how closely the predictions agree
with the observations, then we have a way of testing, and potentially falsify-
ing, causal hypotheses even in cases in which we cannot physically control
the variables of interest.
With these notions we can now go back and look again at the ran-
domised experiment in Chapter 1. Lets consider an example involving an
agricultural researcher who is interested in determining whether, and how,
the addition of a nitrate fertiliser can increase plant yield. To be more spe-
cic, imagine that the plant is Alfalfa, which contains a bacterium in its roots
that is capable of directly xing atmospheric nitrogen (N
2
). The researcher
meets a farmer who tells him that adding such a nitrate fertiliser in the past
had increased the yield of Alfalfa. After further questioning, the researcher
learns that the farmer had tended to add more fertiliser to those parts of the
eld that, in previous years, had produced the lowest yields. The researcher
knows that other things can also aect the amount of fertiliser that a farmer
will add to dierent parts of a eld. For instance, parts of the eld that cause
the farmer to slow down the speed of his tractor will therefore tend to
receive more fertiliser, and so on
29
. Imagine that, unknown to the researcher,
the actual causal processes are as shown in Figure 2.16. There are only three
sources of nitrogen: the nitrate that is added to the soil by the fertiliser, by
FROM CAUSE TO CORRELATI ON AND BACK
58
29
Readers with experience with tractors will have to assume that the governor is not func-
tioning!
Figure 2.16. A hypothetical causal system before experimental
manipulation.
NO
X
deposition, and from N
2
xation by the bacterium. The amount of
fertiliser added by the farmer in dierent parts of the eld is determined by
the yield of plants the previous year as well as the contours of the eld. In
reality, all the sources of nitrogen and the soil phosphate level [P] are causes
of yield.
Before experimenting with this system, the researcher has previous
causal knowledge of only part of it, shown by the thicker arrows in Figure
2.16. He knows that the bacterium will increase Alfalfa yield. He knows
that the bacterium will increase the nitrate concentration in the soil. He
knows that the yield of Alfalfa in previous years has aected the amount of
nitrate fertiliser that the farmer had added, and he knows that the amount
of added nitrate fertiliser is associated with increased yields. What he doesnt
know is whether or not nitrogen added to the soil is the cause of the sub-
sequent plant yield.
Since the experiment has not yet begun, the random numbers in
Figure 2.16 do not aect any actions by the researcher and the researcher
has no causal eect on any variable in the system. The random numbers
and the researchers actions are therefore causally independent of each
other and of every other variable in the system.
Based only on the partial knowledge shown by the thick arrows, can
the researcher use d-separation and statistical control to condently infer
that the added nitrate fertiliser causes an increase in plant yield? No. He
knows that the yields of previous years were a cause of the farmers fertiliser
addition and not vice versa; therefore he knows that he can block any pos-
sible backdoor path between the amount of fertiliser added and plant
yield that passes through the variable plant yield the previous year.
Unfortunately, he also knows that this was not the only possible cause of the
amount of fertiliser added by the farmer to dierent parts of the eld.
Therefore, he cant exclude the possibility that there is some backdoor path
that does not include the variable plant yield the previous year and that is
generating the association between present plant yield and the amount of
fertiliser added by the farmer. Remember that, to invoke such a possibility,
one must be able to present some empirical evidence that such a backdoor
path might exist, but this would be easy to do. For instance, if the tractor
slows down as it begins to go up a slope (and therefore deposits more ferti-
liser), and if water (which is known to increase plant yield) tends to accu-
mulate at the bottom of the slope then we have a possible backdoor path
(fertiliser added tractor slowed down hill water accumulation
plant yield).
The researcher knows that it is possible to randomly assign dierent
levels of nitrate fertiliser to plots of ground in a way that is not caused by
2. 16 STATI STI CAL CONTROL VERSUS PHYSI CAL CONTROL
59
any attribute of these plots. He convinces the farmer not to add any ferti-
liser. The previous cause of the amount of fertiliser added has been erased
in this new context and so the arrow from plant yield previous year to fer-
tiliser added by farmer is removed from the causal graph. Since the farmer
has agreed not to add any fertiliser, the value of this variable is xed at zero,
and so all arrows coming out of this variable are also erased. The researcher
decides to add nitrate fertiliser to dierent plots at either 0 or 20kg/hectare,
based only on the value of randomly chosen numbers. Therefore we add an
arrow from random numbers to researchers actions and also an arrow
from researchers actions to nitrate added to soil. Remember that an arrow
signies a direct cause, i.e. a causal eect that is not mediated through other
variables in the causal explanation. Therefore we cant add an arrow from
researchers actions to plant yield this year unless we believe that the
researchers actions do cause a change in plant yield this year and that this
cause is not completely mediated by some other set of variables in the causal
system. Therefore, the causal structure that exists after the experimental
manipulation is shown in Figure 2.17.
Given this new causal scenario, we can now use d-separation to
determine whether there is a causal relationship between the amount of
nitrate fertiliser added by the researcher and the plant yield that year. If one
can trace a directed path beginning at researchers actions and passing
through plant yield this year by following the direction of the arrows, then
the two are not d-separated. This necessarily implies that there will be a sta-
tistical association between the two variables. If no such directed path exists,
then the addition of nitrate fertiliser by the researcher does not cause a
FROM CAUSE TO CORRELATI ON AND BACK
60
Figure 2.17. Experimental manipulation of the causal system shown
in Figure 2.16 based on a randomised experiment.
change in plant yield this year. In fact, these two variables are not d-separated
in this causal graph and so such a randomised experiment would detect an
eect of fertiliser addition on plant yield. In Chapter 1 I said that if there is
a statistical association between two variables, X and Y, then there can be
only three elementary (but not mutually exclusive) causal explanations: X
causes Y (shown by a directed path leading from X and passing into Y ), or
Y causes X (shown by a directed path leading from Y and passing into X), or
there is some other variable (F) that is a cause of both X and Y (shown by a
backdoor path from F and into both X and Y ). Because the researcher has
agreed to act completely in accordance with the results of the randomisation
process, we know that no arrows point into researchers actions except the
one coming from random numbers. The random numbers are not caused
by any attribute of the system. Therefore the researcher knows that there can
be no backdoor paths confounding the results because he knows that there
are no arrows pointing into researchers actions except for the one coming
from random numbers. If there is a statistical association between
researchers actions and plant yield this year that cant reasonably be attrib-
uted to random sampling uctuations then the researcher knows that the
association must be due to a directed path coming from researchers actions
and passing through plant yield this year. This is why such a randomised
experiment, in conjunction with a way of calculating the probability of
observing such a random event, can provide a strong inference concerning
a causal eect. The reader should note that even the randomisation process
might not allow the researcher to conclude that nitrate added to the soil is
a direct cause of increased plant yield. In Figure 2.17 the researcher has already
concluded that there is a backdoor path from these two variables emanating
from the presence of the nitrogen-xing bacterium, and so to make such a
claim he would have to provide evidence beyond a reasonable doubt that his
actions did not somehow aect the abundance or activity of these bacteria.
Now, lets modify the causal scenario a bit. Imagine that the farmer
has agreed to let the researcher conduct an experiment and promises not to
add any fertiliser while the experiment is in progress, but insists that the parts
of the eld that had produced the lowest plant yield last year must absolutely
receive more fertiliser this year. The researcher decides to allocate the ferti-
liser treatment in the following way: after choosing the random numbers as
before, he also adds 5kg/h to those plots whose previous yields were below
the median value. Figure 2.18 shows this causal scenario. By doing so he is
no longer conducting a true randomised experiment.
Now, using d-separation we see that there would be an association
between researchers actions and plant yield this year even if there were
no causal eect of the amount of nitrate fertiliser added and the plant yield
2. 16 STATI STI CAL CONTROL VERSUS PHYSI CAL CONTROL
61
that follows. The reason is because there is now a backdoor path linking the
two variables through the common cause [P] in soil the previous year. This
path has been created by allowing plant yield the previous year to be a cause
of the researchers actions. Yet all is not lost. He systematically assigned fer-
tiliser levels based only on the yield data of the previous year plus the random
numbers. This means that he knows that there are only two independent
causes determining how much fertiliser each plot received. He also knows,
because of d-separation, that any causal signal passing from any unknown
variable into researchers actions through plant yield the previous year is
blocked if he statistically controls for plant yield the previous year. He can
make this causal inference without knowing anything else about the causal
system. Therefore he knows that once he statistically conditions on plant
yield the previous year then any remaining statistical association, if it exists,
must be due to a causal signal coming from researchers actions and follow-
ing a directed path into plant yield this year. This causal inference is just as
solid as in the previous example in which treatment allocation was due only
to random numbers. What allows him to do this in this controlled, but not
strictly randomised, experiment but not in the original non-manipulated
system in which the farmer applied the fertiliser based on previous yield
data? If you compare Figures 2.16 (non-manipulated) and 2.18 (controlled,
non-randomised manipulation) you will see that in Figure 2.16 there were
other causes, besides yield, that inuenced the farmers actions. These other
causes were both unknown and unmeasured, thus preventing the researcher
from statistically controlling for them, and this left open the possibility of
FROM CAUSE TO CORRELATI ON AND BACK
62
Figure 2.18. Experimental manipulation of the causal system shown
in Figure 2.16 that is not based on a randomised experiment.
other backdoor paths that would confound the causal inference. In Figure
2.18 the experimental design ensured that the only cause (i.e. previous
yields) was already known and measured.
Using either randomised experiments or this controlled approach,
the researcher could conclude
30
that his action of adding nitrate fertiliser
does cause a change in Alfalfa yield and in the amount of nitrate in the soil.
Under what conditions could he infer that the soil nitrate levels (as
opposed to nitrate fertiliser addition) causes the change in Alfalfa yield? That
is, what would allow him to infer that the fertiliser addition increased soil
nitrate concentration, which, in turn, increased Alfalfa yield? Although he
was able to randomise and to exert experimental control over the amount
of fertiliser added to the soil, this is not the same as randomly assigning
values of soil nitrate to the plots and he has not exerted direct experimental
control over soil nitrate levels. Because of this he cannot unambiguously
claim that the experiment has demonstrated that soil nitrate levels cause an
increase in plant yield. In other words, there might be a backdoor path from
the fertiliser addition to each of soil nitrate and plant yield even though soil
nitrate levels may have had no direct eect on plant yield. For instance,
perhaps the fertiliser addition reduced the population level of some soil
pathogen whose presence was reducing plant growth?
He can test the hypothesis that the association between soil nitrate
levels and plant yield is due only to a backdoor path emanating from the
amount of added fertiliser by measuring soil nitrate levels and then statisti-
cally controlling for this variable. d-separation predicts that, if this new
causal hypothesis is true, then the eect of fertiliser addition will still exist.
If the eect of fertiliser addition was due only to its eect on soil nitrate
levels, then d-separation predicts that the eect of fertiliser addition on plant
yield will disappear once the soil nitrate level is statistically controlled. Since
he knows, from previous biological knowledge, that there is at least one
backdoor path linking soil nitrate and plant yield (due to the eect of the
nitrogen-xing bacteria in the root nodules) then he can determine whether
there is some other common cause generating a backdoor path if he can
measure and then control for the amount of this bacterium.
2.17 A taste of things to come
Up to now, we have been inferring the properties of the observational
model (the joint probability distribution) given the causal model that
generates it. Can we also do the contrary? If we know the entire pattern of
2. 17 A TASTE OF THI NGS TO COME
63
30
Given the typical assumptions of the statistical test used, and assuming that he is not in the
presence of an unusual event.
statistical independencies and conditional independencies in our observa-
tional model, can we specify the causal structure that must have generated
it? No. It is possible for dierent causal structures to generate the same set
of d-separation statements and, therefore, the same pattern of independen-
cies. None the less, it is possible to specify a set of causal models that all
predict the same pattern of independencies that we nd in the probability
distribution; these are called equivalent models, and these are described in
Chapter 8. By extension, we can exclude a vast group of causal models that
could not have generated the observational data. There are two important
consequences of this.
First, after proposing a causal model and nding that our observa-
tional data are consistent with it (i.e. that the data do not contradict any of
the d-separation statements of our causal model), we can determine which
other causal models would also be consistent with our data
31
. By denition,
our data cant distinguish between such equivalent causal models and so we
will have to devise other sorts of observation to dierentiate between them.
Second, we can exploit the independencies in our observational
data to generate such equivalent models even if we do not yet have a causal
model that is consistent with our data. This leads to the topic of explora-
tory methods, which is also discussed in Chapter 8. Such exploratory
methods are very useful when theory is not suciently well developed to
allow us to propose a causal explanation a condition that occurs often in
organismal biology.
However, before delving into these topics, we must rst look at the
mechanics of tting such observational models, generating their correla-
tional shadows, and comparing the observed shadows (the patterns of cor-
relation and partial correlation) with the predicted shadows. This leads into
the topic of path models and, more generally, structural equations. Chapters
3 to 7 deal with these topics.
FROM CAUSE TO CORRELATI ON AND BACK
64
31
This statement must be tempered due to practical problems involving statistical power.
3 Sewall Wright, path analysis and
d-separation
3.1 A bit of history
The ideal method of science is the study of the direct inuence of one
condition on another in experiments in which all other possible causes of
variation are eliminated. Unfortunately, causes of variation often seem to
be beyond control. In the biological sciences, especially, one often has to
deal with a group of characteristics or conditions which are correlated
because of a complex of interacting, uncontrollable, and often obscure
causes. The degree of correlation between two variables can be calculated
with well-known methods, but when it is found it gives merely the resul-
tant of all connecting paths of inuence.
The present paper is an attempt to present a method of measuring the
direct inuence along each separate path in such a system and thus of
nding the degree to which variation of a given eect is determined by
each particular cause. The method depends on the combination of knowl-
edge of the degrees of correlation among the variables in a system with
such knowledge as may be possessed of the causal relations. In cases in
which the causal relations are uncertain the method can be used to nd
the logical consequences of any particular hypothesis in regard to them.
So begins Sewall Wrights 1921 paper in which he describes his
method of path coecients. In fact, he invented this method while still in
graduate school (Provine 1986) and had even used it, without presenting its
formal description, in a paper published the previous year (Wright 1920).
The 1920 paper used his new method to describe and measure the direct
and indirect causal relationships that he had proposed to explain the patterns
of inheritance of dierent colour patterns in Guinea Pigs. The paper came
complete with a path diagram (i.e. a causal graph) in which actual drawings
of the colour patterns of Guinea Pig coats were used instead of variable
names.
Wright was one of the most inuential evolutionary biologists of the
twentieth century, being one of the founders of population genetics and inti-
mately involved in the modern synthesis of evolutionary theory and genet-
ics. Despite these other impressive accomplishments Wright viewed path
65
analysis as one of his more important scientic contributions and continued
to publish on the subject right up to his death (Wright 1984). The method
was described by his biographer (Provine 1986) as the quantitative backbone
of his work in evolutionary theory. His method of path coecients is the
intellectual predecessor of all of the methods described in this book. It is
therefore especially ironic that path analysis the backbone of his work in
evolutionary theory has been almost completely ignored by biologists.
This chapter has three goals. First, I want to explore why, despite
such an illustrious family pedigree, path analysis and causal modelling have
been largely ignored by biologists. To do this I have to delve into the history
of biometry at the turn of the century but it is important to understand why
path analysis was ignored in order to appreciate why its modern incarnation
does not deserve such a fate. Next I want to introduce a new inferential test
that allows one to test the causal claims of the path model rather than only
measuring the direct inuence along each separate path in such a system.
The inferential method described in this chapter is not the rst such test.
Another inferential test was developed quite independently by sociologists
in the early 1970s, based on a statistical technique called maximum likeli-
hood estimation. Since that method forms the basis of modern structural
equation modelling, I postpone its explanation until the next chapter.
Finally, I present some published biological examples of path analysis and
apply the new inferential test to them.
3.2 Why Wrights method of path analysis was ignored
I suspect that scientists largely ignored Wrights work on path analysis for
two reasons. First, it ran counter to the philosophical and methodological
underpinnings of the two main contending schools of statistics at the turn
of the twentieth century. Second, it was methodologically incomplete in
comparison with Fishers (1925) statistical methods, based on the analysis of
variance combined with the randomised experiment, which had appeared
at about the same time.
Francis Galton invented the method of correlation. Karl Pearson
transformed correlation from a formula into a concept of great scientic
importance and championed it as a replacement for the primitive notion
of causality. Despite Pearsons long-term programme to provide mathemat-
ical contributions to the theory of evolution (Aldrich 1995), he had little
training in biology, especially in its experimental form. He was educated as
a mathematician and became interested in the philosophy of science early
in his career (Norton 1975). Presumably his interest in heredity and genet-
ics came from his interest in Galtons work on regression, which was itself
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
66
applied to heredity and eugenics
1
. In 1892 Pearson published a book enti-
tled The grammar of science (Pearson 1892). In his chapter entitled Cause and
eect he gave the following denition: Whenever a sequence of percep-
tions D, E, F, G is invariably preceded by the perception C . . ., C is said to
be the cause of D, E, F, G. As will become apparent later, his use of the word
perceptions rather than events or variables or observations was an
important part of his phenomenalist philosophy of science. He viewed the
relatively new concept of correlation as having immense importance to
science and the old notion of causality as so much metaphysical nonsense.
In the third edition of his book (Pearson 1911) he even included a section
entitled The category of association, as replacing causation. In the third
edition he had this to say:
The newer and I think truer, view of the universe is that all existences are
associated with a corresponding variation among the existences in a
second class. Science has to measure the degree of stringency, or looseness
of these concomitant variations. Absolute independence is the conceptual
limit at one end to the looseness of the link, absolute dependence is the
conceptual limit at the other end to the stringency of the link. The old
view of cause and eect tried to subsume the universe under these two
conceptual limits to experience and it could only fail; things are not in
our experience either independent or causative. All classes of phenomena
are linked together, and the problem in each case is how close is the degree
of association.
These words may seem curious to many readers because they
express ideas that have mostly disappeared from modern biology. None the
less, these ideas dominated the philosophy of science at the beginning of the
twentieth century and were at least partially accepted by such eminent sci-
entists as Albert Einstein. Pearson was a convinced phenomenalist and
logical positivist
2
. This view of science was expressed by people such as
Gustav Kirchho, who held that science can only discover new connections
between phenomena, not discover the underlying reasons. Ernst Mach,
who dedicated one of his books to Pearson, viewed the only proper goal of
3. 2 WHY WRI GHT S METHOD WAS I GNORED
67
11
Galton published his Hereditary genius in 1869 in which he studied the natural ability of
men (women were presumably not worth discussing). He was interested in those qual-
ities of intellect and disposition, which urge and qualify a man to perform acts that lead
to reputation . . .. He concluded that [those] men who achieve eminence, and those who
are naturally capable, are, to a large extent, identical. Lest we judge Galton and Pearson
too harshly, remember that such views were considered almost self-evident at the time.
Charles Darwin is reputed to have said of Galtons book: I do not think I ever in my life
read anything more interesting and original . . . a memorable work (Forrest 1974).
12
It is more accurate to say that his ideas were a forerunner to logical positivism.
science as providing economical descriptions of experience by describing a
large number of diverse experiences in the form of mathematical formulae
(Mach 1883). To go beyond this and invoke unobserved entities such as
atoms or causes or genes was not science and such terms must be
removed from its vocabulary. So, Mach (and Pearson) held that a mature
science would express its conclusions as functional i.e. mathematical
relationships that can summarise and predict direct experience, not as causal
links that can explain phenomena (Passmore 1966).
Pearson had thought long and hard about the notion of causality
and had concluded, in accord with British empiricist tradition and the
people cited above, that association was all that there was. Causality was an
outdated and useless concept. The proper goal of science was simply to
measure direct experiences (phenomena) and to economically describe
them in the form of mathematical functions. If a scientist could predict the
likely values of variable Y after observing the values of variable X, then he
would have done his job. The more simply and accurately he could do it,
the better his science. If we go back to Chapter 2, Pearson did not view the
equivalence operator of algebra () as an imperfect translation of a causal
relationship because he did not recognise causality as anything but corre-
lation in the limit
3
. By the time that Wright published his method of path
analysis, Pearsons British school of biometry was dominant. One of its fun-
damental tenets was that it is this conception of correlation between two
occurrences embracing all relationships from absolute independence to
complete dependence, which is the wider category by which we have to
replace the old idea of causation (Pearson 1911).
Given these strong philosophical views, imagine what happened
when Wright proposed using the biometrists tools of correlation and
regression . . . to peek beneath direct observation and deduce systems of
causation from systems of correlation! In such an intellectual atmosphere
Wrights paper on path analysis was seen as a direct challenge to the
Biometrists. One has only to read the title (Correlation and causation) and
the introduction of Wrights (1921) paper, cited at the beginning of this
chapter, to see how infuriating it must have seemed to the Pearson school.
The pagan had entered the temple and, like the Macabees, someone
had to purify it. The reply came the very next year (Niles 1922). Said H. E.
Niles: We therefore conclude that philosophically the basis of the method
of path coecients is faulty, while practically the results of applying it where
it can be checked prove to be wholly unreliable. Although he found fault
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
68
13
And yet, citing the philosopher David Hume, Pearson did accept that associations could
be time-ordered from past to future. Nowhere in his writings have I found him express
unease that such asymmetries could not be expressed by the equivalence operator.
in some of Wrights formulae (which were, in fact, correct) the bulk of
Niles scathing criticism was openly philosophical: Causation has been
popularly used to express the condition of association, when applied to
natural phenomena. There is no philosophical basis for giving it a wider
meaning than partial or absolute association. In no case has it been proved
that there is an inherent necessity in the laws of nature. Causation is corre-
lation . . . (Niles 1922).
Any Mendelian geneticist during that time of whom Wright was
one would have accepted as self-evident that a mere correlation between
parent and ospring told nothing about the mechanisms of inheritance.
Therefore, concluded these biologists, a series of correlations between traits
of an organism told nothing of how these traits interacted biologically or
evolutionarily
4
. The Biometricians could never have disentangled the
genetic rules determining colour inheritance in Guinea Pigs, which Wright
was working on at the time, simply by using correlations or regressions.
Even if distinguishing causation from correlation appeared philosophically
faulty to the Biometricians, Wright and the other Mendelian geneticists
were experimentalists for whom statements such as causation is correlation
would have seemed equally absurd. For Wright, his method of path analy-
sis was not a statistical test based on standard formulae such as correlation or
regression. Rather, his path coecients were interpretative parameters for
measuring direct and indirect causal eects based on a causal system that had
already been determined. His method was a statistical translation, a mathe-
matical analogue, of a biological system obeying asymmetrical causal rela-
tionships.
As the fates would have it, path analysis soon found itself embroiled
in a second heresy. Three years after Wrights Correlation and causation
paper, Fisher published his Statistical methods for research workers (1925). Fisher
certainly viewed correlation as distinct from causation. For him the distinc-
tion was so profound that he developed an entire theory of experimental
design to separate the two. He viewed randomisation and experimental
control as the only reliable way of obtaining causal knowledge. Later in
his life Fisher wrote another book criticising the research that identied
tobacco smoking as a cause of cancer on the basis that such evidence was
not based on randomised trials
5
(Fisher 1959). I have already described the
3. 2 WHY WRI GHT S METHOD WAS I GNORED
69
14
Pearson was strongly opposed to Mendelism and, according to Norton (1975), this oppo-
sition was based on his philosophy of science; Mendelians insisted on using unobserved
entities (genes) and forces (causation).
15
I dont know whether Fisher was a smoker. If he was, I wonder what he would have
thought if, because of a random number, he was assigned to the non-smoker group in a
clinical trial?
assumptions linking causality and probability distributions, unstated by
Fisher but needed to infer causation from a randomised experiment, as well
as the limitations of these assumptions, when one is studying dierent attrib-
utes of organisms. Despite these limitations, Fishers methods had one
important advantage over Wrights path analysis: they allowed one to rigor-
ously test causal hypotheses while path analysis could only estimate the
direct and indirect causal eects assuming that the causal relationships were
correct.
Mulaik (1986) has described these two dominant schools of statis-
tics in the twentieth century. His phenomenalist and empiricist school starts
with Pearson. Examples of the statistical methods of this school were cor-
relation, regression
6
, common-factor and principal component analyses.
The purpose of these methods was primarily, as Mach directed, to provide
an economical description of experience by economically describing a large
number of diverse experiences in the form of mathematical formulae. The
second school was the Realist school begun by Fisher. It emphasised the
analysis of variance, experimental design based on the randomised experi-
ment and the hypothetico-deductive method. These Fisherian methods
were not designed to provide functional relationships but rather to ensure
conditions under which causal relationships could be reliably distinguished
from non-causal relationships.
With hindsight then, it seems that path analysis simply appeared at
the wrong time. It did not t into either of the two dominant schools of
statistics and it contained elements that were objectionable to each. The
Phenomenalist school of Pearson disliked Wrights notion that one should
distinguish causes from correlations. The Realist school of Fisher disliked
Wrights notion that one could study causes by looking at correlations.
Professional statisticians therefore ignored it. Biologists found Fishers
methods, complete with inferential tests of signicance, more useful and
conceptually easier to grasp and so biologists ignored path analysis too. A
statistical method, viewed as central to the work of one of the most inuen-
tial evolutionary biologists of the twentieth century, was largely ignored by
biologists.
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
70
16
Regression based on least squares was, of course, developed well before Pearson by people
like Carl Friedrich Gauss and had been based on a more explicit causal assumption that
the independent variable plus independent measurement errors were the causes of the
dependent variable. This distinction lives on under the guise of Type I and Type II regres-
sion.
3.3 d-sep tests
Wrights method of path analysis was so completely ignored by biologists
that most biometry texts do not even mention it. Those that do (Li 1975;
Sokal and Rohlf 1981) described it as Wright originally presented it,
without even mentioning that it was reformulated by others, primarily
economists and social scientists, such that it permitted inferential tests of the
causal hypothesis and allowed one to include unmeasured (or latent) vari-
ables. The main weakness of Wrights method that it required one to
assume the causal structure rather that being able to test it had been cor-
rected by 1970 ( Jreskog 1970) but biologists are mostly unaware of this.
Two dierent ways of testing causal models will be presented in this
book. The most common method is called structural equations modelling
(SEM) and is based on maximum likelihood techniques. This method is
described in Chapters 4 to 7 and it does have a number of advantages when
testing models that include variables that cannot be directly observed and
measured (so-called latent variables) and for which one must rely on
observed indicator variables that contain measurement errors. SEM also has
some statistical drawbacks. The inferential tests are asymptotic and can
therefore require rather large sample sizes. The functional relationships must
be linear. Data that are not multivariate normal are dicult to treat.
These drawbacks led me to develop an alternative set of methods
that can be used for small sample sizes, non-normally distributed data or
non-linear functional relationships (Shipley 2000). Since these methods are
derived directly from the notion of d-separation that was described in
Chapter 2, I will call these d-sep tests. The main disadvantage of d-sep tests
is that they are not applicable to causal models that include latent (unmeas-
ured) variables.
The link between causal conditional independence, given by
d-separation, and probabilistic independence suggests an intuitive way of
testing a causal model: simply list all of the d-separation statements that are
implied by the causal model and then test each of these using an appropri-
ate test of conditional independence. There are a number of problems with
this nave approach. First, even models with a small number of variables can
include a large number of d-separation statements. Second, we need some
way of combining all of these tests of independence into a single compos-
ite test. For instance, if we had a model that implied 100 independent d-
separation statements and tested each independently at the traditional 5%
signicance level we would expect, on average, that ve of these tests
would reach signicance simply as a result of random sampling uctuations.
Even worse, the d-separation statements in a causal model are almost never
3. 3 d- SEP TESTS
71
completely independent and so we would not even know what the true
overall signicance level would be. Each of these problems can be solved.
3.4 Independence of d-separation statements
Given an acyclic
7
causal graph, we can use the d-separation criterion to
predict a set of conditional probabilistic independencies that must be true if
the causal model is true. However, many of these d-separation statements
can be themselves predicted from other d-separation statements and are
therefore not independent. Happily, Pearl (1988) described a simple method
of obtaining the minimum number of d-separation statements needed to
completely specify the causal graph and proved that this minimum list of d-
separation statements is sucient to predict the entire set of d-separation
statements. This minimum set of d-separation statements is called a basis set
8
.
The basis set is not unique. This method is illustrated in Figure 3.1.
To obtain the basis set, the rst step is to list each unique pair of
non-adjacent vertices. That is, list each pair of variables in the causal model
that do not have an arrow between them. So, in Figure 3.1 the list is: {(A,C),
(A,D), (A,E), (A,F), (B,E), (B,F), (C,D), (C,F), (D,F)}. Pearls (1988)
basis set is given by d-separation statements consisting of each such pair
of vertices conditioned on the parents of the vertex having higher causal
order. The number of pairs of variables that dont have an arrow between
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
72
17
This restriction will be partly removed later. Remember that d-separation also implies
probabilistic independence in cyclic causal models in which all variables are discrete and
in cyclic causal models in which functional relationships are linear.
18
Let S be the set of d-separation facts (and therefore the set of conditional independence
relationships) that are implied by a directed acyclic graph. A basis set B for S is a set of
d-separation facts that (i) implies, using the laws of probability, all other elements of S,
and (ii) no proper subset of B sustains such implications.
Figure 3.1. A directed acyclic graph (DAG) involving six variables.
them is always equal to the total number of pairs minus the number of
arrows in the causal graph. In general, if there are V variables and A arrows
in the causal graph, then the number of elements in the basis set will be:
A
Unfortunately the conditional independencies derived from such a
basis set are not necessarily mutually independent in nite samples (Shipley
2000). A basis set that does have this property is given by the set of unique
pairs of non-adjacent vertices, of which each pair is conditioned on the set
of causal parents of both (Shipley 2000). Remember that an exogenous var-
iable has no parents, so the set of parents of such a variable is empty (such
an empty set is written {}or ). The second step in getting the basis set
that will be used in the inferential test is to list all causal parents of each
vertex in the pair. Using Figure 3.1 and the notation for d-separation intro-
duced in Chapter 2
9
, Table 3.1 summarises the d-separation statements that
make up the basis set.
Each of the d-separation statements in Table 3.1 predicts a (condi-
tional) probabilistic independence. How you test each predicted conditional
independence depends on the nature of the variables. For instance, if the
two variables involved in the independence statement are normally and lin-
early distributed, you could test the hypothesis that the Pearson partial cor-
relation coecient is zero. Other tests of conditional independence are
V!
2(V2)!
3. 4 I NDEPENDENCE OF d- SEPARATI ON STATEMENTS
73
19
In other words, X||
_
Y|Q means that vertex X is d-separated from vertex Y, given the set
of vertices Q.
Table 3.1. A basis set for the DAG shown in Figure 3.1 along with the
implied d-separation statements
Parent variables of either
Non-adjacent variables non-adjacent variable d-separation statement
A, C B A||
_
C|B
A, D B A||
_
D|B
A, E C, D A||
_
E|CD
A, F None A||
_
F
B, E A, C, D B||
_
E|ACD
B, F A B||
_
F|A
C, D B C||
_
D|B
C, F B C||
_
F|B
D, F B D||
_
F|B
described below. At this point, assume that you have used tests of indepen-
dence that are appropriate for the variables involved in each d-separation
statement and that you have obtained the exact probability level assuming
such independence. By exact probability levels, I mean that you cant
simply look at a statistical table and nd that the probability is 0.05; rather,
you must obtain the actual probability level say, p0.036.
Because the conditional independence tests implied by the basis set
are mutually independent, we can obtain a composite probability for the
entire set using Fishers test. Since this test seems not to have a name, I have
called it Fishers C (for combined) test. If there are a total of k indepen-
dence tests in the basis set, and p
i
is the exact probability of the ith test assum-
ing independence, then the test statistic is:
C2 ln( p
i
)
If all k independence relationships are true, then this statistic will follow a
chi-squared distribution with 2k degrees of freedom. This is not an asymp-
totic test unless you use asymptotic tests for some of the individual indepen-
dence hypotheses. Furthermore, you can use dierent statistical tests for
dierent individual independence hypotheses. In this sense, it is a very
general test.
3.5 Testing for probabilistic independence
In this section, I want to be more explicit concerning what independence
and conditional independence mean and the dierent ways that one can
test such hypotheses given empirical data. Lets rst start with the simplest
case: that of unconditional independence.
The dierence between the value of a random quantity X
i
and its
expected value is (X
i
). Since these dierences can be either negative
or positive, and we want to know simply the deviation around the expected
value, not the direction of the deviation, we can take the square of the
dierence: (X
i
)
2
. The expected value of this squared dierence
10
is the
variance: E[(X
i
X
)
2
]E[(X
i
X
)(X
i
X
)].
The covariance is simply a generalisation of the variance. If we have
two dierent random variables (X,Y ) measured on the same observational
units, then the covariance between these two variables is dened as:
E[(X
i
X
)(Y
i
Y
)]. If X and Y behave independently of each other, then
large positive deviations of X from its mean (
X
) will be just as likely to be
k
i1
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
74
10
The formula to estimate this in a sample is given in Box 3.1.
paired with large or small, negative or positive, deviations of Y from its mean
(
Y
). These will cancel each other out in the long run (remember, we are
envisaging a complete statistical population) and the expected value of the
product of these two deviations, E[(X
i
X
)(Y
i
Y
)], will be zero. So, pro-
babilistic independence of X and Y implies a population zero covariance
11
.
If X and Y tend to behave similarly, increasing or decreasing together, then
large positive values of X will often be paired with large positive values of
Y and large negative values of X will often be paired with large negative
values of Y. In such cases, the covariance will be large and positive. If X and
Y tend to behave in opposite ways, then the covariance between them will
be negative.
A Pearson correlation coecient is simply a standardised covari-
ance. Neither a variance nor a covariance have any upper or lower bounds.
Changing the units of measurement (say, from metres to millimetres) will
change both the variance and the covariance. If we divide the covariance
between two variables by the product of their variances (taking the square
root of this product in order to ensure that the range goes from 1 to 1),
then we obtain a Pearson correlation coecient. Box 3.1 summarises these
points.
Box 3.1. Variance, covariance and correlation
Population variance (sigma
2
,
2
) of a random variable X: E[(X
X
)
2
]
Variance (s
2
) of a random variable X from a sample of size n:
Population covariance (sigma
XY
,
XY
) between two random variables X, Y:
E[(X
X
)(Y
Y
)].
Covariance (s
XY
) between two random variables X, Y from a sample of size
n:
Population Pearson correlation (rho
XY
,
XY
) between two random variables,
X,Y:
XY
2
X
2
Y
E[(X
X
)(Y
Y
)]
E[(X
X
)
2
]E[Y
Y
)
2
]
i
(X
i
X)(Y
i
Y)
n 1
i
(X
i
X)
2
n 1
3. 5 TESTI NG FOR PROBABI LI STI C I NDEPENDENCE
75
11
But not the converse!
Pearson correlation coecient (r
XY
) between two random variables, X,Y
from a sample of size n:
The formulae in Box 3.1 are valid so long as both X and Y are
random variables. If we want to conduct an inferential test of independence
using these formulae, we have to pay attention to the probability distri-
butions of X and Y and the form of the relationship between them in case
they are not independent. Dierent assumptions concerning these points
require dierent statistical methods.
Case 1: X and Y are both normally distributed and any relationship
between them is linear
Tests of the independence of X and Y involving this set of assumptions are
treated in any introductory statistics book. First, one can transform the
Pearson correlation coecient so that it follows Students t-distribution. If
X and Y, sampled randomly and measured on n units, are independent (so
the null hypothesis is that 0) then the following transformation will
follow a Students t-distribution
12
with n2 degrees of freedom:
t
r
This test is exact. So long as you have at least three independent observa-
tions then you can test for the independence of X and Y
13
.
It is also possible to transform a Pearson correlation coecient so
that it asymptotically follows a standard normal distribution (i.e. a normal
distribution with a mean of zero and a variance of 1). For sample sizes of at
least 50 (and approximately even for sample sizes as low as 25) one can use
Fishers z-transform:
z0.5 ln
1 r
1 r
n 3
r
n 2
1 r
2
s
XY
s
2
X
s
2
Y
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
76
12
For partial correlations, described below, one simply replaces r with the value of the partial
correlation coecient, and the numerator (n2) becomes (n2p) where p is the
number of conditioning variables.
13
Of course, with so few observations you would have so little statistical power that only
very strong associations would be detected.
If X and Y are independent then the probability of z can be obtained from
a standard normal distribution. Finally, one can use Hotellings (1953) trans-
formation
14
, which is acceptable for sample sizes as low as 10:
z
0.5ln
Case 2: X and Y are continuous but not normally distributed and
any relationship between them is only monotonic
If X or Y are not normally distributed and any relationship between them
is not linear but is monotonic
15
, then we can use Spearmans correlation
coecient. Although there exist statistical tables giving probability levels for
Spearmans correlation coecient, one can use exactly the same formulae
as for Pearsons correlation coecient so long as the sample size is greater
than 10 (Sokal and Rohlf 1981).
The rst step is to convert X and Y to their ranks. In other words,
sort each value of X from smallest to largest and replace the actual value of
each X by its order in the rank; the smallest number becomes 1, the second
smallest number becomes 2, and so on. Do the same thing for Y. Now that
you have converted each X and each Y to its rank, you can simply put these
numbers into the formula for a Pearsons correlation coecient and test as
before.
One complication is when there are ties. Spearmans coecient
assumes that the underlying values of X and Y are continuous, not discrete.
Given such an assumption then equal values of X (or Y ) will only occur due
to limitations in measurement. To correct for such ties, rst sort the values
ignoring ties, and then replace the ranks of tied values by the mean rank of
these tied values. Box 3.2 gives an example of the calculation of a Spearman
rank correlation coecient.
1.5ln
1 r
1 r
r
4(n 1)
1 r
1 r
(n 1)
3. 5 TESTI NG FOR PROBABI LI STI C I NDEPENDENCE
77
14
Both Fishers and Hotellings transformations can be used to test null hypotheses in which
equals a value dierent from zero. This useful property allows one to compute con-
dence intervals around the Pearson correlation coecient.
15
A non-monotonic relationship is one in which X increases with increasing Y over part of
the range and decreases with increasing Y over another part of the range. If you think that
a graph of X and Y has hills and valleys, then the relationship is non-monotonic.
Box 3.2. Spearmans rank correlation coefcient
Here are 10 simulated pairs of values and the accompanying scatterplot
(Figure 3.2). The X values were drawn from a uniform distribution and
rounded to the nearest unit. The Y values were drawn from the following
equation: Y
i
X
i
0.2
(5,1) where the random component is drawn from a
distribution with shape parameters of 5 and 1.
Values of X, Y and their ranks
X Y Rank Rank Rank Rank
X Y X Y
2 2.08 1 3 1 3
3 2.02 2 2 2.5 2
15 2.68 10 10 10 10
10 2.47 8 6 8 6
5 2.21 5 4 5 4
12 2.23 9 5 9 5
3 1.86 3 1 2.5 1
4 2.25 4 7 4 7
9 2.31 6 9 6.5 9
9 2.28 7 8 6.5 8
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
78
Figure 3.2. A scatterplot of randomly generated pairs of values from
a bivariate non-normal distribution and possessing a non-linear
monotonic relationship.
In the above table, X, Y are the original values. Columns 3 and 4 of the table
are the ranks of X and Y before correcting for ties (the underlined values).
Columns 5 and 6 are the ranks after correcting for the two pairs of ties values
of X (there were two values of 3 and two values of 9). To calculate the
Spearman rank correlation coecient of X and Y, simply use the values in
columns 5 and 6 and enter them into the formula for the Pearsons correla-
tion coecient. In the above example, the Spearman rank correlation
coecient is 0.726. Assuming that X and Y are independent in the statistical
population, we can convert this to a standard normal variate using Hotellings
z-transform, giving a value of 2.47. This value has a probability under the
null hypothesis of 0.014.
Case 3: X and Y are continuous and any relationship between
them is not even monotonic
This case applies when the relationship between X and Y might have a very
complicated form, with X and Y being positively related in some parts of
the range and negatively related in other parts, and therefore when neither
a Pearson nor a Spearman correlation can be applied. This situation requires
more computationally demanding methods, including form-free regression
and permutation tests. Each of these topics is dealt with much more fully in
other publications but will be introduced intuitively here because these
notions are needed for the analogous case in conditional independence.
Form-free regression is a vast topic, which includes kernel smoothers, cubic-
spline smoothers (Wahba 1991) and local (loess) smoothers (Cleveland and
Devlin 1988; Cleveland, Devlin and Grosse 1988; Cleveland, Grosse and
Shyu 1992). Collectively, these methods form the basis of generalised addi-
tive models (Hastie and Tibshirani 1990). Permutation tests for association
are described by Good (1993, 1994).
3.6 Permutation tests of independence
To begin, consider a simple linear regression of Y on X, where both are
random variables. The correlation between X and Y is the same as the cor-
relation between the observed value of Y and the predicted value of Y given
X, that is: E[Y|X]. To test for an association between X and Y in this regres-
sion context we need to do three things. First, we have to estimate the pre-
dicted values of Y for each value of X. For linear regression we simply obtain
the slope and intercept to get these values and in the general case we would
use form-free regression methods. Second, we need to calculate a measure
of the association between the observed and predicted values of Y; we can
3. 6 PERMUTATI ON TESTS OF I NDEPENDENCE
79
use a Pearson correlation coecient, a Spearman correlation coecient, or
any of a large number of other measures that can be found in the statistical
literature. Finally, we need to know the probability of having observed such
a value when, in fact, X and Y really are independent. This is where a per-
mutation test comes in handy.
Remembering the denition of probabilistic independence given
in Chapter 2, we know that if X and Y are independent then the probabil-
ity of observing any particular value of Y is the same whether or not we
know the value of X. In other words, any value of X is just as likely to be
paired with any other value of Y as with the particular Y that we happen to
observe. The permutation test works by making this true in our data. After
calculating our measure of association in our data, we randomly rearrange
the values of X and/or Y using a random number generator. In this new
randomly mixed data set the values of X and Y really are independent
because we forced them to be so; we have literally forced our null hypoth-
esis of independence to be true and the value of the association between X
and Y is due only to chance. We do this a very large number of times until
we have generated an empirical frequency distribution of our measure of
association
16
. The exact number of times that we randomly permute our
data will depend on the true probability level of our actual data and the
accuracy that we want to obtain in our probability estimate. Manly (1997)
showed how to determine this number, but it is typically between 1000 and
10000 times. On modern computers this will take only a few seconds. The
last step is to count the proportion of times that we observe at least as large
a value of association within the permuted data sets, or its absolute value for
a 2-tailed test, as we actually observed in our original data. Box 3.3 gives an
example of this permutation procedure.
3.7 Form-free regression
Box 3.3. Loess regression and permutation tests
The following three graphs (Figure 3.3) show a simulated data set generated
from a complicated non-linear function (solid line of the rst graph) along
with a loess regression (broken line) using a local quadratic t and a neigh-
bourhood size of one half the range of X. The middle graph shows the same
complicated non-linear function in the range 1 to 3 of the X values and the
graph to the right shows this in the range 1.5 to 2.5 of the X values.
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
80
16
For small samples one can generate all unique permutations of the data. The use of
random permutations, described here, is generally applicable and the estimated probabil-
ities converge on the true probabilities as the number of random permutations increase.
The loess regression (the dotted line in the left graph) doesnt actually
give a parametric function linking Y to X, but does give the predicted value
of Y for each unique value of X; i.e. it gives the estimate of E[Y|X]; the solid
and broken lines in the left-most graph completely overlap except in the
range of X2. To estimate a permutation probability of the non-linear cor-
relation of X and Y, we can rst calculate the Pearson correlation coecient
between the observed Y values (the circles in the gure) and the predicted
values of Y given X (the loess estimates). In this example, r 0.956. If we
dont want to assume any particular probability distribution for the residuals,
then we can generate a permutation frequency distribution for the correla-
tion coecient. To do this, we randomly permute the order of the observed
Y values (or the predicted values, it doesnt matter which) to get a new set
of Y* values and recalculate the Pearson correlation coecient between Y*
and E[Y*|X]. The following histogram (Figure 3.4) shows the relative fre-
quency of the Pearson correlation coecient in 5000 such permutations; the
arrow indicates the value of the observed Pearson correlation coecient.
None of the 5000 permutation data sets had a Pearson correlation whose
absolute value was at least 0.956. Since the residuals were actually generated
3. 7 FORM- FREE REGRESSI ON
81
Figure 3.3. The graph on the left shows a highly non-linear function
(the solid line) between X and Y and the loess t (dotted line, mostly
superimposed on the solid line). The small rectangle is reproduced in
the middle graph and the small rectangle in this middle graph is
reproduced in the graph on the right.
from a unit normal distribution, we can calculate the probability of observ-
ing a value of 0.956 with 101 observations. It is approximately 110
39
.
The rst graph in Box 3.3 (Figure 3.3) shows a highly non-linear
relationship between X and Y and it is unlikely that we would be able to
deduce the actual function that generated these data
17
. On the other hand,
if we concentrate on smaller and smaller sections of the graph, the relation-
ship becomes simpler and simpler. The basic insight of form-free regression
methods is that even complicated functions can be quite well approximated
by simple linear, quadratic or cubic functions in the neighbourhood of a
given value of X. Within such a neighbourhood, shown by the boxes in the
graphs of Box 3.3, we can use these simpler functions to calculate the
expected value of Y at that particular value of X. We then go on to the next
value of X, move the neighbourhood so that it is centred around this new
value of X, and calculate the expected value of the new Y, and so on. In
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
82
17
The actual function was: YX sin (X), where the error term comes from a unit
normal distribution.
Figure 3.4. The frequency distribution of the Pearson correlation
coefcient in 5000 random permutations of the simulated data set
involving the observed Y values and the predicted loess values. The
arrow shows the observed Pearson correlation in the original
simulated data set.
this way, we do not actually estimate a parametric function predicting Y over
the entire range of X but we do get very good estimates of the predicted
values of Y given each unique value of X. To obtain the predicted values of
Y given X, we use weighted regression (linear, quadratic or cubic) where
each (X,Y ) pair in the data set is weighted according to its distance from the
value of X around which the neighbourhood is centred. In local, or loess
18
,
regression the neighbourhood size can be chosen according to dierent cri-
teria such as minimising the residual sum of squares and the weights are
chosen based on the tricube weight function. Shipley and Hunt (1996)
described this in more detail in the context of plant growth rates
19
.
3.8 Conditional independence
So far we have been talking about unconditional independence; that is, the
independence of two variables without regard to the behaviour of any other
variables. Such unconditional independence is implied by two variables in
a causal graph that are d-separated without conditioning on any other var-
iable. d-separation upon conditioning implies conditional independence. The
notion of conditional independence seems paradoxical to many people.
How can two variables be dependent, even highly correlated, and still be
independent upon conditioning on some other set of variables?
Consider the following causal graph: 1XZY2. Does it
seem equally paradoxical if I say that X and Y will behave similarly owing
to the common causal eect of Z, but that they will no longer behave sim-
ilarly if I prevent Z from changing? If Z doesnt change, then the only
changes in X and Y will come from the changes in 1 and 2, and these two
variables are d-separated and therefore unconditionally independent. A
moments reection will convince you that if Z is allowed to change (vary)
then both X and Y will change as well in a systematic fashion, since they are
both responding to Z. If the variables in the causal graph are random then
the correlation between X and Y will be due to the fact that both share
common variance due to Z. If we restrict the variance in Z more and more,
then X and Y will share a smaller and smaller amount of common variance.
In the limit, if we prevent Z from changing at all, then X and Y will no
longer share any common variance; the only variation in X and Y will come
from the independent error variables 1 and 2 and so X and Y will then
3. 8 CONDI TI ONAL I NDEPENDENCE
83
18
The word loess comes from the geological term loess which is a deposit of ne clay or
silt along a river valley. I suppose that this evokes the image of a very wavy surface that
traces the form of the underlying geological formation. At least some statisticians have a
sense of the poetic.
19
The S-PLUS program performs multivariate form-free regression (StatSci 1995).
be independent. In such a case we would be comparing values of X and Y
when Z is constant. This is the intuitive meaning of conditional indepen-
dence. To illustrate, I generated 10000 independent sets of 1, X, Z, Y and
2 according to the following generating equations:
1N(0,10.9
2
)
2N(0,10.9
2
)
ZN(0,1)
Y0.9Z1
X0.9Z2
Since X, Y and Z are all unit normal variables, the population cor-
relations are
X,Z
0.9,
Y,Z
0.9 and
X,Y
0.81. Notice that X and Y are
highly correlated even though neither X nor Y is a cause of the other. Figure
3.5 shows three scatterplots. The plot on the left shows the relationship
between X and Y when no restrictions are placed on the variance of Z. The
sample correlation between X and Y in this graph is 0.8016, compared with
the population value of 0.81. The graph in the middle plots only those values
of X and Y for which the value of Z is between 2 and 2, thus restrict-
ing the variance of Z a little bit. The sample correlation between X and Y
has been decreased slightly to 0.7591. The graph on the right plots those
values of X and Y for which the value of Z is between 0.5 and 0.5, thus
restricting the variance of Z much more. The sample correlation between
X and Y is now only 0.2294. Clearly, the degree of association between X
and Y is decreasing as Z is prevented more and more from varying.
If we calculate the correlation between X and Y as we restrict the
variation in Z more and more, we can get an idea of what happens to the
correlation between X and Y in the limit when the variance of Z is zero.
This limit is the correlation between X and Y when Z is xed (or condi-
tioned) to a constant value; this is called the partial correlation between X
and Y, conditional on Z and it is written
XY.Z
or
XY|Z
. Figure 3.6 plots
the sample correlation between X and Y as Z is progressively restricted in
its variance.
As expected, as the range of Z around its mean (zero) becomes
smaller and smaller, the correlation between X and Y also becomes smaller
and approaches zero. Given the causal graph that governed these data, we
know that X and Y are not unconditionally d-separated and therefore are
not unconditionally independent. However, X and Y are d-separated given
Z and therefore X and Y are independent conditional on Z.
If we remember that a regression of Xon Zgives the expected value
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
84
F
i
g
u
r
e
3
.
5
.
T
h
e
g
r
a
p
h
o
n
t
h
e
l
e
f
t
s
h
o
w
s
1
0
0
0
0
o
b
s
e
r
v
a
t
i
o
n
s
o
f
X
a
n
d
Y
t
h
a
t
w
e
r
e
g
e
n
e
r
a
t
e
d
f
r
o
m
t
h
e
c
a
u
s
a
l
g
r
a
p
h
:
2
a
n
d
p
a
r
a
m
e
t
e
r
i
s
e
d
a
s
g
i
v
e
n
i
n
t
h
e
t
e
x
t
.
T
h
e
m
i
d
d
l
e
g
r
a
p
h
s
h
o
w
s
o
n
l
y
t
h
o
s
e
(
X
,
Y
)
o
b
s
e
r
v
a
t
i
o
n
s
f
o
r
w
h
i
c
h
|
Z
|
i
s
l
e
s
s
t
h
a
n
2
.
T
h
e
g
r
a
p
h
o
n
t
h
e
r
i
g
h
t
s
h
o
w
s
o
n
l
y
t
h
o
s
e
(
X
,
Y
)
o
b
s
e
r
v
a
t
i
o
n
s
f
o
r
w
h
i
c
h
|
Z
|
i
s
l
e
s
s
t
h
a
n
0
.
5
.
Y
X
Y
X
Y
X
of X conditional on Z, then the residuals around this regression are the
values of X for xed values of Z. This gives us another way of visualising
the partial correlation of X and Y conditional on Z: it is the correlation
between the residuals of X, conditional on Z, and the residuals of Y, con-
ditional on Z. If I regress, in turn, each of X and Y on Z in the above
example and calculate the correlation coecient between the residuals of
these two regressions, I get a value of 0.0060.
This view of a conditional independence provides us with a very
general method of testing for it. If X and Y are predicted to be d-separated
given some other set of variables Q{A, B, C, . . .} then regress (perhaps
using form-free regression) each of X and Y on the set Q and then test for
independence of the residuals using, if you want, any of the methods of
testing unconditional independence described above. If the residuals are
normally distributed and linearly related then you can use the test for
Pearson correlations. If the residuals appear, at most, to have a monotonic
relationship then you can use the test for a Spearman correlation. If the
residuals have a more complicated pattern then you can use one of the non-
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
86
Figure 3.6. The Pearson correlation coefcient between X and Y in the
data shown in Figure 3.5 (left) when the absolute value of Z is restricted
to various degrees. The limiting value of the correlation coefcient
when |Z| is restricted to a constant value is the partial correlation
between X and Y.
parametric smoothing techniques available, followed by a permutation test.
The only dierence is that you have to reduce the degrees of freedom in the
tests by the number of variables in the conditioning set.
Most of these tests can be performed using standard statistical pro-
grams
20
. If your statistical program can invert a matrix, then there are faster
ways of calculating partial Pearson or Spearman correlations. These are
explained in Box 3.4.
Box 3.4. Calculating partial covariances and partial correlations
Given a sample covariance matrix S, the inverse of this matrix is called the
concentration matrix, C. The negative of the o-diagonal elements c
ij
give the
partial covariance between variables i and j, conditional on (holding constant)
all of the other variables included in the matrix. This gives an easy way of
estimating partial covariances and partial correlations of any order. To get the
partial covariance between variables X andY conditional on a set of other var-
iables Q, simply create a covariance matrix in which the only variables are X,
Y, and the remaining variables in Q. After inverting this matrix, this partial
covariance is the negative of the element in the row pertaining to X and the
column pertaining to Y, i.e. c
XY
. The partial correlation between X and Y
is given by:
r
X,Y|Q
X,Y|Z
XY
XZ
YZ
(1
2
XZ
)(1
2
YZ
)
c
XY
c
XX
c
YY
3. 8 CONDI TI ONAL I NDEPENDENCE
87
20
My Toolbox (Appendix) contains a program to calculate partial correlations of various
orders.
W X Y Z
W 1.43347870 0.75265627 0.06269845 0.10179918
X 0.75265627 1.52762094 0.53911722 0.03777874
Y 0.06269845 0.53911722 1.71116716 0.90033856
Z 0.10179918 0.03777874 0.90033856 1.73196991
The inverse of the matrix (rounded to the nearest 100th) obtained by
extracting only the elements of the covariance matrix pertaining to W, X and
Y is:
W X Y
W 1.43 0.75 0.01
X 0.75 1.53 0.56
Y 0.01 0.56 1.24
The partial correlation between Wand Y, conditional on X, is:
r
WY|X
0.0075
The same method can be used to obtain partial Spearman partial correlations,
by simply ranking the variables as described in Box 3.2 and then proceeding
in the same way as for Pearson partial correlations.
3.9 Spearman partial correlations
This next section presents some Monte Carlo results to explore the degree
to which the sampling distribution of Spearman partial correlations, after
appropriate transformation, follows either a standard normal or a Students
t-distribution. This section is not necessary to understand the application of
d-sep tests for path models, only to justify the use of Spearman partial cor-
relations in testing for conditional independence.
There has been remarkably little published in the primary literature
concerning inferential tests related to non-parametric conditional indepen-
dence
21
. It is known that the expected values of rst-order partial Kendall
or Spearman partial correlations need not be strictly zero even when two
variables are conditionally independent given the third (Shirahata 1980;
Korn 1984). On the other hand, Conover and Iman (1981) recommended
0.01(1)
1.43 1.24
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
88
21
Kendall and Gibbons (1990) briey discuss Spearman and Kendall partial correlations and
provide a table of signicance values for rst-order Kendall partial correlations for small
sample sizes.
the use of partial Spearman correlations for most practical cases in which
the relationships between the variables are at least monotonic. A Spearman
partial correlation is simply a Pearson partial correlation applied to the ranks
of the variables in question. Therefore the conditional independence of
non-normally distributed variables with non-linear, but monotonic, func-
tional relationships between the variables can be tested with Spearmans
partial rank correlation coecient simply by ranking each variable (and cor-
recting for ties as described in Box 3.2) and then applying the same inferen-
tial tests as for Pearson partial correlations. For instance, if one accepts
Conover and Imans (1981) recommendations, then a Spearman partial rank
correlation will be approximately distributed as a standard normal variate
when z-transformed.
How robust is this recommendation? To explore this question,
Table 3.2 presents the results of some Monte Carlo simulations to deter-
mine the eects of sample size, the distributional form of the variables, and
the eect of non-linearity on the sampling distribution of the z-trans-
formed Spearman partial correlation coecient. The random components
of the generating equations (
i
) were drawn from four dierent probability
distributions: normal, gamma, beta or binomial. I chose the shape parame-
ters of the gamma and beta distributions to produce dierent degrees of
skew and kurtosis. Gamma(1) is a negative exponential distribution.
Gamma(5)
22
is an asymmetrical distribution with a long right tail.
Beta(1,1) is a uniform distribution, beta(1,5) is a highly asymmetrical dis-
tribution with a long right tail and beta(5,1) is a highly asymmetrical distri-
bution with a long left tail. The nal (discrete) probability distribution was
symmetrical with an expected value of 2 and had ordered states of X0, 1,
2, 3 or 4; these were generated from a binomial distribution of the form
C(5,X)0.5
X
0.5
1X
. Random numbers were generated using the random
number generators given by Press et al. (1986). The generating equations
were of the form:
X
1
1
X
i
i
X
i
(i1)
i
; i 1
These generating equations are based on a causal chain
(X
1
X
2
X
3
. . .) with sucient variables (3, 4 or 5) to produce zero
partial associations of orders 1 to 3. When
i
equals 1.0 the relationships
between the variables are linear and when
i
is dierent from 1.0 then the
relationships between the variables are non-linear but monotonic. The
3. 9 SPEARMAN PARTI AL CORRELATI ONS
89
22
is a constant aecting the shape of the distribution and is sometimes referred to as a
waiting time for the event in a Poisson random process of unit mean.
results in Table 3.2 are based on models with
i
1 (linear) and 0.5 (non-
linear) but other values give similar results. All the simulation results in Table
3.2 are based on 1000 independent simulated data sets. In interpreting Table
3.2, remember that the z-transformed Spearman partial correlations should
be approximately distributed as a standard normal variate whose population
mean is zero, whose population standard deviation is 1.0, and whose 2-tailed
95% limit is |1.96|.
Generally, the sampling distribution of the z-transformed Spear-
man rank partial correlations is a very good approximation of a standard
normal distribution. In fact, the only signicant deviation from a standard
normal distribution (based on a KolmogorovSmirnov test) was observed
for the ranks of normally distributed variables, for which one would not
normally use a Spearman partial correlation. The empirical standard devia-
tions were always close to 1.0 and the empirical means only once diered
signicantly, but very slightly, from zero at high levels of replication.
Approximate 95% condence intervals for the empirical 0.05 signicance
level (i.e. the 2-tailed 95% quantiles), based on 1000 simulations, are 0.037
to 0.064 (Manly 1997).
The results of this simulation study support the recommendations
of Conover and Iman (1981). These results are also consistent with the theo-
retical values given by Korn (1984) for the special case of a Spearman rst-
order partial based on trivariate normal and trivariate log-normal distribu-
tions, where the limiting values of the Spearman partial correlation are less
than, or equal to, an absolute value of 0.012, thus giving an expected abso-
lute z-score of 0.024. Korn (1984) gave a pathological example in which
the above procedure will not work even after ranking the data because there
is a non-monotonic relationship between the variables; he recommended
that one rst check
23
to see whether the relationships between the ranks are
approximately linear before using Spearman partial correlations.
3.10 Seed production in St Lucies Cherry
St Lucies Cherry (Prunus mahaleb) is a small species of tree that is found in
the Mediterranean region and relies on birds for the dispersal of its seeds.
As in most plants, seedlings from seeds that are dispersed some distance from
the adult are more likely to survive, since they will not be shaded by their
own parent or eaten by granivores that are attracted to the parent tree. For
species whose seeds can survive the passage through the digestive tract of
the dispersing animal, it is also evolutionarily and ecologically advantageous
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
90
23
This can be done by simply plotting the scatterplots of the ranked data.
T
a
b
l
e
3
.
2
.
R
e
s
u
l
t
s
o
f
a
M
o
n
t
e
C
a
r
l
o
s
t
u
d
y
o
f
t
h
e
d
i
s
t
r
i
b
u
t
i
o
n
o
f
z
-
t
r
a
n
s
f
o
r
m
e
d
S
p
e
a
r
m
a
n
p
a
r
t
i
a
l
c
o
r
r
e
l
a
t
i
o
n
s
.
F
o
u
r
d
i
e
r
e
n
t
d
i
s
t
r
i
b
u
t
i
o
n
a
l
t
y
p
e
s
w
e
r
e
s
i
m
u
l
a
t
e
d
f
o
r
t
h
e
r
a
n
d
o
m
c
o
m
p
o
n
e
n
t
s
.
S
a
m
p
l
e
s
i
z
e
w
a
s
t
h
e
n
u
m
b
e
r
o
f
o
b
s
e
r
v
a
t
i
o
n
s
p
e
r
s
i
m
u
l
a
t
e
d
d
a
t
a
s
e
t
.
L
i
n
e
a
r
(
L
)
a
n
d
n
o
n
-
l
i
n
e
a
r
(
N
L
)
f
u
n
c
t
i
o
n
a
l
r
e
l
a
t
i
o
n
s
h
i
p
s
w
e
r
e
u
s
e
d
.
T
h
e
e
m
p
i
r
i
c
a
l
m
e
a
n
,
s
t
a
n
d
a
r
d
d
e
v
i
a
t
i
o
n
a
n
d
t
h
e
2
-
t
a
i
l
e
d
9
5
%
l
i
m
i
t
s
o
f
1
0
0
0
s
i
m
u
l
a
t
e
d
d
a
t
a
s
e
t
s
a
r
e
s
h
o
w
n
S
t
a
n
d
a
r
d
d
e
v
i
a
t
i
o
n
2
-
t
a
i
l
e
d
T
h
e
o
r
e
t
i
c
a
l
D
i
s
t
r
i
b
u
t
i
o
n
o
f
i
S
a
m
p
l
e
s
i
z
e
O
r
d
e
r
o
f
p
a
r
t
i
a
l
L
i
n
e
a
r
/
n
o
n
-
l
i
n
e
a
r
M
e
a
n
o
f
z
o
f
z
9
5
%
q
u
a
n
t
i
l
e
p
r
o
b
a
b
i
l
i
t
y
N
o
r
m
a
l
2
5
1
L
0
.
0
8
1
.
0
3
2
.
0
4
0
.
0
4
N
o
r
m
a
l
5
0
1
L
0
.
0
8
0
.
9
7
2
.
0
1
0
.
0
5
N
o
r
m
a
l
4
0
0
1
L
0
.
0
8
1
.
0
4
2
.
1
6
0
.
0
3
N
o
r
m
a
l
5
0
2
L
0
.
0
3
0
.
9
9
1
.
8
6
0
.
0
6
N
o
r
m
a
l
5
0
3
L
0
.
0
7
1
.
0
0
1
.
8
5
0
.
0
6
G
a
m
m
a
(
1
)
2
5
3
L
0
.
0
1
1
.
0
5
2
.
0
9
0
.
0
4
G
a
m
m
a
(
1
)
5
0
3
L
0
.
0
2
0
.
9
6
1
.
8
2
0
.
0
7
G
a
m
m
a
(
1
)
5
0
3
N
L
0
.
0
3
0
.
9
6
2
.
0
2
0
.
0
4
G
a
m
m
a
(
5
)
5
0
3
N
L
0
.
0
7
0
.
9
9
1
.
9
3
0
.
0
5
B
e
t
a
(
1
,
1
)
5
0
3
L
0
.
0
2
0
.
9
9
2
.
0
0
0
.
0
5
B
e
t
a
(
1
,
1
)
5
0
3
N
L
0
.
0
3
1
.
0
2
2
.
0
8
0
.
0
4
B
e
t
a
(
1
,
5
)
5
0
3
N
L
0
.
0
3
1
.
0
2
2
.
0
8
0
.
0
4
B
e
t
a
(
5
,
1
)
5
0
3
N
L
0
.
0
5
0
.
9
9
1
.
7
8
0
.
0
7
B
e
t
a
(
5
,
1
)
4
0
0
3
N
L
0
.
0
0
1
.
0
2
2
.
0
1
0
.
0
4
B
i
n
o
m
i
a
l
5
0
3
N
L
0
.
0
1
0
.
9
9
1
.
9
5
0
.
0
5
for the fruit to be eaten by the animal, since the seed will be deposited with
its own supply of fertiliser. Not all frugivores of St Lucies Cherry are useful
fruit dispersers. Some birds just consume the pulp while either leaving the
naked seed attached to the tree or simply dropping the seed to the ground
directly beneath the parent. In order to estimate selection gradients, Jordano
(1995) measured six traits of 60 individuals of this species: the canopy pro-
jection area (a measure of photosynthetic biomass), average fruit diameter,
the number of fruits produced, average seed weight, the number of fruits
consumed by birds and the percentage of these consumed fruits that were
properly dispersed away from the parent by passage through the gut. Based
on ve of these variables for which I had data (I was lacking the total number
of fruits consumed by birds) I proposed the path model shown in Figure 3.7
(Shipley 1997), using the exploratory path models described in Chapter 8.
We can use this model to illustrate the d-sep test. The rst step is
to obtain the d-separation statements in the basis set that are implied by the
causal graph in Figure 3.7. There are six such statements, since there are ve
variables and four arrows. Table 3.3 lists these d-separation statements.
We next have to decide how to test the independencies that are
implied by these six d-separation statements. The original data showed het-
erogeneity of variance, as often happens with size-related variables, but
transforming each variable to its natural logarithm stabilises the variance.
Figure 3.8 shows the scatterplot matrix of these ln-transformed data.
Since the relationships appear to be linear and histograms of each
variable did not show any obvious deviations from normality, we can test
the predicted independencies using Pearson partial correlations. The results
are shown in Table 3.3. Fishers C statistic is 7.73, with 12 degrees of
freedom (df), for an overall probability of 0.806. The dierence between the
observed and predicted (partial) correlations would occur in about an 80%
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
92
Figure 3.7. Proposed causal relationships between ve variables
related to seed dispersal in St Lucies Cherry.
of data sets (in the long run) even if the data really were produced by the
causal structure in Figure 3.7. This doesnt mean that the data were pro-
duced by such a causal structure but it does mean that we have no reason to
reject it on the basis of the statistical test. If we want to reject it anyway, then
we will need to produce reasonable doubt. Perhaps the assumption of nor-
mality, upon which the test of the Pearson partial correlations is based, was
producing incorrect probability estimates. Table 3.3 also lists the Spearman
partial correlations. The overall probability of the model (
2
9.99, 12 df ),
based on the individual probability levels of these Spearman partial correla-
tions, was 0.616. On the other hand, there are equivalent models that also
produce non-signicant probability estimates (Shipley 1997) and if any of
these equally well-tting other models do not contradict what is known of
the biology of these trees, then they might constitute reasonable doubt
24
.
The original data of Jordano (1995) was t to a latent variable model
using maximum likelihood methods
25
. Neither the model chi-squared sta-
tistic nor the model degrees of freedom were given. It is therefore not pos-
sible to judge the t of that original model
26
, but it is possible to extract those
3. 10 SEED PRODUCTI ON I N ST LUCI E S CHERRY
93
24
The model was actually developed using the exploratory methods of Chapter 8. This, too,
should give us reason to question the model until independent data can be tested against
it. At this point, all we can reasonably say is that the data are consistent with the model
and so deserve further study.
25
These methods are described in Chapters 4 to 7.
26
One measured variable (total number of seeds dispersed) was not provided to me, so I
cant t his original model.
Table 3.3. Shown are the d-separation statements in the basis set of the causal
graph shown in Figure 3.7, along with the Pearson and Spearman partial
correlations that are implied by the d-separation statements.The probabilities,
assuming that the population partial correlations are zero, are listed as well
Pearson partial Spearman partial
correlations correlations
Probability Probability
d-separation assuming assuming
statement Estimate independence Estimate independence
X
4
||
_
X
1
|X
2
0.066 0.617 0.063 0.635
X
4
||
_
X
3
|X
2
X
1
0.142 0.289 0.144 0.279
X
4
||
_
X
5
|X
2
X
3
0.004 0.976 0.075 0.574
X
2
||
_
X
3
|X
1
0.021 0.873 0.059 0.655
X
2
||
_
X
5
|X
3
X
1
0.155 0.244 0.160 0.229
X
1
||
_
X
5
|X
3
0.076 0.565 0.102 0.443
d-separation statements involving only the measured variables available to
myself from the original latent-variable model. Jordanos published model
implies four d-separation statements in the basis set that can be tested:
{(canopy projection area ||
_
average fruit diameter), (canopy projection area
||
_
average seed weight), (number of fruits produced ||
_
average fruit diame-
ter), (number of fruits produced ||
_
average seed weight)}. The d-sep test,
based on Pearson correlations, gives a probability of 0.005 (
2
21.85, 8 df).
Using Spearman correlations the probability is 0.019 (
2
18.24). These
low probabilities, based on a subset of the original measured variables,
provide reasonable doubt concerning Jordanos (1995) model.
3.11 Specic leaf area and leaf gas exchange
The leaves of most owering plants are photosynthetic organs. Since carbon
xation is so central to the survival of plants, one might expect that there is
a tight integration of leaf form and physiology to provide for this necessary
function. However, land plants face a dilemma. They need to keep their
tissues turgid but these humid tissues nd themselves surrounded by air or
soil that is not saturated with water. The leaves (and other tissues) are pro-
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
94
Figure 3.8. Scatterplot matrix of the empirical observations (all variables
transformed to their natural logarithms).
tected by a cuticle to prevent dehydration. Unfortunately, this severely
restricts not only the diusion of water vapour, but also other gases, espe-
cially CO
2
that is required for photosynthesis, from diusing into the leaves.
The production of stomates is the evolutionary solution to this problem.
Stomates are small openings on the surface of the leaves through which gases
can diuse and the size of the stomatal openings is controlled by guard cells.
As soon as the stomates begin to open, CO
2
begins to diuse from
the outside air into the intercellular spaces of the leaf through a process of
passive diusion. Since the leaf is photosynthesising, CO
2
is being removed
from the intercellular spaces, creating a diusion gradient. However, the air
inside the leaf is always saturated with water vapour. As soon as the stomates
begin to open, this water vapour also begins to diuse out of the leaf, since
the outside air is not saturated with water. In essence, the leaf has to accept
a loss (water) in order to eect a gain (carbon). Cowan and Farquhar (1977)
proposed a theoretical model of stomatal regulation to predict how the leaf
should control its stomates in order to maximise carbon gain relative to
water loss. The basic insight of this model was that the leaf should restrict
carbon xation below the maximum level because when the internal CO
2
level in the leaf reaches a certain level the main carboxylating enzyme (rib-
ulose bisphosphate carboxylase (Rubisco)) becomes saturated, and further
increases in carbon xation require the regeneration of ATP from the light
reaction of photosynthesis. The second stage results in a greatly reduced rate
of increase of carbon xation per increase in the internal CO
2
concentra-
tion, but the rate of water loss continues at its former rate. Thus Cowan
and Farquhars principal insight was that the leaf should maintain the inter-
cellular CO
2
concentration at the break-point between Rubisco limitation
and ribulose bisphosphate regeneration limitation so that the carboxylating
capacity and the capacity to regenerate Rubisco are co-limiting.
On the basis of these theoretical notions, Martin Lechowicz and I
(Shipley and Lechowicz 2000) proposed a path model based on ve vari-
ables: specic leaf mass (SLM: leaf dry mass divided by leaf area, g/m
2
), leaf
organic nitrogen concentration (mmol/m
2
), stomatal conductance to water
(mmol/m
2
per s), net photosynthetic rate (mol/m
2
per s) and internal CO
2
concentration (l/l). The proposed model is shown in Figure 3.9. Our data
were the mean values from 40 herbaceous species typical of wetland envi-
ronments.
There are ve outliers in the data in relation to the internal CO
2
concentration. These are C
4
species. The other 35 species are C
3
species.
C
4
species have an additional metabolic pathway in which atmospheric
carbon is rst xed by phosphoenol pyruvate carboxylase in the mesophyll
cells to form malate or aspartate. This molecule, a 4-carbon acid, is then
3. 11 SPECI FI C LEAF AREA AND LEAF GAS EXCHANGE
95
transferred into bundle-sheath cells deeper in the leaf. Here these C
4
acids
are decarboxylated and the freed CO
2
enters the normal Calvin cycle of the
dark reaction of photosynthesis. An advantage of C
4
photosynthesis is that
plants exhibiting it are able to absorb CO
2
strongly from a lower concentra-
tion of CO
2
within the leaf. They can do this without Rubisco acting as an
oxygenase, rather than a carboxylase, under conditions of low CO
2
and high
O
2
. This means that C
4
plants do not exhibit the wasteful process of photo-
respiration under conditions of high illumination and low availability of
water. Because of this, they are able to maintain high rates of photosynthe-
sis even when the stomates are nearly closed. The basis set implied by the
model in Figure 3.9, along with the relevant statistics, is summarised in Table
3.4.
There is no strong evidence for any deviation of the data from the
predicted correlational shadow, as given by the d-separation statements.
However, a reasonable alternative model would be that the leaf nitrogen
content, which is due primarily to enzymes related to photosynthesis,
directly causes the net photosynthetic rate. In other words, what if Cowan
and Farquhars (1977) model of stomatal regulation is wrong, and the leaf is
regulating its stomates to maximise the net rate of CO
2
xation indepen-
dently of water loss? In this case, the observed rate of stomatal conductance
would be a consequence of the net photosynthetic rate rather than its cause
and the net photosynthetic rate would be directly caused by leaf nitrogen
content. We can test this alternative model too and Table 3.5 summarises
the results.
This alternative model is clearly rejected when both the C
3
and C
4
species are analysed together, since there are only about 2 out of 10000
chances of observing such a large dierence at random. This lack of t is
coming from the predicted independence between leaf nitrogen level (2)
and stomatal conductance (3), conditioned jointly on specic leaf mass (1)
and net photosynthetic rate (4). This, of course, is the critical distinction
between the path model in Figure 3.9 and the alternative model. When
SEWALL WRI GHT, PATH ANALYSI S AND d- SEPARATI ON
96
Figure 3.9. Proposed causal relationships between ve variables
related to interspecic leaf morphology and gas exchange. SLM,
specic leaf mass.
T
a
b
l
e
3
.
4
.
T
h
e
d
-
s
e
p
a
r
a
t
i
o
n
s
t
a
t
e
m
e
n
t
s
i
n
t
h
e
b
a
s
i
s
i
m
p
l
i
e
d
b
y
t
h
e
m
o
d
e
l
i
n
F
i
g
u
r
e
3
.
9
,
a
l
o
n
g
w
i
t
h
t
h
e
P
e
a
r
s
o
n
a
n
d
S
p
e
a
r
m
a
n
p
a
r
t
i
a
l
c
o
r
r
e
l
a
t
i
o
n
s
a
n
d
t
h
e
i
r
2
-
t
a
i
l
e
d
p
r
o
b
a
b
i
l
i
t
i
e
s
.
R
e
s
u
l
t
s
a
r
e
s
h
o
w
n
f
o
r
t
h
e
f
u
l
l
d
a
t
a
s
e
t
o
f
4
0
s
p
e
c
i
e
s
a
n
d
f
o
r
t
h
e
3
5
s
p
e
c
i
e
s
o
f
C
3
s
p
e
c
i
e
s
o
n
l
y
.
N
u
m
b
e
r
s
r
e
f
e
r
t
o
t
h
e
v
a
r
i
a
b
l
e
s
s
h
o
w
n
i
n
F
i
g
u
r
e
3
.
9
B
o
t
h
C
3
a
n
d
C
4
s
p
e
c
i
e
s
O
n
l
y
C
3
s
p
e
c
i
e
s
P
e
a
r
s
o
n
S
p
e
a
r
m
a
n
P
e
a
r
s
o
n
S
p
e
a
r
m
a
n
d
-
s
e
p
r
p
(
r
)
r
p
(
r
)
r
p
(
r
)
r
p
(
r
)
1
|
|
_
3
|
2
0
.
2
8
6
0
.
0
7
7
7
0
.
2
3
4
0
.
1
5
2
3
0
.
2
9
8
0
.
0
8
7
1
0
.
2
2
6
0
.
1
9
8
6
1
|
|
_
4
|
3
0
.
1
6
5
0
.
3
1
6
3
0
.
2
1
7
0
.
1
8
4
1
0
.
1
0
9
0
.
5
3
9
2
0
.
1
8
8
0
.
2
8
6
0
1
|
|
_
5
|
3
,
4
0
.
0
3
5
0
.
8
3
2
8
0
.
0
9
9
0
.
5
5
6
0
0
.
0
4
3
0
.
8
1
3
9
0
.
2
1
5
0
.
2
3
0
3
2
|
|
_
4
|
1
,
3
0
.
0
9
2
0
.
5
8
3
7
0
.
0
6
9
0
.
6
8
0
9
0
.
1
6
0
0
.
3
7
4
3
0
.
1
5
6
0
.
3
8
7
0
2
|
|
_
5
|
1
,
3
,
4
0
.
2
6
2
0
.
1
1
6
9
0
.
0
5
8
0
.
7
3
2
7
0
.
0
7
9
0
.
6
6
7
8
0
.
0
0
6
0
.
9
7
5
8
F
i
s
h
e
r
s
C
:
1
3
.
1
5
,
1
0
d
f
,
9
.
7
1
3
,
1
0
d
f
,
9
.
3
0
1
,
1
0
d
f
,
1
0
.
6
2
1
,
1
0
d
f
,
p
0
.
2
1
6
p
0
.
4
6
6
p
0
.
5
0
3
p
0
.
3
8
8
T
a
b
l
e
3
.
5
.
T
h
e
d
-
s
e
p
a
r
a
t
i
o
n
s
t
a
t
e
m
e
n
t
s
i
m
p
l
i
e
d
b
y
a
n
a
l
t
e
r
n
a
t
i
v
e
m
o
d
e
l
i
n
w
h
i
c
h
t
h
e
m
o
d
e
l
i
n
F
i
g
u
r
e
3
.
9
i
s
c
h
a
n
g
e
d
t
o
m
a
k
e
l
e
a
f
n
i
t
r
o
g
e
n
c
a
u
s
e
n
e
t
p
h
o
t
o
s
y
n
t
h
e
t
i
c
r
a
t
e
w
h
i
c
h
t
h
e
n
c
a
u
s
e
s
s
t
o
m
a
t
a
l
c
o
n
d
u
c
t
a
n
c
e
,
a
l
o
n
g
w
i
t
h
t
h
e
P
e
a
r
s
o
n
a
n
d
S
p
e
a
r
m
a
n
p
a
r
t
i
a
l
c
o
r
r
e
l
a
t
i
o
n
s
a
n
d
t
h
e
i
r
2
-
t
a
i
l
e
d
p
r
o
b
a
b
i
l
i
t
i
e
s
.
R
e
s
u
l
t
s
a
r
e
s
h
o
w
n
f
o
r
t
h
e
f
u
l
l
d
a
t
a
s
e
t
o
f
4
0
s
p
e
c
i
e
s
a
n
d
f
o
r
t
h
e
3
5
s
p
e
c
i
e
s
o
f
C
3
s
p
e
c
i
e
s
o
n
l
y
B
o
t
h
C
3
a
n
d
C
4
s
p
e
c
i
e
s
O
n
l
y
C
3
s
p
e
c
i
e
s
P
e
a
r
s
o
n
S
p
e
a
r
m
a
n
P
e
a
r
s
o
n
S
p
e
a
r
m
a
n
d
-
s
e
p
r
p
(
r
)
r
p
(
r
)
r
p
(
r
)
r
p
(
r
)
1
|
|
_
3
|
2
0
.
2
8
6
0
.
0
7
7
7
0
.
2
3
4
0
.
1
5
2
3
0
.
2
9
8
0
.
0
8
7
1
0
.
2
2
6
0
.
1
9
8
6
1
|
|
_
3
|
4
0
.
2
8
6
0
.
0
7
7
7
0
.
2
7
9
0
.
0
8
5
3
0
.
2
2
1
0
.
2
0
9
2
0
.
1
7
2
3
0
.
3
2
9
8
1
|
|
_
5
|
3
,
4
0
.
0
3
5
0
.
8
3
2
8
0
.
0
9
9
0
.
5
5
6
0
0
.
0
4
3
0
.
8
1
3
9
0
.
2
1
5
0
.
2
3
0
3
2
|
|
_
3
|
1
,
4
0
.
5
9
9
7
1
0
0
.
5
6
9
2
1
0
0
.
3
7
1
0
.
0
3
3
8
0
.
3
3
9
0
.
0
5
4
1
2
|
|
_
5
|
1
,
3
,
4
0
.
2
6
2
0
.
1
1
6
9
0
.
0
5
8
0
.
7
3
2
7
0
.
0
7
9
0
.
6
6
7
8
0
.
0
0
6
0
.
9
7
5
8
F
i
s
h
e
r
s
C
:
3
3
.
9
6
,
1
0
d
f
,
2
7
.
5
9
,
1
0
d
f
,
1
6
.
0
0
,
1
0
d
f
,
1
4
.
2
7
,
1
0
d
f
,
p
0
.
0
0
0
2
p
0
.
0
0
2
1
p
0
.
1
0
0
0
p
0
.
1
6
1
looking only at the C
3
species, the alternative model does not have a large
degree of lack of t although the critical prediction still shows a reasonably
large lack of t (r
23|14
0.371, p0.0338) and is always poorer that that pro-
vided by the structure shown in Figure 3.9.
Because of such results, and other reasons described in the original
reference, I prefer the causal structure shown in Figure 3.9. Such a conclu-
sion must remain tentative. After all, the conclusion is based on only 40
species and a larger sample size might detect some more subtle lack of t
that was too small to be found in the present data set.
Given the model in Figure 3.9, and given that we have not been
able to reject it, we can now t the path equations. Although Wrights orig-
inal method was based on standardised variables, I prefer to use the original
variables because the variables each have well-established units of measure-
ment. The least squares regression equations, using only the C
3
species, are
shown below. The residual variation is indicated by N(0,).
ln(% nitrogen)0.780.90 ln(SLM)N(0,0.243), R0.85
ln(conductance)6.601.15 ln(% nitrogen)N(0,0.56),
R0.69
ln(photo)3.080.55 ln(conductance)N(0,0.31), R0.81
ln(CO
2
internal)6.420.14 ln(conductance) 0.1 ln(photo)
N(0,0.04), R0.77
Each of the slopes is signicant at a level below 10
4
and the sign
of each is in the predicted direction. With these path equations we can begin
to simulate how the entire suite of leaf traits would change if we change the
specic leaf mass (the exogenous variable in this model) or if we observe
species with dierent SLMs. We get the functional relationships by back-
converting the variables in the equations from their natural logarithms. Of
course, each of these variables may also change with changing environmen-
tal conditions. By including these environmental variables we could gener-
ate the response surfaces across which the suite of leaf traits would move as
the environment changes.
3. 11 SPECI FI C LEAF AREA AND LEAF GAS EXCHANGE
99
4 Path analysis and maximum likelihood
James Burke (1996), in his fascinating book, The pinball eect, demonstrates
the curious and unexpected paths of inuence leading to most scientic dis-
coveries. People often speak of the marriage of ideas. If so then the most
prolic intellectual ospring come, not from the arranged marriages pre-
ferred by research administrators, but from chance meetings and even illicit
unions. The popular view of scientic discoveries as being linear causal
chains from idea to solution is profoundly wrong; a better image would be
a tangled web with many dead ends and broken strands. If most present
knowledge depends on unlikely chains of events and personalities, then
what paths of discovery have been deected because the right people did
not come together at the right time? Which historical developments in
science have been changed because two people, each with half of the solu-
tion, were prevented from communicating due to linguistic or disciplinary
boundaries? The second stage in the development of modern structural
equation modelling is a case study in such historical contingencies and inter-
disciplinary incomprehension.
During the First World War, and in connection with the American
war eort, Sewall Wright was on a committee allocating pork production
to various US states on the basis of the availability of corn
1
. He was con-
fronted with a problem that had a familiar feel. Given a whole series of var-
iables related to corn availability and pork production, how do all these
variables interact to determine the relationship between supply and demand,
and the uctuations between these two? It occurred to him that his new
method of path analysis might help. He calculated the correlation co-
ecients between each pair of variables for ve years, giving 510 separate
correlations. After much trial and error he developed a model involving
only four variables (corn price, summer hog price, winter hog price and hog
breeding) and only 14 paths that still gave a good match between observed
and predicted correlations. He described his results in a manuscript that was
submitted as a bulletin of the US Bureau of Animal Industry. It was
100
11
This next section is based on Wrights biography (Provine 1986).
promptly rejected because ocials at the Bureau of Agricultural Economics
considered it to be an intrusion onto their turf. Happily for Wright, he had
also shown it to the son of the secretary of agriculture (Henry A. Wallace)
who was interested in animal breeding and quantitative modelling. Wallace,
using his political inuence, intervened to have the manuscript published as
a US Department of Agriculture bulletin (Wright 1925).
Although economists later developed methods that were very
similar to path analysis, Wrights foray into economics does not seem to have
been very inuential. During the Second World War, Wright presented a
seminar on path analysis to the Cowles Commission, where economists
were developing methods that were the forerunner of SEM. Neither
Wright nor the economists recognised the link between the two approaches
or the usefulness of such a marriage (Epstein 1987). None the less, some
economists were independently trying to express causal processes in func-
tional form
2
(Haavelmo 1943). In economics, constraints on the covariance
matrix (for example, zero partial correlations due to d-separation) were
called overidentifying constraints. Since most work in this area was in param-
eter estimation, not theory testing, such constraints were mostly avoided
because they made consistent estimation dicult.
In the 1950s the political scientist Herbert Simon began to derive
the causal claims of a statistical model
3
. This led some social scientists to
think about expressing causal processes as statistical models that implied
certain structural, or overidentifying, constraints. One such person was H. M.
Blalock, who began deriving overidentifying constraints, in the form of
zero partial correlations, that were implied by the structure of the causal
process (Blalock 1961, 1964). Wrights method of path analysis had been
largely rediscovered by social scientists, with the important dierence that
the emphasis shifted from being an a posteriori description of an assumed
causal process as Wright viewed his method to being a (tentative) test of
an assumed causal process. The late 1960s and early 1970s saw many appli-
cations of path analysis in sociology, political science and related social
science disciplines.
The most important next step was the work of people such as
Jreskog (1967, 1969, 1970, 1973) and Keesling (1972), who developed
ways of combining conrmatory factor analysis (see Chapter 5) and path
analysis using maximum likelihood estimation techniques. The advance was
not simply in using a new method of estimating the path coecients. More
importantly, the use of maximum likelihood allowed the resulting series of
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
101
12
Some economists referred to Wrights work in passing (Goldberger 1972; Griliches 1974)
but only for historical completeness.
13
Summarized by Simon (1977).
equations describing the hypothesised causal process (a series of structural
equations) to be tested against data in order to see whether the overidentify-
ing constraints (the zero partial correlation coecients and other types of
constraint) agreed with the observations. This advance solved the main
weakness of Wrights original method of path analysis, since one did not
simply have to assume the causal structure, as Wright did. Now, one could
test the statistical consequences of the causal structure and therefore poten-
tially falsify the hypothesised causal structure
4
. Unfortunately, by the 1970s
most biologists had forgotten about Wrights method of path analysis and
disciplinary boundaries prevented the new SEM approach from penetrating
into biology.
Wrights method was essentially the application of multiple regres-
sion based on standardised variables in the order specied by the path
diagram (the causal graph). This, along with ANOVA and most other famil-
iar statistical methods, consists of modelling the individual observations. In
other words, the path coecients were obtained using least square tech-
niques by minimising the squared dierences between the observed and
predicted values of the individual observations, as is usual in multiple regres-
sion. Structural equations models, of which modern path analysis is a spe-
cialised version
5
, concentrate instead on the pattern of covariation between
the variables and minimise the dierence between the observed and pre-
dicted pattern of covariation among them. The basic steps are:
1. Specify the hypothesised causal structure of the relationships
between the variables.
2. Translate the causal model into an observational model. Write
down the set of linear equations that follow this structure and
specify which parameters (slopes, variances, covariances) are to be
estimated from the data (i.e. that are free) and which are xed (i.e.
are not to be changed to accommodate the data) based on the causal
hypothesis.
3. Derive the predicted variance and the covariance between each pair
of variables in the model using covariance algebra. Covariance
algebra gives the rules of path analysis that Wright had already
derived.
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
102
14
The logical and axiomatic relationships between probability distributions and causal prop-
erties had not yet been developed. This led to much confusion concerning the causal
interpretation of structural equations models (Pearl 1997). One reason why I discuss these
points in detail is to prevent the same sterile debates from recurring among biologists.
15
There are a number of dierent names for this general class of models: structural equa-
tions models, LISREL models, covariance structure models.
4. Estimate these free parameters using maximum likelihood or related
methods, while respecting the values of the xed parameters. This
estimation is done by minimising the dierence between the
observed covariances of the variables in the data and the covariances
of the variables that are predicted by the causal model.
5. Calculate the probability of having observed the measured
minimum dierence between the observed and predicted covari-
ances, assuming that the observed and predicted covariances are
identical except for random sampling variation.
6. If the calculated probability that the remaining dierences between
observed and predicted covariances are due only to sampling vari-
ation is suciently small (say below 0.05) then one concludes that
the observed data were not generated by the causal process specied
by the hypothesis and that the proposed model should be rejected.
If, on the contrary, the probability is suciently large (say above
0.05) then one concludes that the data are consistent with such a
causal process.
4.1 Testing path models using maximum likelihood
Step 1: Translate the hypothetical causal system into a path
diagram
This rst step should be almost second nature by now, but there are a few
notational conventions that must be introduced. Path diagrams contain
three dierent types of variable. Variables that have been directly observed
and measured are enclosed in squares; these variables are called manifest var-
iables in SEM jargon. Variables that are hypothesised to have a causal role in
the model, but which have not been directly observed or measured, are
enclosed in circles; these variables are called latent variables in SEM jargon
6
.
The third type of variable is the residual error variable
7
and it is not enclosed
at all. This type of variable represents all other unmodelled causes of the var-
iable into which it points. It is also generally dened as a normally distrib-
uted random variable with a mean of zero and a variance of 1, although it
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
103
16
By convention, a path model is simply a structural equations model that does not involve
unmeasured, or latent, variables.
17
A common error is to assume that the residual error in a structural equations model is the
same as the residual error in a regression model. The two are not necessarily the same. In
a regression model the residuals are always uncorrelated (or orthogonal) with the predic-
tors. The residuals in SEM need not be uncorrelated with the predictors or with each
other.
is possible to model it with a variance dierent from 1. A second common
classication is between a variable that has no causal parents in the model,
called exogenous, and a variable that is caused by some other variable in the
model, called endogenous
8
. Finally, there are two types of arrow. A straight
arrow indicates a causal relationship between two variables just as it does in
the directed graphs of previous chapters. A curved, double-headed arrow
indicates an unknown causal relationship linking the two variables. This
means that there is a free covariance in the structural equations. These con-
ventions are shown in Figure 4.1.
Step 2: Translate the causal model into an observational model in
the form of a set of structural equations
As the arrows in Figure 4.1 suggest, the hypothesised relationships are asym-
metrical causal ones. When we translate the causal model into mathemati-
cal equations we obtain an observational (statistical) model. Because we are
now dealing with a statistical model, we must make assumptions concern-
ing the form of the functional relationships and the sampling distribution of
the random variables. Contrary to the path diagram, this new model is not
strictly a causal model because it is expressed in the language of algebra using
the equivalence operator (). It is an imperfect translation of the causal
model and we must not forget this and begin manipulating these algebraic
equations in ways contrary to the original asymmetrical causal relations
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
104
18
Yes, I know; I didnt invent these terms. I will try to limit the jargon to a minimum but
you will have to be aware of these terms in order to read the literature.
Figure 4.1. A path model involving ve variables; S
45| 3
is a free
covariance between
4
and
5
.
expressed in the path diagram. In almost all structural equations models, the
relationships are assumed to be additively linear. In most structural equations
models, the random variables are assumed to be multivariate normal.
Multivariate normal means that the variables follow a multivariate normal
distribution, which is a somewhat stronger assumption than assuming simply
that each variable is normally distributed. Dierent ways of assessing this
assumption, and the degree of non-normality that can be tolerated, are
described in Chapter 6.
If your causal model is suciently detailed that you are willing to
hypothesise the numerical values of some parameters (path coecients, var-
iances or covariances) then you can include this information in the model
by specifying the parameter to be xed. If you are not able or willing to
make such an assumption (except, of course, that the parameter is not zero)
then the parameter is estimated from the data and is therefore free. Specifying
that a variable is not a direct cause of another (i.e. that there is not an arrow
from the one to the other in the path diagram) is the same as specifying that
the path coecient of this missing arrow is xed at zero. Each parameter
that is xed adds one degree of freedom to the inferential test.
In such models the interest is in the relationships between the var-
iables, not the mean values of the variables themselves
9
. For this reason, all
variables are centred by subtracting the mean value of each variable from
each observation. For instance, if the mean of X
1
in Figure 4.1 were 6, then
we would replace each value of X
1
by (X
1
6). This trick ensures that the
mean of each transformed variable is zero and therefore that the intercepts
are zero. Assuming that all of our variables are already centred, these are the
structural equations corresponding to Figure 4.1, where Cov(X
1
,X
2
) means
the population covariance between X
1
and X
2
:
X
1
N(0,
1
)
3
N(0,
3
)
X
2
N(0,
2
)
4
N(0,
4
)
X
3
a
13
X
1
a
23
X
2
b
3
3
5
N(0,
5
)
X
4
a
34
X
3
b
4
4
X
5
a
35
X
3
b
5
5
Cov(X
1
,X
2
) Cov(X
1
,
3
) Cov(X
1
,
4
) Cov(X
1
,
5
)
Cov(X
2
,
3
) Cov(X
2
,
4
) Cov(X
2
,
5
) Cov(
3
,
4
)
Cov(
3
,
5
) 0
Cov(
4
,
5
)
45
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
105
19
Means can also be modelled but this requires a little bit more work.
Happily, most commercial SEM programs do most of this transla-
tion work for you; you have only to specify which parameters are free and
which variables are direct causes of which other variables. In fact, all you
have to do is draw the path model in the latest versions of most commer-
cial SEM programs
10
. Notice that some parameters (
i
,a
ij
) in the above
equations do not have numerical values and therefore have to be estimated;
before looking at the data they are free to take on any value.
Lets go through these equations more slowly to understand exactly
how the causal model in Figure 4.1 has been represented in equation form.
First, variables X
1
and X
2
are exogenous in the model; we dont know, or
are not interested in explicitly modelling, the causal parents of these two
variables. In the equations I have specied that X
1
and X
2
are each normally
distributed random variables whose mean is zero and whose standard devi-
ation is unknown. Therefore, these two standard deviations are free and
must be estimated from the data
11
. Next, X
3
is written as a linear function
of both X
1
and X
2
in accordance with Figure 4.1. Since I dont know the
numerical strength of the direct causal eects of these two variables, the path
coecients (a
13
and a
23
) are also free and must be estimated from the data.
If my causal hypothesis had been suciently well developed that I could
specify what the values of these path coecients were then I would have
entered the predicted values rather than having to estimate them from the
data. In addition, the combined direct eect of the other unknown causes
of X
3
are not known either and so b
3
, the path coecient from the error
variable (
3
), is also free and must be estimated. Remember that all of the
error variables () are unit normal variables, i.e. with a zero mean and a stan-
dard deviation of 1. Multiplying a unit normal variable by a constant (b
3
in
this case) makes its variance equal to the constant. Therefore, the part of the
variance of X
3
that is not accounted for by X
1
and X
2
is b
3
. In this particu-
lar equation the residual is exactly analogous to the residuals of a multiple
regression, since it is made to be uncorrelated with either X
1
or X
2
but this
is not always the case. Next, each of X
4
and X
5
are also written as linear
functions of X
3
with the accompanying free path coecients.
Since there are ve variables in the model, there are 10 dierent
pairs of variables and therefore 10 dierent covariances between the unique
pairs of variables. Since X
1
and X
2
are causally independent, the covariance
between these two variables must be zero (remember d-separation). X
1
, X
2
,
and the unknown other causes of X
3
(i.e.
3
) are also independent of each
other and of the unknown other causes of X
4
and X
5
and so each of these
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
106
10
The Appendix lists some common SEM programs.
11
Fixing the error variance at 1.0 and freely estimating the path coecient associated with
the error variable, or xing the path coecient to 1.0 and freely estimating the error var-
iance, are two equivalent ways of doing the same thing.
pairs of covariances must also be zero. Finally, the causal model in Figure
4.1 states that there is some causal inuence linking X
4
and X
5
but the
researcher does not know what it is. Perhaps X
4
causes X
5
? Perhaps X
5
causes
X
4
? Perhaps there is a reciprocal causal relationship? Perhaps there is some
unknown common cause of both X
4
and X
5
? Adding a free covariance (the
translation of a curved double-headed arrow) is an admission of ignorance
as to the causal origin of the covariance. Each of the above causal relation-
ships would generate a non-zero covariance between X
4
and X
5
even after
controlling for X
3
. Therefore, we allow
4
and
5
to have a non-zero covar-
iance and the numerical value of this non-zero covariance must also be esti-
mated from the data.
This completes the best translation of the causal model into the
observational (statistical) model as can be obtained consistent with the sta-
tistical assumptions that are needed to estimate the free parameters. It will
be important to evaluate these assumptions when judging whether the
results of the analysis can be trusted, as is true of any statistical method.
Step 3: Derive the predicted variance and the covariance
between each pair of variables in the model using
covariance algebra
Box 4.1. Basic rules of covariance algebra
The notation E(X) means the expected value of X. So, the population covar-
iance between two variables symbolized here as Cov(X
1
,X
2
) is dened as:
Cov(X
1
,X
2
) E[(X
1
E(X
1
))(X
2
E(X
2
))] E(X
1
X
2
)
E(X
1
)E(X
2
)
If the variables are centred about their expected values, this reduces to:
Cov(X
1
,X
2
) E(X
1
,X
2
)
Since a variance is simply the covariance of a variable with itself, we
can write the population variance (Var) as:
Var(X
1
) Cov(X
1
,X
1
)
If k is a constant and X
1
, X
2
, X
3
are random variables then we can also
state the following useful rules:
(1) Cov(k,X
1
) 0
(2) Cov(kX
1
,X
2
) kCov(X
1
,X
2
)
(3) Cov(k
1
X
1
,k
2
X
2
) k
1
k
2
Cov(X
1
,X
2
)
(4) Cov(X
1
X
2
,X
3
) Cov(X
1
,X
2
) Cov(X
2
,X
3
)
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
107
The set of structural equations allows us to derive the predicted
values for the covariances between each pair of variables. Since a Pearson
correlation coecient is simply a covariance that has been standardised, one
can also derive the predicted values for the correlations between each pair
of variables. This step uses the rules of path analysis that were derived orig-
inally by Wright. Box 4.1 summarises a few basic rules of covariance algebra
that will be useful in discussing this section.
From Chapter 2 we know that two vertices in the path diagram that
are d-separated correspond to two random variables that are independently
distributed, meaning that the population covariance between them must be
zero. If the two vertices are not d-separated, then the corresponding random
variables are not independently distributed and so (given the assumption of
linearity made by SEM) the covariance between them cant be zero. This
justies the list of zero covariances in the structural equations given above.
For those vertices that are not unconditionally d-separated (and are there-
fore correlated in some way), we can use the rules of covariance algebra to
obtain formulae giving their covariances. Take, for instance, variables X
1
and
X
3
in Figure 4.1. We can write:
Cov(X
1
,X
3
)Cov(X
1
, (a
13
X
1
a
23
X
2
b
3
3
))a
13
Cov(X
1
,X
1
)
a
23
Cov(X
1
,X
2
)b
3
Cov(X
1
,
3
).
Looking at the path diagram and applying the d-separation opera-
tion, we see that X
1
is independent of both X
2
and
3
and therefore the
population covariances involving X
1
and these two variables are zero.
Therefore, the population covariance between X
1
and X
3
is simply
a
13
Cov(X
1
,X
1
) or a
13
Var(X
1
). In this way we can obtain formulae for the
expected value for each pair of variables in the model. These are shown in
Table 4.1.
This must seem like a lot of work. Most commercial SEM programs
do all of this work for you and the important point at this stage is only that
you have an intuitive understanding of why we can express the covariances
between each pair of variables as a function of path coecients, variances
and covariances. For those who are used to working with matrix algebra,
Box 4.2 gives a more formal derivation of the predicted covariance matrix
based on the BentlerWeeks model (for a concise description, see Bentler
1995).
If we go back to the analogy of correlations being the shadows that
are cast by causal processes, then Table 4.1 is a description of the shape of
the shadow that is cast by the hypothesised causal process shown in Figure
4.1. Imagine that we were describing the shadow cast by a solid square
whose size was unknown to us (i.e. the length of whose sides are free param-
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
108
eters). We would describe the shadow as having four equal sides of unknown
length (the rst constraint) with four sides that meet in such a way that they
make four corners having 90 angles (the second constraint). The general
shape of the shadow is xed (a square) but the numerical values (the lengths
of the sides) are free parameters and can be estimated by measuring the real
shadow.
Box 4.2. The BentlerWeeks model
Denitions: Let the endogenous (i.e. dependent) variables in the model be
written in a column vector called and let the exogenous (i.e. independent
variables, including the error variables) be written in a column vector called
. Let the coecients of the eects of dependent causes to dependent eects
be a matrix called (rows are dependent eects and columns are dependent
causes) and let the coecients of the eects of independent causes to depen-
dent eects be a matrix called (rows are dependent eects and columns are
independent causes). Then the system of structural equations can be written
as: .
For instance, the path model in Figure 4.1 would be written:
x
1
x
2
a
13
a
23
b
3
0 0
0 0 0 b
4
0
0 0 0 0 b
5
x
3
x
4
x
5
0 0 0
a
34
0 0
a
35
0 0
x
3
x
4
x
5
5
)
X
5
Var(X
5
)
In reduced form the equation is: (I)
1
, where I is the identity
matrix. Predicted covariances between exogenous variables: E[] .
Predicted covariances between endogenous and exogenous variables:
E[] (I)
1
. Predicted covariances between endogenous variables:
E[] (I)
1
(I)
1
.
Step 4: Estimate the free parameters by minimising the diff-
erence between the observed and predicted variances
and covariances
The hypothesised object was the solid square and from this we have pre-
dicted the shape of the shadow that it would cast. Is our hypothesis correct?
To decide, we look at the actual shadow, choose values for the length of the
sides of our hypothesised square that make it as numerically close to the
observed shadow as possible while respecting the constraints, and then
measure the remaining lack-of-t. This is the same basic logic used to t
and test a structural equations model. We rst choose values for the free
parameters in our predicted covariance matrix that make it as numerically
close as possible to the observed covariance matrix, while respecting the
constraints applied to the predicted covariance matrix. How this is done
depends on the assumptions that have been made concerning the distribu-
tional form of the random variables; in SEM the usual assumption is that
the random variables follow a multivariate normal distribution.
The general strategy for obtaining the best values for the free
parameters is easy enough to grasp: choose values of the free parameters that
make the numerical values of the predicted covariance matrix (i.e. the values
in Table 4.1 after replacing the variables by their numerical values) as close
as possible to the actual covariances measured in the data. This is usually
done using maximum likelihood estimation (Fisher 1950). Eliason (1993)
and Bollen (1989) described the mechanics of this technique and Box 4.3
gives a brief introduction for those who are interested in this topic. In
essence, the numerical algorithm used to maximise the likelihood is a bit
like playing the game of 20 questions (Is it alive? Is it a mammal? Is it a car-
nivore? Does it live in Africa? . . .). Box 4.3 gives a more precise denition
of the likelihood function but this function can be intuitively understood to
measure the discrepancy between the observed data and the sort of data that
would have been observed had the free parameter been equal to our chosen
value. We start with an initial guess of the values of the free parameters and
calculate the likelihood of the data given the current parameter values. We
then see whether we can modify our guess of the values of the free param-
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
110
eters in such a way as to improve the likelihood. We continue with this
process until we nd values such that any change to them will do worse than
the present values.
Another analogy for maximising a likelihood function might be a
person who is blindfolded and nds herself in a landscape with various hills
and valleys. Her job is to walk down to the valley oor without peeking.
She begins by taking an initial step in a direction based on her best guess. If
she sees that she has moved down-slope then she continues in the same
direction with a second step in the same direction. If not, she changes direc-
tion and tries again. She continues with this process until she nds herself
at a position on the landscape in which every possible change in direction
results in movement up-slope. She therefore knows that she is in a valley.
Unfortunately, if the landscape is very complicated she may have found
herself in a small depression rather than on the true valley oor. The only
way to nd out would be to start over at a dierent initial position and see
whether she again ends up in the same place.
Box 4.3. Maximum likelihood estimation
The probability of occurrence of a random variable, X
i
, is given by a prob-
ability function (for discrete variables) or a probability density function (for
continuous variables). For instance, the probability density function of a uni-
variate normal random variable is:
f (X;,)
The notation means that x is the random variable and and are popula-
tion parameters that are xed. Now, if we take a series of N independent
observations of the random variable, then the joint probability density func-
tion for these Nobservations is: f (X
1
;,) f (X
2
;,) f (X
3
;,) . . . f (X
N
;,).
The objective of maximising the likelihood of a parameter is to nd a value
of this parameter that maximises this joint probability density function. In
other words, nd a value for the parameter (for example ) that maximises
the likelihood of having observed the series of observations. This objective
turns the probability density function on its head. Now the observed values
(X
i
) are xed and we view the population parameters as variables. We are
envisaging a whole series of dierent normal distributions and we want to
choose the most likely one given our data. So, the likelihood function of the
univariate normal distribution is:
L(,;X)
(X)
2
2
2
1
2
2
e
(X)
2
2
2
1
2
2
e
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
111
and the joint likelihood function of the entire set of data is:
L(,;X
1
)L(,;X
2
)L(,;X
3
) . . . L(,;X
N
).
The natural logarithm of a series of positive numbers is an increasing
function of these numbers. Because it is dicult to maximise a product but
easier to maximise a sum, we use the logarithm of the likelihood function.
For instance, imagine that we have observed eight values (1.20, 0.08, 0.34,
0.57, 0.46, 0.48, 0.56, 1.01) from a normal distribution whose population
variance (
2
) is 1 and we want to nd the maximum likelihood value for the
population mean (). Figure 4.2 shows a graph of the log-likelihood func-
tion over the range 4 to 4.
We see that this function is maximal at around 0.58. This is the sample
mean of our eight values, showing that the standard formula for the sample
mean an unbiased estimator is also a maximum likelihood estimator.
Maximum likelihood estimates are not always unbiased in small samples (for
instance the maximum likelihood estimate of the variance is not) but they are
consistent, meaning that such estimates converge on the true value as the
sample size increases. In other words maximum likelihood estimates are
asymptotic estimates.
In general, the maximum (or minimum) of the likelihood function
occurs when its rst derivative is zero. To see whether one has found a
maximum, one then checks to see whether the second derivative is negative.
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
112
Figure 4.2. The log-likelihood function for the population mean (),
given eight values (1.20, 0.08, 0.34, 0.57, 0.46, 0.48, 0.56, 1.01)
from a normal distribution whose population variance (
2
) is 1 over
the range 4 to 4.
In SEM, the usual assumption is that the data are multivariate normal. In
other words, we assume that the probability density function is multivariate
normal. The likelihood function for this distribution is:
L(,;X) e
[X]
1
[X]
.
In SEM we usually centre the variables about their means, so the only
parameters whose likelihood estimates we need to estimate are those in the
population covariance matrix . These parameters are the free parameters
that we have already derived. With this much more complicated function we
are not able to derive the maximum likelihood estimates directly, and so we
have to use numerical methods that involve iteration.
For instance, in Table 4.1 we had free parameters for the variances of
the two independent observed variables, the three coecients for the error
variables and the one free covariance between X
4
and X
5
. Lets group all these
free parameters together in a vector called . Now, if we take a rst guess at
the values of these free parameters then we can calculate the predicted covar-
iance matrix based on these initial values; lets call the predicted covariance
matrix that results from this guess
(1)
() to emphasise that this matrix will
change if we change our values in . We now calculate the value of the log-
likelihood function. Next, we change our initial estimates of the free param-
eters and recalculate the predicted covariance matrix,
(2)
(), in such a way
as to increase the log-likelihood. We continue in this way until we cant
increase the value of the log-likelihood anymore.
Problems can occur if the log-likelihood function contains potholes
or local maxima. If this happens then the iterative procedure can become
trapped without nding the global maximum. The only way to determine
whether one has found a global maximum is to try dierent starting values
and see whether they all converge on the same values. Problems can also
occur if the iterative procedure wanders into areas of parameter space that are
illegitimate, for instance, negative variances. Most computer programs will
warn you when this happens, and this is usually a sign of a poorly tting
model.
Let S be the observed covariance matrix, involving p dependent
(endogenous) and q independent (exogenous) variables. Let be the
maximum likelihood estimate of the model covariance matrix. Since these
maximum likelihood estimates depend on the values of the free parameters,
which we group together in a vector , we will write the model covariance
matrix as (). The maximum likelihood tting function, F
ML
, that
compares the dierence between the observed and predicted covariance
matrices is:
1
(2)
n/2
||
1/2
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
113
F
ML
ln|()|trace(S
1
()) ln|S|( pq)
This function has three important properties. First, the values of the
free parameters, , that minimise it are also the values that make the pre-
dicted covariance matrix as similar as possible to the observed covariance
matrix while respecting the constraints implied by the causal model.
Second, the values of that minimise this function are the same values that
maximise the multivariate normal likelihood function and so such values of
the free parameters that dene the population covariance matrix () are
called maximum likelihood estimates. Third, and most importantly, if the
observed data (and therefore S) really were generated by the causal process
that the structural equations are modelling, then the only remaining
dierences between () and S at the minimum of F
ML
will be due to nor-
mally distributed random sampling variation. Given these assumptions then
(N1)F
ML
is asymptotically distributed as a chi-squared distribution.
I said that one probable reason why biologists did not accept
Wrights method of path analysis was that his original method could derive
the logical consequences of a causal model but could not test it. The method
described above, developed by Jreskog (1970), was the rst to solve this
important shortcoming of path analysis.
Step 5: Calculate the probability of having observed the
measured minimum difference, assuming that the
observed and predicted covariances are identical except
for random sampling variation
The central chi-squared distribution has only one parameter: the degrees of
freedom. In testing a structural equations model we are comparing the t
between the observed and predicted elements of the covariance matrix. If
we have v variables then there will be v
2
elements in the covariance matrix.
Since this matrix is symmetrical about its diagonal, some of these elements
are redundant. The number of unique elements is v(v1)/2. If we were to
compare the observed and predicted values of these unique elements using
(N1)F
ML
and all of the predicted values were obtained independently of
the observed values then this would dene the degrees of freedom for the
chi-squared test. However, we have had to use our data to estimate the free
parameters that partly determine the predicted covariance matrix. Each free
parameter that we have to estimate uses up one degree of freedom. The
degrees of freedom available to test the model are:
( pq)
v(v 1)
2
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
114
As before, q is the number of free variances of exogenous variables (includ-
ing the error variables) in the model and p is the number of free path
coecients in the model. I say free because it may sometimes be possible
to specify the value of a variance or path coecient based on theory or prior
experience and therefore constrain the model to have the specied value no
matter what the data say.
So we specify, as the null hypothesis, that there is no dierence
between the observed and predicted covariance matrices except what would
be expected given the random sampling variation of N independent obser-
vations all taken from the same multivariate normal distribution. Given this
hypothesis, then the following statistic (the maximum likelihood chi-squared
statistic) will asymptotically follow a central chi-squared distribution with the
degrees of freedom given above:
(N1)F
ML
2
[v(v1)/2]( pq)
In practice one uses a computer program to do all of these calculations.
Step 6: If the calculated probability is sufciently small (say below
0.05) then one concludes that the model was wrong. If
the probability is sufciently large (say above 0.05) then
one concludes that the data are consistent with such a
causal process
At rst blush this step appears much easier to understand than the previous
ones. In fact, it is the step than causes the greatest confusion. The previous
steps are more mathematically involved but they are largely automated and
so the user does not need more than an intuitive grasp of what is happen-
ing. This last step requires that the user interpret the meaning of the result-
ing probability for the biological model. This interpretation can often lead
to confusion.
In most of the statistical tests used by biologists, the biologically
interesting hypothesis is the alternative hypothesis; the null hypothesis func-
tions as a strawman that is erected only to see whether we have suciently
strong evidence to knock it down. This is useful because it forces us to have
strong evidence (evidence beyond reasonable doubt) before we can accept
the biologically interesting alternative hypothesis. In SEM on the other
hand, models are constructed based on biological arguments in such a way
as to reect what we hypothesise to be correct. In other words, our model
and the resulting predicted covariance matrix embodies what we view to be
biologically interesting. The null hypothesis, not the alternative, is therefore
the biologically interesting hypothesis. A probability below the chosen
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
115
signicance level means that the predicted model is wrong and should be
rejected (i.e. the null hypothesis should be rejected). Although the ipping
of the null and alternative hypotheses might seem strange, it is exactly the
same logic as testing the null hypothesis that the slope of a simple linear
regression equals, say, 0.75. Notice that we are reversing the burden of
proof: we are requiring strong evidence, evidence beyond reasonable doubt,
before we are willing to reject our preferred hypothesis. This leads naturally
to the temptation to conclude that the predicted model is correct simply
because we have not obtained strong evidence to the contrary! In fact, all
that we can conclude is that we have no good evidence to reject our model
and that the data are consistent with it. The degree to which we have good
evidence in favour of our model will depend on how well we can exclude
other models that are also consistent with the data. This leads naturally to
the subject of equivalent models (Chapter 8).
At this stage, some numerical examples will help. I will generate
100 independent observations following the causal graph shown in Figure
4.1. Here are the generating equations; these are the same as those shown
previously except that the free parameters have been replaced by actual
values:
X
1
N(0,1)
X
2
N(0,1)
X
3
0.5X
1
0.5X
2
0.5
3
X
4
0.5X
3
0.707
4
X
5
0.5X
3
0.707
5
Cov(X
1
,X
2
) Cov(X
1
,
3
) Cov(X
1
,
4
) Cov(X
1
,
5
)
Cov(X
2
,
3
) Cov(X
2
,
4
) Cov(X
2
,
5
) Cov(
3
,
4
) Cov(
3
,
5
)
0
Cov(
4
,
5
) 0.5
First, we look at the observed covariance matrix obtained from
these 100 observations. This matrix is the observational shadow that was
cast by the causal process shown in Figure 4.1 and quantied by the above
equations. Table 4.2 shows this covariance matrix.
The rst step is to specify the hypothesised causal model. Imagine
that we actually had two dierent competing models and wished to test
between them. The rst model is the model shown in Figure 4.1; this is the
correct model that generated these data, although the model contains free
parameters that have not been specied by our theory. The second causal
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
116
model that we want to test is shown in Figure 4.3. The next step is to trans-
late our two hypothesised causal graphs into structural equations. The trans-
lation of our rst model has already been given. The translation of this
second (incorrect) model is the following:
X
1
N(0,
1
)
X
2
N(0,
2
)
X
3
a
13
X
1
a
23
X
2
b
3
3
X
4
a
34
X
3
b
4
4
X
5
a
25
X
2
b
5
5
Cov(X
1
,X
2
) Cov(X
1
,
3
) Cov(X
1
,
4
) Cov(X
1
,
5
)
Cov(X
2
,
3
) Cov(X
2
,
4
) Cov(X
2
,
5
) Cov(
3
,
4
) Cov(
3
,
5
)
0
Cov(
4
,
5
) 0
Note the dierences between these structural equations and the
ones derived from the correct model. First, the population covariance
between the residual errors of X
4
and X
5
(i.e. Cov(
4
,
5
)) is zero in this
incorrect model. Second, X
5
is hypothesised to be directly caused by X
2
rather than being indirectly caused by both X
1
and X
2
through their eects
on X
3
.
We next have to obtain the maximum likelihood estimates of the
free parameters of each model. To do this we have to provide starting values
for the iterative process. In this book I use the EQS program for structural
equation models, although there are many other commercial programs on
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
117
Table 4.2. The observed unique variances and
covariances between variables X
1
to X
5
from 100
simulated observations based on the causal process
shown in Figure 4.1
X
1
X
2
X
3
X
4
X
5
X
1
0.931
X
2
0.171 1.094
X
3
0.630 0.762 1.350
X
4
0.384 0.368 0.743 1.265
X
5
0.324 0.385 0.611 0.624 0.949
the market. In the path models discussed in this chapter the choice of start-
ing values for the free parameters is usually not critical and so I will use the
default values of 1.0 for all of them. Remember that the tting of these free
parameters is an iterative process in which the estimates at each iteration are
changed in such a way as to reduce the maximum likelihood tting func-
tion, as described in Box 4.3. At the very rst iteration, when all free param-
eters are equal to 1.0, both the correct model and the incorrect model
produce a predicted covariance matrix that poorly ts the observed values;
the maximum likelihood tting function is 0.45707 for the correct model
and 0.74821 for the incorrect model. The correct model took ve iterations
to converge on the maximum likelihood estimates, giving a nal value of
0.04890 for the maximum likelihood tting function. Since this value,
multiplied by 99 (i.e. N1) is the maximum likelihood chi-squared statis-
tic, the nal value of the chi-squared statistic was 4.8411. The incorrect
model took four iterations to converge on the maximum likelihood esti-
mates, giving a nal value of 0.39369 for the maximum likelihood tting
function. Therefore, the nal value of the chi-squared statistic for this incor-
rect model was 38.98.
To see whether these chi-squared statistics are signicantly dierent
from what one would expect given a correct model, we next need to deter-
mine the degrees of freedom. In both models we had ve measured vari-
ables, giving a total of 15 unique variances and covariances, i.e. v(v1)/2,
or 5(6)/2. In the correct model we had to estimate the variances of X
1
and
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
118
Figure 4.3. An alternative path model involving ve variables.
X
2
, and the three error variances as well as four path coecients and one
free covariance (between X
4
and X
5
); this makes 10 free parameters to esti-
mate. The correct model therefore had 15105 degrees of freedom. In
the incorrect model we had to estimate the variances of X
1
and X
2
, and the
three error variances as well as four path coecients but no free covariance;
this makes nine free parameters to estimate. The incorrect model therefore
had 1596 degrees of freedom. SEM programs do all of these calcula-
tions for you.
Remember what we are testing. If the hypothesised model is
correct, then the maximum likelihood chi-squared statistic will follow a chi-
squared distribution. If the predicted and observed covariance matrices were
identical then the maximum likelihood chi-squared statistic would be zero.
The further the predicted covariance matrix deviates from the observed
covariance matrix, the larger the maximum likelihood chi-squared statistic
will be. Of course, even if our causal model were correct we would not
expect the two matrices to be identical because of sampling variation; the
predicted covariance matrix contains the predicted population values but
the observed matrix is from a random sample of 100 observations. However,
if the only dierences were due to random sampling uctuations then the
maximum likelihood chi-squared statistic would closely follow (for large
samples) a theoretical chi-squared distribution with the appropriate degrees
of freedom. To evaluate our two models, we have only to hypothesise that
each is the true model and then calculate the probability, based on this null
hypothesis, of observing at least as large a dierence between the observed
and predicted covariance matrices as measured by our statistic.
First, lets look at the results for the correct model. The probability
of observing a chi-squared value of at least 4.8411 with 5 degrees of
freedom is 0.44. In other words, there is a probability of 0.44 of seeing such
a result even if our null hypothesis were correct. In fact, our null hypothe-
sis is correct, since we generated our data to agree with it. The result is
telling us what we know to be true: the data are perfectly consistent with
the model given normally distributed sampling variation. On the other
hand, the maximum likelihood chi-squared statistic for the incorrect model
was 38.98 with 6 degrees of freedom. The probability of observing such a
large dierence between the observed and predicted covariance matrices,
assuming that the data were actually generated according to the incorrect
model, is 7.210
7
. We either have to accept that an extremely rare event
has occurred (one chance in about 1.5 million times) or reject the hypoth-
esis that our data were generated according to the incorrect model. Again,
the result is telling us what we know to be true: the data are not consistent
with the model.
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
119
Compare the two predicted covariance matrices with the observed
covariance matrix to see where the dierences lie (Table 4.3). The biggest
dierences involve X
5
. First, the predicted covariance between X
1
and X
5
is
zero in the incorrect model while the observed value is 0.324. This is
because X
1
is d-separated from X
5
in the incorrect model. The tting pro-
cedure had to respect this constraint when tting the incorrect model and
so constrained this predicted covariance to be zero. The correct model
allows X
1
to be an indirect cause of X
5
through its eect on X
3
. The tting
procedure had to respect the constraint that the partial covariance between
X
1
and X
5
be zero when controlling for X
3
but, since this constraint actu-
ally existed in the generating process, such a constraint did not distort the
estimates. In the same way, the incorrect model required that the partial
covariance between X
3
and X
5
as well as the partial covariance between X
4
and X
5
be zero when controlling for X
2
. Since neither of these constraints
actually existed in the correct causal process, the tting procedure was
forced to distort the estimates in order to meet these incorrect constraints.
Lets look next at the maximum likelihood estimates for the free
parameters in the two dierent models. Here again are the true population
values used to generate the data:
X
1
N(0,1)
X
2
N(0,1)
X
3
0.5X
1
0.5X
2
0.5
3
X
4
0.5X
3
0.707
4
X
5
0.5X
3
0.707
5
Cov(X
1
,X
2
) Cov(X
1
,
3
) Cov(X
1
,
4
) Cov(X
1
,
5
)
Cov(X
2
,
3
) Cov(X
2
,
4
) Cov(X
2
,
5
) Cov(
3
,
4
) Cov(
3
,
5
)
0
Cov(
4
,
5
) 0.5
Here are the maximum likelihood estimates with their asymptotic standard
errors in parentheses, based on the true model:
X
1
0.931 N(0,1)
(0.132)
X
2
1.094 N(0,1)
(0.156)
X
3
0.565 X
1
0.608 X
2
0.531 E
3
(0.076) (0.070) (0.075)
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
120
X
4
0.550 X
3
0.856 E
4
(0.084) (0.122)
X
5
0.452 X
3
0.673 E
5
(0.074) (0.096)
Covariance(X
4
,X
5
) 0.287
(0.082)
Notice that each estimate is close to the population value. The stan-
dard errors are asymptotic, not exact, but with 100 observations these are
quite close to the actual sample standard errors and so two times each value
denes an approximate 95% condence interval. For instance, the path
coecient from X
1
to X
3
is 0.565 with a standard error of 0.076 so an
4. 1 TESTI NG PATH MODELS USI NG MAXI MUM LI KELI HOOD
121
Table 4.3. The observed covariance matrix for the 100
independent observations, along with the predicted
maximum likelihood covariance matrices based on the
correct model (Figure 4.1) and the incorrect model
(Figure 4.3)
X
1
X
2
X
3
X
4
X
5
Observed covariance matrix
X
1
0.931
X
2
0.171 1.094
X
3
0.630 0.762 1.350
X
4
0.384 0.368 0.743 1.265
X
5
0.324 0.385 0.611 0.624 0.949
Predicted values using the correct model
X
1
0.931
X
2
0.000 1.094
X
3
0.526 0.665 1.233
X
4
0.289 0.366 0.678 1.229
X
5
0.238 0.301 0.557 0.594 0.925
Predicted values using the incorrect model
X
1
0.931
X
2
0.000 1.094
X
3
0.526 0.666 1.234
X
4
0.290 0.366 0.679 1.230
X
5
0.000 0.385 0.234 0.129 0.949
approximate 95% condence interval would be 0.5652(0.076) or between
0.413 and 0.717; the true population value was 0.5. We could obtain the
maximum likelihood estimates for the incorrect model as well, but since we
already know that the data are very unlikely to have been generated by this
incorrect model at least some of the estimates will be incorrect.
We can place the estimates of the free parameters of the correct
model directly on the path diagram (Figure 4.4). I prefer this because the
path diagram makes explicit that these estimates are based on a causal model
with asymmetric relationships. The estimates shown in Figure 4.4 are not
the ones Sewall Wright would have used. First, his estimates were not based
on maximum likelihood methods but rather on least squares methods.
Second, he used standardised variables so that the decomposition of direct
and indirect eects were based on correlations rather than covariances. If the
causal model is correct then the maximum likelihood and least squares esti-
mates will be the same
12
, since least squares (partial) regression coecients
are also maximum likelihood estimates, but if the causal model is wrong
then the two types of estimate will dier. The standardised estimates are
easily obtained by rst standardising the variables to zero mean and unit var-
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
122
12
This assumes, of course, that the data are multivariate normal.
Figure 4.4. The fully parameterised path model of Figure 4.1. The
numerical values are the maximum likelihood values of the free
parameters based on centred, but not standardised, variables.
iance. In fact, most SEM programs print these standardised estimates out.
Figure 4.5 shows the path diagram for the correct model based on standar-
dised variables.
4.2 Decomposing effects in path diagrams
One important use of path diagrams is to decompose an association
between variables into dierent types of causal relationship. In fact, this was
the main goal of Wrights original method of path coecients. Remember-
ing the notions of causal graphs that were introduced in Chapter 2, we can
dierentiate between types of eect: direct causal eects, indirect causal
eects, eects due to shared causal ancestors and unknown causal relation-
ships
13
. One way of visualising this classication of associations is shown in
Figure 4.6.
This decomposition of a statistical association into dierent types of
causal relationship is based on the fundamental association linking causality
with probability distributions, as described in Chapters 1 and 2. The overall
4. 2 DECOMPOSI NG EFFECTS I N PATH DI AGRAMS
123
13
Of course, there can always be associations due to random sampling uctuations.
Figure 4.5. The fully parameterised path model of Figure 4.1. The
numerical values are the maximum likelihood values of the free
parameters based on centred and standardised variables. Units are
therefore standard deviations from the mean.
association between two variables is simply the overall correlation or covar-
iance between them. This overall association can be generated by a number
of dierent causal relationships at the same time. Since the consequences of
interventions or manipulations will depend critically on these dierent types
of relationship it is important to be able to distinguish and quantify them.
Ancestordescendant relationships
If you can trace a path on the causal graph from a causal ancestor to a descen-
dant by following the direction of the arrows then this path denes an eect
from the ancestor to its descendant. These eects can be of two dierent
types. A direct eect is an eect of the ancestor on its descendant that is not
transmitted through any other variable in the model; of necessity this means
that the relationship is one of parent and child. In other words, it is the eect
that would occur if all other variables in the model did not change
14
. The
magnitude of this direct eect is measured by the path coecient on the
arrow going from the parent to the child. The units of this eect are the
same as those used to measure the variables. If the variables are not standar-
dised then the path coecient measures the number of unit changes in the
child per unit change in the parent. For instance, if X is measured in grams,
Y is measured in millimetres, there is an arrow from X to Y (XY ) and the
path coecient for this arrow equals 0.6 then this means that a 1g change
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
124
14
This may also be true even if other variables do change, as described below.
Figure 4.6. A classication of associations (for example, correlations or
covariances).
in X will provoke a 0.6mm change in Y once X is d-separated from all other
causes in the model. It is also the quantitative eect of Xon Y when all other
variables are held constant. If the variables are standardised, then the units
are standard deviations from the mean. As usual, these points are much easier
to grasp when looking at a causal graph. Consider Figure 4.7.
In Figure 4.7 there are six dierent direct eects. There are always
as many direct eects as there are one-headed arrows in the path diagram.
If we were to t this model to data then the path coecient from X
1
to X
2
would measure the direct eect of X
1
on X
2
. If the three other variables
were held constant then this direct eect would quantify by how much X
2
would change given a one unit change in X
1
. However, since X
2
has no
other causal ancestors then this direct eect would also quantify by how
much X
2
would change given a one unit change in X
1
even if the other var-
iables were not held constant.
Indirect eects are the eects of a causal ancestor on its descendant
that are completely transmitted through some other variable. This interven-
ing variable is sometimes called a mediator of the causal eect. For instance
the eect of X
1
on X
3
along the path X
1
X
2
X
3
is an indirect eect of X
1
that is mediated by X
2
. To quantify this eect one multiplies the path
coecients along this path. This indirect eect measures by how much X
3
would change following a change in X
1
if all causal parents of X
3
except for
X
2
were held constant. In general an indirect eect measures by how much
the eect variable would change following a change in the indirect cause
when this eect is transmitted only along the path in question. It is possible
for the same causal variable to exert both a direct and an indirect eect on
the same descendent. An example of this is the eect that X
2
has on X
4
in
4. 2 DECOMPOSI NG EFFECTS I N PATH DI AGRAMS
125
Figure 4.7. A path diagram used to illustrate the decomposition of
associations.
Figure 4.7; X
2
had a direct eect on X
4
, since X
2
is the causal parent of X
4
,
but X
2
also exerts an indirect eect on X
4
through its eect on X
3
.
Both direct and indirect eects involve variables in which one is a
causal ancestor of the other. In these cases there is a directed path from one
variable to the other. The third way in which an association can be decom-
posed in a path diagram is when the association between the two variables
is due to another variable that is a causal ancestor of both. In Figure 4.7 the
association between X
5
and X
6
is due to the eect of X
4
(their common
ancestor) on both. To quantify this eect one multiplies the path coecients
along the path
15
from X
4
to X
5
and along the path from X
4
to X
6
. Such
eects do not measure any causal eect of one variable on the other and
represent what Pearson might have called a spurious association.
Finally, path diagrams can include unresolved causal relationships
between variables; these are shown by double-headed arrows. Including
such an eect in the model is an admission of ignorance; we do not know
which is the cause, which is the eect, or whether the association is due to
a common cause that is not included in the model. Such unresolved eects
are quantied simply by the covariance between the two variables
16
. In
tracing indirect eects along paths that include such double-headed arrows
one can go in either direction but can traverse the double-headed arrow
only once. Table 4.4 summarises the rules for decomposing the overall
covariance or correlation between two variables in the path model and Table
4.5 lists the decomposition of Figure 4.7.
4.3 Multiple regression expressed as a path model
Since path analysis looks rather similar to multiple regression, lets look at
how to represent a multiple regression as a path model. A multiple regres-
sion equation uses a series of predictor variables (say, X
1
, X
2
and X
3
) to
predict, or account for, the observed variation in the dependent variable Y.
The predictor variables are often called the independent variables but this
term can be misleading, since they do not have to be independent of one
another at all. Except when these predictor variables are measured in con-
trolled experiments, they are often not independent of one another. Figure
4.8 shows such a multiple regression in the form of a path model.
It is clear from the path diagram that the partial regression coe-
cients that are estimated with multiple regression are the direct eects of
each predictor on the dependent variable. The indirect eects, which are
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
126
15
The two paths together (X
5
X
4
X
6
) are sometimes called a trek.
16
Or a correlation coecient if the variables are standardised, since a correlation is simply
a standardised covariance.
simply the unresolved causal relationships between the predictors, are
ignored. If the free covariances, (the s
ij
in Figure 4.8) really are zero, then
the direct eects will also be the overall eects, but the regression equation
will not tell you this
17
. Furthermore, the model in Figure 4.8 cant be tested
as a causal claim. There are only four observed variables in this model and
therefore there are 4(5)/210 unique elements in the covariance matrix.
There are also 10 free parameters that have to be estimated (the three path
coecients, the three free covariances, the error variance and the variances
4. 3 MULTI PLE REGRESSI ON AS A PATH MODEL
127
17
If these covariances are not zero then one can run into problems of collinearity.
Table 4.4. Given two variables (X and Y) in a path model, the overall covariance
or correlation (if using standardised variables) between them can be decomposed into
three dierent causal sources. Shown are the rules of the estimation of each source
Eect involving
variables X and Y Rule for its estimation
Direct eect Value of the path coecient on the arrow from X to Y
Indirect eect along a Product of the path coecients on the sequence of
single path arrows along the path leading from X, through at least
one intermediate variable, and into Y
Overall indirect eect Sum of the indirect eects along all paths from X to Y.
along all paths
Eect due to Multiply the path coecients along a single path from Z
common causal to Y and the path coecients along a single path from Z
ancestor (Z) of both on X (called a trek). If there is more than one such trek
X and Y linking X and Y due to common causes, sum these
together
Eect due to Path coecient on the double-headed arrow between X
unresolved causal and Y
relationship
Eect due to all Sum together the eects due to each common causal
common causal ancestor of both X and Y
ancestors of both X
and Y
Overall eect Sum together, the direct eect, the total indirect eects,
the total eects due to common causal ancestors and any
remaining unresolved causal relationship between X and
Y. This will equal the covariance or correlation (if using
standardised variables) between X and Y.
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
128
Table 4.5. Decomposition of the total association between each pair of variables in
Figure 4.7 into direct eects, indirect eects, eects due to common causal ancestors
and pure unresolved causal eects
Common Unresolved
Variable causal causal
pair Direct Indirect ancestor relationship
X
1
, X
2
X
1
X
2
None None None
X
1
, X
3
None (1) None None
X
1
X
2
X
3
(2)
X
1
X
2
X
4
X
5
X
3
X
1
, X
4
None (1) None None
X
1
X
2
X
3
X
4
(2)
X
1
X
2
X
4
X
1
, X
5
None (1) None None
X
1
X
2
X
3
X
4
X
5
(2)
X
1
X
2
X
3
X
5
(3)
X
1
X
2
X
3
X
4
X
1
X
2
X
4
X
5
X
1
, X
6
None (1) None None
X
1
X
2
X
3
X
4
X
6
(2)
X
1
X
2
X
4
X
6
X
2
, X
3
X
2
X
3
(1) None None
X
2
X
4
X
5
X
3
X
2
, X
4
X
2
X
4
(1) None None
X
2
X
3
X
4
X
2
, X
5
None (1) None none
X
2
X
3
X
3
X
5
(2)
X
2
X
4
X
5
X
2
, X
6
None (1) None None
X
2
X
3
X
6
(2)
X
2
X
4
X
6
X
3
, X
4
X
3
X
4
None (1) None
X
4
X
2
X
4
of the three predictor variables). In other words, we have used up all of our
degrees of freedom in estimating our free parameters and have none left over
to test the causal implications of the model
18
. If you use the inferential test
described in Chapter 3 you will nd that no variable is d-separated from any
other variable, either unconditionally or after conditioning on any set of
other observed variables. The regression model places no statistical con-
straints with which to test the causal implications of the model. Multiple
4. 3 MULTI PLE REGRESSI ON AS A PATH MODEL
129
18
When we get to the topic of identication, we will see that multiple regression is an
example of a just-identied model.
Table 4.5. (cont.)
Common Unresolved
Variable causal causal
pair Direct Indirect ancestor relationship
X
3
, X
5
None (1) None X
3
X
5
X
3
X
4
X
5
X
3
, X
6
None (1) (1) None
X
3
X
4
X
6
X
3
X
2
X
4
X
6
X
4
, X
5
X
4
X
5
None (1) None
X
4
X
3
X
5
X
4
, X
6
X
4
X
6
None None None
X
5
, X
6
None None X
5
X
4
X
6
None
Figure 4.8. A multiple regression of X
1
, X
2
and X
3
on Y, expressed as a
path diagram.
regression can certainly be used to decide whether the path coecients (i.e.
the partial regression coecients) are dierent from zero. Multiple regres-
sion can help us to decide whether the error (or residual) variance is less
than the total variance of Y (this is the F ratio). Multiple regression cant
help us to decide whether the causal assumptions of the model are correct;
it cant tell us whether the predictor variables are causes of Y. Multiple
regression can allow us to predict but not to explain. Statistics texts are quite
correct when they say that one cant draw causal conclusions from regres-
sion. The causal conclusions must come from somewhere else. The best way
would be to conduct a controlled randomised experiment in which the
values of the X variables are randomly assigned to the experimental units,
since we would then have good reason to assume that the free covariances
between them really are zero. If this is not possible then we have to con-
struct our models, and collect our observations, in such a way that we can
constrain the patterns of covariation based on our causal hypothesis and then
test these constraints.
4.4 Maximum likelihood estimation of the gas-exchange
model
In Chapter 3 we looked at the model of Shipley and Lechowicz involving
specic leaf mass (SLM), leaf nitrogen concentration, stomatal conductance,
net photosynthetic rate and the internal concentration of CO
2
. Lets t and
test these same data (ln-transformed) to the proposed path model using
maximum likelihood methods. Remember that 5 of the 40 species were
actually C
4
species and that these were clear outliers in the data set. Because
we require approximate multivariate normality, we cant include these 5
species in the data set. The analysis will be restricted to the remaining 35
species. Since the resulting chi-squared statistic and the standard errors of
the free parameters are only asymptotically correct, we can expect that the
estimated standard errors are somewhat narrower than they should be and
the probability value of the chi-squared statistic will not be exact
19
. This
model is reproduced in Figure 4.9.
The rst step is to specify the structural equations and to indicate
which parameters are free. There are ve free path coecients (a
1
to a
5
) and
ve free variances (the variance of specic leaf mass and the four error var-
iances
2
to
5
). Since there are ve measured variables there will be ve
PATH ANALYSI S AND MAXI MUM LI KELI HOOD
130
19
Chapter 6 describes the eects of sample size on the maximum likelihood chi-squared sta-
tistic.
degrees of freedom
20
. The free parameters are shown in Figure 4.9. Next, I
have to specify initial values for these free parameters. In my experience one
rarely has problems with convergence of the maximum likelihood estimates
when there are no latent variables in the model, unless parts of the model
are underidentied, and so I will make all free parameters equal to 1 except
a
5
, which I set at 1. I do this because I expect increasing photosynthetic
rates to reduce the internal CO
2
concentration. One can sometimes have
problems with convergence if some variances are much larger (i.e. orders of
magnitude) than others, but I know that this is not the case with these var-
iables.
The value of the chi-squared statistic, based on the initial values of
the free parameters, was 150.96 obviously a very poor t. The numerical
algorithm searched for changes in these initial values that would improve the
t while respecting the constraints and came up with a second set of values.
The value of the chi-squared statistic, based on this second set of values of
the free parameters, was 65.96 obviously still a very poor t but at least
much better. Again the estimates of the free parameters were adjusted and
after the third try the chi-squared statistic was 19.72. This process was
repeated a forth, and then a fth time, giving a chi-squared statistic of 4.72.
The sixth attempt made such a small improvement (from 4.71954 to
4.71648) that the algorithm stopped; it had reached the valley oor. The
nal maximum likelihood chi-squared value was therefore 4.72 and, with 5
degrees of freedom, the asymptotic probability under the null hypothesis
was 0.45. At this point we can obtain the estimates of the free parameters
and their asymptotic standard errors. These estimates, divided by their
asymptotic standard errors, can be used to test whether they are signicantly
4. 4 MLE AND THE GAS- EXCHANGE MODEL
131
20
105
5(6)
2
Figure 4.9. The proposed path model relating leaf morphology and leaf
gas exchange. The letters with subscripts show the free parameters
whose maximum likelihood estimates must be obtained.
dierent from zero, using a z-test. Remember, because we only have 35
observations, such a z-test will be somewhat liberal, since the real standard
errors will be a bit larger that their asymptotic estimates. Here are the
maximum likelihood estimates. The asymptotic standard errors are given in
round brackets and the z-value (whose absolute value will be less that 1.96
95% of the time) is given in square brackets.
0.183
ln(SLM) N
0,(0.044)
[4.123]
0.898ln(SLM) 0.057
ln(Nitrogen) (0.096) N
0,(0.014)
[9.338] [4.123]
1.146ln(Nit) 0.300
ln(Conductance) (0.206) N
0,(0.070)
[5.525] [4.123]
0.548ln(Cond) 0.091
ln(Photo) (0.069) N
0,(0.022)
[7.977] [4.123]
0.162ln(Photo) 0.142ln(Cond) 0.001
ln(CO
2
) (0.020) (0.013) N
0,(0.0002)
0.75
5. 1 MEASUREMENT ERROR AND THE TESTS
139
d-sep test. If we test 500 independent data sets, each with 1000 indepen-
dent observations, then the 95% condence interval for our empirical rejec-
tion rate would be between approximately 2% and 8% (Manly 1997). Table
5.1 summarises the results of these simulations as we increase the measure-
ment errors. We see that even when the measurement error variance of Y
is 0.3, i.e. slightly less than 4% of the true variance of Y, the rejection rate
is outside the 95% condence limits. As the measurement error variance
increases further, the rejection rate increases rapidly. In other words, even if
the hypothesised causal process involving the theoretical variables were
correct, we would tend to reject it too often if we were to incorrectly
assume that our conditioning variable is measured without error. Here,
ignoring measurement error increases the likelihood that one will incor-
rectly reject a model that is correct. The eect of measurement error on the
accuracy of the probabilities associated with the maximum likelihood chi-
squared statistic is the same in this example (Table 5.1).
5.2 Measurement error and the estimation of path coefcients
The eect of measurement error on the accuracy of the estimation of the
path coecients, using either the least squares regression methods of
Chapter 3 or the maximum likelihood regression methods of Chapter 4, are
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
140
Table 5.1. The empirical rejection rates of 500 independent data sets, consisting of
1000 independent observations (X, X,Y,Y, Z, Z) each, are shown.The true
model is shown in Figure 5.3A with dierent amounts of measurement error and
the inferential tests are based on the incorrect model in Figure 5.3B.The
population variances of the latent X,Y and Z are 1, 8, and 4.75.The population
variances of the observed X,Y, and Z are 1
1
2
, 8
2
2
and 4.75
3
2
d-sep ML
Measurement error Ratio of variances
rejection rejection
1
2
2
2
3
2
X/X Y/Y Z/Z rate rate
0.01 0.01 0.01 0.990 0.999 0.998 0.052 0.064
10.0 0.01 0.01 0.009 0.999 0.998 0.060 0.052
0.01 0.10 0.01 0.990 0.988 0.998 0.060 0.050
0.01 0.30 0.01 0.990 0.964 0.998 0.088 0.102
0.01 0.40 0.01 0.990 0.952 0.998 0.152 0.144
0.01 0.01 10.00 0.009 0.998 0.322 0.052 0.060
Note:
ML, maximum likelihood.
perhaps somewhat better known to biologists. Lets rst look at a simple
example involving only two variables (X and Y ) that are measured with
error. When I say measured with error I dont mean only the obvious case
in which the measuring device (say an analytical balance) has a certain
degree of error.
Imagine that you wish to measure the nitrate availability of the soil
in the rooting zone of a plant, but only measure the total nitrogen content
of samples of this soil at a single time. In such a case the error of measure-
ment will include, not only the error involved in the analytical method for
nitrate concentration, but also the error involved in using the sample meas-
ures of total nitrogen at one point in time as a proxy variable for the total
nitrogen availability in the rooting zone. Let us imagine a causal process in
which the nitrate absorption rate of a plant (Y ) is caused by the amount of
nitrate available in the rhizosphere of its roots (X). X is estimated as the
average nitrate availability of a sample of soil cores taken directly beneath
the plant at one point in time. Y is estimated as the change in the net total
nitrogen concentration of the plant from the time the soil is sampled until
the next day. Figure 5.4 shows the causal graph assuming measurement error
(Figure 5.4A) and without measurement error (Figure 5.4B).
The path coecient shown as a in Figure 5.4 is the regression slope
of Y on X. By denition, this is:
a
Now, the true value of a in Figure 5.4B can be derived from the
rules of path analysis:
a
Often the measured variables (X and Y) will scale 11 with the
underlying latent variable; that is, a unit increase in the underlying variable
will result in a unit increase or decrease of the measured variable. In such a
case (or if b
1
b
2
) the formula can be simplied to:
a
From this we see that the eect of measurement error (e
2
) is to
decrease a relative to a. If we ignore the measurement error then a will be
a biased estimate
2
of a. The formula also shows why it is important to sample
aVar(X)
Var(X) Var(e
2
)
Cov(X,Y)
Var(X)
a(b
1
)(b
2
)Var(X)
b
1
2
Var(X) Var(e
2
)
Cov(X,Y)
Var(X)
aVar(X)
Var(X)
5. 2 MEASUREMENT ERROR AND PATH COEFFI CI ENTS
141
12
This is true in this simple bivariate case. When there is more than one cause of Y, each
with measurement error, then the relationship between a and a will also depend on the
covariances between the measurement errors. See Bollen (1989) for the exact formulae.
in such a way as to allow the widest variation possible in the causal variable
X. Presumably the measurement error will not change with the range of X
and so as the variance of X increases the dierence between a and a will
decrease. Furthermore, measurement error in the eect variable (Y ) has no
eect on the bias of the path coecient.
These measurement errors have dierent eects on the estimation
of the path coecients and on the probabilities of the overall inferential test
of the causal model. For instance, in the causal model shown in Figure 5.3,
measurement error in X (the air temperature) had no eect on the probabil-
ity levels estimated in the inferential test when we ignored measurement
error but would bias the path coecient from air temperature to metabolic
rate. Measurement error in Y (metabolic rate) did have an eect on the esti-
mated probabilities but would not bias the path coecient
3
from X to Y.
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
142
13
Measurement error in Y would bias the estimation of the path coecient from Y to Z, if
we ignore it.
Figure 5.4. The rst directed graph (A) shows the true causal structure
in this scenario. The second directed graph (B) shows the causal
structure that is assumed when ignoring measurement error.
5.3 A measurement model
When measurement error cant be safely ignored, it must be explicitly
included in the model and estimated. Most biologists deal with measure-
ment error by ignoring it. Sometimes this is reasonable. After all, we are able
to measure temperature, mass or CO
2
concentration with great accuracy.
Sometimes ignoring measurement error is not at all reasonable. Trying to
estimate the fat reserves of a large free-ranging mammal by palpating its ribs
and giving it a score of 1 to 4 can hardly inspire condence in its accuracy,
yet this is a common measure of body condition. Similarly, we do not
possess the equivalent of a thermometer that we can put into the mouth of
our animals to measure their evolutionary tness. If we try to measure tness
using indirect measures then these measures will probably possess important
measurement errors.
Although not generally known to most biologists, methods for
dealing with measurement error have been developed in the social sciences.
Almost all important variables are latent in these sciences and they can be
measured only with substantial error. For instance, one might reasonably
hypothesise that the degree to which a person can empathise with suering
might determine their career choice. It seems reasonable to suppose that a
person with more empathy might choose to become a nurse rather than a
mercenary soldier. Yet how can one measure empathy? A common
approach would be to devise a series of survey questions and develop an
index of empathy based on the answers to these questions. It is obvious that
choosing the answer A in a multiple choice test does not cause one to
become a nurse rather than a mercenary nor does it cause one to become
more empathic. Rather, a psychologist might say that ones empathic ten-
dency is a common latent cause both of the answers on the survey and of
ones choice of career. The survey answers are imperfect measures, or indi-
cators, of the underlying latent variable and the measurement model must
separate those parts of the covariance between the answers that are due to
the underlying latent cause from those parts of the covariance due to other
causes. One simple type of measurement model is a factor model.
There is a huge literature devoted to measurement theory and to its
many pitfalls. Many of these pitfalls are conceptual rather than statistical, so
lets begin with an example (Dunn, Everitt and Pickles 1993) that doesnt
pose any conceptual problems. You cut a number of pieces of string into
dierent lengths and lay them on a table. Each string now has an attribute
length that you ask four dierent people to measure. One person uses a
ruler graduated in centimetres. One person uses a hand and measures in
hand lengths. The third person uses a ruler graduated in inches and the
5. 3 A MEASUREMENT MODEL
143
fourth person simply looks at each string and tries to estimate it to the
nearest centimetre. The length measurements have dierent units and each
estimate has two dierent causes. The rst cause is the true length of the
string, since each person is trying to accurately measure this latent attribute.
The other type consists of all those other causes that give rise to the meas-
urement error (incorrectly calibrated rulers, tiredness, myopia . . .). Figure
5.5 shows the causal graph.
In this example everything is in plain sight. The true length of the
string, although latent, is not hypothetical. We can see the strings on the
table and know that each has a xed value of the attribute length. The only
uncertainty is in knowing the actual length of each string. Let the j strings
( j1,n) measured by the four people be X
j1
to X
j4
and let the true length
of each string be L
j
. The structural equations representing this causal process
are:
X
j1
1
L
j
N(0,
1
)
X
j2
2
L
j
N(0,
2
)
X
j3
3
L
j
N(0,
3
)
X
j4
4
L
j
N(0,
4
)
These structural equations, coupled with the path diagram in
Figure 5.5, state that each persons measurement (X
jk
) is a linear function of
the true length of each string (L
j
) plus a certain amount of other unknown
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
144
Figure 5.5. The causal structure generating the different measured
lengths of the pieces of string.
causes whose variance is
k
. The total variance in the kth persons measure
of a given string is now separated into two parts: a part that is common to
everyone (the common variance) that is due to the true lengths of the strings,
and a part that is unique to each person (the unique variance,
k
) that is due
to those other causes of a particular measurement. Since these unique var-
iances are d-separated in Figure 5.5, we know that they must be uncorre-
lated. Since there are four observed variables there are 4(5)/210 unique
covariances in the covariance matrix
4
. There are only nine free parameters
that we have to estimate: the four path coecients (the four
i
), the vari-
ances of the four measurement errors (the four
i
2
) and the variance of the
latent variable. We can therefore t this model by maximising its likelihood
and then test it using the maximum likelihood chi-squared test because we
have 1 degree of freedom left. Before we do this, however, we have to over-
come a problem of identication. Identication is a problem that is dis-
cussed in more detail in Chapter 6, after we have seen how to combine the
measurement model with the structural model involving the latent variables,
but it can be intuitively understood with the following example.
If you are given an equation, say y2xz, and are told only that
x equals 1, then there is more that one combination of values for y and z
that will solve this equation; in fact, there are an innite number of such
values. The equation is said to be underidentied. If you are told both that x
equals 1 and that z equals 3, then there is only one value of y that is admis-
sible: y5. The equation is just identied
5
.
In our current example with the string lengths the underidenti-
cation arises because we have to estimate both the path coecients and the
variance of the latent variable. We see in Figure 5.5 that d-separation pre-
dicts that the partial correlations (thus, the partial covariances) of each pair
of observed variables must be zero when conditioned on the latent variable.
Maximum likelihood estimation ts the data to the structural equations
while respecting this constraint. The predicted covariance between the
latent L and each observed X
j
is given by Cov(L, X
j
)
i
Var(L). Since there
is an innite combination of values and Var(L) that can solve the equa-
tions, we must choose one by imposing an additional constraint. In reality,
the imposition of this constraint consists of choosing the units that you want
for your latent variable.
5. 3 A MEASUREMENT MODEL
145
14
See Chapter 4.
15
If we are told that x equals 1 and that two dierent estimates of z are 2.5 and 3.5, then
the equation is overidentied. There is no unique solution to overidentied equations and
in any empirical problem the objective is to nd the combination of estimates that gives
the best solution; least squares regression or maximum likelihood estimation are both
examples of this.
At this point you have a choice. If you want the latent variable to
have the same units as one of your measures, then you can x the path
coecient from the latent to this measure to 1. By doing this you are stating
that a one unit change in the latent changes the measured variable by one
unit of the chosen scale. For instance, if we wish to scale our latent string
lengths in centimetres then we could x
1
to 1. Think carefully before
doing this. Your measure might systematically underestimate small values of
the latent variable and overestimate large values; in this case the slope (i.e.
the path coecient) would be greater than 1. If this is the case, or if the
scales of none of the measured variables are inherently more reasonable or
useful than any of the others, then you can express the scale of the latent
variable in units of standard deviations. This is done by xing the variance
of the latent variable to unity and allowing all the path coecients to be
freely estimated. Remember that standardisation (dividing a variable by its
standard deviation so that its variance is unity) removes the original unit of
the variable and replaces it with a scale of standard deviations from the mean.
This has the eect of dening the scale of the latent variable by the meas-
ured variable and the path coecient.
Here is the full set of structural equations that are used in the like-
lihood maximisation using a standard deviation scale for the latent:
X
j1
1
LN(0,
1
)
X
j2
2
LN(0,
2
)
X
j3
3
LN(0,
3
)
X
j4
4
LN(0,
4
)
Var(X
j
)
j
2
Var(L)
j
2
j 1,4
Var(L) 1
Cov(X
i
,X
j
)
i
j
Var(L) i j
These structural equations decompose the observed variances of the four
measured variables into one part that is the same for all of them, due to the
common cause of L, and one part that is dierent for each of them, due to
the uncorrelated measurement errors. Lets simulate this causal process using
the following generating equations:
X
j1
1LN(0,0.5)
X
j2
0.07LN(0,7)
X
j3
0.39LN(0,3.3)
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
146
X
j4
1LN(0,10)
LN(50,10)
Notice that the real scale for the true latent length is in centimetres
in this simulation, since the path coecients leading from the latent to X
1
and to X
3
are unity. So, I generate 100 independent strings, whose true
length (L) is measured in centimetres. Since the rst person used a ruler grad-
uated in centimetres, the path coecient is 1. Since she rounded her esti-
mates to the nearest half centimetre, I have given her measurement error a
standard deviation of 0.5cm. The second person used her hand, which was
14 centimetres long. Her measurement scale was hand-lengths, rounded to
the nearest half-hand and so the path coecient is 0.07, with standard devi-
ation of the measurement error being 7 centimetres. The third person used
a ruler calibrated in inches, resulting in a path coecient of 0.39, which is
the conversion frominches to centimetres. He took little care in his readings
and so the standard deviation of the measurement error was 3.3 centimetres.
The last person simply visually estimated the true length in centimetres and
so the path coecient is 1. He was accurate only to within 10 centimetres,
resulting in the standard deviation of his measurement error being 10 centi-
metres. Figure 5.6 shows the scatterplot matrix
6
of the 100 strings.
The measured lengths taken by the four people are all correlated,
since they are all trying to measure the same thing. The residual scatter in
the graphs between each measured variable is due the measurement errors
of both variables in the pair and the magnitudes of these measurement errors
dier from one variable to the other. Here are the maximum likelihood esti-
mates of the free parameters of the structural equations after xing the var-
iance of the latent variable to unity:
X
1
10.076LN(0,3.416)
X
2
1.829LN(0,7.167)
X
3
3.631LN(0,3.480)
X
4
12.120LN(0,9.075)
The chi-squared statistic is 3.283 with 2 degrees of freedom,
7
pro-
ducing a probability of 0.19 under the null hypothesis, telling us that the
5. 3 A MEASUREMENT MODEL
147
16
A scatterplot matrix is like a correlation matrix except that the actual scatterplot of each
pair of variables in the matrix is shown and a histogram of each variable is shown on the
diagonal panels.
17
I said above that there was only 1 degree of freedom. However, xing the variance of the
latent variable to 1 means that we did not have to estimate this parameter, adding one extra
degree of freedom.
hypothesised causal structure is consistent with the data. The estimated var-
iances of the measurement error of each variable agree well with the true
values; the condence intervals of these estimates, which are also given in
most commercial SEM computer programs, include the true values. The
estimated path coecients are quite dierent from the true values. This is
due to the fact that I xed the variance of the latent variable to 1 even
though we know that it is 100 (thus with a standard deviation of 10) in the
simulations. This means that the path coecients are proportional to the
true values with the constant of proportionality being the inverse of the true
standard deviation of the latent variable. Since we know the true variance
of the latent variable in this simulation, we can convert the structural equa-
tions, obtaining:
X
1
1.0076LN(0,3.416)
X
2
0.1829LN(0,7.167)
X
3
0.3631LN(0,3.480)
X
4
1.2120LN(0,9.075)
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
148
Figure 5.6. Scatterplot matrix of the simulated observations on the
lengths of the four strings (X
1
to X
4
); the histograms in the diagonal cells
show the empirical distributions of the observations.
Both these equations and the ones before are identical up to a con-
stant (1/10) and so the conversion simply changed the mean of the latent
variable. Since SEM is generally concerned with the relationships between
the variables, not their means, this conversion will have no consequence on
the model. However, I said above that we could have xed the scale of our
latent variable to centimetres by xing the path from the latent to X
1
to
unity and allowing the variance of the latent to be freely estimated. In an
empirical study we would not know the true variance of the underlying
latent variable and so this second strategy would be the one to use if the
scale of the latent is important to you. We can calculate the correlation
between the measured variables and the underlying latent variable in order
to judge how well the measurement model has done. These correlation
coecients are routinely printed out in commercial SEM programs; Box 5.1
summarises the calculations.
Box 5.1. Correlating latents and indicators
By denition, the correlation coecient between the latent variable, L, and
its observed indicator variable, X
i
, is:
L,X
i
where
i
is the path coecient from the latent (L) to its indicator (X
i
).
The coecient of determination,
2
L,X
i
, between the latent and its indi-
cator is:
2
L,X
i
i
2
i
where
i
is called the reliability of X
i
.
If you want to obtain estimates of the latent variable, up to a scaling
constant, then you can form a weighting function of the observed variables;
see Bollen (1989) page 305, for the formula. However, no weighting func-
tion can estimate the latent without error and, in practice, the various weight-
ing functions that have been proposed do not improve the accuracy of the
estimation of the latent much beyond that obtained by choosing the meas-
ured variable whose correlation with the latent is highest.
The accuracy with which one can estimate the underlying latent
variable (up to a scaling constant) will depend both on the reliability of the
measured variables and on the number of such variables used to measure
the latent. However, this predictive ability is really quite secondary in the
2
Var(L)
Var(X
i
)
i
Var(L)
Var(L)Var(X
i
)
Cov(L,X
i
)
Var(L)Var(X
i
)
5. 3 A MEASUREMENT MODEL
149
context of testing causal models with measurement error. The most impor-
tant point is that a measurement model allows one to explicitly account for
measurement error, therefore providing unbiased estimates of the path
coecients. Because the predicted covariances between the observed var-
iables are functions of the path coecients linking them, then we can
obtain unbiased estimates of the predicted covariance matrix and, there-
fore, of the asymptotic probability of the model under the null hypothesis.
For those who like to see the algebraic details, Box 5.2 gives the generic
factor model.
Box 5.2. Standard factor model
Consider the following model (Figure 5.7) with six observed (manifest) var-
iables and two latent variables with an unresolved covariance between them.
Here are the structural equations:
y
1
a
11
f
1
0f
2
e
1
y
1
a
12
f
1
0f
2
e
2
y
3
a
13
f
1
0f
2
e
3
y
4
0f
1
a
24
f
2
e
4
y
5
0f
1
a
25
f
2
e
5
y
6
0f
1
a
26
f
2
e
6
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
150
Figure 5.7. A measurement model involving two latent variables (f
1
and
f
2
), each measured by three observed variables (y
1
to y
3
and y
4
to y
6
).
In matrix form, the equation is YAFE. Multiplying both sides by the
transpose of Y gives: YY(AFE)Y(AFE)(AFE). Expanding
this gives: (AF)(AF)(AF)EE(AF)EE. Since the errors are inde-
pents of the latent factors ( f
1
and f
2
) and of each other, when we take expec-
tations the following two terms are zero: (AF)E and E(AF). This gives
E[YY] E[AFFA] E[EE]. From this comes the standard factor model
equation:
YY
AA
where
YY
is the model covariance matrix between the observed (Y ) vari-
ables, is the model covariance matrix between the latent (F) variables, and
is the model covariance matrix between the errors (which dont have to
be mutually independent, as in the current example). In order to avoid inde-
terminacies, we can set the factor variances to unity. Putting this all together
for our model we get the following matrix equation:
11
0
11
12
16
21
0
21
22
26
31
0
1
42
11
21
31
0 0 0
0
42
21
1 0 0 0
42
52
62
61
62
66
0
52
0
62
The minimum number of observed variables needed to t, and test,
a measurement model will depend on the number of hypothesised latent v-
ariables and the ways in which these latents are related to one another. For
instance, if you have only one measured variable, the structural equation is
X
j
L. You have only one element in the covariance matrix (i.e. the
variance of X) but you have three parameters to estimate: , Var() and
Var(L). You can x to 1 to x the scale of the latent, but this still leaves
Var(L) and Var(). The equation is underidentied. If you can obtain an
independent estimate of the error variance, then you can x Var() to this
value. This can sometimes be done. For instance, one could physically
extract the body lipids of a sample of animals to obtain a precise estimate of
body fat and then regress an indirect measure of this body fat to obtain the
residual error variance. This will allow you to separate measurement error
in subsequent data (assuming that the measurement error doesnt change),
but you still cant test this measurement model, since there would be no
degrees of freedom.
5. 3 A MEASUREMENT MODEL
151
What about two measures of a latent? With two observed variables
we have three non-redundant elements of the covariance matrix (two var-
iances and one covariance), but we also now have ve free parameters (
1
,
2
, Var(
1
), Var(
2
) and Var(L)). We have not solved our problem of under-
identication. With three measures of a latent we have six non-redundant
elements of the covariance matrix and seven free parameters. Since we have
to set the scale of the latent, for instance by xing Var(L) at 1, we can now
t the structural equations, but we will have no degrees of freedom with
which to test the measurement model. This is ne if we dont need to inde-
pendently justify the measurement model (for instance, the relationship
between a thermometer and the air temperature) but if there is any ques-
tion about the causal relationships between the measured variables then we
have defeated the whole purpose of modelling measurement error. Four
measured variables per latent is the minimum number needed to both t
and test such a measurement model.
5.4 The nature of latent variables
So far, I have described latent variables simply as variables that we have not
directly measured but that we can directly observe. In the previous exam-
ples there was no question but that the animals really did have lipid reserves
or that the strings really did have a length. Our only concern was in accu-
rately measuring these variables. In such situations the development of the
measurement model involves choosing measurable indicator variables that
are all linearly related to the same latent variable. Ideally, the only causal rela-
tionships between these indicators will be through the common eect of
the latent variable. If there exist other causal relationships between the meas-
ured variables, through other latents or not, then these must also be included
in the model.
Often nature is not that accommodating. What happens if we want
to model latent variables that we cannot directly observe? In such cases, even
the existence of the latent variable is hypothetical. The invocation of such
theoretical entities presents much more dicult choices, since we cant rely
on direct observation to know whether such things even exist, although the
actual modelling is no dierent. None the less, the history of science is lit-
tered (or enriched, depending on your philosophical view) with such things.
When Gregor Mendel invoked recessive and dominant alleles of genes to
explain his patterns of inheritance in pea seeds, he did not measure or
observe such things. Rather, he inferred them because the ratios of the
resulting phenotypes agreed with the binomial proportions that would result
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
152
if such things existed
8
. Genes were latent variables and still are; no one has
ever directly observed a gene. Atoms, too, are latent; the Periodic Table was
developed by inferring atomic structures from the numerical regularities
that resulted from experiments. Ernst Mach, who was mentioned in
Chapter 3 as one of the phenomenalists who inuenced Karl Pearsons
views, initially refused to accept the reality of shock waves caused by bullets
going faster than the speed of sound. He accepted such waves only when
he was able to devise an experiment in which a camera was rigged to take
a picture just as a bullet cut a ne wire covered in soot, revealing a V-shaped
pattern
9
. As these successful latent variables attest, scientists have regularly
invoked things that can only be indirectly observed through the use of proxy
measures. The problem is that scientists have also invoked unsuccessful
latent variables. A classic example is the aether, through which light waves
were supposed to cross outer space. The use of latent variables in measure-
ment models or SEM is not so much a statistical controversy as a scientic
and philosophical one. Think carefully before including latent variables in
your models and be prepared to justify their existence.
Much of my personal discomfort with latent variable models comes
from the causal claims that many (by no means all) latent variables make. It
is one thing to invoke a theoretical unmeasured variable and quite another
to demonstrate that such an entity has both a reality in nature and has causal
ecacy. Choosing, developing and justifying such latent variables is, perhaps,
the most dicult aspect of structural equations modelling. I dont know of
any set of rules that can unfailingly guide us in this task either. The explor-
atory methods, described in Chapter 8, can help to alert us to the existence
of latent variables. The statistical tests based on maximum likelihood allow
us to compare our data with such hypothesised latent variable models and
therefore potentially to reject them. However, scientists generally demand
stronger evidence than an acceptable statistical t before accepting the phys-
ical reality of unmeasured variables. Before continuing further, it is again
useful to look briey at the history of latent variable modelling in statistics.
The hornets nest of confusion involving latent variable models is due in part,
I believe, to the historical link between latent variable models and factor
models in the social sciences.
In 1904 the English psychologist Charles Spearman combined
the new psychometric work of Alfred Binet on human intelligence with
5. 4 THE NATURE OF LATENT VARI ABLES
153
18
There has been a long-running debate as to whether Mendel cooked his data by ignor-
ing outliers, since the observed and predicted ratios are so remarkably close as to be highly
unlikely to occur by chance.
19
Needless to say, he did not really observe the shock waves, only the indirect eect on the
soot particles.
correlation coecients. In his provocatively titled paper (Spearman 1904),
General intelligence objectively determined and measured, he hypothe-
sised that the observed measures of intelligence of people, obtained from
test questions, were all correlated because they were all due to a general
latent intellectual capacity ( g) that varied from person to person. If we have
four dierent measures of intelligence (from an IQ test, say) then the causal
graph would look like Figure 5.8.
Now, given this structure, the population correlation between any
two observed variables, say X
1
and X
2
, would be
12
2
. It follows that
the following three equations must be true if the model in Figure 5.8 is true:
12
34
13
24
0
13
24
14
23
0
14
23
13
24
0
Spearman called these vanishing tetrads because each involves four corre-
lation coecients and they are, in modern terms, constraints on the corre-
lations due to the causal structure of the model. He argued (incorrectly, as
explained in Chapter 8) that data obeying such vanishing tetrads was evi-
dence for this unmeasured latent cause (generalised intelligence). Spearman
apparently viewed this latent variable as a real, causally ecacious attribute
of people. As more measured variables were added, and more complicated
latent structures were hypothesised, one could derive the vanishing tetrads
that were implied by the model, but this quickly became dicult to do, both
conceptually and computationally.
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
154
Figure 5.8. A measurement model for general intelligence a latent
variable. Note that the errors for each of the observed measures (X
1
to
X
4
) are not shown but will exist unless the measures are all perfectly
correlated with this latent variable.
In the 1930s Harold Hotelling invented principal components anal-
ysis and Louis Thurstone invented factor analysis. These methods were not
based on any explicit causal model. Quite the contrary. Thurstone viewed
science in much the same way as Pearson did: science consisted of erecting
constructs that could describe the data as simply as possible. Whether or
not such constructs actually existed in nature was irrelevant; they could
economically summarise the patterns of correlation and replace a large
number of observed variables by a single construct. These constructs, or
factors, had the property that they could partition the variances of each
measured variable into one part (the construct) that was common to all and
one part that was unique to each measured variable. Described in this way,
we appear to be right back to our measurement model, as described above.
However, such factor models (like principal component models) could
perform this trick quite independently of whether the common variance
was really due to some unmeasured common cause. Moreover, the method,
if drawn as a graph, always has the arrows going from the construct (factor,
common variance) to the measured variables. Such a structure was a require-
ment of the method. Interpreted as Thurstone had intended, this was not a
problem, since the constructs were simply mathematical functions designed
to summarise data. Interpreted as causal models (as Thurstone most emphat-
ically did not intend) factor models had the bizarre property of requiring
that the direction of causality always went from the latent construct to the
observed variables. The obvious advantage of factor analysis or principal
components analysis
10
over Spearmans method of vanishing tetrads was that
these former methods were easier to use and based on standard formulae.
One had only to plug the data into the equations and out popped the con-
struct or the principal component axis.
Thus vanishing tetrads became an historical footnote and factor
analysis (with its requirement that the arrows go from the construct to the
measured variables) took their place in psychometrics. Jreskog (1967,
1969) applied maximum likelihood methods to factor analysis to develop an
inferential test for such models, and then extended this to allow for cause-
and-eect relationships between the latent variables based on econometric
simultaneous equations models, giving rise to structural equations models
(Jreskog 1970, 1973). Although there is no longer any mathematical
requirement that the arrows always go from the latents to the measured var-
iables as was the case with factor analysis the formalism of factor analy-
sis still persists in SEM along with some of its philosophical origins.
5. 4 THE NATURE OF LATENT VARI ABLES
155
10
Principal components analysis, another multivariate data-summary method, requires that
the path coecients always go in the opposite way to factor analysis.
As an example, consider the description given in Bollens (1989)
inuential book on SEM. He stated that the measurement process begins
with a concept and denes a concept as an idea that unites phenomena
under a single term. He gave the example of anger, which provides the
common element tying together attributes such as screaming, throwing
objects, having a ushed face and so on. The concept of anger, he said, acts
as a summarising device to replace a list of specic traits that an individual
may exhibit. This is a close paraphrase of Thurstones original description
of a factor but remember that Thurstones factor analysis was explicitly
acausal. To Bollens rhetorical question Do concepts really exist?, he
answered [c]oncepts have the same reality or lack of reality as other ideas
. . . The concept identies that thing or things held in common. Latent var-
iables are the representations of concepts in measurement models. If we are
dealing with purely statistical models devoid of causal implications, then
such a view might be ne. If our models are statistical translations of causal
processes then the latent variables in our models must be something more
than a mathematical summary; latent variables must represent variables with
physical reality having causal relationships to the measured variables.
One of the early controversies amongst geneticists at the turn of the
century concerned the inheritance of size dierences in dierent body
parts. W. E. Castle, an inuential geneticist at the time and Sewall Wrights
thesis supervisor, argued that there was a single size factor that was inher-
ited and that determined the allometric scaling of dierent body parts. Part
of this argument was based on correlation coecients, calculated by Wright
while still a graduate student, relating ve dierent bone measurements of
rabbits. Davenport (1917), studying human stature, argued that the patterns
of correlation between dierent lengths of dierent body parts suggested
that these attributes of size were inherited independently. In 1918 Wright
published On the nature of size factors based on the rabbit measures, in
which he calculated a series of partial correlations. Based on these calcula-
tions he concluded that his own supervisor was wrong and that These three
correlations
11
suggest the existence of growth factors which aect the size
of the skull independently of the body, others which aect similarly the
length of homologous long bones apart from all else, and others which aect
similarly bones of the same limb. Since no one knew what these size factors
were, the entire argument concerned the number of latent variables con-
trolling the inheritance of size in dierent body parts. The following
example shows how one can test such claims.
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
156
11
Actually, they were partial correlations.
5.5 Horn dimensions in Bighorn Sheep
Marco Festa-Bianchet and his students have been following a population of
Bighorn Sheep from the Rocky Mountains of Alberta for many years. Horn
size is very important in this species because, among other things, it aects
the ability of males in combat during the rut and therefore their evolution-
ary tness. Every year from 1981 until 1998 the researchers measured the
total length and the circumference at the base of the two horns of the cap-
tured sheep. As one might expect, these four variables are highly correlated
and display the sorts of allometric scaling pattern that are so ubiquitous in
biology. Are these four variables simply responding to a single latent size
factor. In other words, are the patterns of correlation between the four var-
iables simply due to a single common unmeasured cause that determines
increases in linear dimensions, as Chase might have supposed?
Figure 5.9 shows this hypothesis, translated into a causal graph. To
x the scale of the latent, I xed the variance of the latent variable to 1. The
data do not follow a multivariate normal distribution, even after a ln-
transformation, as shown by Mardias normalised coecient of kurtosis.
Because of this, I use a robust estimation method for the chi-squared statis-
tic (the SatorraBentler chi-squared); these statistics are explained in detail
in Chapter 6. The data are clearly not consistent with the single common
latent model in Figure 5.9, since the SatorraBentler chi-squared statistic is
759.106 with 2 degrees of freedom. The probability of observing this by
chance is far lower than one in a million.
5. 5 HORN DI MENSI ONS I N BI GHORN SHEEP
157
Figure 5.9. The hypothesised measurement model relating four
observed attributes of the horns of male Bighorn Sheep.
A look at the residuals shows why the t is so bad. The residuals of
the two length measures are highly correlated, indicating that there is some-
thing else that is aecting length independently of the basal circumference.
Perhaps the General size factor causes two Specic size factors one for
length and one for circumference? In Chapter 6 I explain how one can sta-
tistically test these ideas, however I cant point to any specic biological
mechanism to justify this proliferation of hypothetical unmeasured variables.
At this point I apply the SQUIRM test. When the hypothesised
latent variable begins to resemble a summarising device to replace a list of
specic traits that an individual may exhibit rather than a physical thing
with causal ecacy then I begin to squirm. The statistical model has wan-
dered too far away from any causal model to which it was supposed to be a
translation. I cant justify one of the auxiliary assumptions (that the latent
variable is not simply a statistical construct) beyond reasonable doubt. Each
person will have their own tolerance for this, but in my experience most
biologists (and reviewers!) have a very low SQUIRM tolerance indeed. My
own (highly personal) opinion is that this is a good thing.
5.6 Body size in Bighorn Sheep
Body size is another important attribute of Bighorn Sheep. Large animals
are less likely to fall prey to predators. Animals that have been able to amass
sucient fat reserves in the autumn are more likely to survive the severe
winter conditions at the top of a mountain in Canada. Larger males are more
successful in the rut and therefore are able to copulate with more of the
females. The reproductive success of a female is aected by her fat reserves.
Now imagine that you are a eld biologist, perched at the top of a steep
rocky slope with a temporarily subdued animal, and you need to estimate
its body size. You do not have a balance (and have to keep your own
balance!) but you can quickly take measurements of attributes associated
with body size. Can you construct a measurement model that will be able
to estimate the unmeasured body size?
The following analysis, based on data provided by Festa-Bianchet,
are from four indirect measures of body size based on 248 observations of
Bighorn Sheep. The observed variables are the total length of the animal
(snout to tail), the circumference of the neck, circumference of the chest
just behind the front legs, and a visual estimate (which sounds better than
guess) of body weight. These data, transformed to their natural logarithms,
are consistent with multivariate normality, based on Mardias coecient.
The measurement model consists of a single latent variable, which I have
labelled body size, that is the single common cause of the four indirect meas-
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
158
ures listed above. As always, one can set the scale of the exogenous latent
variable either by xing its variance to unity or by xing the path coecient
from it to one of the observed variables. Since body size is usually inter-
preted as meaning mass, I have therefore chosen to x the path coecient
from the latent to the estimated body weight at unity. This means that the
latent body size is measured in ln(kilograms) (the units of the estimated
body weight). Figure 5.10 shows the directed graph for this measurement
model.
The chi-squared statistic for this model is 0.971 with 2 degrees of
freedom, giving a probability under the null hypothesis of 0.615. The data
are perfectly consistent with the model. Here are the structural equations
and the proportion of the variation of each measured variable that is
accounted for by the latent body size:
ln(Estimated weight) 1ln(Body size) N(0,0.023) R
2
0.893
ln(Total length) 0.370ln(Body size) N(0,0.003) R
2
0.911
ln(Neck circumference) 0.424ln(Body size) N(0,0.005)
R
2
0.883
ln(Chest circumference) 0.387ln(Body size) N(0,0.001)
R
2
0.982
ln(Body size) N(0,0.191)
We see that the estimated error variance of the ln(estimated body
weight) is 0.023 and the latent body size accounts for 89.3% of the vari-
ance of this estimated weight. The guesses were not so bad after all. In fact,
these guesses of the true body weight appear to be just as tightly correlated
5. 6 BODY SI ZE I N BI GHORN SHEEP
159
Figure 5.10. The hypothesised measurement model relating four
observed size attributes of Bighorn Sheep.
with the latent body size as the measures of neck circumference. The
observed variable that was most highly correlated (R
2
0.982) with the
latent body size was the chest circumference. If only one measurement is to
be taken, the chest circumference should be the one to use.
If we now wish to test a causal model involving body size in a new
data set we can either use one of the measured variables and explicitly
include our estimates of the error variance or else use all four measured var-
iables. The advantage of using all four variables is that we do not have to
assume that the error variances will remain the same from one study to the
next; we can instead estimate them. This might be a wise decision if, for
instance, dierent people take these measurements from one study to the
next and some people make more measurement errors than others.
The problem of dierent people making systematically biased esti-
mates is the likely explanation for the lack of t that occurs when another
measured variable is included in the measurement model described above.
This variable is the length of the hind foot. The chi-squared statistic for a
new model that includes this new variable is 25.808 with 5 degrees of
freedom, for a probability of about 110
4
. This ve-variable measure-
ment model can be made to t the data only by letting the length of the
hind foot covary freely with the weight estimate and the neck diameter
12
.
In other words, there are other causes, independent of the latent body size
that are generating associations between these three measured variables.
Although it is possible that these other causes are related to the biology of
the animals, it seems more likely that the other causes are due to the way
the data were collected.
The measurements were taken over many years by many dierent
people mostly graduate students with dierent levels of ability in eld
work. Unlike the other length measurements, the length of the hind foot
requires that the foot and hoof be consistently extended to the same degree.
These measurements must be taken quickly while the animal is still subdued.
Imagine that you are sitting at the edge of a steep precipice on the top of a
mountain, with an adult Bighorn Sheep about to wake up. It is safe to
assume that the care with which the foot is extended will vary from person
to person. So, if there are any systematic biases between people in how they
measure the three variables in question (the length of the hind foot, the
weight estimate and the neck diameter) then this would be a cause of cor-
relations between them independent of dierences in body size.
MEASUREMENT ERROR AND LATENT I NFERENTI AL VARI ABLES
160
12
Actually, one permits covariation between the error variables of these measured variables.
5.7 Name calling
A certain Henry P. Crowell of Ravenna, Ohio, bought a bankrupt mill in
1881 and went into the business of convincing people to consume what pre-
viously only poverty-stricken Scotsmen, Germans and horses had eaten
(Burke 1996). How did he accomplish this feat of marketing
13
? Since people
associated the word Quaker with honesty and a healthy life-style, he simply
called his newproduct Quaker Oats. Thus began one of the longest-running
breakfast cereals in the USA. A certain mouthwash company proudly pro-
claimed that its product, besides making ones breath taste fresh, also cured
halitosis. The dictionary denition of halitosis is bad-smelling breath. So
what is the latent variable, shown in Figure 5.10, that is the single indepen-
dent cause of the estimated body weight, the total length, the neck and the
chest circumference of the sheep? I have labelled it body size but is this mis-
leading advertising? Just as saying that something cures halitosis evokes con-
notations beyond simply curing bad breath, does calling my latent variable
body size evoke connotations beyond simply that theoretical variable that
has the property of making the partial covariances between each unique set
of measured variables equal to zero, when conditioned on it?
Remember Bollens (1989) claim that [l]atent variables are the rep-
resentations of concepts in measurement models, and that [t]he concept
identies that thing or things held in common. It is certainly reasonable to
state that body size is that which is common to weight, length and circum-
ference of the body (the measured variables). However, if Figure 5.10 is to
be interpreted as a description of a causal process, then the latent variable
also represents a single common cause of body weights, lengths and circum-
ferences. The causal claim must be that there is a single biological process
that determines all of these body dimensions. It would obviously be better
if we knew enough about the genetic and developmental processes deter-
mining body size that we could label our latent variable as hormone X or
gene Y. If we cant, then we could at least label it as the unknown cause
of body size. So long as both you and I understand the name body size as
being a short form of saying this, then we can properly translate between
the causal claim and the statistical model. It is particularly important that we
choose our words carefully when dealing with latent variables and the
burden of clarity is on the person proposing the model, not on the reader.
If you see a latent variable in a structural equation and its meaning and causal
justication are not clearly explained, think of bad breath.
5. 7 NAME CALLI NG
161
13
To show what Crowell was up against, consider that the rst edition (1755) of Samuel
Johnsons classic Dictionary of the English language dened oats as a grain that sustained
horses in England and people in Scotland.
6 The structural equations model
The structural equations model is commonly described as the combination
of a measurement model and a structural model. These terms derive from
the history of SEM as being a union of the factor analytical, or measure-
ment, models of psychology and sociology and the simultaneous structural
equations of the econometricians. In its pure form it therefore explicitly
assumes that every variable that we can observe is an imperfect measure of
some underlying latent causal variable and that the causal relationships of
interest are always between these latent variables. As in many other things,
purity is more a goal than a requirement. Using the example in Chapter 5
of the eect of air temperature on metabolic rate (Figure 6.1), the things
that we can measure (the height of the mercury in the thermometer or the
change in CO
2
in the metabolic chamber) always contain measurement
error (
i
). The measurement model, shown by the dotted squares in Figure
6.1, describes the relationship between the observed measures and the
underlying latent variables (average kinetic energy of the molecules in the
air and the metabolic rate of the animal). The structural model, shown by
the dotted circle in Figure 6.1, describes the relationship between the true
underlying causal variables. If we have only one measured variable per latent
variable, and we assume that the measured variable contains no measure-
ment error (i.e. the correlation between the measured variable and the
underlying latent variable is perfect) then we end up with a path model. If
we have a set of measured variables for each latent variable and we do not
assume any causal relationships between the latent variables, then we have a
series of measurement models. If we have more complicated combinations
in which we assume causal relationships between the latent variables, then
we have a full structural equations model. Therefore, if you have understood
Chapters 1 to 5, then you already know how to construct and test a struc-
tural equations model; you simply have to put the pieces together.
The goal of this chapter is therefore to deal with some technical
details that I have ignored up to now. The rst detail is the problem of iden-
tication. In models involving more complicated combinations of latent and
observed variables, how can we make sure that no model parameters are
162
underidentied? The second detail involves the robustness of SEM to vio-
lations of two important assumptions: large sample sizes and multivariate
normal distributions. What happens when our data do not agree with these
assumptions, and what can be done about it?
6.1 Parameter identication
We have already met the problem of underidentication in Chapter 5.
Intuitively, a model is underidentied when more than one combination of
parameter values can account for the same pattern of covariance. If a model
is underidentied then you cant trust the parameter estimates, their standard
errors, the chi-squared value or its probability level. If a model is underiden-
tied then most commercial SEMprograms will print a warning. For instance,
if you are told that a parameter estimate is a linear combination of some other
set of parameters or that an estimated variance estimate is negative or set at
zero, then this probably means that the model is underidentied.
A model can be structurally underidentied or empirically underiden-
tied. Structural underidentication means that the model will be under-
identied for any combination of parameter estimates the problem is in
the way the model itself is constructed. You will want to ensure that your
model is not structurally underidentied before collecting data, in order to
avoid wasting your time. Empirical underidentication means that the
model is under identied only for some particular sets of parameter esti-
mates the problem is not in the general construction of the model but
rather with the particular values found in the data. These points will be
6. 1 PARAMETER I DENTI FI CATI ON
163
Figure 6.1. The relationship between the measurement model and the
structural model in the causal structure relating the ambient air
temperature and the metabolic rate of an animal.
illustrated with examples later. Lets start with some useful rules for avoid-
ing this problem.
6.2 Structural underidentication with measurement models
Following the notions introduced in Chapter 5, lets call a measurement
model any factor analytical model consisting of a set of latent variables and
a set of observed indicator (measurement) variables that are each caused by
at least one of the latent variables. The latent variables can be allowed to
freely covary (i.e. there can be curved double-headed arrows between them)
but there are no causeeect relationships between the latent variables (i.e.
there can be no arrows from one latent to another). Bollen (1989) has sum-
marised three rules to help in judging whether a measurement model is
structurally identied. These rules cover many types of measurement
model, but not all. All of these rules assume that the scale of each latent var-
iable has been xed, as described in Chapter 5, either by xing one of the
path coecients to 1 or by xing the variance of the latent variable to 1.
Rule 1: t n(n1)/2 where n is the number of observed variables and t is
the number of free parameters (i.e. free path coecients, free error
variables and free covariances either between the latents or between
the error variables). This rule is necessary for identication; if the
rule doesnt hold in your model then you can be sure that the model
is not identied. Unfortunately, this rule doesnt ensure that your
model will be identied; even if the rule holds, the model might still
be underidentied. The following two rules are sucient (i.e. if
they hold then the model is identied) but not necessary (i.e. there
are still identied models that violate these rules).
Rule 2: A measurement model is identied if, along with rule 1:
1. There are at least three indicator variables per latent variable.
2. Each indicator variable is caused by only one latent variable.
3. There are no correlations between the error variables.
Rule 3: A measurement model is identied if, along with rule 1:
1. There is more than one latent variable.
2. There are at least two indicator variables per latent variable.
3. Each indicator variable is caused by only one latent variable.
4. Each latent variable is correlated with at least one other latent
variable.
5. There are no correlations between the error variables.
To understand how these rules work, lets look at Figure 6.2, which
shows six dierent measurement models. The scale of the latent variables
THE STRUCTURAL EQUATI ONS MODEL
164
F
i
g
u
r
e
6
.
2
.
S
i
x
m
e
a
s
u
r
e
m
e
n
t
m
o
d
e
l
s
u
s
e
d
t
o
i
l
l
u
s
t
r
a
t
e
B
o
l
l
e
n
s
r
u
l
e
s
f
o
r
i
d
e
n
t
i
c
a
t
i
o
n
.
(L) have all been set by xing the variances of the latent variables to unity.
Free parameters (the path coecients, the variances of the error variables,
or the covariances indicated by curved double-headed arrows) are shown by
an asterisk (*). The model in Figure 6.2A is underidentied; there are four
free parameters (t4) and two observed indicator variables (n2) but rule
1 states that t n(n1)/2. If we constrain the values of the two path
coecients to be the same during the iterative procedure that minimised
the maximum likelihood chi-squared statistic (shown in Figure 6.2B as the
dashed line between the two free path coecients) then we have only three
free parameters, and the model is just identied
1
. This trick, while allowing
us to identify the model, doesnt really allow us to get unbiased estimates of
the measurement error or of the two path coecients.
The model in Figure 6.2C is just identied. There are t6 free
parameters and n3 observed variables, therefore 63(4)/2 and rule 1 is
fullled. Since there are no degrees of freedom left, we cannot test such a
model using the maximum likelihood chi-squared. Such a model can always
be t even if the causal assumption of a single common latent cause is
wrong, and we cant know whether or not the causal assumption is reason-
able on the basis of statistical criteria.
The model in Figure 6.2D is overidentied. There are t8 free
parameters and n4 observed variables, therefore 84(5)/2 and rule 1 is
fullled. Since there are 4(5)/282 degrees of freedom left, we can also
test such a model using the maximum likelihood chi-squared. Such a model
can always be t but if the causal assumption of a single common latent cause
is wrong then we would obtain a signicant probability estimate of the
measured maximum likelihood chi-squared statistic and could therefore
reject the model. Both the measurement models for the Bighorn Sheep
horns and for the body dimensions, studied in Chapter 5, were of this form
and we saw that the model for the horn dimensions was clearly rejected
(p10
6
) while the model for the body dimensions was not rejected (p
0.615).
The model in Figure 6.2E is like the model in Figure 6.2D except
that it has a free covariance between the error variables of X
1
and X
2
. Rule
1 is still satised since t9, n4 and 94(5)/2. Rule 2 is not satised;
although there are at least three indicator variables per latent (there are four)
and each indicator variable is caused by only one latent, there is also a cor-
relation between two of the error variables. Rule 3 cant be applied either,
since there is only one latent variable. However, rules 2 and 3 are sucient
THE STRUCTURAL EQUATI ONS MODEL
166
11
In fact, this model is equivalent to the so-called major axis (errors-in-variables) regression
of Sokal and Rohlf (1981). Reduced major axis regression is obtained by simply standardis-
ing X
1
and X
2
to unit variance and a zero mean before tting the model.
conditions, not necessary conditions. We cant state that the model is de-
nitely not identied, only that we cant tell one way or the other. In fact,
model E in Figure 6.2 is structurally identied.
The model in Figure 6.2F looks a bit like two models from Figure
6.2A combined. Remember that model A wasnt identied because rule 1
was violated. What about model F? There are t9 free parameters and n4
observed variables. Since 94(5)/2, rule 1 is satised. Rule 2 is not satis-
ed (there are only two indicator variables per latent) but rule 3 is satised.
Therefore, model F is structurally identied.
This model also provides a good example of how a model can be
structurally identied but empirically underidentied. If, in reality, the
covariance between the two latent variables (the curved, double-headed
arrow) is close to zero then the estimated covariance in the data might be
zero due to sampling uctuations. If this occurs then the maximum likeli-
hood procedure will be trying to t two independent measurement models,
each with only two indicators per latent. Since each separate measurement
model has four free parameters (two error variances and two path coe-
cients) but only two indicator variables, rule 1 would be violated in this par-
ticular case.
Recall the measurement model for the length and basal diameter of
the left and right horns of the Bighorn Sheep. We were quite condent that
the correlations between these four measures were not due to a single
common unknown cause because the measurement model with a single
latent variable was strongly rejected. Perhaps the observed correlations are
due to two correlated latent causes, as shown in Figure 6.2F? In bilaterally
symmetrical organisms the left and right halves of the body should be mirror
images in terms of size and shape. You have only to look into a mirror
(when no one else is watching) to see that no one is really perfectly bilater-
ally symmetrical. Various environmental perturbations during embryonic
development can cause random deviations from perfect bilateral symmetry,
and the degree of this uctuating asymmetry is sometimes used as an index
of pollution load or other forms of environmental stresses. Perhaps the
model with a single latent cause of horn dimensions was rejected because
there were additional causes of the left and right horns besides a single size
factor that generate deviations from bilateral symmetry? This hypothesis
produces the model in Figure 6.2F and we know that this model is both
structurally identied and has 1 degree of freedom left to test the model.
The single size factor is reected in the free covariance between the two
latent variables. The two latent variables, according to our present hypoth-
esis, should represent the dierent causes of the left and right horns gener-
ating deviations from bilateral symmetry. The two latent variables in Figure
6. 2 MEASUREMENT MODELS
167
6.2F would then represent the causes specic to the left and right horns.
When I t this model to the data (using the SatorraBentler chi-squared, since
there is signicant multivariate kurtosis in the data), I get a chi-squared value
of 146.1205 with 1 degree of freedom. Clearly, this model, too, is wrong
2
.
If we think of how the measurements were taken, we are led to
another measurement model with two latent variables. The horns are
strongly curved and their length is measured with a measuring tape that has
to properly follow the curve of the horn. It is possible that longer horns,
with a more pronounced curve, would be systematically underestimated as
the exasperated researcher tries to make the measuring tape follow the curve
of the horn before the sheep regains control. The longer the horn, and the
more it is curved, the greater the degree of underestimation, since the meas-
uring tape will have more chance to slip down a bit along the horn. A similar
systematic bias might occur for the two measures of basal diameter, since
this measurement too requires a subjective decision as to where the base of
the horn begins. If these speculations are correct, then each of the length
measures and each of the diameter measures might have a separate cause (i.e.
the way in which they are measured) besides a common size factor as horn
volume increases during development. When I t this model to the data
(again using the SattoraBentler chi-squared, since there is signicant multi-
variate kurtosis in the data), I get a chi-squared value of 3.948 with 1 degree
of freedom, giving a probability level of 0.05. This model has an ambigu-
ous probability level and its true value is probably higher, given the large
positive multivariate kurtosis of the data (Mardias coecient of multivari-
ate kurtosis is 27.1); this point will be further discussed in the section on
non-normality of the data. Therefore, I conclude that there is not sucient
evidence to reject it. The path coecients from the latent diameter to the
two diameter measures were 0.522 and 0.519 for the left and right horns.
Since the square of the diameter of a circle is proportional to its area (a
dimension of 2), one would expect these path coecients to be 0.5. The
path coecients from the latent length to the two length measures were
0.875 and 0.869 with standard errors of about 0.03 for the left and right
horns. Since length has a dimension of 1, one would expect these path
coecients to be 1. An approximate 95% condence interval of the
path coecients is therefore about 0.872(0.03) and 1.0 is clearly outside
THE STRUCTURAL EQUATI ONS MODEL
168
12
I used this example simply for pedagogical reasons. In reality, the notion of uctuating
asymmetry is that the deviations are random from individual to individual. In some indi-
viduals the asymmetry will be on the left side and on others the deviation will be on the
right side. In this case there would be no systematic dierence in the population and this
would not generate a latent cause that is systematic to either left or right horns. The causes
of the asymmetry would simply be subsumed into the unique error variances.
this interval. Therefore the measurement model suggests that the lengths of
the longer horns were systematically underestimated. The covariance (and
correlation, since their variances were xed at unity) between the two
latents for length and diameter was 0.9995. If we accept this two-factor
measurement model then all these points suggest that the horn dimensions
are caused by a single latent size factor, but that there was a systematic bias
in measuring the longer horns that introduced a second latent cause of the
lengths independent of the diameters.
Now apply your personal SQUIRM test. Does my interpretation
of these latent variables seem reasonable to you? The acceptable t of the
measurement model with two latents says nothing about what these two
latent variables represent. The explanation that I have outlined, that the two
latents represent the systematic errors made in measuring lengths and diam-
eters and that the covariance between the latents is due to the common size
factor, is an interpretation of the latent variables. This interpretation of the
latent variables in the model is not supported by any statistical evidence;
rather, my evidence comes from how the variables were measured and the
sorts of error of measurement that might occur. The next step would be to
search for an independent conrmation of this explanation. For instance, if
the explanation is correct, then the two latent variables should disappear and
be replaced by a single latent variable once we measure horn length in a way
that does not systematically underestimate longer horns. One way would be
to photograph the horns and then measure the lengths using image analysis.
Davis (1993) described a way of testing for identication in much
more complicated measurement models, applicable to any measurement
model in which each indicator is caused by only one latent variable. This
method (the FC1 rule) requires that you be able to do matrix multiplica-
tion, but many statistical programs can do this
3
. A further requirement is that
the scale of each latent be xed by xing one path coecient to 1 rather
than xing the scale by xing the variance of the latent variable to 1. Box
6.1 summarises the FC1 rule.
Box 6.1. FC1 rule for identication of a measurement model
The FC1 (Factor Complexity 1) rule for the structural identication of a
measurement model assumes that each observed indicator variable is caused
by only 1 latent variable (hence its name).
For each latent variable, L
i
, in the measurement model, construct a
binary matrix P
i
with q
i
rows and t columns; q
i
is the number of observed
6. 2 MEASUREMENT MODELS
169
13
My Toolbox (Appendix) includes a program to carry out this test.
indicator variables of L
i
and t is the total number of observed indicator vari-
ables in the model. Each element (p
ij
) of P
i
has a 1 if the error variables of
indicators i and j are d-separated or if the covariance between them has been
xed and if the covariance of the latents associated with indicator variables i
and j are free.
Form the matrix D
i
P
i
P
i
. Iteratively multiply D
i
j1
D
i
D
i
j
until you
get the matrix D
i
q
i
1
.
The rst requirement for structural identication is that every element
of D
i
q
i
1
be non-zero in the row corresponding to the indicator of L
i
that
denes its scale. This must be true for all latent variables in the model.
The second requirement for structural identication of the full meas-
urement model is that, for every pair of latent variables whose covariance is
to be estimated (i.e. that are not d-separated or whose covariance is not xed)
there must be at least one pair of indicator variables (one for each latent) whose
error variables are independent (i.e. d-separated) or whose covariance is xed.
The third requirement for structural identication of the full measure-
ment model is that, for every latent variable whose variance is to estimated
(i.e. is not xed), there must be at least one pair of indicator variables (one
for each latent) whose error variables are independent (i.e. d-separated) or
whose covariance is xed.
I will now show how this rule works with reference to Figure 6.3.
In this model there are two latent variables, and so we need two P
matrices:
0 1 0 0 1 1 0 1 1 0 1 1
P
1
1 0 1 1 1 1
P
2
1 1 1 1 0 1
0 1 0 1 1 0 1 1 0 0 1 1
THE STRUCTURAL EQUATI ONS MODEL
170
Figure 6.3. A structural equations model used to illustrate the FC1 rule
for identication.
Note that p
12
p
21
1 in P
1
because
1
has a xed (zero) covariance with
2
and similarly for
4
and
5
in P
2
. There are three indicator variables for each
latent and so q
1
q
2
3. We must form D
1
31
and D
2
31
. These are the same
in this example, although this is not true in general:
17 20 16
D
2
1
D
2
2
20 33 20
16 20 17
Now, the rst requirement is that every element in the row of each matrix
representing the scaling variable must be non-zero. The scaling variable of the
rst latent variable is X
1
and so every element in row 1 of the rst matrix
must be non-zero. The rst part of this rst requirement is fullled. The
scaling variable of the second latent variable is X
4
and so every element in
row 1 of the second matrix must be non-zero. The second part of this rst
requirement is also fullled.
The next requirement is that there be at least one pair of error variables
(one associated with an indicator of each unique pair of latents) whose covari-
ance is zero or xed to some other value. Error variables
2
and
5
full this
second requirement.
The nal requirement is that there must be at least one pair of error
variables (associated with an indicator of each latent) whose covariance is zero
or xed to some other value. Error variables
1
and
2
full this requirement
for the rst latent variable and error variables
4
and
5
full this requirement
for the second latent variable. Therefore this measurement model is structu-
rally identied.
6.3 Structural underidentication with structural models
Obtaining identication of the measurement model is necessary to t a
structural equations model. However, SEM also includes the causal relation-
ships between the latent variables. In fact, you can think of the structural
model as the path model that is imbedded in the full model. A path model
is therefore also a structural model. The rules for ensuring structural iden-
tication that I will describe come from Rigdon (1995). Rigdons rules do
not apply to models in which there are cyclic relationships involving more
than two variables (for example, if X causes Y causes Z causes X). On the
other hand, these rules are both necessary and sucient for acyclic or block-
acyclic structural models; a block-acyclic model is dened below. This
means that any acyclic or block-acyclic structural model that satises these
rules is guaranteed to be structurally identied and any such structural
model that does not satisfy these rules is guaranteed to be non-identied.
6. 3 STRUCTURAL MODELS
171
The rst step is to conceptually divide the structural model into
segmented blocks. The model is fully segmented when: (i) there are no
cyclic relationships between the blocks and (ii) each block contains the
minimum number of variables needed to satisfy (i). In other words, if the
members of a set of variables in the model do not have a cyclic relationship
then each variable denes a separate block. If, on the other hand, a set of
variables does dene a cyclic relationship (for example A causes B causes C
causes A) then they must be included in the same block. If, once this has
been done, there are more than two variables in any block then the iden-
tication status of the model cant be determined. If this is not the case, then
the identication status of the whole structural model can be determined
by verifying the identication status of each block. If each block is iden-
tied then the whole structural model is also identied
4
. In evaluating these
blocks, we dont need to consider the exogenous variables. Figure 6.4 illus-
trates these points.
In Figure 6.4A there are seven variables. Since variables X
3
and X
4
have a reciprocal (cyclic) relationship they must be included in the same
block. Since variables X
5
and X
6
have correlated errors they too must be
included in the same block. Variables X
1
and X
2
also have correlated errors
and form a block but they are exogenous in this model and so we dont have
to worry about them. Finally, X
7
is in a block all by itself. Therefore Figure
6.4A can be decomposed into four blocks, the causal relationships between
the blocks have no cyclic patterns, and the model fulls the requirements
for Rigdons test.
In Figure 6.4B there are three variables (X
1
, X
2
and X
3
) that possess
a feedback relationship. Therefore all three variables must be included in a
single block. The last variable, X
4
, forms a second block. Because there are
more than two variables in one of the blocks, we cant determine the iden-
tication status of this model using Rigdons rules.
Once the structural model has been reduced to these blocks, then
you simply have to determine the identication status of each block. To do
this, refer to Figure 6.5, which shows eight dierent patterns. To interpret
these diagrams, you will need some notational conventions. The two vari-
ables indicated as 1 and 2 are the two variables in the block (if there is only
one variable in the block then it is automatically identied). The variables
indicated as P are causal parents. If an arrow and a circle as shown with
solid lines then the two P variables must be present. If an arrow and circle
are shown with broken lines then the two P variables can be present but
their existence is irrelevant to determining the identication status of the
THE STRUCTURAL EQUATI ONS MODEL
172
14
Such a model is called a block-recursive model.
block. Finally, OR means that at least one of the two P variables must be
present and BOTH means that both P variables must be present.
Now, look at each block of the structural equation that contains two
variables in a cyclic relationship and classify it as belonging to one of the eight
cases in Figure 6.5. If any of these blocks are not identied then the model
is also not identied. The only complication is with case 8 in Figure 6.5. To
determine the identication status of a block belonging to this case, ignore
the common causal parent of 1 and 2 and then see which of the other seven
cases corresponds to the block while ignoring the common causal parent.
6.4 Behaviour of the maximum likelihood chi-squared statistic
with small sample sizes
Many of my in-laws like to make home-made wine. A supercial glance at
bottles of these winesmight convince you that they are the real thing. When
you taste them you realise that they vary along a gradient from gut-rot
6. 4 MLX
2
AND SMALL SAMPLE SI ZES
173
Figure 6.4. An illustration of Rigdons rules for structural identication
with cyclic models. (After Rigdon 1995.)
F
i
g
u
r
e
6
.
5
.
E
i
g
h
t
d
i
f
f
e
r
e
n
t
c
a
s
e
s
u
s
e
d
t
o
e
v
a
l
u
a
t
e
R
i
g
d
o
n
s
r
u
l
e
s
f
o
r
s
t
r
u
c
t
u
r
a
l
i
d
e
n
t
i
c
a
t
i
o
n
.
(
A
f
t
e
r
R
i
g
d
o
n
1
9
9
5
.
)
through drinkable to divine. As with latent variables, giving something a
name doesnt make it so. The so-called maximum likelihood chi-squared
statistic (MLX
2
) is the statistical equivalent of home-made wine. It is not
really distributed as a chi-squared variate at all and, unfortunately, its true
sampling distribution is unknown. However, as the size of the sample of
independent observations increases, the sampling distribution of this statis-
tic becomes closer and closer to the theoretical chi-squared distribution. At
very small sample sizes the MLX
2
statistic is like gut-rot wine; it bears an
approximate resemblance to the true
2
distribution but there is no confus-
ing the two. At moderate sample sizes the MLX
2
is like drinkable home-
made wine; it is a reasonable approximation of the real thing unless it is to
be used for a special occasion. It is only when sample sizes are very large that
one cannot distinguish between the two. So howbig is big enoughand what
can be done if ones sample is not big enough? In this section, I discuss the
eects of sample size on the MLX
2
assuming that the data follow a multivar-
iate normal distribution. To explore these questions I use simulations drawn
fromthe path model shown in Figure 6.6, with all exogenous variables being
drawn from a standard normal distribution (i.e. zero mean and unit standard
deviation).
Figure 6.7 shows the empirical sampling distribution of the MLX
2
statistic, based on 1000 independent data sets. I xed all path coecients to
their theoretical values (0.5) and all the error variances to their theoretical
values (1). This way, the only free parameters were the variances of X
1
and
X
2
, and the model covariance matrix could be determined without itera-
tively minimising the MLX
2
. There were therefore 13 degrees of freedom
and the curve shown in Figure 6.7 is the theoretical
2
distribution with 13
degrees of freedom. The rst histogram shows the distribution of the
MLX
2
statistic in the 1000 data sets with 10 observations each. It is clear
that this empirical distribution is not well approximated by the theoretical
2
distribution; the 95% quantile, corresponding to a 5% signicance level,
6. 4 MLX
2
AND SMALL SAMPLE SI ZES
175
Figure 6.6. The path model used to simulate the data that are
summarised in Figure 6.7.
is shown by the arrow. The second histogram shows the distribution of the
MLX
2
statistic in the 1000 data sets with 100 observations each. The empir-
ical 90%, 95%, 97.5% and 99% quantiles in the second simulation were
19.68, 21.94, 24.33 and 27.13, corresponding to theoretical probabilities of
0.103, 0.056, 0.028 and 0.012. Now, the empirical and theoretical distri-
butions are quite close and, assuming that the MLX
2
is truly distributed as
a
2
distribution, will introduce little error.
In general, small sample sizes result in conservative probability esti-
mates. In other words, the true probability level will be larger than the value
obtained when assuming a
2
distribution. If your model produces a MLX
2
value that is judged to be signicant using the
2
distribution, then you have
an ambiguous result and you will have to use a dierent method of estimat-
ing the true probability level. For instance, Table 6.1 shows the empirical
quantiles and theoretical probability levels using the model shown in Figure
6.6. For this particular model a sample size of only 30 provides a passable
estimate of the tail probabilities but with somewhat conservative probabil-
ity estimates and a sample size of 50 is quite acceptable. In general, the more
THE STRUCTURAL EQUATI ONS MODEL
176
Figure 6.7. Empirical distributions of simulated data based on sample
sizes of 10 (left) or 100 (right) observations per data set. The solid curve
shows the theoretical chi-squared distribution and the arrow shows the
95% quantile corresponding to the 5% probability level.
free parameters in the model that need to be estimated, the larger the sample
size required. More complicated models may require sample sizes of 200 or
more. One rule of thumb is that there should be at least ve times more
observations than free parameters (Bentler 1995).
What can be done if your sample size is too small to condently
assume that the sampling distribution of the MLX
2
statistic is close to the
theoretical
2
distribution? If there are no latent variables in your model
then you can use the method described in Chapter 3. If there are latent var-
iables then you will need another way. One way is to use bootstrap methods;
since this method is also useful in cases where the variables have other dis-
tributional problems, the bootstrap will be described later. Another way
around the problem of small sample sizes (but not non-normal distributions
in general) is to use Monte Carlo methods, as used in the simulations
reported in Table 6.1.
The rst step is to t your model using any SEM program, obtain
the MLX
2
statistic (call it X) and the degrees of freedom (df ). Next, con-
struct a model covariance matrix that has the same number of degrees of
freedom as does your model. If your SEM program permits numerical sim-
ulations then just specify your original model with model parameters the
same as those estimated by the program and whose random values are drawn
from normal variates with the specied means and variances. Simulate a
large number N (say 1000) data sets, each with the sample size (n) of your
original data, following this model. Next t each simulated data set to your
original model with the same pattern of free and xed parameters and save
the calculated MLX
2
values of each run. Finally count the number (x) of
these simulated MLX
2
values that are greater than the value (X) obtained in
your original data. The proportion x/N will estimate the probability value
6. 4 MLX
2
AND SMALL SAMPLE SI ZES
177
Table 6.1. Empirical quantiles from 1000 independent data sets, with dierent
numbers of observations (sample size), are shown along with the theoretical
probability levels assuming a
2
distribution with 13 degrees of freedom (see Figure
6.6)
Sample 50% 90% 95% 97.5% 99%
size quantile quantile quantile quantile quantile
10 16.18 26.80 30.58 33.70 38.65
( p0.24) ( p0.01) ( p0.004) ( p0.001) ( p0.0002)
30 13.55 21.38 24.72 27.00 29.81
( p0.41) ( p0.07) ( p0.025) ( p0.012) ( p0.005)
50 12.79 20.48 22.47 24.43 27.52
( p0.46) ( p0.08) ( p0.05) ( p0.027) ( p0.011)
(p) that you are looking for. Since these simulated data sets are mutually
independent and large in number you can obtain a 95% condence inter-
val (Manly 1997) around p by referring to a normal distribution whose mean
is x and whose variance is Np(1p). Thus the 95% condence interval is
p1.96 .
If your SEM program does not do Monte Carlo simulations, then
you can still get an empirical probability estimate so long as you have access
to a computer program that can generate standard normal random variates
and do simple matrix operations (invert a matrix and calculate a determi-
nant)
5
. Since the MLX
2
statistic requires that we calculate the determinant,
and the inverse, of the model covariance matrix, it is useful to choose a
matrix for which this can be easily done. The determinant of a square
matrix whose non-zero values are all on the diagonal is simply the product
of these diagonal values. Similarly, the inverse of such a diagonal matrix is
simply a diagonal matrix whose diagonal values are the inverse of the orig-
inal matrix. We therefore simulate data from a model consisting of v mutu-
ally independent variables, each of which is drawn from a standard normal
distribution. The predicted covariance matrix, , of such a model has non-
zero values only on its diagonal. There are v(v1)/2 non-redundant ele-
ments. If we estimate the variance of q of the v variables, which will be on
the diagonal of , then there will be v(v1)/2q degrees of freedom. So,
here are the steps needed to estimate an empirical probability level
6
for a
MLX
2
statistic of X:
1. Given the desired degrees of freedom (df ), nd the smallest
integer value of v such that df v(v1)/2. This will be the smallest
integer value of v such that v . For example, if
df9 then we need the smallest integer value of v such that
v 3.8. Thus v4.
2. Find the integer value of c such that cv(v1)/2df. So, if df9
and v4 then c1.
3. Construct a model covariance matrix with v* rows and columns.
Estimate the variances of the rst c variables and put these in the
rst c diagonal elements. Dene all other diagonal elements (the
remaining variances) to be 1 and all non-diagonal elements to be 0.
1 1 8(9)
2
11 8df
2
p(1 p)
N
THE STRUCTURAL EQUATI ONS MODEL
178
15
Most commercial statistical programs can do this.
16
My Toolbox (Appendix) contains a program to do this.
This is the population covariance matrix of v mutually independent
standard normal variates, of which the variances of the rst c of
these variables have been estimated from the data. This model
covariance matrix will have df degrees of freedom.
4. Now, generate a large number N (say 1000) independent data sets
consisting of v* mutually independent standard normal random
variables with n observations in each data set.
5. For each of the i simulated data sets, calculate the sample covariance
matrix S
i
and also MLX
i
2
(n1)(ln||trace(S
i
1
)ln|S
i
|c)
(trace is the trace of the resulting matrix).
6. Count the number (x) of the N MLX
2
values that are greater than
the value of the MLX
2
value obtained in your real data (X).
7. The estimated empirical probability of your data will be x/N and
the 95% condence interval of this estimate can be calculated as
described before.
6.5 Behaviour of the maximum likelihood chi-squared statistic
with data that do not follow a multivariate normal distribution
A biologist, a physicist and a statistician are shipwrecked on a deserted island.
Besides themselves, only a crate of canned food has been washed ashore.
After staring hungrily at the cans for a number of hours, the biologist sug-
gests that they break open some cans with a large rock. The physicist sug-
gests instead that they climb to just the right height in a palm tree. She
explains that the kinetic energy, as the can hits the ground, should be just
enough to crack it open without losing any food. Glancing over to the sta-
tistician, who has just nished writing some equations in the sand, they see
him shaking his head in disapproval at their crude methods. He announces
that he has just found a more elegant method of opening the cans, and
points proudly at his equations. Now, he begins, pointing to the rst equa-
tion, assume that we have a can opener . . ..
Sometimes we dont have the statistical equivalent of a can opener.
Knowing the assumptions of a statistical test is important but knowing what
might happen if the assumptions are wrong can be just as important.
Another assumption of the maximum likelihood chi-squared statistic is that
the data follow a multivariate normal distribution
7
. We require methods of
both testing and relaxing this assumption. First, lets look at how to test for
a departures from multivariate normality.
6. 5 MLX
2
AND NON- NORMAL DI STRI BUTI ON
179
17
Actually, the assumption is that the endogenous variables follow a multivariate normal dis-
tribution. Exogenous variables (i.e. ones that are not caused by any others in the model)
dont have this restriction (Bollen 1989).
F
i
g
u
r
e
6
.
8
.
E
x
a
m
p
l
e
s
o
f
c
u
r
v
e
s
s
h
o
w
i
n
g
p
o
s
i
t
i
v
e
o
r
n
e
g
a
t
i
v
e
s
k
e
w
a
n
d
k
u
r
t
o
s
i
s
.
The normal distribution is fully characterised by its mean and var-
iance. Departures from normality can be characterised by non-zero skew
and kurtosis. The skew measures the degree of asymmetry of the distribu-
tion. A negative skew occurs when a univariate distribution has a longer tail
to the left and whose mode is to the right of centre. A positive skew occurs
when a univariate distribution has a longer tail to the right and whose mode
is to the left of centre (Figure 6.8). An index of skew for a series of Nobser-
vations of a random vector X is:
g
11
Box 6.2 summarises the steps for identifying signicant deviations from nor-
mality with respect to kurtosis.
Box 6.2. Measures of skew and kurtosis
Univariate skew
The expected value of g
11
for a normal distribution is 0. The following sta-
tistic is approximately distributed as a standard normal variate for large
(N149) sample sizes, and values greater than 1.96 in absolute value would
indicate skew:
N
i1
(X
i
X)
4
Ns
4
N
i1
(X
i
X)
3
Ns
3
6. 5 MLX
2
AND NON- NORMAL DI STRI BUTI ON
181
zg
11
For tests applicable to small samples, the reader is directed to DAgostino,
Belanger and DAgostino (1990) or Bollen (1989).
Univariate kurtosis
The expected value of g
21
is 3 for normally distributed variables. For this
reason, many computer programs often report the centred version of the g
21
statistic ( g
21
3) even though this is not always well documented. The fol-
lowing statistic follows a standard normal distribution only in very large
samples (N1500) but at least provides a rough guide. A more complicated
statistic, applicable to small sample sizes (N19) is given by DAgostino,
Belanger and DAgostino (1990) or Bollen (1989).
E( g
21
) , Var( g
21
) and z
Many SEM programs report either the standardised (z) or the asymptotic
centred value of kurtosis ( g
21
3) as a benchmark for normality. In using these
tests on all variables in your model, you should use a Bonferonni correction
to the signicance levels. If you want to test at an overall level of , then test
each of the V variables at a level of /V. For instance, if you want to test at
a 95% level (0.05), then test each variable at a level of 0.05/V.
Multivariate measures of skew and kurtosis
The above measures of skew and kurtosis are applied separately to each var-
iable. Since it is possible for the joint distribution to have skew or kurtosis
even though each individual variable shows no evidence of this, Mardia
(1970, 1974) developed multivariate analogs of these statistics. They are based
on a matrix of squared Mahalanobis distances. For a single variable (X
i
) with
N observations, the squared Mahalanobis distance is simply
(X
ij
X
)
2
If each of the j observations consists of a series of V variables then the result-
ing data set, X, has N rows and V columns. The squared Mahalanobis dis-
tance matrix for the entire data set is: XSX, where S is the covariance matrix
of X. Looking at the Mahalanobis distance for each observation helps to iden-
tify outliers in the multivariate space. Based on the squared Mahalanobis dis-
tance, Mardias multivariate measure of skew with V variables is:
1
i
2
n
j1
( g
21
E( g
21
))
Var( g
21
)
24N(N2)(N3)
(N1)
2
(N3)(N5)
3(N1)
(N1)
(N1)(N3)
6(N2)
THE STRUCTURAL EQUATI ONS MODEL
182
g
1v
(XSX)
3
If the data follow a multivariate normal distribution then the expected value
of this statistic is 0. The statistic Ng
1v
/6 asymptotically follows a chi-
squared distribution with V(V1)(V2)/6 degrees of freedom if the data
are multivariate normal. Mardias multivariate measure of kurtosis is:
g
2v
trace((XSX)
2
)
where trace means the diagonal elements. If the data follow a multivariate
normal distribution then the expected value of g
2v
is V(V2) and the vari-
ance is 8V(V2)/N. The statistic
asymptotically follows a standard normal distribution. Bollen (1989) provides
more complicated test statistics that are applicable to small data sets.
If any of the variables in your model have signicant skew or kur-
tosis then the joint multivariate distribution also has signicant skew or kur-
tosis. However, it is possible for the multivariate distribution to have either
skew or kurtosis even though each variable, taken singly, is normally distrib-
uted. This means that we also require a multivariate version of our meas-
ures of skew and kurtosis. Such measures, along with their tests, are given
by Mardia (1970, 1974). The calculations are explained in Box 6.2.
Most biologically oriented statistics texts describe the BoxCox
method of choosing a transformation to make data more closely follow a
normal distribution. This is because statistical tests involving means (t-tests,
ANOVA, etc.) are more sensitive to skew and the BoxCox method helps
to reduce skewness in data. However, tests involving variances and covari-
ances, such as those used in SEM, are more sensitive to kurtosis than to skew
(Mardia, Kent and Bibby 1979; Jobson 1992). Jobson (1992) described a
modied power transformation that is designed to reduce kurtosis:
YSIGN 0
YSIGN ln(XX
M
1) 0
SIGN is the sign of the original value of (XX
M
) and X
M
is the median
value of X.
To nd the value of that reduces the kurtosis best, I calculate the
sum of squared dierences between a series of quantiles of Y (say, 5%, 10%,
(XX
M
1)
( g
2v
v(v 2))
8v(v 2) /N
1
N
1
N
2
N
i1
N
j1
6. 5 MLX
2
AND NON- NORMAL DI STRI BUTI ON
183
50%, 90% and 95%) for dierent values of and the quantiles of a normal
distribution with the mean and standard deviation
8
of Y. The value of that
minimises this sum of squared dierences will best reduce the kurtosis of
the original values. Most statistics packages allow one to plot the empirical
quantiles (cumulative percentage) against these theoretical quantiles. Using
these graphs, you can try dierent values of and choose the one in which
the resulting graph looks most like a straight line. Figure 6.9A shows a quan-
tile plot for 100 values drawn from a t-distribution with 3 degrees of
freedom. These values have a centred kurtosis of 4.12, the standardised value
is 9.06 and the probability of this occurring in a normally distributed vari-
able is 6.610
5
. Notice the large deviations in the tails (the extreme
values). Figure 6.9B shows the value of the sum of squared distances
THE STRUCTURAL EQUATI ONS MODEL
184
18
Alternatively, you can standardise your variable rst and then refer to the quantiles of a
standard normal distribution. These quantiles can be found in most tables of the standard
normal distribution.
Figure 6.9. (A) A normal quantile plot for 100 values drawn from a
t-distribution with 3 degrees of freedom. (B) The value of the sum of
squared distances between the quantiles of these values and those of a
standard normal distribution for various values of using Jobsons
transformation. (C) The normal quantile plot after transforming the data
using the best value of .
between the quantiles of these values and those of a standard normal distri-
bution, as described above, for various values of between 0 and 1. The
best value of is around 0, thus demanding a ln-transformation. Figure
6.9C shows the quantile plot for the values transformed using 0. The
centred kurtosis of this transformed variable is 0.10, the standardised value
is 0.23 and the probability of observing such a kurtosis in a normally dis-
tributed variable is 0.95.
Non-normality can aect the accuracy of the maximum likelihood
chi-squared statistic, the standard errors of the free parameters and the esti-
mation of the free parameters themselves. As you might expect, researchers
have spend a good deal of eort in exploring how dierent types and
degrees of non-normality aect these statistics. To get an idea of robustness
of the maximum likelihood chi-squared statistic, I again generated data from
the model shown in Figure 6.6. The path coecients were xed at 0.5. The
exogenous variances (i.e. the variances of X
1
, X
2
,
3
,
4
and
5
) were gen-
erated from dierent probability distributions with dierent degrees of skew
and kurtosis: the standard normal distribution, a t-distribution with 3
degrees of freedom, a beta distribution with shape parameters of 2 and 5, a
chi-squared distribution with 2 degrees of freedom and a uniform distribu-
tion between 0 and 1. The t-distribution has no skew but strong positive
kurtosis. The uniform distribution has no skew but strong negative kurto-
sis. The chi-squared distribution with 2 degrees of freedom has both strong
positive skew and kurtosis. The beta distribution with shape parameters of
2,5 has positive skew and negative kurtosis. For each distributional type, I
simulated 1000 independent data sets of 50, 100 or 200 observations.
Average values of Mardias multivariate estimates of skew and centred kur-
tosis for these data were also estimated. The results are shown in Table 6.2.
The rst thing to notice is that distributions with strong kurtosis
produce conservative probability levels; there is a tendency for models to be
rejected more often than they should be when one is assuming a theoreti-
cal
2
distribution. This is the same result as we saw for small sample sizes.
The second thing to notice is that even models generated from very non-
normal distributions produce quite acceptable probability levels as sample
sizes increase. This shows the asymptotic robustness of the maximum like-
lihood chi-squared test statistic. If the errors are distributed independently of
their non-descendants in the model, the test statistic should asymptotically
follow a chi-squared distribution. The robustness conditions hold under
independence and not necessarily under uncorrelatedness
9
; for example, if
the variance of the error variable changes systematically with respect to any
6. 6 SOLUTI ONS FOR MODELLI NG
185
19
I am sorry for using such an ugly word.
THE STRUCTURAL EQUATI ONS MODEL
186
Table 6.2. Simulation results (1000 data sets each) based on sample sizes of N
with exogenous variables drawn from a standard normal, a t-distribution with 3
degrees of freedom, a beta distribution with shape parameters of 2 and 5, a uniform
distribution between 0 and 1, and a
2
distribution with 2 degrees of freedom.
Shown are the 50, 90, 95 and 97.5% quantiles of the 1000 maximum
likelihood statistics for each simulation as well as the theoretical probabilities
assuming a
2
distribution with 10 degrees of freedom.The average of Mardias
multivariate centred estimate of kurtosis is also shown
Quantiles (theoretical probability)
N Type 50 90 95 97.5 kurtosis
50 Normal 10.00 16.58 18.73 21.17 1.26
(0.440) (0.084) (0.044) (0.020)
50 t(3df ) 9.72 17.87 20.89 24.81 17.83
(0.465) (0.057) (0.022) (0.006)
50 Beta(2,5) 9.92 17.00 19.20 21.23 1.51
(0.448) (0.074) (0.038) (0.020)
50
2
(2 df ) 9.91 17.24 20.11 23.41 11.98
(0.448) (0.069) (0.028) (0.009)
50 Uniform 9.48 16.59 19.92 22.04 5.38
(0.487) (0.084) (0.030) (0.015)
100 Normal 9.45 15.87 18.34 20.76 0.70
(0.490) (0.103) (0.050) (0.023)
100 t(3df ) 9.25 16.98 19.48 22.16 34.37
(0.509) (0.075) (0.035) (0.014)
100 Beta(2,5) 9.68 16.43 19.10 20.79 1.13
(0.469) (0.089) (0.039) (0.023)
100
2
(2 df ) 9.30 17.09 19.48 22.09 17.86
(0.504) (0.072) (0.035) (0.015)
100 Uniform 9.24 16.35 18.53 20.09 5.68
(0.509) (0.090) (0.047) (0.028)
200 Normal 9.38 16.19 18.50 20.67 0.17
(0.497) (0.094) (0.047) (0.024)
200 t(3df ) 9.25 16.45 19.61 22.74 64.24
(0.508) (0.087) (0.033) (0.012)
200 Beta(2,5) 9.46 16.42 18.44 20.77 0.76
(0.489) (0.088) (0.048) (0.023)
200
2
(2 df ) 9.64 16.67 18.37 22.04 23.51
(0.472) (0.082) (0.049) (0.015)
95% condence 0.531 0.119 0.064 0.035
intervals for to to to to
probabilities 0.479 0.081 0.036 0.015
of its causal parents then this would undermine independence
10
. The
robustness of the maximum likelihood chi-squared statistic depends on
many dierent attributes of the model and the data: the number of free
parameters, the distributional properties of each variable and (especially)
non-independence of the error variables with respect to their non-
descendants in the model
11
.
6.6 Solutions for modelling non-normally distributed variables
Since non-normality can cause problems with the maximum likelihood
chi-squared statistic, a number of alternative ways of tting the model have
been devised. Most commercial SEM programs will include statistics based
on generalised least squares, elliptical estimators and distribution-free esti-
mators, as well as a method of correcting for non-normality that produces
robust chi-squared statistics and condence intervals
12
. The most popular,
and best studied, correction method comes from Satorra and Bentler
(1988).
There now exists an extensive literature that uses Monte Carlo
methods to explore the relative merits of these dierent solutions for non-
normality. Dierent studies have explored the eects of sample size, the
number of free parameters, model type (measurement models, path models,
full structural models) and distributional violations (kurtosis, skew and non-
independence of errors and their causal non-descendants). Hoogland and
Boomstra (1998) have done a meta-analysis of these studies. Their main rec-
ommendations are the following:
1. With respect to sample size, they recommend that there be at least
ve times as many observations as there are degrees of freedom in
the model.
2. When the observed variables have an average positive kurtosis of 5
or more, the sample size may have to be increased by up to 10 times
the degrees of freedom.
3. The generalised least squares chi-squared statistic has an acceptable
performance for a sample size that is two times smaller than the
sample size needed for an acceptable performance of the maximum
likelihood chi-squared statistic.
6. 6 SOLUTI ONS FOR MODELLI NG
187
10
This is similar to heteroscedastic error variances in ordinary regression.
11
Except, of course, when the non-independence is explicitly modelled.
12
Another correction method is found in Browne (1984). This consists of dividing the
maximum likelihood chi-squared statistic by the ratio of Mardias multivariate measure of
kurtosis to its expected value given normality. Little simulation work seems to have been
done on this correction.
4. With small samples the standard errors of the estimates of the free
parameters are biased. Positive kurtosis results in estimates of the
standard errors that are smaller than they should be. Negative kur-
tosis results in estimates of the standard errors that are larger than
they should be.
5. The degree of skew has little eect on the bias of the estimators.
6. The asymptotic distribution-free estimator should not be used
except for very large sample sizes (1000).
7. The SatorraBentler robust estimator, upon which is based their
robust (SB) chi-squared statistic and standard errors, largely cor-
rects for excessive kurtosis and for problems in which the errors are
not independent of their causal non-descendants. This is particu-
larly important for models that include latent variables and meas-
urement models, since the S-B chi-squared statistic can correct for
cases in which the latent variables and the measurement errors are
not independent.
Basically, unless your data are very strongly kurtotic and your
sample sizes are very low, you can still perform a reasonable test of your
causal model. As a last resort, you can use bootstrap methods (Bollen and
Stine 1993). The bootstrap has been included in some commercial SEM
programs, and most will soon have this option. Note, however, that the
original data must be transformed using a technique called a Cholesky fac-
torisation
13
and not all commercial SEM programs implement this step. The
bootstrap is related to Monte Carlo methods except that, rather than sam-
pling from some theoretical distribution (multivariate normal or otherwise),
you sample from your own data to build up an empirical sampling distribu-
tion. See Manly (1997) for a discussion of bootstrap methods in biology.
Box 6.3 summarises the steps required to generate a bootstrap distribution
of the maximum likelihood chi-squared statistic. Note, however, that this
method is very computer-intensive.
Box 6.3. Bootstrapping the sampling distribution
Here are the steps to take in order to generate a bootstrap sampling distribu-
tion and perform an inferential test.
1. Given your original data set (Y ) with N rows and p variables centred
about their means, calculate the sample covariance matrix (S), obtain
THE STRUCTURAL EQUATI ONS MODEL
188
13
See Press et al. (1986) for a description of the Cholesky factorisation of a square positive
denite matrix and numerical algorithms to calculate this.
the predicted model covariance matrix () and the maximum likel-
hood chi-squared statistic, MLX
2
.
2. Calculate the Cholesky factorisation of S and to give S
1/2
and
1/2
.
3. Form a new data set: ZYS
1/2
1/2
.
4. Randomly choose N observations from Z with replacement to form a
bootstrap sample Z*. Form the covariance matrix from this bootstrap
sample, t the model to these data, and save the bootstrap value of the
maximum likelihood chi-squared statistic (MLX
2*
).
5. Repeat step 4 a large number of times (at least 1000).
6. Count the proportion of times that MLX
2
* is greater than MLX
2
. This
proportion is the empirical estimate of the probability of observing the
data given the model. Note that this probability does not assume any
particular sampling distribution.
6.7 Alternative measures of approximate t
This section deals with various methods of assessing the degree of approx-
imate t between data and a theoretical model. I dont like these methods
and dont advise you to use them either, for reasons that I will explain below.
However, they are popular with many users of SEM and are always printed
out in commercial SEM programs. These measures of approximate t are
generally used once the model has already been rejected and the purpose of
these approximate t measures are to determine the degree to which the
rejected model is approximately correct.
The origin and rationale behind the use of these approximate t
indices comes from a consideration of statistical power. The power of a sta-
tistical test can be dened as the probability that the test will reject the null
hypothesis when it is indeed false. To illustrate this notion, imagine that we
wish to test the null hypothesis that two random variables, X and Y, are
uncorrelated (H
0
:0). I generated 100 independent data sets each with 10,
50, 100 or 500 observations in which the true population correlation
coecient was either 0, 0.1, 0.2, 0.3, 0.4 or 0.5. We know that, if H
0
is true
then we should reject about 1 out of 20 tests at the 0.05 level. If we had
perfect statistical power then we should reject all data sets in which is
dierent from 0. In other words, we should reject a proportion when the
null hypothesis is correct and proportion 1 whenever deviates, however
slightly, from 0. Figure 6.10 shows the actual proportion of the 100 data sets
for which the null hypothesis (0) was rejected at 0.05.
6. 7 ALTERNATI VE MEASURES OF APPROXI MATE FI T
189
In Figure 6.10 we see that when the sample size is very small (N
10) then even when the null hypothesis is false (i.e. when the true correla-
tion between X and Y is not zero) the null hypothesis wont be rejected a
large proportion of the time; even when 0.5 only 33 out of 100 tests
14
rejected the null hypothesis that 0. As the number of observations per
data set increases then the number of times that the test correctly rejects the
hypothesis that 0 also increases. The curves in Figure 6.10 are called
power functions and the proportion of times that a test will reject H
0
:0
when, in fact , is called the power of the test. From Figure 6.10 we can
see that if we have 50 observations then the test has at least a 90% chance of
rejecting our null hypothesis (thus, a power of 0.9) when is greater than
about 0.5. If we have 100 observations then we have a power of 0.9 as soon
as is greater than about 0.3 and if we have 500 observations then we have
a power of 0.9 as soon as is greater than about 0.16. In other words, as the
sample size increases we have a greater and greater chance of detecting a
THE STRUCTURAL EQUATI ONS MODEL
190
14
This is a simple example of why failing to reject a null hypothesis is not the same as
showing that it is true.
Figure 6.10. The proportion of the 100 data sets of various sample
sizes (N) for which the null hypothesis H
0
(0) was rejected at
0.05 when the true population value of the correlation coefcient took
various values between 0 and 0.5.
smaller and smaller dierence between the hypothesised value and the true
value. At very large sample sizes even minuscule dierences (say 0.01)
will almost surely be detected and the null hypothesis would be rejected
almost always.
Usually therefore, more power is a good thing. Tests of structural
equations models, based on the chi-squared distribution, also have power
properties. The justication for using alternative tests of t is based on the
premise that statistical power is not always such a good thing. If you remem-
ber the section in Chapter 2 dealing with the logic of inference in science
then you will recall that no hypothesis is ever really tested in isolation. Every
hypothesis contains within it many other auxiliary hypotheses. In the
context of testing structural equations models with reference to a chi-
squared distribution we are really interested in knowing whether the causal
structure of the model is wrong. Unfortunately, when we conduct our sta-
tistical test we are testing all aspects of the model: the causal implications,
the distributional properties of the variables, the linearity of the relation-
ships and so on. Now, when we add the notion of statistical power to our
argument we realise that, as sample size increases, we run a greater and
greater risk of rejecting our models because of very minor deviations that
might not even interest us. This point was raised early in the history of
modern SEM by Jreskog (1969).
What might these uninteresting and minor deviations be? They
cant be minor deviations from multivariate normality, since the maximum
likelihood chi-squared statistic is asymptotically robust against non-
normality. In any case, we have already seen ways of dealing with this. Small
amounts of non-linearity could be one such minor deviation that would not
interest us. If some parameter values (for instance, path coecients or error
variances) are xed to non-zero values in our model then small deviations
from these xed values might be another minor dierence that would not
interest us. For instance, we might have only a single indicator of some latent
variable whose error variance we x at 1.1, perhaps based on previous expe-
rience. If the true error variance of this indicator was 1.15 and we use a large
enough sample size, then our model would be rejected. However, the prin-
cipal minor deviation that is evoked in the justication for measures of
approximate t is a minor deviation in the causal structure of the model.
The theoretical objective of the various indices of approximate t is there-
fore somehow to quantify the degree of these deviations. The various alter-
native t indices attempt to quantify the degree of such deviations by
measuring the dierence between the observed covariance matrix and the
predicted (model) covariance matrix. The most popular t indices do this
in a way that standardise for dierences in sample size.
6. 7 ALTERNATI VE MEASURES OF APPROXI MATE FI T
191
At rst blush then, these indices of approximate t have a seductive
quality. Wouldnt it be nice, after having found that ones preferred causal
explanation (as translated by the structural equations model) has been
rejected, to be able to say: but it is almost right! The remaining lack-of-t
is only due to minor errors that are not really very important anyway. This,
I suspect, is the real (psychological) objective of these t indices. Even this
weakness of the esh could be tolerated if there were any justication for
the implicit assumption that minor errors in specifying the causal structure
will translate into only minor dierences between the observed and pre-
dicted covariance matrices. Unfortunately, no such one-to-one relationship
has ever been demonstrated for these indices of approximate t. To me,
evoking such an argument of approximate t to justify accepting a causal
model is like the old joke about the drunk in the parking lot
15
. The alter-
native t indices measure dierent aspects of the ability of the observational
model (the structural equations) to predict the data, not the explanatory ability
of the causal model. As such, the indices of approximate t commit the sort
of subtle error of causal translation that I discussed in Chapter 3: small (but
real) dierences between the observed and predicted covariances of the
observational model do not necessarily mean only small (but real) dierences
between the actual causal structure and the predicted causal structure.
Now that I have given my reasons why you should not use these
alternative t indices, you can read the justications of those who promote
them and decide for yourself (Bentler and Bonnett 1980; Browne and
Cudeck 1993; Tanaka 1993). Below, I describe two of the more popular
alternative t indices. The book by Bollen and Long (1993) contains a
number of chapters that deal with these alternative indices of approximate
t.
6.8 Bentlers comparative t index
Lets go back to the maximum likelihood chi-squared statistic for a moment.
This statistic, and its inferential test, measure exact t between the observed
and predicted covariance matrices. The logic is that if the data are generated
by the process specied by the structural equations (and therefore the causal
structure of which these equations are a translation) then the observed and
predicted covariance matrices will be identical except for random sampling
variation. If this assumptionis true thenthe maximumlikelihood chi-squared
THE STRUCTURAL EQUATI ONS MODEL
192
15
You enter a parking lot late at night and see a drunk causal modeller on his knees under-
neath the only street light. He explains that he is looking for his car keys. Are you sure
that you lost your keys here? you ask. No, he answers. In fact, I have no idea where
they are, but at least here I have enough light to see.
statistic will asymptotically followa chi-squared distribution with the appro-
priate degrees of freedom (). Actually, it is more precise to say that this sta-
tistic will asymptotically followa central chi-squared distribution (
v
2
) with the
appropriate degrees of freedom. The central chi-squared distribution is a
special case of a more general chi-squared distribution called the non-central
chi-squared distribution. The non-central chi-squared distribution (
2
v,
) has
two parameters: the degrees of freedom () and the non-centrality parame-
ter (). Acentral chi-squared distribution is simply a non-central chi-squared
distribution whose non-centrality parameter () is zero.
Now, if the degree of mis-specication of the model covariance
matrix is not zero (as assumed in the test for exact t) but is small relative to
the sampling variation in the observed covariance matrix, then the
maximum likelihood chi-squared statistic actually asymptotically follows a
non-central chi-squared (
2
v,
) distribution and the non-centrality parame-
ter () measures the degree of mis-specication. The expected value of
the non-central chi-squared distribution is simply the expected value of
the central chi-squared distribution plus the non-centrality parameter:
E[
2
v,
] E[
2
v
] . In practice, the non-centrality parameter is esti-
mated as the value of the maximum likelihood chi-squared statistic (MLX
2
)
minus the degrees of freedom of the model (i.e. the expected value that the
maximum likelihood chi-squared statistic would have if there were no errors
of mis-specication). Because the non-centrality parameter cant be less
than zero, negative values are replaced with zero. Therefore max
{(MLX
2
),0}.
The Bentler comparative t index uses this fact to measure by how
much the proposed model has reduced the non-centrality parameter (thus,
the degree of mis-specication) relative to a baseline model. The most
common baseline model is one that assumes that the variables are mutually
independent. If
i
is the estimate of the non-centrality parameter for the
model of interest and
0
is the estimate of the non-centrality parameter for
the baseline model, then the comparative t index is dened as:
CFI
If the model of interest t exactly then the expected value of its
non-centrality parameter (
i
) would be zero and the CFI value would be
1.0. Therefore, the CFI index varies from 0 (the proposed model ts no
better than the baseline model) to 1.0. The sampling distribution of this
index is unknown and users of this index consider a value of at least 0.95 as
being an acceptable approximate t. There is no theoretical justication for
this value; it is simply a rule of thumb.
0
6. 8 BENTLER S COMPARATI VE FI T I NDEX
193
Actually, the description above is for the sample-based CFI.
Although this is the index usually reported in most commercial SEM pro-
grams, it is known that the sample-based CFI is a biased estimator of the
population-based CFI. The result of this bias is to exaggerate the degree of
mist. Steiger (1989) explained how to calculate the unbiased estimator of
the population CFI from the information provided by most commercial
programs. If p is the number of observed variables in the model, df is the
degrees of freedom and n is the sample size then rst get the model t index,
or calculate it as:
F
a
0 then you are doing a test of exact t with reference to the central chi-
squared distribution. Here are the steps:
1. Specify the null hypothesis H
0
:
a
a
2. Calculate the maximum likelihood chi-squared statistic (MLX
2
)
and the non-centrality parameter *nv
a
2
.
3. Find the probability of having observed MLX
2
given a non-central
chi-squared distribution with parameters ,*.
max{MLX
2
),0}
n
n
p
p 2F
(X
2
df )
n 1
THE STRUCTURAL EQUATI ONS MODEL
194
4. If the probability is less than your chosen signicance level, reject
the null hypothesis and conclude that
a
is greater than that spec-
ied in the null hypothesis.
An obvious problem with this test is in choosing the null hypothe-
sis. Remember that these indices of approximate t are used when one has
already rejected the null hypothesis of exact t (i.e.
a
0). We already know
that there is something wrong with the model. Browne and Cudeck (1993)
recommend the null hypothesis of
a
0.05, but this is only their rule of
thumb. Again, there is no compelling reason for choosing this value as a rea-
sonable level of approximate t.
Quite apart from using the RMSEA to measure approximate t,
there is a very useful property of the inferential test for this t statistic. If we
have not been able to reject our model then it is still important to be able to
estimate a condence interval for the RMSEA. In such a case the condence
interval will have a lower bound of zero. The upper bound will reect the
statistical power of our test. A large upper bound indicates that the test had
little statistical power to reject alternative models. A 90% condence interval
for RMSEA would not reject the null hypothesis of exact t at the 5% level.
This interval can be calculated as the values of for a non-central chi-squared
distribution whose 5% and 95% quantiles equals the calculated MLX
2
statis-
tic (Browne and Cudeck 1993). These condence intervals, and a whole
plethora of approximate t statistics, are provided by most SEM programs.
6.10 An SEM analysis of the Bumpus House Sparrow data
Natural selection was in the air during the last decade of the nineteenth
century. According to Bumpus (1899) natural selection was literally in the air
during a New England snow and ice storm one cold night. Many House
Sparrows (Passer domesticus) were immobilised during that storm and 136 of
the unfortunate birds were collected and transported tothe BrownUniversity
Anatomical Laboratory. Seventy two birds (51 males and 21 females) subse-
quently recovered but 64 birds (36 males and 28 females) died. Bumpus deter-
mined the sex of all 136 birds and also measured nine phenotypic attributes
of each bird, alive or dead. He used these data to showthe selective elimina-
tion of individuals in a population based on their characteristics.
These data have been subsequently analysed by many dierent
people
16
. In particular, Lande and Arnolds (1983) inuential paper on the
statistical estimation of selection gradients used this particular data set as an
6. 10 SEM ANALYSI S OF BUMPUS HOUSE SPARROW DATA
195
16
Pugesek and Tomer (1996) provide a brief history of these analyses.
F
i
g
u
r
e
6
.
1
1
.
P
u
g
e
s
e
k
a
n
d
T
o
m
e
r
s
m
o
d
e
l
o
f
t
h
e
c
l
a
s
s
i
c
B
u
m
p
u
s
H
o
u
s
e
S
p
a
r
r
o
w
d
a
t
a
.
(
A
)
(
B
)
(
C
)
f
e
m
u
r
t
a
r
s
u
s
h
u
m
e
r
u
s
s
t
e
r
n
u
m
w
i
n
g
h
e
a
d
s
k
u
l
l
G
e
n
e
r
a
l
s
i
z
e
S
u
r
v
i
v
a
l
H
e
a
d
s
i
z
e
f
e
m
u
r
t
a
r
s
u
s
h
u
m
e
r
u
s
s
t
e
r
n
u
m
w
i
n
g
h
e
a
d
s
k
u
l
l
L
e
g
s
i
z
e
F
i
t
n
e
s
s
G
e
n
e
r
a
l
s
i
z
e
H
e
a
d
s
i
z
e
f
e
m
u
r
t
a
r
s
u
s
h
u
m
e
r
u
s
s
t
e
r
n
u
m
w
i
n
g
h
e
a
d
s
k
u
l
l
G
e
n
e
r
a
l
s
i
z
e
L
e
g
s
i
z
e
example. The method of Lande and Arnold was essentially an application
of multiple regression of a suite of correlated characters on a measure of
evolutionary tness. The regression coecients were interpreted as causal
measures of the selection gradient. In Chapter 2 we have seen the problems
that can occur when we use multiple regression in such a context.
Pugesek and Tomer (1996) reanalysed the Bumpus data using SEM.
Besides two binary variables representing sex (male/female) and survival
(alive/dead) they used seven other observed variables of various body meas-
urements, transformed to natural logarithms: length of femur, tarsus,
humerus, sternum, wing and head, and width of skull. They began with the
measurement model involving all birds, living or dead. The rst model that
they considered (Figure 6.11A) was that all seven length measures were due
to a single latent variable. Since there were two sexes they actually used a
two-group model with across-group constraints (see Chapter 7). This rst
measurement model did not t well (MLX
2
43.59, 28 df, p0.03). Their
second measurement model was a three-factor model (Figure 6.11B).
Pugesek and Tomer interpreted these latents as a latent general size factor
that is a common cause of all seven body measurements, a latent leg size
factor that is an additional common cause only of the femur and tarsus
lengths, and a latent head size factor
17
that is an additional common cause
only of the head length and skull width. As we have seen before, giving the
latents these names doesnt necessarily mean that the names are accurate; it
is also possible that some more mundane causes, systematic measurement
errors for instance, are the source of these latent variables. Whatever the
source of the latent variables, the model provided a good t (MLX
2
28.82,
24 df, p0.227). A series of nested models (see Chapter 7) showed that
there were not signicant dierences between the males and females in any
of the free parameters. Fixing all these free parameters to be equal in the
two sexes provided a nal measurement model with an acceptable t (MLX
2
49.29, 40 df, p0.149).
The next step was to relate the measured and latent variables to sur-
vival. Pugesek and Tomer allowed the three latent size variables to be direct
causes of a fourth latent variable that they call tness and which then deter-
mines the death or survival of the individual bird. Since they xed the path
from the latent tness to the observed survival at 1, and the residual error
of survival at zero, the latent tness variable is redundant since it will be
6. 10 SEM ANALYSI S OF BUMPUS HOUSE SPARROW DATA
197
17
Since the two specic latent variables only had two observed variables each, identication
of the model was obtained by constraining the two path coecients of each latent to be
equal. This is simply a mathematical trick and means that the actual values of the path
coecients cannot be interpreted.
perfectly correlated with survival
18
. This two-group model provided a
marginal t to the data (MLX
2
62.72, 48 df, p0.075), but the authors
added an edge from wing length directly to the latent tness that signi-
cantly improved the t of the model (MLX
2
52.37, 46 df, p0.241).
Finally, a nested sequence of models showed no signicant dierences in the
path coecients or error variances leading into, and out of, the latent
tness variable and so these were constrained to be equal in the males and
females.
The nal model is shown in Figure 6.11C. The parameter values of
this nal model can be found in Figure 8 of Pugesek and Tomer (1996). The
path coecients (based on standardised variables) allow one to determine
by how much a change in one morphological variable will change the prob-
ability of survival of the bird. For instance from their model one can calcu-
late that an individual whose general size was one standard deviation larger
than average increased its chances of survival by 0.564 standard deviations
more than the average. On the basis of their model, it seems that larger birds
were less likely to die during the storm than the smaller birds, birds whose
legs and head were even larger than average given their general size were
even less likely to die (although the path coecients from these latent var-
iables were not signicant at the 5% level), but birds whose wings were
shorter than average given their general size were also less likely to die.
THE STRUCTURAL EQUATI ONS MODEL
198
18
Unless there was some detail that was omitted from the paper, it was not necessary to x
the error variance of survival to zero. It is not clear either what the latent tness means.
Presumably it represents the propensity (or probability) of a bird to pass on a greater or
lesser number of ospring to the next generation. If this is the case, then the causal inter-
pretation of the model makes no sense. Ones survival is not caused by such a propensity;
rather, the propensity is partially caused by ones survival. Given the constraints that were
put on the model, the latent tness should be removed and replaced by the observed sur-
vival. If one also had data on reproductive output of these individuals, then survival and
reproductive output could be modelled as causes of a latent tness.
7 Nested models and multilevel models
Like successful politicians, good statistical models must be able to lie without
getting caught. For instance, no series of observations from nature are really
normally distributed. The normal distribution is just a useful abstraction
a myth that makes life bearable. In constructing statistical models we
pretend that the normal distribution is real and then check to ensure that
our data do not deviate from it so much that the myth becomes a fairy tale.
In Chapter 6 we saw how far we could stretch the truth about the distrib-
utional properties of our data before our data called us a liar. The goal of
this chapter is to describe how SEM can deal with two other statistical myths
that people often relate with respect to their data.
Two important assumptions made by all of the models that we have
studied up to now is that the observations in our data sets are (i) indepen-
dent draws generated by (ii) the same causal process. Consider rst the
assumption of causal homogeneity. It is easy to imagine cases in which
dierent groups of observations might be generated by partially dierent
causal processes. For instance, a behavioural ecologist studying a series of
variables related to aggression and social dominance in primates would not
necessarily want to combine together the observations from males and
females, since it is possible that the behavioural responses of males and
females are generated by dierent causal stimuli. When we sample from
populations with dierent causal processes, either in terms of the causal
structure or of the quantitative strengths between the variables, and we wish
to compare the causal relationships across the dierent groups, we require a
model that can explicitly take into account these dierences between
groups. Such a model is called multigroup SEM and this, in turn, requires the
notion of nested models.
The assumption of independence of observations can often be vio-
lated as well. Natural selection itself suggests a way in which we can get non-
independence of observations (Felsenstein 1985; Harvey and Pagel 1991).
The attributes of organisms, if they have a genetic component, will tend to
be more similar to those of close relatives than to genetic strangers. The
process of speciation therefore generates a hierarchical structure to data
199
when we combine observations from dierent families, populations or
species. If we ignore this hierarchical structure, and therefore ignore the
non-independence of the observations, then we will obtain incorrect prob-
ability estimates. The application of multilevel SEM can deal with this com-
plication.
7.1 Nested models
Given two SEM models with the same set of variables, then one model is
nested within a second one if (i) all of the xed parameters in the rst model
are also xed to the same values in the second, but (ii) some of the free
parameters in the rst are still xed in the second. In other words, the xed
parameters in the rst model are a subset of the xed parameters in the
second model. The notion of nesting can be grasped most easily by com-
paring some path diagrams. In Figure 7.1 model B is nested within A and
model D is nested within C.
Model A has two xed parameters. The path coecients for the
edges between X
1
and X
2
and between X
1
and X
3
have each been xed to
zero, therefore there is no edge between X
1
and X
2
or between X
1
and X
3
.
There are two xed parameters in model A, all others being freely esti-
mated
1
. There is only one xed parameter in model B the path coecient
for the edge between X
1
and X
3
is still xed to zero and all others, includ-
ing the path from X
1
to X
2
, are freely estimated. So the xed parameters of
model B are a subset of those in model A and model B is nested within
model A.
Model C also has two xed parameters. The path coecient for the
edge between X
1
and X
3
is still xed to zero and the path coecient for the
edge from X
1
to X
2
has been xed to 0.5. Note that model C is not nested
within model A; it is true that the path coecients between X
1
and X
2
are
both xed but they are not xed to the same value. Model D is, however,
nested within model C. This is because every xed parameter in model D
the path coecient for the edge between X
1
and X
3
is also xed in
model C.
Nested models are useful because the dierence in the maximum
likelihood chi-squared values between nested models is, itself, asymptoti-
cally distributed as a chi-squared variate if the freely estimated parameters
are equal to their associated xed parameters. The degrees of freedom of
this change in chi-squared are the number of parameters that have been
freed in the nested model, which is the same as the change in the degrees
of freedom between the nested models.
NESTED MODELS AND MULTI LEVEL MODELS
200
11
The free parameters representing the variances are not shown in Figure 7.1.
F
i
g
u
r
e
7
.
1
.
F
o
u
r
p
a
t
h
m
o
d
e
l
s
u
s
e
d
t
o
i
l
l
u
s
t
r
a
t
e
t
h
e
c
o
n
c
e
p
t
o
f
n
e
s
t
i
n
g
.
Intuitively, the testing of a nested model uses the following logic.
One starts with a model (call it model 1) in which a set of parameters are
xed to particular values (zero or otherwise). Now, we dene a new nested
model (call it model 2) by freeing some previously xed parameters but
without changing anything else relative to model 1. If we allow some of
these previously xed parameters to be freely estimated, but these newly
freed parameters really do have the values to which they had previously been
xed, then the only dierence in the estimated covariance matrices between
models 1 and 2 will be due to random sampling variation. If this is true then
the dierence between the maximum likelihood chi-squared statistics will
also follow a chi-squared distribution with degrees of freedom equal to the
number of previously xed parameters that have been freed in the nested
model 2. Here are the steps:
1. Fit the model at the top of the nested sequence, obtain its chi-
squared value (MLX
1
2
) and its degrees of freedom (df
1
).
2. Fit the model at the bottom of the nested sequence, obtain its chi-
squared value (MLX
2
2
) and its degrees of freedom (df
2
).
3. Calculate the change in the chi-squared value and the change in the
degrees of freedom: MLX
2
MLX
1
2
MLX
2
2
and df df
1
df
2
.
4. Determine the probability of having observed this change in the
chi-squared value (
2
)assuming that the freed parameters in the
second (nested) model are equal to those in the rst model, except
for random sampling variation.
5. If this probability is less than the chosen signicance level, conclude
that the freed parameters were not the same as those xed in the
rst model.
Tests of nested models are used in a number of dierent research
contexts. One reason might be if you want to test for the equality of a set
of parameters to some theoretical values but dont care whether the model
as a whole is acceptable. Two exploratory methods in SEM (the Wald and
Lagrangian multiplier tests) are based on this logic. Perhaps the most useful
application of nested models is in the context of multigroup models and
multilevel models.
7.2 Multigroup models
Another assumption of the tests for structural equations models that have
been described so far is that all of the observations come from the same sta-
tistical population. In other words, we are assuming that the same causal
process has generated all of our observations even if we dont know what
NESTED MODELS AND MULTI LEVEL MODELS
202
this causal process might be. Often we know (or suspect) that this is not the
case. For instance, if we are studying attributes related to reproductive
success then we might suspect that dierent causal processes are at work for
males and females. Even if the causal structure is the same in males and
females, it is possible that the two sexes dier in the numerical strength of
the causal relationships. If we were to combine males and females into one
data set then we would obtain incorrect parameter estimates and might
incorrectly reject the model even though the qualitative structure of the
model is correct. Perhaps our data come from three dierent geographical
regions and we are not willing to assume that the same causal forces (with
the same numerical strengths) apply to the observations in these dierent
regions. Perhaps our data come from groups that we have subjected to
dierent experimental treatments. All of these examples require that we
explicitly include the group structure into our analysis. Such analyses are
called multigroup SEM.
The rst impulse (which is not always wrong) is to analyse the data
in each group separately. The real strength of multigroup SEM is the ability
to compare statistically between groups and determine which parts of the
models in each group (i.e. which parameters) are the same and which parts
dier. In this sense multigroup SEM is analogous to ANOVA except that,
rather than testing for dierences in the means between groups, we are
testing for dierences in the covariance structure between the groups. To
do this we construct a series of nested multigroup models.
A multigroup model can be t with a minor modication of the
method that you already know. Since the standard structural equation
model is simply a multigroup model with only one group, lets start
there. With only one group we have only one observed covariance matrix
(S
1
). We then set up the model covariance matrix (
1
) using covariance
algebra and iteratively nd values of the free parameters of
1
that
minimise the maximum likelihood chi-squared statistic: (N
1
1)( ln|
1
(
1
) trace(S
1
1
1
(
1
)) ln|S
1
|p
1
). This is the same formula that you
saw in Chapter 4 ( p is the number of variables in the model) except that
I have added subscripts to emphasise that we are referring to group 1.
When our data are divided into g groups with N
1
, N
2
, . . ., N
g
observa-
tions in the dierent groups then we have g sample covariance matrices
(S
1
, S
2
, . . ., S
g
) and also g population covariance matrices (
1
,
2
, . . .,
g
). Each population covariance matrix can potentially have dierent sets
of free and xed parameters or even dierent sets of variables. We itera-
tively choose values of all of these free parameters simultaneously to
minimise:
7. 2 MULTI GROUP MODELS
203
[(N
1
1)(ln|
1
(
1
) trace(S
1
1
1
(
1
)) ln|S
1
|p
1
)] [(N
2
1)
(ln|
2
(
2
) trace(S
2
2
1
(
2
)) ln|S
2
|p
2
) . . .
[(N
g
1)(ln|
g
(
g
) trace(S
g
g
1
)(
g
)) ln|S
g
|p
g
)
Although this equation looks intimidating, it is simply the sum of the
maximum likelihood chi-squared statistics for each group.
The value of this multigroup maximum likelihood chi-squared sta-
tistic at the minimum also asymptotically follows a chi-squared distribution
with degrees of freedom equal to the sum of the degrees of freedom of the
model in each group. Even if a particular parameter is free in each group,
we can constrain the tting procedure to choose the same value for all groups
(this generates g1 extra degrees of freedom). In this way we are stating
that, although we dont know what the numerical value of the free param-
eter is, it must be the same numerical value in all groups. Viewed in this way,
we see that multigroup models dene a continuum. If we propose the same
causal structure and the same numerical values for all free parameters across
the groups, then we get the same result as if we had centred each variable
around its group mean and then put all of our data into one big group. If
we allow all of the free parameters to dier between groups then we get the
same result as if we had tested each group separately and summed the
maximum likelihood chi-squared statistics and degrees of freedom. By con-
straining the estimation of dierent sets of free parameters across groups
then we can dene a series of nested models. In this way, we can test for the
equivalence of various free parameters in the dierent groups. If we do this
more than once then we should adjust our signicance level using a
Bonferonni correction
2
.
The following example comes from Meziane (1998). Although it is
a path model without latent variables, the logic and approach are identical
with full SEM models. The study consisted of 22 species of herbaceous
plants grown under controlled conditions in four dierent environmental
conditions: a high (N) and low (n) nutrient concentration in hydroponic
culture crossed with a high (L) and low (l) light intensity. This gave four
dierent groups of data corresponding to the four dierent environments:
NL, Nl, nL, nl. Two leaves on each plant were harvested and a series of four
morphological attributes were measured: the water content of the leaf, the
thickness of the lamella, the thickness of the midvein and the specic leaf
area (the ratio between the projected leaf area and its dry weight). The values
of the two leaves per plant were averaged. Owing to a few missing values,
there were a total of 80 independent observations in the nal data set. A
NESTED MODELS AND MULTI LEVEL MODELS
204
12
If we do this t times with a signicance level of then we should test the change in the
chi-squared of each test at a signicance level of /t.
previously published study (Shipley 1995) had described a path model relat-
ing these variables, and one objective of Meziane (1998) was to see whether
the previous path model could be applied under dierent environmental
conditions. If Meziane had simply combined the data from all four environ-
ments and tested his path model then he would have implicitly assumed that
the dierent environments had no eect on the relationships between the
four variables. By no eect I mean both that the structure of the relation-
ships (their presence or absence) and their numerical strengths do not
change. If you remember that each variable is centred around its mean in
the data set, then combining the data from all four environments would also
implicitly require that the treatments did not aect the mean values of the
variables either. By separating the data into the four groups, the variables are
centred around their respective group means. In this way, the treatment
eects on the means are removed and only the relationships between the
variables are analysed.
In his multigroup analysis he specied four models, each with the
same structure but each potentially diering in the numerical strengths of
the free parameters. I have shown this in Figure 7.2, in which I have
included the free parameters. In this model there are ve free path coe-
cients and four free error variances in each of the four models. There are
therefore from 9 (if all free parameters are constrained to be equal across
groups) to potentially 36 dierent free parameters to estimate (if no free
parameters are constrained to be equal across groups). Using the rule of
thumb requiring ve times more observations than free parameters, we see
that any multigroup model with more than 16 free parameters will not be
well approximated by a chi-squared distribution and will have true probabil-
ity values that are somewhat higher than those obtained using a chi-squared
distribution. The data, after transforming to natural logarithms, had reason-
ably low values of Mardias multivariate index of multivariate kurtosis
(4.37, 2.70, 3.34 and 1.40 for the NL, Nl, nL and nl groups, respec-
tively).
The rst step was to t the data to the most constrained model;
namely, in which all nine free parameters are forced to be equal across the
four groups
3
. This fully constrained model gave a maximum likelihood chi-
squared statistic of 48.271 with 31 degrees of freedom (p0.02). Why 31
degrees of freedom? Each covariance matrix was composed of four vari-
ables, so there were 4(5)/210 non-redundant elements in each matrix.
There were four independent matrices for a total of 40 non-redundant
7. 2 MULTI GROUP MODELS
205
13
Note that, although the free parameters are constrained to be equal, the variables are still
centred about their group means, not the overall mean of all data taken together.
Therefore, dierences in the means between the groups are still removed.
F
i
g
u
r
e
7
.
2
.
M
e
z
i
a
n
e
s
f
o
u
r
-
g
r
o
u
p
p
a
t
h
m
o
d
e
l
r
e
l
a
t
i
n
g
f
o
u
r
a
t
t
r
i
b
u
t
e
s
o
f
t
h
e
l
e
a
v
e
s
o
f
h
e
r
b
a
c
e
o
u
s
p
l
a
n
t
s
.
E
a
c
h
g
r
o
u
p
r
e
f
e
r
s
t
o
p
l
a
n
t
s
g
r
o
w
n
i
n
d
i
f
f
e
r
e
n
t
e
n
v
i
r
o
n
m
e
n
t
s
o
f
h
y
d
r
o
p
o
n
i
c
n
u
t
r
i
e
n
t
s
o
l
u
t
i
o
n
a
n
d
l
i
g
h
t
i
n
t
e
n
s
i
t
y
(
s
e
e
t
h
e
t
e
x
t
f
o
r
s
y
m
b
o
l
s
)
.
N
L
e
1
3
1
3
e
1
2
1
2
e
1
1
1
1
a
1
2
3
a
1
1
2
a
1
1
4
a
1
3
4
a
1
2
4
e
1
4
1
4
W
a
t
e
r
c
o
n
t
e
n
t
L
a
m
i
n
a
t
h
i
c
k
n
e
s
s
M
i
d
v
e
i
n
t
h
i
c
k
n
e
s
s
S
p
e
c
i
f
i
c
l
e
a
f
a
r
e
a
N
l
e
2
3
2
3
e
2
2
2
2
e
2
1
2
1
a
2
2
3
a
2
1
2
a
2
1
4
a
2
3
4
a
2
2
4
e
2
4
2
4
W
a
t
e
r
c
o
n
t
e
n
t
L
a
m
i
n
a
t
h
i
c
k
n
e
s
s
M
i
d
v
e
i
n
t
h
i
c
k
n
e
s
s
S
p
e
c
i
f
i
c
l
e
a
f
a
r
e
a
n
l
e
4
3
4
3
e
4
2
4
2
e
4
1
4
1
a
4
2
3
a
4
1
2
a
4
1
4
a
4
3
4
a
4
2
4
e
4
4
4
4
W
a
t
e
r
c
o
n
t
e
n
t
L
a
m
i
n
a
t
h
i
c
k
n
e
s
s
M
i
d
v
e
i
n
t
h
i
c
k
n
e
s
s
S
p
e
c
i
f
i
c
l
e
a
f
a
r
e
a
n
L
e
3
3
3
3
e
3
2
3
2
e
3
1
3
1
a
3
2
3
a
3
1
2
a
3
1
4
a
3
3
4
a
3
2
4
e
3
4
3
4
W
a
t
e
r
c
o
n
t
e
n
t
L
a
m
i
n
a
t
h
i
c
k
n
e
s
s
M
i
d
v
e
i
n
t
h
i
c
k
n
e
s
s
S
p
e
c
i
f
i
c
l
e
a
f
a
r
e
a
elements in all. Since we xed all free parameters to be equal across groups,
we estimated only nine dierent free parameters. This gives a total of 409
31 degrees of freedom. Since we had 80 observations and 9 free parame-
ters, and the data did not show strong kurtosis, we can be fairly condent
that this fully constrained multigroup model has some errors in it. Since
there were no obvious nonlinearities in these data and the distributional
assumptions do not seem to cause any problems, the remaining problems
reside either in the causal structure of the model or in the equality con-
straints that we have imposed on the data. If we remove all equality con-
straints across groups, then we will be testing only that the same qualitative
structure applies to all four groups. The maximum likelihood chi-squared
statistic, when all equality constraints are removed, is 3.224 with 4 degrees
of freedom ( p0.52). Even though we now have few observations per esti-
mated free parameter (we have 36 free parameters now, so the ratio is only
2.21) the probability level gives us no good reason to reject this multigroup
model with no between-group equality constraints for the parameter esti-
mates
4
.
Since the lack of t that was detected in the fully constrained multi-
group model appears to be in the equality constraints between the four
groups, we can use a series of nested models to detect which of the equal-
ity constraints is unreasonable. If we remove only one between-group
equality constraint then this new model will be nested within the fully con-
strained model. There will be 3 degrees of freedom less in this new model,
since we now have to independently estimate the value of this free param-
eter in all four groups, rather than simply estimating one value for all four
groups. The dierence in the two maximum likelihood chi-squared statis-
tics, compared to a chi-squared distribution with 3 degrees of freedom (the
dierence in the degrees of freedom between the fully constrained model
and the new model), will test for a dierence in the value of this parameter
between the four groups. We do this nine times, each time removing the
equality constraint for a dierent free parameter. Since we have done this
test nine times at a signicance level of 5%, we adjust the overall signicance
level to 0.05/90.0056.
Table 7.1 summarises the results. The rst column lists the free
parameter whose between-group equality constraint has been removed.
The second column lists the maximum likelihood chi-squared statistic for
this new model (always with 28 degrees of freedom). The third column lists
the change in the maximum likelihood chi-squared statistic relative to the
7. 2 MULTI GROUP MODELS
207
14
Remember that the eect of small sample sizes is to produce conservative probability esti-
mates.
model with all between-group equality constraints applied. The fourth
column lists the (asymptotic) probability for the change in the maximum
likelihood chi-squared statistic.
From Table 7.1 we see that only two of the path coecients (those
between the thickness of the lamina and midvein and the specic leaf area)
dier between the four groups given our chosen signicance level. Note
that, although the path coecient from leaf water content to specic leaf
area had a probability level of 0.031, we had to adjust our individual signif-
icance levels to 0.05/90.0056 in order to maintain an overall signicance
level of 0.05. Our nal multigroup model xes all free parameters except
for these two path coecients to be equal across groups. This model has a
maximum likelihood chi-squared statistic of 21.463 with 25 degrees of
freedom, giving a probability level of 0.667. The condence interval of the
NESTED MODELS AND MULTI LEVEL MODELS
208
Table 7.1. The results of comparisons of a series of nested models based on the
four-group model shown in Figure 7.2.The rst row gives the results of the fully
constrained model assuming all free parameters are the same in the four groups.The
remaining rows show the result of relaxing one constraint at a time
Free parameter whose between-
group equality constraint was Change in MLX
2
Probability of
released MLX
2
(MLX
2
) MLX
2
None 48.271
Variance of leaf water content 42.141 6.130 0.105
Error variance of specic leaf area 45.995 2.276 0.517
Error variance of lamina thickness 46.387 1.884 0.597
Error variance of midvein thickness 47.195 1.076 0.783
Path coecient from leaf water
content to specic leaf area 39.411 8.860 0.031
Path coecient from lamina
thickness content to specic leaf area 27.710 20.561 0.0001
Path coecient from midvein
thickness content to specic leaf area 35.122 13.149 0.004
Path coecient from leaf water
content to leaf lamina thickness 44.34 3.931 0.269
Path coecient from leaf lamina
thickness to midvein thickness 47.646 0.625 0.891
Note:
MLX
2
, maximum likelihood chi-squared statistic.
RMSEA for this model is (0,0.074). Since the original purpose of the anal-
ysis of Meziane (1998) was to see whether the original path model that I
had proposed (Shipley 1995) could be applied to plants growing in dier-
ent resource environments, the conclusion is that the model appears to apply
in its general structure, but that the numerical strengths of the two thick-
ness measures on the specic leaf area change in the dierent environments.
Of course, given the rather small number of observations and therefore low
power, we must temper this conclusion, since small dierences between
groups might not have been detected.
In interpreting the results of a multigroup SEM it is important to
remember that we are testing only for dierences in the relationships
between the variables within each group. Dierences in the mean values of
the variables between the groups are never detected because the variables
are centred about their mean values within each group. In the example from
Meziane (1998) the mean values of every one of the variables diered
between the groups, on the basis of analyses of variance. In other words,
the dierent levels of nutrients and light intensities did cause changes in the
average values of the leaf attributes (from the ANOVA) and did change the
numerical strengths of the eects of the two thickness measures on specic
leaf area, but did not change the numerical strengths of the other relation-
ships, the error variances or the causal structure (i.e. the topology) between
the variables.
7.3 The dangers of hierarchically structured data
Lets now turn to the problem of analysing data when the observations are
not independent. To illustrate the problems caused by partially dependent
data, consider rst the following nave analysis
5
: I wish to test the hypothe-
sis that people with blue eyes (such as myself ) have shorter hair than do
people with beautiful green eyes (such as my wife). To test this hypothesis I
randomly choose 20 hairs from my head and 20 hairs from my wifes head,
measure them, and then conduct a t-test on this set of 40 observations using
(40 2) degrees of freedom. Of course, I nd a highly signicant dier-
ence in hair length between the two groups and, of course, the probability
level associated with this test would be profoundly wrong. The problem is
that both variables, eye colour and hair length, are nested within individu-
als, of which there are only two. A large proportion of the total variation in
hair length and all of the total variation in eye colour resides at the level of
7. 3 THE DANGERS OF HI ERARCHI CALLY STRUCTURED DATA
209
15
I actually present this problem to students in a rst-year undergraduate course in biome-
try. They instinctively recognise the nonsensical nature of the result, but are not able to
explain why the result is awed.
individual people, not at the level of individual observations. Clearly, I do
not have 40 independent observations of the two variables (eye colour, hair
length).
If two values of some variable X (say X
1
and X
2
) are independent
then knowing the value of X
1
tells us nothing about the likely value of X
2
.
Two independent values give us two pieces of information and n indepen-
dent values give us n pieces of information. If, in some group, every indi-
vidual had exactly the same values of X, then as soon as you knew X
1
then
you would also know the values of all other X in the group. No matter how
many observations of X that you took from such a group, you would have
only one piece of non-redundant information.
Now, imagine that we create two groups. We randomly choose 20
values of X to form the rst group and these values are independently and
normally distributed with a mean of 1 and a standard deviation of 0.5. We
randomly choose 20 values of X to form the second group and these values
are also independently and normally distributed but with a mean of 2 and
a standard deviation of 0.5. If the values within each group were exactly
the same then we would still only have two pieces of information, but this
is not the case. If knowing that an observation came froma particular group
told us nothing about what values it might have, then we would have 40
pieces of information, but this is not the case either. So, we have more
than 2 and fewer than 40 pieces of information
6
. This is the nature of
hierarchically structured data, and we require a way of determining how
many pieces of information dierent variables possess at dierent levels of
the hierarchy. This is the goal of multilevel models, also called random
coecient models, variance component models or hierarchical linear
models. Such models, in the generalised linear context, have a large liter-
ature
7
. Detailed discussions and statistical derivations can be found in a
number of books (Bryk and Raudenbush 1992; Longford 1993; Goldstein
1995).
Hierarchies and partial dependence are the rule, rather than the
NESTED MODELS AND MULTI LEVEL MODELS
210
16
The ratio of the variation of a variable at a given level in a hierarchy to its total variation
is given the unfortunate name of intraclass correlation (Muthn and Sattora 1995). If we
let the estimate of the variance of a mean, when ignoring the hierarchical nature of data,
be Var
SRS
(simple random sampling), the correct variance estimate be Var
C
, the number
of observations per group be c and the intraclass correlation be , then the relationship
between them is Var
C
Var
SRS
(1(c 1)). A similar formula exists for the variance of a
linear regression slope with hierarchical structure. If
X
).
17
These models were mostly developed for the eld of educational research and commer-
cial statistical programs are available. One such program is MLwiN (Goldstein et al. 1998).
exception, in nature. If this is so, then we need to incorporate this structure
into our models of nature. One way to do this is through the use of multi-
level models. Before going on to the mechanics of tting such models, or
even of interpreting them, it is useful to have a simple concrete example of
such a structure. A good example is the relationship between seed size and
seedling relative growth rate. Relative growth rate (RGR) is the amount of
new biomass produced by a plant over a unit time period, relative the
amount of biomass that the plant had as an initial capital at the beginning
of the time period. If we plot the weight of dierent seeds of individuals
within a single species against the RGR of the seedling that emerges, we
often nd a positive relationship between the two. For individuals within a
given species, having larger seeds translates into more rapidly growing seed-
lings, with all of the attendant benets. When we compare across species,
however, we see that both average seed size and average potential RGR
varies much more between species than within species. For instance, the
seeds of some orchids are almost microscopic while it might require two
hands to hold the seeds of a Coconut Palm. Under constant environmental
conditions and resource levels, the variation of seedling RGR within a
single species usually varies by less than 10% of the variation in the mean
RGR values between dierent species. Curiously, the relationship between
the average values of seed size and seedling RGR across species usually shows
a negative relationship. Figure 7.3 plots some simulated data showing this
pattern. There are 10 simulated species, each having a dierent plotting
symbol. Figure 7.3A shows the relationship between the two variables for
the rst species, along with its regression line. Figure 7.3B shows the rela-
tionship between the two variables when the data for all 10 species are com-
bined.
This example, although very simple, demonstrates many of the
challenges of analysing data that are hierarchically ordered. It is clear that
part of the variation in each variable, and the covariation between the two,
is generated by dierences between individuals within each species, as
shown by Figure 7.3A. It is also clear that part of the variation and covari-
ation involving these variables is generated by dierences between the
species means. We would like to model this covariation both within and
between species, taking into account the fact that individuals within a given
species tend to resemble each other more than the do individuals of dier-
ent species. Finally, we would like to model how the dierent levels in the
hierarchy interact and constrain each other.
Since there are two levels to this data hierarchy, lets call level 2 the
level of species and call level 1 the individual level. If we had only one
species, then we could write the regression equation as:
7. 3 THE DANGERS OF HI ERARCHI CALLY STRUCTURED DATA
211
y
i
abx
i
e
i
In this equation the subscript i refers to each individual, a is the
value of the average individual when x0 (i.e. the intercept) and e
i
is the
amount by which the value of y for the ith individual deviates from its
expected value given x
i
. As usual, we assume that these deviations are
random. Since we have 10 species in our data set, we could write:
y
i1
a
01
bx
i1
e
i1
y
i2
a
02
bx
i2
e
i2
y
i10
a
010
bx
i10
e
i10
In this simple example I have assumed a common slope for all species (thus
b has no subscript) because that is how I generated these data. In general,
multilevel models allow for randomslopes as well as randomintercepts. The
assumption of a common slope across species in a structural equation model
NESTED MODELS AND MULTI LEVEL MODELS
212
Figure 7.3. Hypothetical relationships between the relative growth rate
(RGR) of 50 plants and their seed size. (A) This relationship for one
species. (B) The relationship when all 10 species are combined. Notice
that the relationship is positive within every species but the overall
trend shows a negative relationship because the relationship between
the species means is negative.
could be tested with a multigroup model, as described in the previous
section, remembering that the slope is the path coecient between x and y.
Here, we have a dierent regression equation for each species. We
could, at this point, simply introduce a dummy variable and conduct an
analysis of covariance. In the context of SEM, this would be the equivalent
of doing a multigroup model. However, if we have chosen our 10 species
at random, and want to extrapolate to a larger population of species, then
our regression intercepts are, themselves, random variables and we might
want to model how these species-level random variables change as well. In
this case, the 10 intercept terms (a
1
to a
10
) are random variables which we
can model as: a
0j
a
00
u
0j
. Here, a
00
is the overall intercept term for the
entire population of species and u
0j
is the random deviation of the intercept
for the jth species from this overall intercept term. Putting this all together
we obtain:
y
ij
(a
00
u
0j
) bx
ij
e
ij
Here the i-subscript refers to the level-1 units (the individual plants) and the
j-subscript refers to the level-2 units (the dierent species). Rearranging, we
obtain:
y
ij
a
00
bx
ij
u
0j
e
ij
This equation expresses each y
ij
as a function of a systematic component (a
00
bx
ij
), a random component due to the dierences in the mean values of
y
ij
between the species (u
0j
) and a random component due to the dierences
of individuals within each dierent species (e
ij
).
Up to now the model that we have developed is perhaps familiar to
some readers, since it is simply a variance components model
8
with a
between-species variance component (the variance of the u
0j
) and a within-
species component (the variance of e
ij
). However, since the intercepts (u
0j
)
are random variables, the relationships between these intercepts may be
determined by other, species-level variables. For instance, in Figure 7.3 it is
clear that as the mean value of seed size for a given species increases the mean
value of RGR for that species decreases.
Secondary succession in plant communities starts with some major
7. 3 THE DANGERS OF HI ERARCHI CALLY STRUCTURED DATA
213
18
Actually, the classical variance component model for RGR would not include the seed
size variable. Such a model would be of the form: RGR
ij
a
00
u
0j
e
ij
, the variance of
u
0j
and of e
ij
being the two variance components and the total variance of RGR
ij
being the
sum of these two variances. Using the simulated data for RGR in Figure 7.2 these values
are: a
00
0.162(0.020), Var(u
0j
)0.0039(0.0018) and Var(e
ij
)0.0002(0.00004).
One usually expresses these variance components as percentages of the total variation. In
this example, the species-level variance is 95% (i.e. 0.0039/(0.00390.0002)).
disturbance event, for instance a eld that has been cultivated and then aban-
doned. As dierent species reinvade the site the relative abundance of each
species changes. Because of this one often nds that particular species tend
to be most abundant in abandoned elds of a specic age. Immediately fol-
lowing a major disturbance event one typically nds annual species that
have rapid relative growth rates and that produce a large quantity of small
seeds. As secondary succession proceeds dominance shifts to species with
larger seeds and slower relative growth rates (Grime 1979). This is a selec-
tion process in which dierent frequencies and intensities of density-
independent mortality (the result of disturbance events) select for dierent
suites of plant attributes. As such, selection pressures represented by such
variables as average time since the last major disturbance event aect
species-level properties by determining the mean and variation of individ-
ual-level attributes.
Actually, I generated the data shown in Figure 7.3 by simulating the
scenario described above. I dened a species-level variable the frequency
of major disturbance events that quanties how often a habitat experi-
ences a major event of density-independent mortality. The causal eect of
this variable is to select for individuals with both larger seeds and lower
RGRs as the frequency of disturbance events decreases and as successional
age of the vegetation increases. In other words, although RGR increases
with increasing seed size within each species, the average seed size and the
average relative growth rates of a given species are both determined by the
common cause disturbance frequency. This is shown in Figure 7.4.
As Figure 7.4 makes clear, the relationship between individual seed
size and individual relative growth rate consists of causes that operate at two
dierent hierarchical levels. At the level of individual plants there is a posi-
tive direct eect (seed sizerelative growth rate). At an interspecic level
there is a negative indirect eect between the two variables that is generated
by the common cause of selection for habitats experiencing dierent dis-
turbance frequencies. Whether the overall relationship between seed size
and relative growth rate is positive, negative or ambiguous depends on the
relative strengths of these dierent paths. In the data that I have simulated,
the species-level eect dominated, which is why the overall trend in Figure
7.3 is a downward sloping cloud of points. How can we incorporate these
hierarchical eects into our models? We could ignore the individual level
and simply work with the species means. If we did this then we would not
only lose a great deal of information (by reducing our data set from 50
observations to 10) but we would also ignore the fact that the relationship
at the individual level is quite dierent from the relationship at the species
level. We could ignore the fact that disturbance frequency is a variable that
NESTED MODELS AND MULTI LEVEL MODELS
214
is only relevant to the species-level process and simply conduct a standard
multiple regression of RGR on both seed size and average disturbance fre-
quency. This is like correlating eye colour and hair length the average dis-
turbance frequency is the same for all individuals within a given species and
so we would be inating the number of pieces of information that we really
possess. Instead, we should take an explicit multilevel approach.
First, we model the individual level relationship between RGR and
seed size:
RGR
ij
0j
1
seedsize
ij
e
ij
Here, we are specifying that RGR varies at both the individual (i-level) and
the species ( j-level). The intercept (
0j
) varies only at the level of species.
The slope of the relationship between RGR and seed size (
1
) is constant
across all species. Finally, the residual variation in RGR within each species
(e
ij
) is assumed to be normally distributed with a constant variance
9
. Since
7. 3 THE DANGERS OF HI ERARCHI CALLY STRUCTURED DATA
215
19
Both the assumption of normality and the assumption of constant variance can be relaxed
in multilevel modelling.
Figure 7.4. The causal process determining the relationship between
seed size and relative growth rate operated at two levels. At the level of
individuals, seed size causes relative growth rate. However, the average
seed size and relative growth rate of each species is caused by the
typical disturbance frequency of the habitat occupied by each species.
the intercepts of RGR are, themselves, random variables that change from
species to species, we next model this species-level variation in these inter-
cepts:
0j
00
01
disturbance
j
0j
Now we are specifying that the species-level intercepts are functions of the
average disturbance frequency, which is a species-level variable. There is a
constant intercept (
00
) term which represents the average seed size across
all species in the statistical population. The species-level slope between
average disturbance frequency and the intercepts of each species is
01
.
Finally, the deviations of each species intercept from that predicted by
average disturbance frequency is the random variable
0j
. Putting this all
together we get:
RGR
ij
00
1
seedsize
ij
01
disturbance
j
0j
e
ij
The standard error of the slope of seed size is based on the error variance at
the level of individuals within species (e
ij
), which has been corrected for the
species-level variation. The slope of the average disturbance frequency is
based on the error variance at the level of species (
0j
), which has been cor-
rected for the error variance at the individual level.
We next specify a multilevel model for seed size. According to
Figure 7.4, seed size is caused only by average disturbance frequency and
this eect occurs only at the species level. Our multilevel model is there-
fore:
seedsize
ij
00
01
disturbance
j
0j
e
ij
If I t these models using the MLwiN software (Goldstein et al. 1998), I
obtain the following results:
Seed size443.7616.32 disturbance frequency
RGR0.2840.0006 seed size0.0217disturbance frequency
The residual variation of the mean seed sizes per species was 55.28
(13%) and the residual variation of individual seed sizes was 361.76 (77%).
The residual variation of the mean RGR values per species was 4.9610
5
(64%) and the residual variation of the individual RGR values was
2.7910
5
(46%). The big dierence between this multilevel model and an
ordinary regression model is best seen in the standard errors of the param-
eters of the equation for RGR. If we were to do an ordinary regression then
we get an estimate of the standard error of the slope for average disturbance
frequency of 110
4
while the standard error estimated from the multi-
NESTED MODELS AND MULTI LEVEL MODELS
216
level model was 8.310
4
. In other words, by ignoring the hierarchical
nature of RGR the ordinary regression overestimated the precision of the
slope by eight times. Since signicance tests of the eects of a variable in a
regression are based on these standard errors, the eect of ignoring the par-
tially dependent nature of observations is to produce probability estimates
that are much smaller than they should be.
Hierarchically ordered data not only cause problems for parameter
estimation and inferential tests of signicance. The patterns that can be gen-
erated with multilevel data can often be downright counterintuitive. To give
you a feeling for such patterns, I have generated data from a very simple
two-level model, shown in Figure 7.5. In this scenario there are two attrib-
utes (X and Y ) of each individual i (1 to 11) from each species j (1 to 10).
Each species has a mean value of each variable (
X
j
and
Y
j
) and each indi-
vidual has a value of each variable that varies around its species mean (X
ij
X
j
X
ij
and Y
ij
Y
j
1X
ij
Y
ij
). Variable
X
ij
takes values from 0.25 to
0.25. Since we are interested in comparing the within-species and
between-species patterns, I will ignore the random variation of Y at the
individual level (
Y
ij
) and concentrate on the expected value of y (E[Y|X]
Y
j
1X
ij
). Substituting for X
ij
we get:
E[Y|X]
Y
j
1
X
j
1
X
ij
This equation represents the intraspecic (i.e. within-group) level. The
interspecic (i.e. between-group level) will be generated in dierent ways
10
and the combined data are then shown in Figure 7.6. We imagine that the
mean value of variables X and Y (
X
and
Y
) dier randomly between the
10 species and that the interspecic relationship between these interspecic
means follows the following equation:
Y
a
X
.
First, lets see what happens when there is no species-level variation
or covariation. To simulate this, I set a in Figure 7.5 equal to zero and make
the values of
X
and
Y
the same (zero) for all 10 species (thus, the variance
of these two variables is zero). If we put these values into our generating
equation we obtain:
E[Y|X](0)1(0)1
X
ij
Figure 7.6A shows the pattern that results. There are actually 10 lines in this
gure but they are superimposed on each other, since we have exactly the
same values for all 10 species (remember that I am plotting the expected
values). The solid circles show the mean values of X and Y for each species.
7. 3 THE DANGERS OF HI ERARCHI CALLY STRUCTURED DATA
217
10
Ignore for now the fact that the path coecient from each to the observed variable is
xed at the square root of the number of individuals per species. This will be explained
later.
F
i
g
u
r
e
7
.
5
.
A
v
e
r
y
s
i
m
p
l
e
t
w
o
-
l
e
v
e
l
m
o
d
e
l
.
I
n
t
h
i
s
s
c
e
n
a
r
i
o
t
h
e
r
e
a
r
e
t
w
o
a
t
t
r
i
b
u
t
e
s
(
X
a
n
d
Y
)
o
f
e
a
c
h
i
n
d
i
v
i
d
u
a
l
f
r
o
m
e
a
c
h
s
p
e
c
i
e
s
.
E
a
c
h
s
p
e
c
i
e
s
h
a
s
a
m
e
a
n
v
a
l
u
e
o
f
e
a
c
h
v
a
r
i
a
b
l
e
(
X
j
a
n
d
Y
j
)
a
n
d
e
a
c
h
i
n
d
i
v
i
d
u
a
l
h
a
s
a
v
a
l
u
e
o
f
e
a
c
h
v
a
r
i
a
b
l
e
t
h
a
t
v
a
r
i
e
s
a
r
o
u
n
d
i
t
s
s
p
e
c
i
e
s
m
e
a
n
(
X
i
j
X
j
X
i
j
a
n
d
Y
i
j
Y
j
1
X
i
j
Y
i
j
)
.
V
a
r
i
a
b
l
e
X
i
j
t
a
k
e
s
v
a
l
u
e
s
f
r
o
m
0
.
2
5
t
o
0
.
2
5
.
F
i
g
u
r
e
7
.
6
.
S
i
m
u
l
a
t
e
d
d
a
t
a
b
a
s
e
d
o
n
F
i
g
u
r
e
7
.
5
b
a
s
e
d
o
n
s
i
x
d
i
f
f
e
r
e
n
t
s
c
e
n
a
r
i
o
s
i
n
v
o
l
v
i
n
g
t
h
e
p
o
p
u
l
a
t
i
o
n
m
e
a
n
v
a
l
u
e
s
(
X
,
Y
)
t
h
e
s
p
e
c
i
e
s
-
l
e
v
e
l
s
l
o
p
e
l
i
n
k
i
n
g
t
h
e
s
e
p
o
p
u
l
a
t
i
o
n
v
a
l
u
e
s
,
a
n
d
t
h
e
i
n
d
i
v
i
d
u
a
l
-
l
e
v
e
l
s
l
o
p
e
.
T
h
e
l
i
n
e
s
s
h
o
w
t
h
e
s
y
s
t
e
m
a
t
i
c
r
e
l
a
t
i
o
n
s
h
i
p
s
w
i
t
h
i
n
e
a
c
h
s
p
e
c
i
e
s
a
n
d
t
h
e
s
o
l
i
d
c
i
r
c
l
e
s
s
h
o
w
m
e
a
s
u
r
e
d
m
e
a
n
v
a
l
u
e
s
o
f
X
a
n
d
Y
f
o
r
e
a
c
h
s
p
e
c
i
e
s
.
Since all 10 species have the same means, these 10 circles are also super-
imposed.
In Figure 7.6B I simulate what happens if each species has a dier-
ent mean value for X but has the same mean value for Y; that is, I allow
X
to vary randomly but not
Y
. All 10 lines one for each species appear
to line up along the same trend as that observed at the intraspecic level.
Notice that species whose mean for X (
X
j
) is less than other species have
their individual values of both X and Y (their line in the graph) in the lower
left. Similarly, those species whose mean for X (
X
j
) is greater than other
species have their individual values of both X and Y (their line in the graph)
in the upper right. In other words, there is a positive correlation between
the mean values of X and Y of these 10 species even though there is no real
relationship between the species means
X
j
and
Y
j
. To see why, we have
only to write the generating equation for this simulation:
E[Y|X](0)1
X
j
1
X
ij
Whenever the mean value of X for a given species (
X
j
) happens, by chance,
to be less than average, this decreases the values of Y that individuals of this
species will possess. Similarly, whenever the mean value of X for a given
species (
X
j
) happens, by chance, to be greater than average, this increases
the values of Y that individuals of this species will possess. The observed
interspecic correlation that we observe is simply an artefact of mixing
together the two levels of variation.
In Figure 7.6C I simulate what happens if each species has a dier-
ent mean value of Y but the same mean value of X; that is, I allow
Y
to
randomly vary but not
X
. Returning to our generating equation and sub-
stituting, we get:
E[Y|X]
Yj
1(0)1
X
ij
The result is a series of lines stacked on top of each other. The overall cor-
relation between X and Y is severely diluted.
In Figure 7.6D I simulate what happens if each species has a dier-
ent mean value of both X and Y but there is still no true interspecic rela-
tionship between these mean values; that is, I allow both
X
and
Y
to vary
randomly and independently. The result is intermediate between graphs B
and C.
In Figure 7.5E I simulate what happens when each species both has
a dierent mean value of X and Y and there is also a positive interspecic
relationship between these mean values. To do this, I allow
X
and
Y
to
randomly vary but link them:
Y
j
1
X
j
X
j
. Substituting this into the
generating equation, I get:
NESTED MODELS AND MULTI LEVEL MODELS
220
E[Y|X]
Y
j
1
X
j
1
X
ij
2
X
j
X
j
1
X
ij
The result is that the slope between the means appears twice as large as it
really is.
Finally, in Figure 7.6F I simulate what happens when each species
has a dierent mean value of X and Y and there is also a negative interspe-
cic relationship between these mean values. In other words, the interspe-
cic relationship is the opposite of the intraspecic relationship. To do this,
I allow
X
and
Y
to vary randomly but link them:
Y
j
1
X
j
X
j
.
Substituting this into the generating equation, I get:
E[Y|X]
Y
j
1
X
j
1
X
ij
0(
X
j
)
Xj
1
Xij
The result is that the correlation between the means disappears even though
there are really strong (but opposite) relationships between the variables at
both hierarchical levels. The moral of this simple set of simulations is that
combining data that have relationships at dierent levels and analysing it as
if the hierarchical structure did not exist can lead to incorrect conclusions.
There is much more to be said about multilevel regression than has
been said so far. What I have described is not so much an introduction as an
appetiser and for more information the interested reader should consult the
references given earlier in this chapter. Now that we recognise the problem
that hierarchically organised data can cause, and have an intuitive under-
standing of how multilevel regression deals with it, lets see how these
notions can be incorporated into SEM
11
.
7.4 Multilevel SEM
Suppose that we have a variable that has been measured on N observations.
These observations are organised into Ggroups with N
1
, N
2
, . . ., N
G
obser-
vations in each group; for the moment we will assume that there are the
same number (C) observations in each group. I write Y
ij
to mean the ith
observation in group j and I write
j
to mean the mean of the value in the
jth group. The deviation of this value from the overall mean is (Y
ij
). We
can decompose this deviance as follows: (Y
ij
) (
j
) (Y
ij
j
).
This leads to the one-way ANOVA table that is fondly remembered by
everyone who has taken an introductory course in statistics (Table 7.2).
Y Y Y Y
Y
Y
7. 4 MULTI LEVEL SEM
221
11
Although I do not discuss multilevel tests of path models based on the method presented
in Chapter 3, the reader should be aware that multilevel regression methods can be used
to obtain probability estimates of conditional independence by tting a series of regres-
sions in accordance with the hypothesised causal graph. These probability estimates can
then be combined using Fishers C statistic.
The above decomposition has the useful property that the between-
group deviations have zero correlation with the within-group deviations.
Remembering that a variance is simply the covariance of a variable with
itself, we can do the same trick with covariances. In this way, we can dene
both a pooled within-group covariance matrix (S
PW
) and a between-group
covariance matrix (S
B
) for our data. The pooled within-group covariance
matrix is constructed by rst centring each variable by its group mean, cal-
culating the sum of squares and cross-products of these group-centred var-
iables, and then dividing by NG. One easy way to obtain this matrix from
any statistical program is simply to calculate the covariance matrix of the
centred variables (which has a denominator of N1), and multiply by
(N1)/(NG). The between-group covariance matrix is constructed by
calculating the sum of squares and cross-products of the group means and
dividing by G1. An easy way to obtain this matrix from any statistical
program is simply to calculate the covariance matrix of the group means,
which has a denominator of G1.
The sample pooled within-group covariance matrix (S
PW
) is an
unbiased estimate of the population within-group covariance (
PW
) matrix.
Unfortunately, the sample between-group covariance matrix (S
B
) is not a
simple estimator of the population between-group covariance matrix (
B
);
instead it estimates S
B
PW
C
B
. If you look carefully at this equation
then you will notice that it looks suspiciously like the multigroup structu-
ral equations formulation that we studied earlier in this chapter, with two
NESTED MODELS AND MULTI LEVEL MODELS
222
Table 7.2. Decomposition of the variance in a one-way analysis of variance
Degrees of
Source of variance Sum of squares freedom Variance
Total (Y
ij
)
2
N1
Between groups C (
.j
)
2
G1
Within groups (Y
ij
.j
)
2
NG
G
j1
Nj
i1
(Y
ij
Y
.j
)
2
NG
Y
G
j1
Nj
i1
C
G
j1
( Y
.j
Y)
2
G1
Y Y
G
j1
G
i1
Ni
j1
(Y
ij
Y)
2
N1
Y
G
i1
N
i
j1
groups. We can therefore trick commercial SEM programs into tting a
multilevel SEM by treating it like a multigroup SEM with particular cross-
group constraints.
To set up the analysis we tell the program that it is actually conduct-
ing a multigroup analysis with two groups. The rst group represents the
group-centred data, for which there is the pooled sample covariance matrix
obtained from the group-centred variables (S
PW
) based on NG observa-
tions. This is the level 1 covariance matrix. We specify our within-group
causal structure for this group. Next, we dene a second group, for which
we have the sample between-group matrix (S
B
) based on G observations,
where G is the number of groups in the multilevel model. For this second
group we specify both the within-group causal structure and the between-
group causal structure. These two causal structures are linked by latent var-
iables that represent the true values of the group means in the statistical
population; remember that our calculated group means are only estimates
of these underlying parameters. Since the variances and covariances of the
group means are multiplied by the constant C (the number of individuals
within each group), we x the path coecients leading from these latent
variables to the individual variables by
12
. Finally, we must constrain all
of the free parameters in the rst group (i.e. the model at the level of the
individual) to be equal to the equivalent parameters in the second group (i.e.
those parameters in the second group dealing with the model for the indi-
vidual level). When we t this model to our data the estimation procedure
will correct for the partial non-independence of our data due to its hier-
archical nature.
When we have dierent numbers of observations per group, we can
calculate an approximate scaling factor due to Muthn:
C
Estimates based on this scaling factor have been shown to be fairly accurate
so long as the group sizes are not extremely dierent (Hox 1993; McDonald
1994; Muthn 1994b). Of course, when group sizes are equal this reduces
to the common group size. The parameter estimates, standard errors and
maximum likelihood chi-squared statistics are still asymptotic values but
now the requirement for sucient sample sizes applies both at the level of
N
2
G
i1
N
2
i
N(G1)
C
7. 4 MULTI LEVEL SEM
223
12
If we have an equation: YaXe then the variances are Var(Y )a
2
Var(x)Var(e). By
setting the path coecient from the latent variable representing the group mean to the
individual-level variable at (i.e. L Xe) then we obtain Var(L)C*Var(X)
var(e).
C
individuals and at the level of groups. Note that, at the level of groups, we
are considering random samples of means. This means that the central limit
theorem applies and the distribution of means will be closer to multivariate
normal than is the distribution of the actual values.
In order to better understand how to interpret such multilevel
structural equations models, I will analyse simulated data generated by
dierent models. First, lets see what happens in the simplest case when all
of the observations are really independent observations generated by the
same causal process; in other words, there is really no group-level structure
and all variances and covariances exist at the individual level. To do this, I
generate 200 independent observations from the following equations:
XN(0,1)
Y0.5XN(0,0.75)
Z0.5YN(0,0.75).
Now, I randomly divide these 200 independent observations into 40
groups of 5 observations each. Since this assignment is completely random,
the only variation between the group means is due to sampling variation and
the systematic variation in the group means is zero. Figure 7.7 shows the
multilevel model. The variables M
1
, M
2
and M
3
are latent variables represent-
ing the population means of each variable centred around the overall mean
of each variable. Since there are 5 observations per group the scaling con-
stant C is 5. Since there are no causal paths linking these latent variables, we
are assuming that there is no covariance between them, although we could
allow for such covariances; this will be shown in a later example.
I now t two models, one nested within the other. First, I t the
model shown in Figure 7.7 (remembering to constrain all the free parame-
ters at level 1 to be equal in the model associated with the within-group
covariance matrix and the model associated with the between-group covar-
iance matrix) while xing the variance of the latent M
1
, M
2
and M
3
to zero.
The model in Figure 7.7 is the between-groups model that necessarily con-
tains the within-groups model (XYZ) nested within it. By xing the
variance of the latents M
1
, M
2
and M
3
to zero I am assuming that the vari-
ance in the population means between groups are zero for all three variables
and that any observed variance at this group level is due only to sampling
variation. This assumption is, of course, correct for these data. The result-
ing model gives a maximum likelihood chi-squared statistic of 7.611 with
7 degrees of freedom, for a probability of 0.368.
There were two covariance matrices, the within-groups covariance
matrix and the between-groups covariance matrix, and each had (34)/2
NESTED MODELS AND MULTI LEVEL MODELS
224
6 non-redundant elements. Therefore, we had 12 non-redundant elements
in total. The within-groups model had to estimate 5 free parameters (a
YX
,
a
ZY
,
X
,
Y
and
Z
). The between-groups model also had to estimate the
same 5 free parameters since the variances of the latents M
1
, M
2
and M
3
were
xed at zero. I also had to constraint the 5 free parameters associated with
the within-groups model to be equal to these same free parameters in the
between-groups model. Therefore, I had 1257 degrees of freedom.
Since this model, with the variances of the latents M
1
, M
2
and M
3
xed to zero, provides a non-signicant probability level, we could stop
there. However to test the hypothesis that there is really no group-level var-
iance contributing to X, Y or Z, I now re-t the model by allowing the var-
iances of the latent M
1
, M
2
and M
3
to be freely estimated. This new model
gives a maximum likelihood chi-squared statistic of 7.475 with 4 degrees of
freedom, for a probability level of 0.113. Note that we have reduced the
degrees of freedom from 7 to 4 because we are now estimating three param-
eters that were previously xed. Since this model is nested within the rst,
we can calculate the probability that the variances of M
1
, M
2
and M
3
really
were zero by calculating the dierence in the chi-squared statistics (0.136)
with 3 degrees of freedom. The resulting probability level (0.987) tells us
that the observed variation in the group means was very likely to have been
due only to sampling variation. If we go back to our rst model and look
at the estimated variances of M
1
, M
2
and M
3
and the standard variation of
these estimates, we nd that each is very small and less than 1 standard error
from zero. I have not explained the actual code needed to t models in this
book because each program does this dierently, and most have user-
friendly interfaces that hide much of the code anyway. However, since the
7. 4 MULTI LEVEL SEM
225
Figure 7.7. A two-level model involving three variables (X, Y and Z) and
their population means.
multilevel model is more complicated, I show some pseudo-code for the
EQS program (Bentler 1995) in Box 7.1.
Box 7.1. EQS program code for a multilevel model
The following is the program code of the EQS program needed to t the
multilevel model shown in Figure 7.4. Note that the actual code generation
is done automatically in EQS from a user-friendly interface. My comments
are shown in italics.
/TITLE
within-groups model
/SPECIFICATIONS
DATAWITHIN.ESS;
VARIABLES3; CASES160;
GROUP2
METHODSML;
MATRIXCOVARIANCE;
This section species that the rst input data le, containing the pooled within-group
covariance matrix, is called WITHIN.ESS, that there are 3 variables in this le, that
it is a covariance matrix, and that it is based on 160 observations. Remember that there
are really 200 observations, but the data are grouped into 40 groups.The within-group
covariance matrix has 200401 degrees of freedom. Finally, the code tells us that
the overall model has 2 groups and the parameters are to be estimated using maximum
likelihood techniques.
/LABELS
V1X; V2Y; V3Z;
There are only four legal types of variable in EQS. Observed variables are called V,
latent variables are called F, errors of observed variables are called E and the errors of
latent variables are called D. The LABELS section just tells us how our variables
names (X,Y, Z) map onto the EQS variables.
/EQUATIONS
V21*V1E2;
V31*V2E3;
These are the equations for the within-groups section of the overall model. Note that
there are no latent variables here.The asterisk indicates that there is a free parameter to
be estimated (slope in this case), and that the starting guess for its value in the itera-
tions is the number before the asterisk.
/VARIANCES
V1 *;
NESTED MODELS AND MULTI LEVEL MODELS
226
E2 *;
E3 *;
These are the variances whose free values are to be estimated.Again, the asterisk indi-
cates that it is a free parameter whose value must be estimated. Since X (V1) is exog-
enous, its variance is estimated. Since Y (V2) and Z (V3) are endogenous, their error
variances must be estimated.
/COVARIANCES
Since no entries have been given in the COVARIANCE section, this means that no
free covariances are to be allowed.
/END
Now, we enter the model for the second model of the multigroup model. Remember that
we are actually trying to trick the program into tting a multilevel model using the
syntax of a multigroup model.
/TITLE
full between-group and within-group model
/SPECIFICATIONS
DATAbetween.ESS;
VARIABLES4; CASES40;
METHODSML;
MATRIXCOVARIANCE;
This gives the same type of information as in the rst section, except that we are using
the le between.ESS which holds the covariance matrix of the 40 group means.
/LABELS
V1X; V2Y; V3Z;
/EQUATIONS
V12.236F1E1;
V22.236F21*V1E2;
V32.236F31*V2E3;
Compare these equations with those in the rst section; the variables F1, F2 and F3
are EQS code for three latent variables.These latents represent the population group
means in this case.The dierence is that both the within-group model and the between-
group model are combined in this second group.The between-group part links X(V1),
Y(V2) and Z (V3) at level 1 to their group means, which are represented by the three
latent variables.The path coecients from these latents to their respective level-1 vari-
ables are xed at 2.236.You can tell that these are xed values because there is
no asterisk after these values.
/VARIANCES
E1*;
E2*;
5
7. 4 MULTI LEVEL SEM
227
E3*;
F1*;
F2*;
F3*;
Notice that we now specify the variances of the three latent variables representing the
group means. Here, these latent variances are allowed to be freely estimated. If we want
to x them at zero, we replace the asterisks with 0. Note also that we dont specify a
variance for V1 since, in this model, it is no longer exogenous (it is now caused by F1,
its group mean). Instead, we have to specify its error variance (E1).
/COVARIANCES
Again, we dont allow any free covariances. If we wanted to allow a free covariance
between the population means of (say) X(V1) and Y(V2) then we would add the fol-
lowing line:F1, F2*;.
/CONSTRAINTS
(1,V1,V1) (2,E1,E1);
(1,E2,E2) (2,E2,E2);
(1,E3,E3) (2,E3,E3);
(1,V1,V2) (2,V1,V2);
(1,V3,V2) (2,V3,V2);
This is a critical part of the overall model.This section species which free parameters
in the two models are constrained to be equal. We must make all free parameters in the
rst model (the within-group model) be equal to their equivalents in the second model.
Thus, we state for instance that the error variance of V2 in model 1 (1,E2,E2) be
equal to the error variance of V2 in model 2 (2,E2,E2). Note also that V1 is exog-
enous in model 1 and so it has a variance (1,V1,V1) but it is endogenous in model
2 so it is represented by its error variance (2,E1,E1).
/END
The rst simulation exercise was simply to show you that, if there
really is no group-level variation that results in partial non-independence of
the observations, then the multilevel model will detect this fact. Next, lets
look at a case in which there really is group-level variation, but not group-
level covariation. This time, we generate our 200 observations according to
the following equations:
(A) For the 40 groups:
X
j
N(0,1)
Y
j
N(0,1)
Z
j
N(0,1)
NESTED MODELS AND MULTI LEVEL MODELS
228
(B) For each of the ve observations within each of the 40 groups:
X
ij
X
j
N(0,1)
Y
ij
Y
j
0.5X
ij
N(0,1)
Z
ij
Z
j
0.5Y
ij
N(0,1)
Note that, in this simulation, each variable receives variation from
two sources: the variation at level 1 (the group level) due to the random
variation of each of the group means and the variation at level 2. I again t
two nested models, as in the rst example. In the rst model I x all of the
group-level variation associated with the latent M
1
, M
2
and M
3
to zero (see
Figure 7.7). This model does poorly (MLX
2
197.079, 7 df, p10
7
), as
it should. Since the variance of the group-level latent means was xed at
zero then all of the group-level error variance is incorrectly forced down to
the within-group level. The resulting estimates of the three within-group
error variances are 1.647, 1.904 and 1.857 instead of the correct value of 1.
Now, I re-t the model but allow the variances of the latent population
means (M
1
, M
2
and M
3
) to be freely estimated. This time, the model pro-
vides an adequate t (MLX
2
6.931, 4 df, p0.140), as it should. Since this
model is nested within the rst, the dierence in the maximum likelihood
chi-squared statistic tests the hypothesis that there was group-level variance.
This dierence is highly signicant (MLX
2
190.148, 3 df, p10
7
). Both
the group-level variances and the individual variances are correctly esti-
mated. If we ignore the multilevel nature of these data and simply put all
200 observations into the same data set and t the model XYZ, then
the model fails (MLX
2
6.206, 1 df, p0.013). This is a general result
(Muthn 1994a).
Up to now, we have seen how ignoring the multilevel nature of data
can result in improper parameter estimates and probability levels. The real
strength of multilevel SEM is that we can actually model how the group-
level variables interact separately from the level 1 variables, and how these
two levels interact together. To show this, I will simulate data from the
process shown in Figure 7.8.
This model has four observed variables. Three of these variables,
the number of seeds produced per plant, the seed size for an individual plant
and the relative growth rate (RGR) of an individual plant, are properties of
individual plants, and form the within-group model. The fourth variable,
the average disturbance frequency of the habitat, is a property of each
species; that is, individuals within a given species tend to be found in habi-
tats with the same average frequency of disturbance. The model proposes
that, at the level of individuals, increasing seed size causes an increase in
7. 4 MULTI LEVEL SEM
229
RGR but a decrease in the number of seeds produced. At the level of
species, selection for habitats of dierent disturbance frequencies results in
species that are adapted to frequently disturbed habitats (early during secon-
dary succession) producing fewer seeds (because they tend to be smaller
plants), also smaller seeds on average, and seedlings having faster average
RGRs. Note that the relationship between seed size and RGR is positive
at the level of individuals but negative at an interspecic level. Figure 7.9
shows the simulated data set.
Before analysing these simulated data, I want you to notice some
interesting trends. First, there is a negative relationship between seed size
and RGR. This is because the species-level eect of selection, based on
average disturbance frequency, dominates the within-species tendency for
larger seeds to increase RGR. Second, notice that there is a positive rela-
tionship between seed number and disturbance frequency even though the
direct eect of disturbance frequency on these two variables, at an interspe-
cic level, is negative. This is because an increasingly disturbed habitat
selects for species with smaller seeds on average and this, in turn, reduces
the seed size within such species. However, smaller seeds increase seed
number within a given species resulting in an overall eect along this path
NESTED MODELS AND MULTI LEVEL MODELS
230
Figure 7.8. A hypothetical causal structure involving four observed
variables and three latents. One of these observed variables
(disturbance frequency) is a species-level variable that is a common
cause of the three latent species means. The other three observed
variables are individual-level variables that are caused by the latent
species means.
F
i
g
u
r
e
7
.
9
.
T
h
e
s
c
a
t
t
e
r
p
l
o
t
m
a
t
r
i
x
o
f
t
h
e
s
i
m
u
l
a
t
e
d
d
a
t
a
g
e
n
e
r
a
t
e
d
b
y
t
h
e
c
a
u
s
a
l
s
t
r
u
c
t
u
r
e
s
h
o
w
n
i
n
F
i
g
u
r
e
7
.
8
.
being positive. Since this path dominates the direct eect at the species level,
we see a positive overall relationship.
Now, we t a series of nested models. First, we specify no variance
or covariance at the species level. This model is rejected (MLX
2
746.249,
7 df, p10
10
). Next, we allow variance, but not covariance, at the species
level. This nested model is also rejected (MLX
2
51.385, 4 df, p10
10
)
but the change in the maximum likelihood chi-squared statistic is signi-
cant (MLX
2
694.864 , 3 df, p10
10
) showing that there is signicant
species-level variation. Finally, we allow the three latent variables to freely
covary amongst themselves. This model, nested within the second, provides
an acceptable t (MLX
2
2.096, 1 df, p0.148) and the change in the
maximum likelihood chi-squared statistic is signicant (MLX
2
49.289, 3
df, p1.1310
10
) showing that there is also species-level covariation.
Since there is signicant covariation between the latent species-
level means, we introduce the species-level variable average disturbance fre-
quency and specify that this variable is the sole common cause of these three
species-level variables. This model (which is the true model that generated
these data) also provides an acceptable t (MLX
2
5.123, df4, p0.275).
Figure 7.10 shows the nal parameter estimates and their standard errors in
parentheses.
The parameter estimates are all close to the true values that I had
simulated, and all are within the approximate 95% condence intervals (two
times the standard errors). This model is quite remarkable. Not only have
we been able to account for the partial non-independence of the data within
each species due to their hierarchical nature, but we have also been able to
separate the within-species structure from the between-species structure and
link the two hierarchical processes together. Although the data were simu-
lated, they are biologically realistic. Natural selection based on some average
environmental property determines the average values of attributes shown
by dierent species and the covariation between these average values. These
average values then limit the range of values shown by particular individu-
als within each species but still allow variation between individuals and co-
variation of the attributes at the level of individuals.
The multigroup model can be extended to more than two levels.
For simplicity, lets imagine that our data are grouped into G dierent
genera, S dierent species and I dierent individuals. The value for our
attribute in individual i of species j of genus k is Y
ijk
. Now, we can write:
Y
ijk
(
..k
) (
.jk
..k
) (Y
ijk
.jk
)
where is the grand mean over all I observations,
..k
is the mean of Y for
each genus, and
.jk
is the mean of Y for each species. It follows that we can Y
Y Y
Y Y Y Y Y
NESTED MODELS AND MULTI LEVEL MODELS
232
decompose the sum of squares of Y
ijk
into a term representing the devia-
tions of each genus mean from the grand mean, (
..k
), a term repre-
senting the deviations of each species from its genus mean (
.jk
..k
) and
a term representing each individual from its species mean (
ijk
.jk
).
Following exactly the same logic as we used to derive the two-group multi-
level model, we can therefore obtain a genus-level sample covariance matrix
(S
G
), a species-level pooled sample covariance matrix (S
PS
) and an individ-
ual-level pooled sample covariance matrix (S
PI
)
13
:
S
PI
S
PS
S
G
G
k1
N
..k
(Y
..k
Y)
2
G1
G
k1
S
j1
N
.jk
(Y
.jk
Y
..k
)
2
SG
G
k1
S
j1
I
i1
(Y
ijk
Y
.jk
)
2
NS
Y Y
Y Y
Y Y
7. 4 MULTI LEVEL SEM
233
13
My Toolbox (see Appendix) contains a program to calculate these multilevel covariance
matrices.
Figure 7.10. The maximum likelihood estimates of the free parameters
of Figure 7.8, based on the simulated data. Values in parentheses are
the standard errors of the parameter estimates.
If our data are completely balanced with N
G
species per genus and
N
S
individuals per species, we can write:
T
N
G
G
N
S
I
. If this is
not the case then we have to use Muthns approximate scaling factor for
each level of the hierarchy:
C
L
Here, C
L
is the scaling factor for level L of the hierarchy, N is the total
number of observations in the data set, N
L
is the total number of units at
level L (for instance, the number of species or genera) and N
iL
is the number
of units within group i of level L. In this way, we can model structural rela-
tionships at various levels of organisation and account for partial non-inde-
pendence due to common ancestry at various taxonomic levels. Of course,
these levels do not have to represent traditional taxonomic classications.
For instance, you might have measures at dierent times for the same indi-
vidual, dening a within-individual level. Some readers might have noticed
that the above description looks much like a nested Type II ANOVA. Box
7.2 describes the relationship for those who are interested.
Box 7.2. Variance components in a nested ANOVA
Consider a typical balanced ANOVA table for a nested analysis with three
levels. There are 160 observations. These observations are grouped into
N
1
40 level 1 groups, N
2
80 level 2 groups (i.e. two level 2 groups per
level 1 group) and N
3
160 level 3 observations (two level 3 observations per
level 2 group).
Source df SS MS Expected MS
Level 1 N
1
1401 SS
1
SS
1
/(N
1
1)
2
L
3
L
2
C
2
2
L
2
L
1
C
1
2
L
1
Level 2 N
2
N
3
8040 SS
2
SS
2
/(N
2
N
3
)
2
L
3
L
2
C
2
2
L
2
L
1
Level 3 N
3
N
2
16080 SS
3
SS
3
/(N
3
N
2
)
2
L
3
L
2
Total N
3
11601 SS
T
SS
T
/(N
3
1)
Note:
SS, sum of squares; MS, mean square.
Here, the notation
2
L
3
L
2
means the variation of the level 3 units
nested within the level 2 units. C
2
is the number of level 3 units within each
level 2 unit and C
1
is the number of level 2 units within each level 1 unit. We
see that the higher level mean squares (which are sample variances if the units
are randomly sampled) do not estimate variation unique to that level but a
weighted sum of variation at that level and all levels below it.
N
2
N
L
i1
N
2
iL
N(N
L
1)
NESTED MODELS AND MULTI LEVEL MODELS
234
If we subtract the variance (i.e. MS) at a given level with the variance
directly below it in the hierarchy, and divide by the number of observa-
tions per unit (i.e. C
i
), then we obtain an estimate of the variance component
at that level. For instance, to obtain the variance component at level 1 we
write:
2
L
1
We estimate this variance component by calculating:
The variance (i.e. MS) at a given level measures the total amount of
variation that is found at that level. However, such variation is due to the
combined eect of variation at lower levels and the added variation contrib-
uted at that level. The variance components measure the amount of added
variation at each level. Usually, one expresses these variance components as
percentages of the total variation.
A variance (or a sum of squares) is simply a special type of covariance
(or sum of cross-products); namely the covariance of a variable with itself.
We can therefore apply the same logic to each variance and covariance in a
covariance matrix. If we measure a whole set of variables on each observa-
tional unit instead of only one then we can produce a table that summarises
the decomposition of the entire covariance matrix. Rather than sums of
squares (SS), we would calculate sums of squares and cross-products (SSCP).
Rather than mean squares (variances) we would calculate mean squares and
cross-products (MSCP), i.e. covariances.
Source df SSCP MSCP Expected MSCP
Level 1 N
1
1401 SSCP
1
SSCP
1
/(N
1
1)
2
L
3
L
2
C
2
2
L
2
L
1
C
1
2
L
1
Level 2 N
2
N
3
8040 SSCP
2
SSCP
2
/(N
2
N
3
)
2
L
3
L
2
C
2
2
L
2
L
1
Level 3 N
3
N
2
16080 SSCP
3
SSCP
3
/(N
3
N
2
)
2
L
3
L
2
Total N
3
11601 SSCP
T
SSCP
T
/(N
3
1)
The variance components can be extracted from the diagonal elements of
these covariance matrices.
One important type of multilevel model involves repeated meas-
urements over time on the same set of individuals. In such a sampling design,
the rst level would be the intra-individual level (i.e. variation over time in
the same individual); this is analogous to repeated-measures analyses with
which many biologists are familiar. Unfortunately, I am not aware of any
[MS
L
1
MS
L
2
]
C
1
[(
L
2
3
L
2
C
2
2
L
2
L
1
C
1
2
L
1
) (
L
2
3
L
2
C
2
2
L
2
L
1
)]
C
1
7. 4 MULTI LEVEL SEM
235
multilevel models that have been used in a biological context and only very
few in any other context (Muthn 1990, 1994a,b; Muthn and Satorra
1995). None the less, I suspect that multilevel models will become very
important in biology, since hierarchies are so ubiquitous.
NESTED MODELS AND MULTI LEVEL MODELS
236
8 Exploration, discovery and equivalence
8.1 Hypothesis generation
If this were a textbook of statistics then this chapter would not exist.
Modern statistics is almost entirely concerned with testing hypotheses, not
developing them. This bureaucratic approach views science as a compartmen-
talised activity in which hypotheses are constructed by one group, data are
collected by another group and then the statistician confronts the hypothe-
sis with the data. Since this book is a users guide to causal modelling such
a compartmentalised approach will not do. One of the main challenges
faced by the practising biologist is not in testing causal hypotheses but in
developing causal hypotheses worth testing.
If this were a book about the philosophy of science then this
chapter might not exist either. The philosophy of science mostly deals with
questions such as: How can we know whether a scientic hypothesis is true
or not? or What demarcates a scientic hypothesis from a non-scientic
hypothesis?. For most philosophers of science the question of how one
looks for a useful scientic hypothesis in the rst place is someone elses
problem. For instance, Poppers (1980) inuential Logic of scientic discovery
says that there is no such thing as a logical method of having new ideas, or
a logical reconstruction of this process. My view may be expressed by saying
that every discovery contains an irrational element, or a creative intui-
tion. . . Later, he says that [scientic laws] can only be reached by intui-
tion, based on something like an intellectual love of the objects of
experience. Again, one gets the impression that science consists to two her-
metically sealed compartments. One compartment, labelled hypothesis
generation, consists of an irrational fog of thoughts and ideas, devoid of
method, out of which a few gifted people are able to extract brilliant
insights. The other compartment, labelled hypothesis testing, is the public
face of science. Here, one nds method and logic, in which established rules
govern how observations are to be taken, statistically manipulated and inter-
preted.
At a purely analytical level there is much to be gained by taking this
237
schizophrenic view of the scientic process. After all, how a scientic idea
is developed is irrelevant to its truth. For instance, the history of science
documents many important ideas whose genesis was bizarre
1
. Archimedes
reportedly discovered the laws of hydrostatics after jumping into a bathtub
full of water. Kukul discovered the ring structure of benzene after falling
asleep before a re and dreaming of snakes biting their tails. These curious
stories are entertaining but we remember them only because the laws of
hydrostatics hold and benzene really does have a ring structure. As a public
activity, science is interested in the result of the creation, not in the creative
act itself.
The day-to-day world of biology does not exist at such a purely
analytical level. Although it is possible to conceptually divide science into
distinct hypothesis-generation and hypothesis-testing phases, the two are
often intimately intertwined in practice. When the two are not intertwined
the science can even suer. Peters (1991), in his A critique for ecology, pointed
out that because empirical and theoretical ecology are often done by dier-
ent people, the result is that much ecological theory is crafted in such a way
that it cant be tested in practice and much of eld ecology cant be gener-
alised because it is not placed into a proper theoretical perspective. In this
context I like the citation, attributed to W. H. George, given at the begin-
ning of Beveridges (1957) The art of scientic investigation: Scientic research
is not itself a science; it is still an art or craft. Unlike the assembly-line worker
who receives a partly nished object, adds to it, and then passes it along to
someone else, the craftsman must construct the object from start to nish.
In the same way the craft of causal modelling consists as much of the gen-
eration of useful hypotheses as of their testing. Certainly hypothesis gener-
ation is more art than method, and hypothesis testing is more method than
art, but this does not mean that we must relegate hypothesis generation to
a mystical world of creative intuition in which there are no rules. The
purpose of this chapter is to describe reliable methods of generating causal
hypotheses.
8.2 Exploring hypothesis space
How does one go about choosing promising hypotheses concerning causal
processes? To place the problem in context, imagine that you have collected
data on N variables and at least some of these variables are not amenable to
controlled randomised experiments. Why you suspect that these Nvariables
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
238
11
The appendix of The art of scientic investigation (Beveridge 1957) lists 19 cases in which
the origin of important scientic ideas arose from bizarre or haphazard situations. In fact,
Beveridge devotes an entire chapter to the importance of chance in scientic discovery.
possess interesting or important causal relationships may well be due to the
irrational creative intuition to which Popper referred, but you are still left
with the problem of forming a multivariate hypothesis specifying the causal
connections linking these variables.
To simplify things, lets assume that all of the data are generated by
the same unknown causal process (i.e. causal homogeneity), that there are
no latent variables responsible for some observed associations (i.e. causal
suciency) and that the data are faithful
2
to the causal process. How many
dierent causal graphs could exist under these conditions? Each pair of var-
iables (X and Y ) can have one of four dierent causal relationships: X
directly causes Y, or Y directly causes X, or X and Y directly cause each
other, or the two have no direct causal links. We now have to count up the
number of dierent pairs of variables, which is just the number of combi-
nations of two objects out of N. The combinatorial formula is therefore
4
Table 8.1 gives the number of dierent potential causal graphs of this type
that can exist given N variables.
If we think of the full set of potential causal graphs having N vari-
ables as forming an hypothesis space, and your research program as a search
through this space to nd the appropriate causal graph, then Table 8.1 is bad
news. Even if we could test one potential graph per second it would take us
N!
2!(N2)!
8. 2 EXPLORI NG HYPOTHESI S SPACE
239
12
See Chapter 2 for the denition of faithfulness. In fact, much of the present chapter makes
use of notions introduced in Chapter 2, and the reader might want to re-read that chapter
before continuing.
Table 8.1. The number of
dierent cyclic causal graphs
without latent variables that
can be constructed given N
variables
N Number of graphs
2 4
3 64
4 4096
5 1048576
6 1073741824
almost 32 years to test every potential graph containing only six variables!
If we were to restrict our problem to acyclic graphs then the numbers would
be smaller, but still astronomical (Glymour et al. 1987). If it is true that the
process of hypothesis generation (in this case, proposing one casual graph
out of all those in the hypothesis space) is pure intuition, devoid of method,
then it is a wonder that science has made any progress at all. That science
has made progress shows that ecient methods of hypothesis generation,
although perhaps largely unstated, do exist.
So how should we go about eciently exploring this hypothesis
space? To go back to my previous question: how does one go about gener-
ating promising hypotheses concerning causal processes (Shipley 1999)?
One way would be to choose a graph at random and then collect data to
test it. With ve variables there is a bit less than one chance in a million of
hitting on the correct structure. There is nothing logically wrong with such
a search strategy; we will have proposed a falsiable hypothesis and tested it.
However, no thinking person would ever attempt such a search strategy
because it is incredibly inecient. We need search strategies that have a good
chance of quickly nding those regions of hypothesis space that are likely
to contain the correct answer. What would be our chances of hitting on the
correct structure if we were to appeal only to pre-existing theory, as rec-
ommended by many SEM books? Clearly that would depend on the quality
of the pre-existing theory. If, however, the theory really was so compelling
that the researcher did not feel a need to search for alternatives then the
problem would be rmly within the hypothesis testing compartment and
no question of a search strategy would be posed.
Very often biologists nd themselves in the awkward position
of straddling the hypothesis-generation and hypothesis-testing compart-
ments. Often, we have some background knowledge that excludes certain
causal relationships and suggests others, but not enough rmly established
background knowledge to specify the full causal structure without ambigu-
ity. In such situations the goal is not to test a pre-existing theory which
might not be suciently compelling to justify allocating scarce resources and
time to testing it but rather in developing a more complete causal hypoth-
esis that would be worth testing with independent data. The real problem
is less in testing hypotheses than in nding hypotheses that are worth testing
in the rst place. We need search strategies that can be proved to be ecient
at exploring hypothesis space, at least given explicitly stated assumptions.
Until very recently such search strategies, which are described in this
chapter, did not exist. You will see that these search strategies rely heavily
on the notion of d-separation and on how this notion allows a translation
from causal graphs to probability distributions.
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
240
8.3 The shadows cause revisited
I have repeatedly compared the relationship between cause and correlation
to the relationship of an object and its shadow. There is something missing
in this analogy when applied to actual research projects. When we measure
a correlation in a sample of data we are almost never interested in the value
as such. Rather, we use the value to infer what the correlation might be in
the population from which we randomly chose our sample data. It is as if,
in natures Shadow Play, not only do the causal processes cast potentially
ambiguous correlational shadows, but these shadows are randomly blurred
as well. We therefore have two problems. First, we have to nd a way of
provably deducing causal processes from correlational shadows and, second,
we have to take into account the inaccuracies caused by using sample cor-
relations to infer population correlations. It is important to keep these two
problems distinct. The second problem, that of dealing with sampling vari-
ation, is a typical problem of mainstream statistics. For this reason, we will
rst see how to go from correlations to causes when there is no sampling
variation. In other words, we will consider asymptotic methods.
The history of the development of these exploratory methods, or
search algorithms, is fascinating. The word history has connotations of age
but, in fact, all of these methods date to less than 10 years before the writing
of this book. The mathematical relationships between graphs, d-separation
and probability distributions were worked out in the mid 1980s by Judea
Pearl and his students at the University of California at Los Angeles (UCLA)
(Pearl 1988). This was the translation device between the language of cau-
sality and the language of probability distributions that had been missing for
so long. As soon as it became possible to convert causal claims into prob-
ability distributions the dam was burst and the conceptual ood came
pouring out. It became immediately obvious that one could also convert
statements concerning probabilistic independencies into causal claims. Pearl
and his team at UCLA developed a series of algorithms to extract causal
information from observational data during the period 1988-1992
3
.
Interestingly, a group of people at the Philosophy Department at Carnegie
Mellon University (Clark Glymour, Peter Spirtes, Richard Scheines and
their students) had also been working on the same goal. In the late 1980s
they had published a book (Glymour et al. 1987) in which zero partial cor-
relations and vanishing tetrad dierences were used to infer causal structure,
but without the benet of d-separation or the mathematical link between
8. 3 THE SHADOW S CAUSE REVI SI TED
241
13
This brief history, and the algorithms of Pearl and his students, are given in Chapter 2 of
Pearl (2000).
causal graphs and probability distributions. As soon as the Carnegie-Mellon
group encountered Pearls work on d-separation (they didnt know about
the discovery algorithms of Pearl) they immediately began to independently
derive and prove almost identical search algorithms. These algorithms (and
much more) were proved and published in Spirtes, Glymour and Scheines
(1993) and incorporated into their TETRAD II program. An algorithm
called the Inductive Causation algorithm was proved and published by
Verma and Pearl (1991) and is very similar to the Causal Inference algorithm
of the CarnegieMellon group that is presented in this chapter. I will leave
it to the people involved to sort out questions of priority. I think that it is
fair to say that once the d-separation criterion was developed the various
algorithms were in the air and had only to be brought down to earth by
those with the knowledge. The philosophers dream of inferring (partial
knowledge of ) causation from observational data had been realised.
In Chapter 2 I explained how to translate from the language of cau-
sality, with its inherently asymmetric relationships, to the language of prob-
ability distributions with its inherently symmetric relationships. The
Rosetta Stone allowing this translation was the notion of d-separation.
Using d-separation we could reliably convert the causal statements expressed
in a directed acyclic graph into probabilistic statements of dependence or
independence that are expressed as (conditional) associations. This transla-
tion strategy was used in Chapters 3 to 7 to allow us to test hypothesised
causal models using observational data.
Now that we are attempting to discover causal relationships, the
problem has been turned on its head. We have to start with probabilistic
statements of (conditional) dependence or independence and somehow
back-translate into the language of causality. As you will see, this back-
translation is almost always incomplete. There is almost always more than
one acyclic causal graph that implies the same set of probabilistic statements
of (conditional) dependence or independence. In other words, there are
almost always dierent acyclic causal graphs that make dierent causal pre-
dictions but exactly the same predictions concerning probabilistic depen-
dence or independence. This gives rise to the topic of equivalent models, a
topic that has been recognised in SEMfor a long time and generally ignored
for just as long.
The methods that I describe in this chapter are based on the strat-
egy of back-translation that I described above. The rst step is to obtain a
list of probabilistic statements of (conditional) dependence or independence
involving the variables in question. From this list, we construct an undirected
dependency graph. An undirected dependency graph looks like a causal graph
in which all of the arrows have been converted into lines without arrow-
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
242
heads. However, the lines in the undirected dependency graph have a very
dierent meaning. Two variables in this graph have a line between them if
they are probabilistically dependent conditional on every subset of other
variables in the graph. The lines in the undirected dependency graph express
symmetrical associations, not asymmetrical causal relationships. Since we
cant measure associations involving variables that we have not measured,
the undirected dependency graph cant have latent variables. The next step
is to convert as many of the symmetrical relationships in the undirected
dependency graph as possible into asymmetrical causal relationships. This is
called orienting the edges and uses the notion of d-separation
4
. Generally, not
all of the undirected lines will be converted into directed arrows and so we
do not end up with a directed graph. Rather, we end up with a partially
oriented graph.
8.4 Obtaining the undirected dependency graph
Before I explain how to obtain an undirected dependency graph from
observational data, it is useful to explore how to convert a directed acyclic
graph into an undirected dependency graph involving only measured vari-
ables. Doing this will help to underscore the dierence between the undi-
rected dependency graph and the causal graphs with which you are now
familiar. In acyclic graphs without latent variables, the undirected depen-
dency graph is simply the directed acyclic graph in which all of the arrows
are replaced with lines lacking arrowheads. However, for the method to be
useful in discovery, we can work only with those variables that we have actu-
ally measured. If the directed graph contains latent variables then the result-
ing undirected dependency graph, involving only observed variables, will
usually require modications and these modications help to illustrate the
proper interpretation of such graphs. To get the undirected graph from a
directed acyclic graph
5
(Figure 8.1), or from a typical acyclic path diagram
if it contains correlated errors, do the following things.
1. If there is not already an arrow or curved double-headed arrow
between any two observed variables, but d-separation of the pair
requires conditioning on latent variables, then draw a line (not an
arrow) between the pair; see inducing paths (pp. 250253).
2. If there are curved double-headed arrows between any pairs of var-
iables (i.e. correlated errors), then replace these with a line (not an
8. 4 OBTAI NI NG THE UNDI RECTED DEPENDENCY GRAPH
243
14
Vanishing Tetrad dierences (Chapter 5) are also used, but these zero tetrad equations can
be reduced to statements concerning d-separation involving latent variables.
15
The case of cyclic directed graphs will be dealt with later.
arrow).
3. Remove the latent variables and also any arrows going into, or out
of, these latent variables.
4. Change all remaining arrows to lines.
The top of Figure 8.1 shows a path diagram with both latent vari-
ables and correlated errors. The bottom of Figure 8.1 shows the undirected
dependency graph that results when considering only the observed vari-
ables. In Figure 8.1 there is a line between {B,C}, {B,D} and {C,D} in the
undirected dependency graph even though these pairs of variables were not
adjacent in the original path diagram. This is because, following the rst
rule, d-separation of each of these pairs required conditioning on a latent
variable (A). Since we have not measured variable A we cant condition on
it, and so the three pairs of observed variables remain probabilistically asso-
ciated even after conditioning on any set of other observed variables.
Similarly, there is a line between {F,G} in the undirected dependency graph
even though the two variables were not adjacent in the original path
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
244
Figure 8.1. A path diagram (top) involving six observed variables and
one latent variable. Below is the undirected dependency graph
corresponding to this path diagram.
diagram. This is because, following the second rule, this pair of variables has
correlated errors. After changing all remaining arrows in the path diagram
to lines, we end up with the undirected graph.
A direct cause between two variables in a causal graph is a causal rela-
tionship between them that cant be blocked by other variables or a set of
variables involved in the causal explanation. Similarly, we can dene a direct
association between two variables in an undirected dependency graph as an
association that cant be removed upon conditioning on any other observed
variable or set of variables
6
. The undirected dependency graph is therefore
a graph that shows these direct associations. Of course, if we are attempting
to discover causal relationships then we will not already have the directed
acyclic graph. Our rst task is therefore to discover the undirected depen-
dency graph from the data alone (remembering that, for the moment, we
are assuming that our sample size is so large that we can ignore sampling
variation) when we dont know what the true directed acyclic graph looks
like.
Lets begin with the following assumptions:
1. Every unit in the population is governed by the same causal process
(i.e. causal homogeneity).
2. The probability distribution of the observed variables measured on
each unit is faithful to some (possibly unknown) cyclic
7
or acyclic
causal graph.
3. For each possible association, or partial association, among the
measured variables, we can denitely know whether the association
or partial association exists (is dierent from zero) or does not exist
(is equal to zero). This is simply the assumption that there is no sam-
pling variation.
We dont have to assume that there are no unmeasured variables
generating some associations (this assumption is called causal suciency) or
that the variables follow any particular probability distribution, or that the
causal relationships between the variables take any particular functional
form. No assumptions of an acyclic structure are needed, although the algo-
rithms for cyclic structures require linearity in the functional relationships
between variables. The method uses d-separation and we know the d-sep-
aration implies zero (partial) associations (Spirtes 1995; Pearl and Dechter
1996) under such conditions. Unfortunately, we dont yet know whether
the converse is true; that is, whether there can be independencies generated
8. 4 OBTAI NI NG THE UNDI RECTED DEPENDENCY GRAPH
245
16
This will be more formally dened as an inducing path later on.
17
The subsequent orientation phases will dier depending on whether or not we assume an
acyclic structure.
by cyclic causal processes that are not implied by d-separation (Spirtes 1995).
The assumption concerning causal homogeneity can be partly relaxed as
well, as is described later.
Given these assumptions, Pearl (1988) has proved that there will be
an edge (a line) in our undirected dependency graph between a pair of var-
iables (X and Y ) if X and Y are dependent conditional on every set of var-
iables in the graph that does not include X or Y. We can therefore discover
the undirected graph of the causal process that generated our data by apply-
ing the algorithm below. Following the denition of the order of a partial
correlation, lets dene the conditioning order of an association as the
number of variables in the conditioning set. So, a zero-order association is
an association between two variables without conditioning, a rst-order
association is an association between two variables conditioned on one other
variable, and so on. How one measures these associations will depend on
the nature of the data; the various methods described in Chapter 3 can be
used for dierent types of data.
8.5 The undirected dependency graph algorithm
8
The rst step is to form the complete undirected graph involving the V
observed variables. In other words, add a line between each variable and
every other variable. Since latent variables are, by denition, unmeasured,
we cant include them in our complete undirected graph. Now, for each
unique pair of observed variables (X, Y ) that have a line between them in
the undirected dependency graph at any stage during the implementation
of the algorithm, do the following:
1. Let the order of the association be zero.
2.1 Form every possible set of conditioning variables, containing the
number of variables specied by the order, out of the remaining
observed variables in the graph.
2.2 If the association between the pair of variables (X,Y ) is zero when
conditioned on any of these sets, then remove the line between X
and Y from the undirected dependency graph, move on to a new
pair of variables and then go to step 1.
2.3 If the association between the pair of variables (X,Y ) is not zero
when conditioned on all of these sets, then increase the order of the
association by one, and go to step 2.1. If you cannot increase the
conditioning order, then the line between your two variables is
kept. Move on to a new pair of variables.
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
246
18
This algorithm is included in the SGS algorithm of Spirtes, Glymour and Scheines (1990).
Once you have applied this algorithm to every set of observed var-
iables, the result is the undirected dependency graph. Given the assumptions
listed above, you are guaranteed to obtain the correct undirected depen-
dency graph of the causal process that generated the data if the algorithm is
properly implemented.
To illustrate this algorithm, lets imagine that we have been given
data (lots of it so that we do not have to worry about sampling variation)
that, unknown to us, was generated by the causal graph shown in Figure
8.2.
Now, we dont know about Figure 8.2; this causal structure is
hidden behind the screen of natures Shadow Play. In fact, we might not
even know of the existence of the latent variable (L), since, had we known
about it, we probably would have measured it. All that we have is a (very
large) data set containing observations on the variables A to E and a series
of measures of association and partial association between them; these are
the shadows that we can observe on the screen. Our task is to infer as much
about the structure of Figure 8.2 as we can. To begin, we create the com-
plete undirected dependency graph of these ve variables (Figure 8.3).
Notice that the latent variable (L) doesnt appear in Figure 8.3
because we are dealing only with observed variables at this point. Lets begin
with the pair (A,B) and apply the algorithm. Since A and B are adjacent in
the true causal structure (Figure 8.2) then these two variables are not uncon-
ditionally d-separated. We will therefore nd that the pair is associated in
our data when we test for a zero association (independence) without con-
ditioning (zero-order conditioning). Therefore the line between A and B in
Figure 8.3 remains after the zero-order step. We increase the conditioning
8. 5 THE UNDI RECTED DEPENDENCY GRAPH ALGORI THM
247
Figure 8.2. A directed graph, including one latent variable, used to
illustrate the undirected dependency graph algorithm. This causal graph
is unknown to the observer.
order to 1 and see whether A and B become independent upon condition-
ing on the following rst-order sets: {C}, {D}, {E}. These are the only
rst-order conditioning sets that we can form from ve variables while
excluding variables A and B. From Figure 8.2 we know that A and B are
not d-separated given any of these sets. Therefore they will not be indepen-
dent in our data upon rst-order conditioning and the line between them
in Figure 8.3 remains after this step. We continue by increasing the condi-
tioning order to 2 and test for a zero association relative to the following
sets: {C,D}, {C,E}, {D,E}. These are the only second-order conditioning
sets that we can form. Given the true causal structure in Figure 8.2 we will
nd that the second-order association between A and B remains. We
increase the conditioning set to 3 and test for a zero association relative to
the following conditioning set: {C,D,E} but still the association between A
and B would remain. Since we cannot increase the conditioning order any
more, we conclude that there is a line between A and B in the nal undi-
rected dependency graph.
We then go on to a new set of variables; in this case, Aand C. When
we apply the algorithm to the pair (A,C) we will nd that A and C are still
zero-order associated since they are d-connected in Figure 8.2. When we
increase to order 1 and form the sets {B}, {D} and {E} we will nd that
A and C become independent upon conditioning on B. This is because,
in Figure 8.2 (the true graph) A and C are d-separated given B and d-
separation implies probabilistic independence. So, we remove the line
between A and C in Figure 8.3, giving Figure 8.4. Since we have removed
the line we dont have to go any further with this pair.
If we apply the algorithm to the pair (A,D) we would nd that A
and D also become independent upon conditioning on B, and so we would
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
248
Figure 8.3. Step 1 in the construction of the undirected dependency
graph. There is an undirected edge between each pair of observed
variables.
remove the line from A to D in Figure 8.4. A and E would also become
independent either upon conditioning on B or on the sets {B,C}, {B,D}
or {B,C,D}. This is because A and E are d-separated by any of these con-
ditioning sets. The pair (C,D) would never become independent, since, in
Figure 8.2, they are both caused by a latent variable that will never there-
fore appear in any of the conditioning sets. Similarly, the pair {C,E} will
always remain associated as will the pair {D,E}. The undirected dependency
graph that results after applying the algorithm to every possible pair is shown
in Figure 8.5; this is the correct undirected dependency graph given the
causal process shown in Figure 8.2.
8. 5 THE UNDI RECTED DEPENDENCY GRAPH ALGORI THM
249
Figure 8.4. The undirected edge between A and C has been removed
because we have found a subset of observed variables {B} that
makes A and C independent upon conditioning.
Figure 8.5. The completed undirected dependency graph. The
undirected edges between A and D, between A and E, and between B
and E have been removed because we have found a subset of
observed variables that renders each pair independent upon
conditioning.
8.6 Interpreting the undirected dependency graph
The undirected dependency graph informs us of the pattern of direct asso-
ciations in our data. It doesnt inform us of the pattern of direct causes in
our data. For instance, there is a line between C and D in Figure 8.5 even
though, peeking at the causal process that generated the data (Figure 8.2)
we know that the association between C and D is due only to the eect of
the latent variable (L). Just as the term direct cause can only have meaning
relative to the other variables in the causal explanation, a direct association
can only have meaning relative to the other variables that have been meas-
ured. However, we can infer from the undirected dependency graph that if
two variables have a line between them then there is:
1. a direct causal relationship between the two and/or
2. a latent variable that is a common cause of the two and/or
3. a more complicated type of path between the two, called an induc-
ing path; this will be explained in more detail later.
At the same time, we can exclude other types of latent variables.
For instance, we know that there is no latent variable that is a common cause
of A, B and C in Figure 8.5. If there were, then A and C would not be
d-separated given any set of other observed variables and there would there-
fore be a line between A and C in the undirected dependency graph.
The rst two explanations for a direct association in an undirected
dependency graph should be understandable by now. The third possibility
is less obvious but can be illustrated by an example given by Spirtes,
Glymour and Scheines (1993). Consider Figure 8.6. On the left is the
directed acyclic graph with a latent variable F. Neither variable A nor var-
iable B has any direct causal link with variable D. On the right is the undi-
rected dependency graph. Notice that there is a line between A and D and
also between B and D in the undirected dependency graph even though
there are neither direct causal links between the pairs nor latent variables
that are causes common to both. This is because A and D would never be
d-separated given any subset of variables B and C and thus would always be
probabilistically associated. Similarly, B and D would never be d-separated
given any subset of variables A and C and thus would always be probabilis-
tically associated. For instance, if we look at the pair (A,D) in the true causal
graph then A and D would be both unconditionally associated through the
path ACD and associated conditional on B, associated conditional on
C through the path ACFD and also associated conditional on both
B and C for the same reason.
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
250
To better understand how this third possibility can arise in general
requires more denitions before the explanation can be understood.
Directed versus undirected paths
Look at the directed acyclic graph (DAG) in Figure 8.7. Imagine
that this DAG is a road map consisting only of one-way streets whose direc-
tion is shown by the arrows. Normally, to go from one variable to another
we have to respect the trac rules and follow the arrows. If we can go from
one variable to the other by following these rules then we will call our route
a directed path. For instance, we can go from Ato Dby following the directed
path ABCD. The following is not a directed path: ABFD
because we have gone the wrong way on a one-way road (BF) when
going from B to F. However, if we ignore the rules of the road and drive in
whatever direction we want, irrespective of the direction of the arrows, then
we can go from A to F along the path ABFD. Such a path, in which
the direction of the arrows is ignored, is called an undirected path. We havent
erased the arrows, we have simply decided to ignore them
9
. Of course, a
directed path must also be an undirected path but an undirected path might
not also be a directed path. For instance, the following are undirected paths
in Figure 8.7 but they are not directed paths: ABFD, AB
CFD.
8. 6 I NTERPRETI NG THE UNDI RECTED DEPENDENCY GRAPH
251
19
Pretend you are a diplomat who is working at the United Nations HQ in New York City.
You cant change the laws about one-way streets but you can safely ignore them because
of diplomatic immunity.
Figure 8.6. The true causal structure is shown on the left and the
resulting undirected dependency graph is shown on the right. Notice
that there are edges between A and D and between B and D in the
undirected dependency graph even though no such directed edges
exist in the directed graph. This is due to the presence of the latent
variable F generating an inducing path between these pairs of variables.
Inducing paths
10
List all of the variables in the DAG and call it the set V. In Figure
8.7 the set V is {A,B,C,F,D}. We can call this complete DAG the graph G.
Now choose some subset of variables in the DAG and call it O. For instance,
you might choose the set O{A,B,C,D} thus leaving out the variable F.
By doing this you will have a new graph (call it G) in which the variable F
is latent; that is, the variable F still has the same causal relationships to the
other (O) variables as before, but variable F doesnt appear in G. Because
G doesnt show the variable F, it is not a complete description of the full
causal process. Now, choose two variables in your chosen set (O) of vari-
ables and nd an undirected path between them in the complete graph (G).
For instance, if we choose A and Din Figure 8.7 then we can nd the undi-
rected paths ABFD, ABCFD, ABFCD and
ABCD. Some of these undirected paths might be a special type of
path called an inducing path relative to O. To determine whether a given undi-
rected path is an inducing path relative to O, look at those variables in the
undirected path in G that are also in your chosen set O. If (i) every variable
in O along the undirected path except for the endpoints (here, A and D) is
a collider along the path, and if (ii) every collider along this undirected path
is an ancestor of either of the endpoints, then the path is an inducing path
between the endpoints relative to O. Such an inducing path has the prop-
erty that the endpoints will never be d-separated given any subset of other
variables from the set O.
Lets look at the rst undirected path between A and D (AB
FD) and choose O{A,B,C,D}. The only other variable along this
undirected path, except for A and D (the endpoints), that is in Ois B, since
F has been left out (i.e. is latent). Since B is a collider along this path and is
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
252
10
The properties of inducing paths were described by Verma and Pearl (1991), but the name
for such a path was introduced by Spirtes, Glymour and Scheines (1993).
Figure 8.7. A directed graph with one latent variable (F).
an ancestor (because of the path BCD) of D(one of the endpoints) the
path ABFD is an inducing path relative to O{A,B,C,D}. To see
that this inducing path results in A and D never being d-separated given any
subset of variables from O, we have only to look at each possible condition-
ing set. The empty set (i.e. unconditional conditioning) allows d-connec-
tion though the path ABCD. The set {B} allows d-connection
through the undirected path ABFD. The set {C} allows d-connec-
tion through the undirected path ABCFD. The set {B,C} allows
d-connection through either of these last two undirected paths.
None of the other undirected paths between A and D are inducing
paths relative to O{A,B,C,D}. For instance, the undirected path
ABCFD has the variable B that is in O but is not a collider along
the path. Therefore, conditioning on B will d-separate A and D. Similar
reasons exclude the paths ABFCD and ABCD.
Notice that the variables at the ends of such an inducing path will
never be d-separated given the variables in O because one will always be
conditioning on a collider and thus opening a path through some variable
not in O. Therefore the undirected dependency graph involving the vari-
ables in Owill always have a line between two variables if there is an induc-
ing path between them. Noting that O will usually consist of the set of
observed variables, you might start to see the usefulness of the notion of
an inducing path. If you see a line between two variables in the undirected
dependency graph then you will know that there is an inducing path
between them.
One practical problem with the algorithm that I have presented for
obtaining the undirected dependency graph is that, as the number of
observed variables increases, the number of sets of conditioning variables
increases geometrically. When faced with large numbers (say 50) observed
variables, even fast personal computers might take a long time to construct
the undirected graph if the topology of the true causal graph is uncooper-
ative. A slightly modied version of the algorithm
11
is presented by Spirtes,
Glymour and Scheines (1993); it is more ecient when one is dealing with
many observed variables. The two algorithms are equivalent given popula-
tion measures of association, but the more ecient algorithm can make
more mistakes in small data sets.
We sometimes have independent information about some of the
causal relationships governing our data. In such cases it is straightforward to
modify the algorithm for the undirected dependency graph to incorporate
8. 6 I NTERPRETI NG THE UNDI RECTED DEPENDENCY GRAPH
253
11
This modied algorithm is incorporated in their PC algorithm. The algorithm that I have
described forms part of their SGS algorithm.
such information. If we know that the association between two observed
variables is due only to the fact that another measured variable, or set of
measured variables, is a common cause of both then we simply remove that
edge before applying the algorithm. Similarly, if we know that two observed
variables either have a direct causal relationship, or share at least one
common latent cause, then simply forbid the algorithm from considering
that pair. Note that it is not enough to know (say from a randomised experi-
ment) that one measured variable is a cause of another; we must know that
it is (or is not) a direct cause. A randomised experiment will not be able to
tell us this when some of the observed variables are attributes of the experi-
mental units, as explained in Chapter 1.
8.7 Orienting edges in the undirected dependency graph using
unshielded colliders assuming an acyclic causal structure
In Chapter 2 I discussed how d-separation predicts some counterintuitive
results concerning statistical conditioning. Consider a simple causal graph
of the form XZY. X and Y are causally independent and, since they are
unconditionally d-separated, they are also probabilistically independent.
However, if we condition on Z (the common causal descendant of both X
and Y ) then X and Y become conditionally dependent. This is because X
and Y are not d-separated conditional on Z. In general
12
, if we have two
variables (X and Y ) and condition on some set of variables Q that contains
at least one common causal descendant of both X and Y, then X and Y will
not be d-separated. Because of this X and Y will not be probabilistically
independent upon conditioning on Qeven if X and Y are causally indepen-
dent.
This fact allows us to determine the causal direction of some lines
in the undirected dependency graph. In Chapter 2 I dened an unshielded
collider as a causal relationship between three variables (X, Y and Z) such that
both X and Y are direct causes of Z (XZY ) but there is no direct causal
relationship between X and Y (i.e. there is no arrow going from one to the
other
13
). Lets now dene an unshielded pattern in an undirected dependency
graph as one in which we have three variables (X, Y and Z) such that there
is a line between X and Y, a line between Y and Z (XZY ), but no line
between X and Y. Since there is no line between X and Y we know that X
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
254
12
This is true for acyclic causal structures but not for cyclic causal structures. This is dis-
cussed in more detail later.
13
If we were dealing with a path diagram rather than a directed acyclic graph then there
must not be any edge at all, either an arrow or any double-headed arrows, between X and
Y.
and Y are d-separated given some subset of other variables in the undirected
dependency graph. Given such an unshielded pattern we can decide
whether there are arrowheads pointing into Z from both directions or not
in the causal graph that generated the data. If there were arrowheads point-
ing into Z from both directions in the actual causal process generating the
data, then Xand Y would never be probabilistically independent conditional
on any set of other observed variables that includes Z.
To illustrate this method of orienting our undirected paths in the
undirected dependency graph, imagine that the unknown causal process
generating our observed data is as shown in Figure 8.8. Even though the
causal process is hidden from us, we will obtain the undirected dependency
graph shown in Figure 8.8 once we apply the algorithm to our data. Now,
since we dont know what the actual causal process looks like, we dont
know whether there are latent variables generating some of the direct asso-
ciations.
Before going on, lets introduce some more conventions for mod-
ifying our undirected dependency graph. A graph in which only some of
the edges are oriented is called a partially directed graph or a partially oriented
8. 7 ORI ENTI NG EDGES USI NG UNSHI ELDED COLLI DERS
255
Figure 8.8. The true (unknown) causal graph and the resulting
undirected dependency graph.
graph. Since we dont yet know whether or not there are arrowheads at the
ends of any of the lines in our undirected dependency graph (i.e. we dont
yet know the directions of the causal relationships shown in the causal graph
at the top of Figure 8.8), lets admit this fact by adding an open circle (X
Y ) at the end of each line (Figure 8.9). By doing this we are no longer
dealing with an undirected graph; rather, we are dealing with a partially ori-
ented graph whose directions are not yet known. An open circle simply
means that we dont know whether or not there should be an arrowhead.
Therefore, given XY the oriented edge in the true causal graph might
be XY, XY or XY. The nal oriented edge (XY ) doesnt
mean a feedback relationship between X and Y (remember our assump-
tions). Rather, it means that there is an unmeasured (latent) common cause
generating the direct association between X and Y. It doesnt necessarily
mean that there is a common latent cause of X and Y either, as Figure 8.6
makes clear.
The partially oriented graph in Figure 8.9 has six unshielded pat-
terns, as given in Table 8.1. To orient some of the edges by detecting an
unshielded collider, apply the following algorithm to each unshielded
pattern (XZY ).
8.8 Orientation algorithm
14
using unshielded colliders
Let the conditioning number (i ) be 1.
1. Form all possible conditioning sets of i observed variables consist-
ing of the variable in the middle of the unshielded pattern (Z) plus
any observed variables other than the variables at the ends of the
unshielded pattern (X and Y ). Call each such conditioning set Q.
2.1 If the partial association between X and Y, conditioned on any set
Q of other variables, is zero then stop and conclude that the three
variables forming the unshielded pattern do not form an unshielded
collider in the true causal graph (i.e. not XZY ). We can call
such a pattern a definite non-collider.
2.2 If the partial association between X and Y, conditioned on every
set Q of other variables, is not zero, then increase the conditioning
number (i) by one and go to step 1.
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
256
14
This algorithm is used in Pearls IC (Inductive Causation) algorithm. The related algo-
rithm in Spirtes, Glymour and Scheines (1993) uses a set called Sepset(X,Y ) that reduces
the computational burden. The output is identical in acyclic causal structures, but can be
dierent in cyclic causal structures.
After cycling through all possible orders of i, if we have not declared the
unshielded pattern to be a denite non-collider, then it is a collider. Orient
the pattern as: XZY.
Since, in this example, we can peek at the true causal graph (top of
Figure 8.9), we can use d-separation to predict what would happen if we
applied the above algorithm to each of the six unshielded patterns that we
found in our partially oriented graph. For instance, when we test the
unshielded pattern ABCwe would begin the algorithm by testing
for a zero association (probabilistic independence) between A and C given
B. The order of the conditioning set is initially 1 and we already have one
variable (B). Since A and C are d-separated by B we will nd A and C to
be probabilistically independent given B and therefore stop right away, con-
cluding that the causal eects do not collide at B. This unshielded pattern
is a denite non-collider. The full results, and their explanations, are given
in Table 8.2. You will notice one more new notation in Table 8.2. If we
have concluded that an unshielded pattern is a denite non-collider (i.e. that
there denitely are not arrowheads pointing into the middle variable), then
underline the middle variable. Thus, the notation XYZ means
8. 8 ORI ENTATI ON ALGORI THM USI NG UNSHI ELDED COLLI DERS
257
Figure 8.9. The true causal graph and a corresponding partially oriented
graph with no orientations of the edges specied.
that we still dont know what the actual orientation is, but it is denitely not
XZY.
The nal partially oriented graph that results is shown in Figure
8.10. In fact, since the partially oriented edges indicate inducing paths,
Spirtes, Glymour and Scheines (1993) called these partially oriented inducing
path graphs or POIPGs.
At this point, other information about some of these partially ori-
ented relationships might help us. If, for instance, we knew from previous
work that the direct association between A and B was due at least in part to
a common latent cause, then we could orient this as: AB. This would
immediately restrict the orientations of the other lines, since we know that
the three unshielded patterns of which B is the middle variable are all def-
inite non-colliders. Therefore we can exclude ABC and ABD.
Because a randomised experiment, when it can be done, can give
us information about causal direction, the combination of prior informa-
tion from randomised experiments and these search algorithms can often be
very useful. For instance, imagine that the ve variables in Figure 8.10 rep-
resent ve attributes of a plant and that we cant perform randomised
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
258
Table.8.2. Applying the orientation algorithm using unshielded colliders to the
partially oriented graph in Figure 8.9
Partially
Unshielded pattern oriented pattern Explanation
ABC ABC B must be in Q. A and C are always d-
separated given B and any other observed
variable
ABD ABD B must be in Q. A and D are d-separated
given B and any other observed variable
CBD CBD B must be in Q. C and D are d-separated
given B and any other observed variable
BCE BCE C must be in Q. B and E are d-separated
given {C,D} and any other observed
variable
BDE BDE D must be in Q. B and E are d-separated
given {D,C} and any other observed
variable
CED CED E must be in Q. C and D are never
d-separated given E plus any other
observed variable
experiments to untangle the causal relationships between them. However,
we can introduce a new variable that is a property of the external environ-
ment, for instance light intensity. It is possible to randomly allocate plants
to the dierent treatment groups representing light intensity and so we can
tell, for each of the ve plant attributes, whether changes in light intensity
cause changes in the attribute. For the reasons given in Chapter 1 we cant
say that light intensity is a direct cause but we can say that, if the values of
the attribute dier between treatments, then the causal signal (direct or indi-
rect) goes from light intensity to the attribute and not the other way around.
Now if, in an observational study, we measure the ve attributes plus light
intensity, and nd that there is an edge in the partially oriented graph
between light intensity and some of the plant attributes, then we can use
this information to orient such edges. Once some edges are oriented this
will usually help to orient others.
If we are willing to assume that there are no latent variables respon-
sible for some of the lines in an undirected graph (i.e. causal suciency)
8. 8 ORI ENTATI ON ALGORI THM USI NG UNSHI ELDED COLLI DERS
259
Figure 8.10. The true, but unknown causal graph is shown at the top.
The nal partially oriented graph, with orientation using unshielded
colliders, is shown below.
then we can further restrict the number of possible graphs. For instance, if
we take the partially oriented graph in Figure 8.10 and assume causal su-
ciency then there are only four dierent directed acyclic graphs that are
compatible with the partially oriented graph (Figure 8.11). There were a
huge number of potential causal graphs involving ve variables in our initial
hypothesis space and we have reduced this number to four.
8.9 Orienting edges in the undirected dependency graph using
denite discriminating paths
In order to orient edges using unshielded colliders we require that the
pattern be unshielded. When we have a shielded pattern (three variables
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
260
Figure 8.11. On the left is the partially oriented graph of Figure 8.10.
On the right are the four possible completely directed acyclic graphs
without latent variables that are consistent with the partially oriented
graph.
with lines between each of the three pairs, forming a triangle) we can
sometimes still orient the pattern if it is embedded within a special type of
partially oriented path called a definite discriminating path.
Lets start with some undirected path (call it U) between two vari-
ables (X and Y ) in a partly oriented graph that contains some other variable
B. Even though the graph is only partially oriented (thus we dont know all
of the asymmetrical relationships between the variables) there is a special
type of undirected path that contains important information about the var-
iable B. Before giving the formal denition, I have to introduce yet another
symbol. If we look at a single variable and the edge coming into it then we
can have three dierent symbols at the end of the edge. For instance, we can
have X, X or X. The three dierent symbols are the arrowhead,
the and the empty mark. Now, if I write X* then the star is simply
a placeholder that can refer to any of the three dierent symbols. So, if I say
replace X* Y by X*Y then I mean keep whatever symbol was next
to the X but change the symbol next to the Y to a symbol.
Here is the denition of a denite discriminating path. An undi-
rected path U is a definite discriminating path for variable B if and only if:
1. U is an undirected path between variables X and Y containing B.
2. X and Y are not adjacent.
3. B is dierent from both X and Y (i.e. B cannot be an endpoint of
the path).
4. Every variable on U except for B and the endpoints (X,Y ) is either
a collider or a denite non-collider on U.
5. If two other variables on U (V and V) are adjacent on U, and V
is between V and B on U, then the orientation must be: V*V
on U.
6. If V is between X and B on U and V is a collider on U then the
orientation must be either VY or else V*X.
7. If V is between Y and B on U and V is a collider on U then the
orientation must be either VX or else V*Y.
To see the usefulness of such denite discriminating paths, consider
Figure 8.12. Each of the four unshielded patterns in the undirected depen-
dency graph (XVV, VVA, VAB and VAY )
derived from this partially oriented graph allowed us to apply the algorithm
for unshielded colliders; in this case all were determined to be denite non-
colliders. Unfortunately, the shielded pattern involving A, B and Y cant be
oriented this way. However, the undirected path XVV
ABY is also a denite discriminating path for the variable B.
What happens if, in the underlying causal graph, X and Y are d-separated
8. 9 ORI ENTI NG EDGES USI NG DI SCRI MI NATI NG PATHS
261
given A and B, but not given A alone? The only way that this could occur
is if B were a denite non-collider along the undirected path (AB
Y), since, if the orientation was really ABY then conditioning on
A and B would not d-separate X and Y. So we can denitely state that the
partial orientation is ABY. Yet this is not all. Since we have
assumed that the unknown causal graph is acyclic, there are only two dier-
ent partially oriented acyclic causal graphs that accord with this information
(Figure 8.13).
Now we can put all of the pieces together and state the Causal
Inference algorithm of Spirtes, Glymour and Scheines (1993).
8.10 The Causal Inference algorithm
15
1. Apply the algorithm to obtain the undirected dependency graph.
2. Orient each edge in the undirected dependency graph as .
3. Apply the orientation algorithm using unshielded colliders. For
each unshielded pattern (ABC) orient unshielded colliders as
ABC and orient each denite non-collider as AB
C.
4. If there is a directed path from A to B, and an edge A**B, orient
A**B as A*B.
5. If B is a collider along a path A*B*C, B is also adjacent to
another variable D (i.e. B**D), and A and C are conditionally
independent
16
given D, then orient B**D as B*D.
6. If there is an undirected path U that is a denite discriminating path
between variables A and B for variable M, and variables P and R are
adjacent to M along U, and P*M*R forms a triangle, then
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
262
15
My description of the Causal Inference algorithm diers from the original formulation
only in replacing Sepset sets with the actual d-separation claim.
16
There was an error in Spirtes, Glymour and Scheines (1993) at this point, which was cor-
rected in a subsequent Erratum.
Figure 8.12. A partially oriented graph involving six observed variables.
The undirected path XooVooVooAooBooY is a denite
discriminating path for the variable B.
F
i
g
u
r
e
8
.
1
3
.
T
h
e
p
a
r
t
i
a
l
l
y
o
r
i
e
n
t
e
d
g
r
a
p
h
o
n
t
h
e
l
e
f
t
i
m
p
l
i
e
s
o
n
l
y
t
w
o
a
l
t
e
r
n
a
t
i
v
e
p
a
r
t
i
a
l
l
y
o
r
i
e
n
t
e
d
a
c
y
c
l
i
c
g
r
a
p
h
s
o
n
t
h
e
r
i
g
h
t
.
i. If A and B are conditionally independent given Mplus any other
variable except A and B, then P*M*R along U is ori-
ented as a non-collider: P*M*R.
ii. If Aand B are never conditionally independent given Mplus any
other variable except A and B, then P*M*R along U is
oriented as a collider: P*M*R.
iii. If the triangle is already oriented as P*M**R then orient
it as P*MR.
Repeat steps 4-6 until no further changes can be made.
The result is a partially oriented inducing path graph. You should
be able to understand steps 1 to 3 by now. Step 4 is justied by the assump-
tion that there are no cyclic relationships in the causal structure. If there is
a directed path from variable A to variable B and we were to also orient the
direct edge as BA then this would create a cyclic path. Step 5 is simply a
generalisation of the reason for orienting an unshielded collider. Remember
that the two variables (X,Y ) in this group will never be d-separated if con-
ditioned on any of their causal descendants. Since we have already estab-
lished that the orientation is A*B*Cand there is another edge oriented
as B*D, and that both A and B are (conditionally) causally independent
of D (i.e. they are not d-connected), then d-separation predicts that A and
Dwould become probabilistically dependent when conditioned on Dif the
orientation was B*D and remain probabilistically independent if the
orientation was B*D. Steps 6i and 6ii derive from the notion of a de-
nite discriminating path, as described before. In step 6iii we have already
established that M is a non-collider along P*M**R. Therefore there
cannot be arrowheads pointing into M from both directions and we can
orient the triplet as P*M*R. There is now only one orientation pos-
sible, namely P* MR.
8.11 Equivalent models
The inferential testing of structural equations models, described in Chapters
3 to 7, consisted of deriving the observational predictions of the hypothe-
sised causal process (the correlational shadows) and then comparing the
observed and predicted patterns of correlation or covariation. I have empha-
sised that failing to reject such an hypothesised model provides support for
it, but does not allow us to accept it without other (non-statistical) evidence.
One reason might be that the sample size was too small to permit us to
detect a real (but small) deviation between the observed and predicted pat-
terns. However, the search algorithms in this chapter should alert us to
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
264
another reason: dierent causal processes can cast the same observational
shadows.
This leads to the topic (rarely discussed in the SEM literature) of
observationally equivalent models; that is, dierent causal models that cant
be distinguished on the basis of observational data. Such equivalent models
will produce exactly the same chi-squared values, and exactly the same
probability levels, when tested against the same data set. This is true no
matter how big your data set is. In fact, it will be true even if you have the
population values rather than sample values. When we test a structural equa-
tions model we are really testing the entire set of observationally equivalent
models against all non-equivalent models. In one sense this might be disap-
pointing; we cant distinguish between some competing causal explanations.
In another sense this is useful; when we reject a particular model we are also
simultaneously rejecting all of the observationally equivalent models as well.
The search algorithms in this chapter can allow us to nd all of the
causal models that are observationally equivalent
17
to our hypothesised one.
Given your path diagram, here are the steps:
1. Change all arrows (even double-headed ones
18
) to lines.
2. Draw the symbol at either end of each line.
3. Redraw each unshielded collider that was in the original path
diagram; that is, if there was an unshielded collider in the path
diagram (XZY ) then replace XZY with
XZY.
4. For each non-collider triplet that was in the original path diagram,
add an underline; that is, if there was either XZY, XZY
or ZZY in the path diagram then replace XZY with
XZY.
Figure 8.14 summarises these steps.
At this point you can permute the dierent possible orientations so
long as you never introduce an unshielded collider that was not in your orig-
inal path diagram, and never remove an unshielded collider that was in your
original path diagram.
8. 11 EQUI VALENT MODELS
265
17
The algorithm for observational equivalence in acyclic models was rst published by
Verma and Pearl (1991).
18
A model having two variables sharing correlated errors (i.e. a double-headed arrow
between them) is equivalent in its d-separation consequences to a model having a latent
variable that is a common cause of both (Spirtes et al. 1998).
8.12 Detecting latent variables
One practical problem with the Causal Inference algorithm is that it can be
quite uninformative when many observed variables are all caused by a small
number of latent variables. In such cases the application of the Causal
Inference algorithm will not be very informative. Consider the simple
measurement model (graph A) shown in Figure 8.15 and the resulting
output from the Causal Inference algorithm.
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
266
Figure 8.14. The top graph shows a path diagram. The following two
graphs show the steps in obtaining all models that are equivalent to the
path diagram.
The output of the Causal Inference algorithm tells us that each of
the observed variables (A to D) is probabilistically associated with each of
the others. Since there are no unshielded patterns among the observed var-
iables in either of the two output graphs, we cant orient any of the edges.
Whenever you see a set of observed variables that form such a pattern (I will
call this a saturated pattern) you should suspect latent variables. However, it
is possible for such saturated patterns to arise even without latent variables,
as causal process B shows. Is there any way of dierentiating between the
two? Yes. For this, we need to look again at vanishing tetrad equations,
which we studied briey in Chapter 5.
You will recall that Spearman (1904) derived a set of equations,
called vanishing tetrads, which must be true given the type of structure
shown in the causal graph A in Figure 8.15. He argued that if such vanish-
ing tetrad equations held then this was evidence for the presence of a
common latent cause of the observed variables. As will be explained below,
this claim is not true, but a modication of it can indeed be used to detect
such a common latent cause if the relationships between the latent variable
and the observed variables are linear.
8. 12 DETECTI NG LATENT VARI ABLES
267
Figure 8.15. On the top are directed acyclic graphs of two different
causal processes that both imply the same partially oriented graph on
the bottom.
A vanishing tetrad equation is a function of four correlation (or
covariance) coecients. Because of the causal structure of models like those
in Figure 8.16, and because of the rules of path analysis, such a vanishing
tetrad equation must be zero in the population regardless of the (non-zero)
values of the path coecients. For instance, the population correlation
coecient (
AB
) between the observed variables A and B is a*b. The pop-
ulation correlation coecient (
CD
) between the observed variables C and
D is c*d. Therefore,
AB
CD
a*b*c*d. However, we also know that
AC
BD
a*c*b*d. It follows that
AB
CD
AC
BD
0, since a*b*c*da*c*b*d0. The tetrad equation
(
AB
CD
AC
BD
) becomes zero, or vanishes, because of the way the
observed variables relate to the latent variable. The causal process shown in
Figure 8.16 implies three dierent vanishing tetrad equations (of which only
two are independent). In fact, every set of four variables can have three pos-
sible tetrad equations regardless of the true causal process, although they
dont have to be zero.
AB
CD
AD
BC
0
AC
BD
AD
BC
0
AC
BD
AB
CD
0
Unfortunately, a causal structure like the one in Figure 8.17 also
implies vanishing tetrad equations. For instance, using the rules of path anal-
ysis we will nd that
AC
BD
AD
BC
(ab)(bc) (abc)(b) 0. Clearly,
simply showing that a vanishing tetrad equation holds is not evidence for
the presence of a common latent cause of the observed variables. Although
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
268
Figure 8.16. A directed graph involving four observed variables and one
latent variable (F).
it may not seem immediately obvious, there is a close relationship between
d-separation and vanishing tetrad equations
19
.
A vanishing tetrad equation can be given a graphical interpretation.
Lets dene a trek between two variables (X,Y ) as a pair of directed paths;
one directed path goes from a source variable (S) to X and the other directed
path goes from the same source variable to Y. One of the two directed paths
can be of length 0 (i.e. SX or SY ). For instance, in Figure 8.18 there
are three treks between X and Y. One is from the source variable S
1
(XZS
1
Y ), one is from the source variable S
2
(XS
2
Y ) and one
is from the source variable X in which one directed path is of length zero
8. 12 DETECTI NG LATENT VARI ABLES
269
19
Theorem 6.11 of Spirtes, Glymour and Scheines (1993) states that a vanishing tetrad equa-
tion of the type
IJ
KL
IL
JK
0 is linearly implied by an directed acyclic graph only if
either
IJ
or
KL
equals zero and if either
IL
or
JK
equals zero or there is a (possibly empty)
set Q of variables in the directed acyclic graph such that
IJ.Q
KL.Q
IL.Q
JK.Q
0.
Figure 8.17. This directed acyclic graph also implies a vanishing tetrad
(
AC
BD
AD
BC
(ab)(bc(abc)(b) 0) even though there are no
latent variables.
Figure 8.18. A directed acyclic graph used to illustrate the concept of a
trek.
(XY ). I will write T(X,Y ) to mean a trek between X and Y, T(X,Y )
to mean the set of all treks between X and Y and I will write X(T(X,Y ))
to mean the directed path in a trek between X and Y that goes into X.
In Figure 8.19 there are three dierent treks between X and Y:
XSY, XSVY and XSVWY. Notice that all the
directed paths in all these treks leading into X pass through S. When this
occurs we say that S is a choke point for X(T(X,Y )). There was no choke
point for X(T(X,Y )) in Figure 8.18.
To see what all this has to do with vanishing tetrads lets consider a
set of four variables (I, J,K,L). If we have a set of treks T(I, J ) between two
variables (I, J ) and a set of treks T(K,L) between two other variables (K,L)
and all of the directed paths in T(I, J ) that are into J (i.e. J(T(I, J ))) and all
of the directed paths in T(K,L) that are into L (i.e. L(T(K,L))) intersect at
the same variable Q, then Q is called a JL choke point. The Tetrad
Representation Theorem (Spirtes, Glymour and Scheines 1993) states that
if we see a vanishing tetrad in the statistical population (
IJ
KL
IL
JK
0)
then this means that there is either a JL choke point or an IK choke point.
How can vanishing tetrads help to detect the presence of latent var-
iables? If you see a saturated pattern in your undirected dependency graph
involving four variables
20
, then test to see whether there are any vanishing
tetrads between these variables. If vanishing tetrads exist then this is evi-
dence for a latent variable. To see why, consider that if the choke point
implied by this vanishing tetrad was an observed variable then the two var-
iables ( J,L) or (I,K) would be d-separated by this choke point and therefore
could not form part of a saturated pattern
21
.
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
270
20
If there are more than four variables forming a saturated pattern, then take each unique
set of four variables.
21
In Figure 8.17 there was a vanishing tetrad but no latent variable. The treks between each
Figure 8.19. A directed acyclic graph used to illustrate the concept of a
choke point for a set of treks.
This fact provides a simple algorithm to test whether the observed
correlations among a set of four observed variables is due to a common
latent cause. These are the steps.
8.13 Vanishing Tetrad algorithm
Given a set O of observed variables and a set T of four observed variables
from O that form a saturated pattern in the undirected dependency graph,
assume that there is no reason to invoke a common latent cause for these
four variables in T and then do the following:
1. Choose one of the three tetrad equations that are possible given the
four chosen variables in T. If you have tried all three then stop.
2. If the tetrad equation does not equal zero, go to step 1.
3. If the tetrad equation does equal zero then there is a latent variable
that forms the IK choke point of IK(T(I, J ),T(K,L),T(I,L),T( JK))
or the JL choke point of JL(T(I, J ),T(K,L),T(I,L),T( J,K)).
To illustrate this algorithm, lets go back to the two graphs in Figure
8.15. The undirected dependency graph will contain a saturated pattern for
variables A, B, C and D. Here again are the three tetrad equations:
AB
CD
AD
BC
0
AC
BD
AD
BC
0
AC
BD
AB
CD
0
All three tetrad equations vanish in graph A of Figure 8.15. Because
the rst equation vanishes we know that there is either an AC and/or a BD
choke point. Because the second equation vanishes we know that there is
either an AB and/or a CDchoke point. Because the third equation vanishes
we know that there is either an AD or a BC choke point. In fact all these
choke points exist in this graph and all are the same variable (F). If we do
the same thing to graph B of Figure 8.15 we will see that no tetrad equa-
tion vanishes
22
.
Lets go on and apply the Vanishing Tetrad algorithm to a causal
8. 13 VANI SHI NG TETRAD ALGORI THM
271
of the four pairs of variables (AC, BD, AD and BC) all had directed paths of zero length
(ABC, BCD, ABCD and BC). The choke point for these four treks
was the variable B, which was an observed variable. This is part of the reason why these
four variables do not form a saturated pattern.
22
Unless the graph is unfaithful. It is always possible to choose path coecients in such a
way as to make a particular tetrad equation vanish, but the vanishing tetrad equation is not
implied by the topology of the graph.
graph involving two latent variables (Figure 8.20). Here are the three tetrad
equations:
AB
CD
AD
BC
(ab)(cd ) (afd )(bfc) abcd(1f
2
) 0
AC
BD
AD
BC
(afc)(bfd ) (afd )(bfc) 0
AC
BD
AB
CD
(afc)(bfd ) (ab)(cd ) f
2
(1abcd ) 0
Notice that the only tetrad equation that vanishes is the one with
either an AB and/or a CD choke point. In fact, both choke points exist. All
the directed paths leading into A and B of all treks between the four pairs
of variables (AC, BD, AD and BC) pass through F
1
. All the directed paths
leading into C and D of all treks between these four pairs of variables also
pass through F
2
.
8.14 Separating the message from the noise
The ancients knew how to discover causal relationships. Things happened
in the world because the gods willed them. One had only to ask and, if the
gods were willing and the diviner gifted, the causes would be revealed.
Unfortunately, the gods were capricious and their words couched in alle-
gory. A good seer had to be able to separate the message from the noise, to
know when a bump in a goats intestine foretold war and when it was simply
undigested grass
23
. If the methods presented in this chapter are the modern
version of the diviners art then we still need to separate the causal message
from the sampling noise.
The various algorithms all require that we know whether or not
sets of random variables are independent. We are constantly being asked: Is
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
272
23
The ancient Greek philosophers were the rst to conceive of a world governed by natural
causes rather than by divine will. They then confronted the subject of this chapter.
Democritus (460370 BC) is reported to have said: I would rather discover one causal
law than be King of Persia (Pearl 2000).
Figure 8.20. A path diagram involving four observed variables and two
latent variables; a, b, c, d and f are path coefcients.
the statistical association zero or dierent from zero? So far I have assumed
that we can always answer such a question unambiguously because I have
assumed that we have access to the entire statistical population. If correla-
tions are the shadows cast by causes then I have assumed that these shadows
are always crisp and well dened. Given such an assumption we can extract
an amazing amount of causal information from purely observational data;
certainly much more than is intimated by the old mantra that correlation
does not imply causation.
Lets get back to reality. We almost never have access to the entire
statistical population. Rather, we collect observations from random samples
of the statistical population and these random samples are not perfect repli-
cas of the entire population. If correlations are the shadows cast by causes
then sample correlations are randomly blurred correlational shadows. We have
to nd a way of dealing with the imperfect information contained in these
blurred correlational shadows. Inferring population values from sample
values is the goal of inferential statistics and inferential statistics is the art of
drawing conclusions based on imperfect information. In practice we can
never unambiguously know whether the statistical association is zero or
dierent from zero. How can we deal with this problem when applying the
various discovery algorithms and what sort of errors might creep into our
results? To see this, we rst need to review some basic notions of hypothe-
sis testing.
Consider the problem of determining whether the population
value of a Pearson correlation coecient (
XY
) between two random vari-
ables, X and Y, is zero based on the measured sample value (r
XY
). There are
only two possible choices: either it really is or it really isnt. Similarly, we can
only give one of two answers based on our sample measure: either we think
it is or we think it isnt. These dene four dierent outcomes (Table 8.3).
Normally, the types of biological hypothesis that interest us are ones
in which variables are associated, not independent. Because we want evi-
dence beyond reasonable doubt before accepting this interesting hypothesis
(see Chapter 2) we usually begin by assuming the contrary that there is no
association and then look for strong evidence against this assumption
before rejecting it and therefore accepting our biologically interesting one.
In other words, we want to see a value of r
XY
that is suciently large that
there is very little probability that it would have come from a statistical popu-
lation in which
XY
0 is true.
So what is a small enough probability that we would be willing to
declare that
XY
really is dierent from zero? This is a somewhat subjective
decision, as described in Chapter 2, but Table 8.3 shows that part of our
decision will depend on how important it is for us to avoid either a Type I
8. 14 SEPARATI NG THE MESSAGE FROM THE NOI SE
273
error (incorrectly declaring that
XY
0 when, in reality,
XY
0) or a Type
II error (incorrectly declaring that
XY
0 when, in reality,
XY
0).
Because the presence of a real association usually (but not always) gives us
useful biological information, and because we know that our ever-present
sampling variation can sometimes fool us into observing a large value of r
XY
even when the variables are independent, we usually place more importance
in reducing our Type I errors than in reducing our Type II errors. Therefore,
we usually choose a small probability before we are willing to declare our
value of r
XY
as being signicantly dierent from zero. For instance, choos-
ing a signicance level of 0.05 means that we are only willing to accept a
5% chance of making a Type I error. Notice, however, that by decreasing
our signicance level to the low value of 0.05 we are simultaneously willing
to accept a larger chance of making a Type II error. This is usually okay
because we have already decided that it is more important to be quite sure
that
XY
is not zero than to be quite sure that
XY
is zero.
In each of the algorithms described in this chapter you are repeat-
edly asked to decide whether the measure of association is zero or not. You
have to make this choice based on a random sample of data and, therefore,
you have to conduct statistical tests and choose how important it is to mini-
mise either Type I or Type II errors. Heres the rub: by denition, these are
discovery algorithms. You dont already have a preferred causal hypothesis
that you wish to test. You cant have any a priori preference for either
XY
0
or
XY
0 and neither outcome provides more information than the other
to the various algorithms
24
.
This brings us to the notion of statistical power. The hypothesis that
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
274
24
One exception might be in the orientation phase of the algorithms. It might be better to
plead ignorance, and leave edges unoriented, than to make a denite choice about declar-
ing an unshielded pattern to be a collider or a denite non-collider.
Table 8.3. Possible combinations of decisions to a null hypothesis and its
alternative, giving rise to Type I and Type II errors
True value in the
statistical population
XY
0
XY
0
Your answer after
XY
0 Right choice Type II error
looking at r
XY
and
calculating its
XY
0 Type I error Right choice
probability
XY
0 is not really a single hypothesis at all; rather, it is a composite hypoth-
esis that includes
XY
0.01,
XY
0.1,
XY
0.9 and an innite number of
other individual hypotheses. Intuitively, it is obvious that it would be much
more dicult to distinguish between 0.0 and 0.01 than between 0.0 and 0.9
in any sample of data. If we had a huge data set (say a thousand observa-
tions) and the population value was 0.0 then our sample correlation would
almost always be extremely close to zero (Figure 8.21). Sampling variation
would only very rarely result, by chance, in a value greater than even a low
number such as 0.1. Therefore, if the population value was even slightly
dierent from 0.0 then we would almost always nd a very small probabil-
ity value for our measured r
XY
and would almost never conclude that
XY
0
when, in reality,
XY
0. Our test would be very powerful in detecting even
very slight associations between X and Y. If, on the other hand, we had only
a small data set (say 10 observations) and our population value was 0.0 then
our sample correlation would uctuate quite widely around zero simply due
to sampling variation (Figure 8.21). Therefore, even if the population value
was quite dierent from 0.0 (say
XY
0.4) we would often observe sample
values close to zero simply due to sampling variation. Because of this, the
probability of incorrectly concluding that
XY
0 would not be negligible.
Our test would not be very powerful in detecting even moderate associa-
tions between X and Y and we would make Type II errors more often.
8. 14 SEPARATI NG THE MESSAGE FROM THE NOI SE
275
Figure 8.21. The probability of observing a Pearson correlation
coefcient of various values when the population value is zero at two
different sample sizes.
The power of a statistical test is the probability of rejecting the null
hypothesis when, in fact, it is false. It is dened as 1, where is the
probability of a sample statistic, taken from a statistical population in which
the null hypothesis is false, falling within the acceptance region of the null
hypothesis. The power is aected by sample size, by the signicance level
chosen for rejecting the null hypothesis and by the dierence (the eect
size) between the true value of the test statistic and the value assumed by
the null hypothesis. Figure 8.22 plots the statistical power to reject the null
hypothesis that 0 when the true value varies from 0.9 to 0.9 at two
sample sizes (30 and 300 observations) and three dierent signicance levels
(0.05, 0.10 and 0.20).
Figure 8.22 clearly shows the compromise that must be made. If the
null hypothesis (0) is true, then increasing the signicance level from
0.05 to 0.2 increases our chances of incorrectly rejecting the null
hypothesis (Type I error). We will be incorrectly declaring associations to
exist more often when they really do. On the other hand, increasing the sig-
nicance level from 0.05 to 0.2 increases our power to reject the
null hypothesis when associations really do exist but are weak. As sample
size increases then power increases irrespective of the chosen signicance
level. This is a good thing because now we can increase our power to detect
real, but weak associations without increasing our signicance level and
therefore without increasing our chances of falsely accepting associations
that dont really exist. At large sample sizes we are best to set a low signi-
cance level (say 0.05 or even 0.01), since at such large sample sizes
we will keep both Type I and Type II error rates low. At small sample sizes
we are best to increase our signicance level if, in fact, we dont have any
preference for the presence or absence of a real association. This is because,
at a low signicance level, only very large values of the correlation coe-
cient would have a reasonable chance of being detected. As the sample size
increases, that power approaches 1.0 even as approaches 0, meaning that
the chances of committing both Type I and Type II errors approach zero.
You will see that signicance levels of 0.2, 0.4 or even higher
might be used with very small sample sizes. Clearly, applying these algo-
rithms to small samples means accepting more and more errors due to sam-
pling uctuations. Remember that these are exploratory methods, not
methods designed to test a preconceived hypothesis. If you set out to hike
through an unfamiliar area then you would probably take a map. The search
algorithms are like imperfect maps to a causal landscape. At large sample
sizes these maps will give you all the detail that can be obtained even though
no one map might be able to provide all the information that is wanted. At
small sample sizes these maps will only give you the major hiking trails, may
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
276
F
i
g
u
r
e
8
.
2
2
.
T
h
e
p
o
w
e
r
t
o
r
e
j
e
c
t
t
h
e
n
u
l
l
h
y
p
o
t
h
e
s
i
s
t
h
a
t
t
h
e
P
e
a
r
s
o
n
c
o
r
r
e
l
a
t
i
o
n
c
o
e
f
c
i
e
n
t
i
s
z
e
r
o
i
n
t
h
e
p
o
p
u
l
a
t
i
o
n
,
w
h
e
n
t
h
e
t
r
u
e
v
a
l
u
e
o
f
t
h
e
p
o
p
u
l
a
t
i
o
n
P
e
a
r
s
o
n
c
o
r
r
e
l
a
t
i
o
n
t
a
k
e
s
o
n
d
i
f
f
e
r
e
n
t
v
a
l
u
e
s
.
T
h
i
s
p
o
w
e
r
i
s
a
f
f
e
c
t
e
d
b
y
t
h
e
s
a
m
p
l
e
s
i
z
e
a
n
d
t
h
e
c
h
o
s
e
n
s
i
g
n
i
c
a
n
c
e
l
e
v
e
l
(
)
u
s
e
d
t
o
r
e
j
e
c
t
t
h
e
n
u
l
l
h
y
p
o
t
h
e
s
i
s
.
quite possibly miss some of the smaller trails and even include some incor-
rect paths. Independent tests using new data are always important after
applying the search algorithms but this is especially important as sample size
decreases. None the less, you will see that the error rates are not that bad,
especially for constructing the undirected dependency graph, even at small
sample sizes. With these points in mind, lets look at the Causal Inference
algorithm in the presence of sampling error.
8.15 The Causal Inference algorithm and sampling error
At this stage it is useful to look at a numerical example. Figure 8.23 shows
a path model from which I will generate sample data and apply the Causal
Inference algorithm. I will generate two dierent data sets, one with 30
observations and one with 300 observations. The coecient equals 0.4.
Lets begin with the larger sample size (N300). Table 8.4 shows the vari-
ances on the diagonal, the covariances below the diagonal, and the correla-
tions above the diagonal, for a simulated data set with N300.
The rst step is to obtain the undirected dependency graph. I will
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
278
Figure 8.23. Path model used to generate sample data.
choose a signicance level
25
of 0.05. After constructing the saturated undi-
rected graph, I then remove any lines between variables whose zero-order
correlations are not signicantly dierent from zero at a rejection level of
0.05; that is, if |r|0.113. There is one correlation coecient in Table 8.4,
between A and E, that is judged to be zero. Note that this decision is actu-
ally a Type II error, since A and E are really associated with a weak popu-
lation coecient of 0.128. However, this does not introduce any errors in
the undirected dependency graph, since this graph is concerned only with
direct associations. In other words, the algorithm is robust to these types of
error.
I next look at each of the 10 pairs of variables ({A,B}, {A,C}, . . .,
{D,E}) and, for each, test for zero rst-order partial correlations. That is,
for each pair I calculate the partial correlation conditional on each of the
remaining three variables in turn. With each test I see whether the absolute
value of the partial correlation is less than 0.113 and, if it is, I remove the
line joining the two variables in the pair. When I do this I nd only four
rst-order partials that are judged to be zero: r
AC|B
0.008 ( p0.89),
r
AD|B
0.062 ( p0.28), r
AE|B
0.022 ( p0.70) and r
CD|B
0.069
( p0.23). Note that we cant remove the line between A and E, since it
was already removed (by error) when looking at the zero-order correla-
tions
26
. This is why I said that the algorithm is robust.
8. 15 CAUSAL I NFERENCE AND SAMPLI NG ERROR
279
25
This signicance level refers to each individual statistical test, not the nal partially ori-
ented graph. From Figure 8.17 I know that I will have almost 100% power to detect cor-
relations whose population values are greater than 0.2 in absolute value, although, in an
empirical study, I would not know what the population values were.
26
Actually the algorithm would not even calculate this partial correlation, since the two var-
iables are not adjacent at this stage.
Table 8.4. Variances (diagonal), covariances (lower
subdiagonal) and correlations (upper subdiagonal) of
300 simulated data from a multivariate normal
distribution generated according to the causal structure
shown in Figure 8.23
A B C D E
A 0.93 0.35 0.14 0.08 0.10
B 0.35 1.03 0.42 0.38 0.35
C 0.15 0.45 1.11 0.22 0.52
D 0.07 0.37 0.22 0.94 0.52
E 0.11 0.38 0.59 0.54 1.12
I next look at each of the remaining pairs of variables that are still
adjacent and, for each, test for a zero second-order partial correlation. That
is, for each pair I calculate the partial correlation conditional on each pos-
sible pair of the remaining three variables in turn and remove the line
between any pair whose absolute value of the second-order partial is less
than 0.113. There is only one such zero second-order partial: r
BE|{CD}
n
a
l
o
u
t
p
u
t
o
f
t
h
e
a
l
g
o
r
i
t
h
m
w
h
e
n
u
s
i
n
g
d
i
f
f
e
r
e
n
t
s
i
g
n
i
c
a
n
c
e
l
e
v
e
l
s
f
o
r
t
h
e
o
r
i
e
n
t
a
t
i
o
n
p
h
a
s
e
.
At this stage I can use my d-sep test to evaluate each of the three
partially oriented graphs
27
. When I do this I nd that the equivalent graphs
that result when the signicance level used in the orientation stage is 0.3 or
larger are clearly rejected. This is because these graphs all predict that A and
C are unconditionally independent. In fact, r
AC
0.35 and, with a sample
size of 300, this would occur much less than once in a million times if the
data were really generated according to this graph. Therefore, we have to
choose between the rst two partially oriented graphs. When we look at
the rst partially oriented graph we see that it is impossible for all of the
unshielded patterns to be denite non-colliders and for there to also be no
cycles in any equivalent DAG that is consistent with it. We are led to accept
the middle partially oriented graph as the most consistent with our data.
Figure 8.25 shows the results when the undirected dependency
graph algorithm is applied to a sample data set, generated from Figure 8.23,
but with a small sample size of 30 observations. Now we see many errors.
No equivalent model from the undirected dependency graph, obtained
using 0.01, provides an acceptable t to the data. However, all of the
remaining undirected graphs have equivalent models that do produce an
acceptable t, based on the d-sep test and a signicance level of 0.05. To go
any further requires information beyond that which exists in this little data
set. For instance, the undirected graphs at 0.05 to 0.2 all predict that A
and B are independent of C, D and E
28
. If, in other studies, either A or B
was found to be correlated with C, Dor E, then this undirected graph could
be rejected
29
. Once you decide upon a particular undirected graph as being
most consistent with all of the information, then you can begin to explore
the orientation phase.
The best way to see what types of error these search algorithms will
make at dierent sample sizes is to generate data with dierent characteris-
tics and count the types of error that occur. Spirtes, Glymour and Scheines
(1993) have conducted such simulation studies for the case of causal su-
ciency and using both the algorithms described here and also others that are
more ecient with large numbers of variables, but which commit more
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
282
27
Every possible partially oriented graph can be tested in this way. Simply choose one of the
equivalent DAGs consistent with the partially oriented graph and apply the d-sep test.
Since every equivalent graph will give the same probability level under the null hypoth-
esis it doesnt matter which one you choose.
28
Since there are no undirected paths between these two sets of variables, there can be no
directed paths either. Therefore they must be independent.
29
In Shipley (1997) I described how imbedding the algorithm for the undirected depen-
dency graph inside a bootstrap loop helps to reduce the eects of small sample sizes. This
option is included in the EPA program of my Toolbox (see Appendix).
F
i
g
u
r
e
8
.
2
5
.
O
u
t
p
u
t
o
f
t
h
e
u
n
d
i
r
e
c
t
e
d
d
e
p
e
n
d
e
n
c
y
g
r
a
p
h
a
l
g
o
r
i
t
h
m
w
h
e
n
a
p
p
l
i
e
d
t
o
a
s
a
m
p
l
e
d
a
t
a
s
e
t
,
g
e
n
e
r
a
t
e
d
f
r
o
m
F
i
g
u
r
e
8
.
2
3
,
w
i
t
h
a
s
m
a
l
l
s
a
m
p
l
e
s
i
z
e
o
f
3
0
o
b
s
e
r
v
a
t
i
o
n
s
.
errors at small sample sizes. I have also explored the error rates of the Causal
Inference algorithm with small sample sizes, and using bootstrap techniques
(Shipley 1997). The general results that come from these simulation studies
are:
1. Error rates are lower for constructing the undirected dependency
graph than for orienting the edges.
2. The error rates for adding a line in the undirected dependency
graph when there shouldnt be one are quite low. Even at very small
sample sizes (say 30 observations), if a line appears then it probably
exists unless the rejection level is very high (say 0.5 or more).
3. The error rates for missing a line in the undirected dependency
graph when there should be one are higher. As the strength of the
direct causal relationships decrease, this error rate increases. As the
number of other variables to which a given variable is a direct cause
increases, this error rate increases. As the sample size increases, this
error rate decreases.
4. The error rates for orienting edges are higher than those related to
the undirected dependency graph. This is to be expected, since the
orientation phase depends on the number and types of unshielded
pattern; therefore any errors in the undirected dependency graph
will be propagated into the orientation phase.
5. The rejection level used in constructing the undirected dependency
graph should increase as the sample size decreases. At very small
sample sizes values of 0.2 or higher should be used. At sample sizes
of around 100 to 300 a rejection level of 0.1 should be used. At
higher sample sizes a value of 0.05 is ne, since statistical power
does not have to be traded o against the ability to avoid Type I
errors.
8.16 The Vanishing Tetrad algorithm and sampling variation
The Vanishing Tetrad algorithm has been much less studied than the other
algorithms that I have described. In part, this is because the assumption of
a linear relationship between the latents and the observed variables limits its
application. Another reason is perhaps because it is less informative than the
other algorithms; it can alert us to the presence of latent variables but cant
tell us exactly how these latents connect to the observed variables. Another
reason is that, unlike the tests for (conditional) independence that are used
in the other algorithms, the test for a zero tetrad equation is only asymp-
totic. In other words, in order to get accurate probability estimates based on
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
284
the null hypothesis that a tetrad equation is zero, you need a certain
minimum sample size. No one (to my knowledge) has formally studied the
asymptotic requirements for this test, and so I will present some Monte
Carlo results to give you some rules of thumb when interpreting the asymp-
totic probability levels of the test statistic.
The test statistic is
IJ
KL
IL
JK
, where I, J, K and L are the
four variables involved in the tetrad; remember that there are always three
tetrad equations for each set of four variables. Under the null hypothesis this
value will be zero in the population. Wishart (1928) derived the asymptotic
sampling variance of this statistic in the rst part of the twentieth century
but no one (to my knowledge) has ever derived the exact sampling variance.
The asymptotic sampling variance
30
is:
where Dis the determinant of the population correlation matrix of the four
variables, D
IK
is the determinant of the 22 matrix consisting of the pop-
ulation correlation matrix of variables I and K, D
JL
is the determinant of the
22 matrix consisting of the population correlation matrix of variables J
and L, N is the sample size and the four variables follow a multivariate
normal distribution. There are six possible pairs of four variables, four of
these pairs dene a tetrad equation and the other two pairs dene the 22
submatrices whose determinants are used in calculating the asymptotic var-
iance. If the null hypothesis is true then the test statistic is asymptotically dis-
tributed as a normal variate with a zero mean and the given variance.
Therefore, the value / asymptotically follows a standard normal
distribution.
To conduct the statistical test you replace the population values by
the sample values. In doing this you are only approximating the true prob-
ability level and so it is important to know how good (or bad) this approx-
imation is. Table 8.5 shows some results of Monte Carlo simulations in
which a four variable measurement model, of the sort shown in Figure 8.16,
was used to generate 500 independent data sets. Each such simulation used
a dierent sample size per data set and a dierent value for the path coe-
cients () between the latent variable and the observed variables.
Remember that, according to the rules of path analysis, the population cor-
relation coecient () between each pair of observed variables in such a
model, is
2
.
var()
D
IK
D
JL
(N1)
(N1)
D
1
N2
c
a
n
c
e
l
e
v
e
l
s
.
#
,
n
u
m
b
e
r
o
f
.
R
e
j
e
c
t
i
o
n
l
e
v
e
l
=
0
.
0
5
,
0
.
1
0
R
e
j
e
c
t
i
o
n
l
e
v
e
l
=
0
.
2
t
o
0
.
5
R
e
j
e
c
t
i
o
n
l
e
v
e
l
=
0
.
6
#
s
e
e
d
s
d
i
s
p
e
r
s
e
d
#
f
r
u
i
t
p
r
o
d
u
c
e
d
C
a
n
o
p
y
p
r
o
j
e
c
t
i
o
n
F
r
u
i
t
d
i
a
m
e
t
e
r
S
e
e
d
w
e
i
g
h
t
#
s
e
e
d
s
d
i
s
p
e
r
s
e
d
#
f
r
u
i
t
p
r
o
d
u
c
e
d
C
a
n
o
p
y
p
r
o
j
e
c
t
i
o
n
F
r
u
i
t
d
i
a
m
e
t
e
r
S
e
e
d
w
e
i
g
h
t
#
s
e
e
d
s
d
i
s
p
e
r
s
e
d
#
f
r
u
i
t
p
r
o
d
u
c
e
d
C
a
n
o
p
y
p
r
o
j
e
c
t
i
o
n
F
r
u
i
t
d
i
a
m
e
t
e
r
S
e
e
d
w
e
i
g
h
t
can be rejected without even orienting it. This is because it predicts that
each of {seed weight, fruit diameter} is unconditionally d-separated from
and therefore independent of each of {canopy projection, number of fruit
produced, number of seeds dispersed}. Applying the d-sep test to only these
independent statements yields a
2
value of 27.18 with 12 degrees of
freedom ( p0.007). If we then go to the second undirected graph,
obtained using rejection levels of 0.2 to 0.5, we can apply the orientation
phase of the algorithm. Until we get to a very high rejection level for this
phase (0.4) we always nd that each unshielded pattern is a denite non-
collider. At a rejection level of 0.4 we are informed that fruit diameter is a
collider. Figure 8.27 shows the partially oriented acyclic graph based on the
middle undirected graph.
Despite the small sample size (60 observations) we have already dis-
covered quite a lot of information about the possible causal relationships
between these variables. There is no evidence that there are latent variables
that are common causes of more than two observed variables; if there were
then we would see three or more variables with a saturated pattern between
them. Remember that we are in an exploratory mode. We are looking for
possible models that accord with our available evidence about the correla-
tional shadows but we also want our model to accord with any previous bio-
logical knowledge that we might possess. For instance, consider the
relationship between the number of cherry fruits produced and the number
of seeds dispersed by the birds (number of fruits produced number of
seeds dispersed). Since seeds cant be dispersed by birds before the fruit has
been produced, we can exclude the orientation: (number of fruits produced
number of seeds dispersed). It is possible that the orientation is: (number
of fruits produced number of seeds dispersed), although it is dicult to
conceive of a latent variable that determines both how many fruits the tree
will produce and also how many of these fruits will be eaten by the birds.
However, if we accept this orientation involving a latent variable then we
must also exclude the orientation: (canopy projection onumber of fruits
produced), since this would produce a collider. This would therefore force
us to accept the following orientation: (canopy projectionnumber of
fruits produced). Such an orientation disagrees not only with much empir-
ical evidence but also with the time ordering of the phenomenon, since the
8. 17 EMPI RI CAL EXAMPLES
289
Figure 8.27. The nal partially oriented graph that is produced based on
the middle undirected graph of Figure 8.26. #, number of.
Seed
weight
Fruit
diameter
Canopy
projection
# fruit
produced
# seeds
dispersed
canopy is produced before the fruits are made. If we begin with the biolog-
ically reasonable hypothesis that the total photosynthetic capital of the tree,
of which the canopy projection area is a measure, determines both how
many fruit will be produced and the average size of each fruit, then we are
immediately led to the partially oriented directed acyclic graph in Figure
8.28.
Such a result, to me, is incredible. With ve observed variables we
had a little over 59000 possible directed acyclic models. This algorithm,
combined with a few reasonable biological observations, reduced this huge
number to a few reasonable models. Since I know that the statistical power
to detect small non-zero correlations is not great with only 60 observa-
tions, I would not bet my salary on the accuracy of Figure 8.28, but I
would feel much more condent about proposing a model derived from
Figure 8.28 as a useful biological hypothesis to be tested with independent
data.
This is the real strength of these discovery algorithms. Unless pre-
existing theory is already quite solid, then proposing a complete causal
model from such theory often degenerates into asking: If I were God, and
the world was a machine, then how would I construct it?. Since few of us
are gods and the world is not really a machine, such hypothesis generation
can easily mask unbridled speculation. The discovery algorithms rst show
us the correlational shadows that our data contain, which causal processes
might reasonably have cast them, and which causal processes were unlikely
to have cast these correlational shadows. This constrains our speculation,
forces us to consider dierent alternative models, and also forces us to expli-
citly justify any causal process that appears to contradict what the data seem
to say.
The next empirical example shows that we shouldnt accept the
output of these discovery algorithms blindly. Their purpose is to help us to
develop useful causal hypotheses, not to replace the scientist with a com-
puter algorithm. In Chapters 3 and 4 I presented a path model relating
specic leaf area, leaf nitrogen content, stomatal conductance, net photo-
synthetic rate, and the CO
2
concentration within the leaf. This model
(Figure 8.29A) was based on the pre-existing model of stomatal regulation
produced by Cowan and Farquhar (1977). When I apply the Causal
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
290
Figure 8.28. The partially oriented graph that is retained as most
biologically plausible. #, number of.
Seed
weight
Fruit
diameter
Canopy
projection
# fruit
produced
# seeds
dispersed
Inference algorithm to the empirical data
32
the resulting partially oriented
graphs make no biological sense. At low rejection levels (0.2 and lower)
none of the suggested graphs t the data. At higher rejection level (0.2 to
0.5) a graph (Figure 8.29B) is suggested that does produce a path model with
a non-signicant MLX
2
value but this graph contradicts some well-estab-
lished biological knowledge of leaf gas exchange. Note that in this graph
the net photosynthetic rate is causally independent of the CO
2
concentra-
tion within the leaf, even though this is physically impossible. The amount
of CO
2
within the leaf is determined by the rate at which it is diusing
into the leaf across the concentration gradient from the higher outside
8. 17 EMPI RI CAL EXAMPLES
291
32
I use only the 35 species that have a C
3
photosynthetic system, and each variable is trans-
formed to its natural logarithm to ensure multivariate normality.
Figure 8.29. Model (A) is the one proposed in Chapter 3 based on
biological arguments. The partially oriented graph (B) is the output of
the Causal Inference algorithm when no constraints are placed on it.
The partially oriented graph (C) is the output of the Causal Inference
algorithm when simple constraints are placed on it based on well-
known physical laws.
concentration to the lower concentration within the leaf (i.e. stomatal con-
ductance) and the rate at which it is being removed from the intercellular
air by photosynthesis. The concentration within the leaf is therefore deter-
mined by the net rate at which it is being xed by photosynthesis (by de-
nition, the net photosynthetic rate) and the rate at which gases are diusing
across the stomates (measured by the stomatal conductance).
Why would the Causal Inference algorithm produce such an erro-
neous output? The reason is almost surely because this biological process vio-
lates one of the assumptions of the algorithm; namely that the probability
distribution of these variables is faithful to the causal process that generated
it
33
. This means that independence or partial independence relationships are
assumed to be due to the way in which the variables are causally linked
together rather than on special numerical values of the strengths of the direct
causal relationships that manage to cancel each other. Imagine that you
glance out of the window and see a single person walking down the lane.
One hypothesis for this observation might be that there are really two people
but that one is positioned behind the other in such a way that she is perfectly
hidden by the person in front. A simpler, and more parsimonious, hypoth-
esis is that there is really only one person coming down the lane. Both
hypotheses are possible but the rst requires that you are witnessing a very
special juxtaposition of distances, shapes and sizes of people. The illusion of
a single person would disappear as soon as any of those conditions change.
We could say that such special conditions are unfaithful to our general expec-
tations and so we would reject the hypothesis unless we had very good inde-
pendent reasons to believe that someone might be hiding due to such special
conditions. In the same way, these discovery algorithms assume that if we
observe an observational independence between two variables, then this
means that the two are causally independent. It is always possible that the two
variables only appear to be independent because positive and negative direct
and indirect relationships cancel each other out, but this would require a very
special balancing of causal eects. Just as with the example of two people
appearing to be a single person, unless we have good reasons for suspecting
such a curious observation, we would choose the more parsimonious expla-
nation.
In fact, the correlation coecient between the ln-transformed net
photosynthetic rates and the ln-transformed internal CO
2
concentrations in
these data was only 0.051 ( p0.77). Therefore the line between these two
variables would be immediately removed when we are constructing the
undirected dependency graph. We know the net photosynthetic rate must
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
292
33
Unfaithfulness only properly applies to the population probability distribution.
be a cause of the amount of CO
2
within the leaf and yet there appears to
be no relationship between the two variables in these data. The reason for
this apparent contradiction can be found in the Cowan and Farquhar (1977)
model of stomatal regulation upon which the path model in Figure 8.24A
is based. According to this theory, the stomates are regulated in order to
maintain the internal CO
2
concentration at the break point; that point at
which carbon xation is limited equally by the regeneration of Rubisco due
to ATP production from the light reaction of photosynthesis and the
amount of Rubisco available in the dark reaction of photosynthesis. In other
words, the overall correlation between net photosynthetic rate and internal
CO
2
concentration is determined by two dierent causal paths. One path is
the direct eect of net photosynthetic rate in reducing the internal CO
2
concentration (net photosynthesisinternal CO
2
). The other path is the
trek from stomatal conductance that is a common cause of both net photo-
synthetic rate and internal CO
2
concentration (net photosynthesisstom-
atal conductanceinternal CO
2
). Increasing the stomatal conductance
increases the amount of CO
2
that enters the leaf, thus increasing both the
photosynthetic rate and the internal CO
2
concentration. Furthermore, in
order to maintain a constant internal CO
2
concentration, the stomates must
ensure that the increase in internal CO
2
due to diusion through the sto-
mates is just enough to counter the decrease in CO
2
that is caused by the
resulting increase in the net photosynthetic rate. By balancing these positive
and negative eects, such homeostatic control maintains a constant internal
CO
2
concentration but also produces an unfaithful probability distribution.
Since the operational denition of net photosynthetic rate is the rate
at which CO
2
is being removed from the air within the leaf, we have good
independent reasons to suspect that net photosynthetic rate will exert a direct
negative eect on the internal CO
2
concentration. We can now apply the
Causal Inference algorithm again but add the constraint that stomatal con-
ductance and net photosynthetic rates must each remain as direct causes of
the internal CO
2
concentration. This constraint is justied by simple physi-
cal laws of passive diusion of gases across a concentration gradient. The
resulting graph is shown in Figure 8.29C. The Causal Inference algorithm has
suggested a partially oriented graph that is statistically equivalent to the path
model that was proposed based on the CowanFarquhar theory of stomatal
regulation. We have only to note that it is biologically more reasonable to
suppose that the leaf nitrogen concentration is caused by specic leaf mass
rather than the inverse
34
and we recover the path model in its entirety.
8. 17 EMPI RI CAL EXAMPLES
293
34
The concentration of nitrogen does not vary strongly through the depth of the leaf.
Therefore, if there is more leaf biomass per leaf area (i.e. the leaf is thicker) then there will
be more nitrogen per unit leaf area.
The assumption of faithfulness is really based on a parsimony argu-
ment. It says that, if the only causal information available is that obtained
from the observational data at hand and we have dierent possible causal
structures that are exactly equivalent in their predictions of (partial) inde-
pendence, then it is preferable to assume a causal structure whose indepen-
dence predictions are robust rather than to assume a causal structure whose
independence predictions require a special balancing of direct and indirect
causal eects. When we have good causal information exterior to the data
at hand then such information should be used. With small samples this is
especially important because low statistical power means that real, but weak,
eects might be incorrectly interpreted as independence.
8.18 Orienting edges in the undirected dependency graph
without assuming an acyclic causal structure
A recurring theme in science ction stories is the Universal Translator; a
device that can infallibly translate back and forth between any set of lan-
guages. In the case of acyclic causal structures we had an imperfect, but still
quite serviceable, translation device. d-separation, applied to the directed
acyclic graph, could infallibly translate from the language of causality to the
language of probability distributions. We could not use it to translate infal-
libly backwards from a probability distribution to the causal graph because
dierent causal structures can generate the same joint probability distribu-
tion. This is why the discovery algorithms output a partially oriented acyclic
graph rather than a single DAG. Still, this is quite useful in reducing hypoth-
esis space down to a manageable set of possible DAGs. When we move on
to search algorithms for (possibly) cyclic causal processes then the problem
gets even more dicult because our translation device, d-separation, cant
be generally applied to non-linear cyclic causal processes. None the less,
Richardson (1996b) has produced an algorithm that is provably correct for
cyclic causal structures (given population measures of association and faith-
fulness) under the assumption that the functional relationships between the
variables are linear and that there are no latent variables generating associa-
tions between more than two observed variables.
As you might have already feared, this algorithm is both more com-
plicated and requires some new denitions and notational conventions.
Taken one at a time, each part of the algorithm is still intuitively compre-
hensible. The algorithm is based on the notion of a partial ancestral graph
(PAG). A PAG is an extension of the partially oriented inducing path graph
of acyclic models that was used in the Causal Inference algorithm. Here are
the conditions for a graph to be a PAG:
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
294
1. There is an edge between two variables, A and B, if and only if A
and B are d-connected given any subset of other observed variables
in the graph; i.e. if and only if there is an inducing path between A
and B. This is the same as for the graphs output from the Causal
Inference algorithm.
2. If there is an edge between A and B that is out of A with the nota-
tion A*B (but not necessarily into B), then A is an ancestor of
B.
3. If there is an edge between A and B that is into B with the nota-
tion A*B (but not necessarily out of A), then B is not an ances-
tor of A.
4. If there is an underlining at the middle variable of a triplet with
notation A**B**C then the edges do not collide at B.
Therefore B is an ancestor of either A or B but not of both.
5. If there is an edge from A to B and from C to B (ABC) but B
is not a descendant of a common child of A and B, then B is doubly
underlined
35
; thus AB
C) and
record T plus Sepset(A,C) plus B in new sets called
SupSepset(A,B,C) and SupSepset(C,B,A). The double under-
line means that X and Y do not both collide at Z
39
.
5. Find a quadruple of variables (A,B,C,D) such that B and Dare adja-
cent and the following patterns exist: AB
C and either
ADC or AD
C. If such
conditions exist then orient BD or BD as BD if A and
C are not d-separated given SupSepset(A,B,C) plus D.
This algorithm probably seems overwhelming. Since it is incorpo-
rated into the TETRAD III program you dont have to understand it su-
ciently to actually program it, only well enough to have an intuitive
knowledge of what it does. The most important part is to be able to inter-
pret it. Ill go over each section of the algorithm and provide an intuitive
explanation. Note, however, that this algorithm is the most general algo-
rithm of all those presented so far
40
. If the causal process is acyclic then this
algorithm will give the same output as the Causal Inference algorithm even
if the functional relationships are non-linear.
Section 1: This is simply the algorithm for constructing the undi-
rected dependency graph. If there is an edge between two variables (X,Y )
then there is an inducing path between them and no other variable, or set
of these other variables, can d-separate X and Y. The reason for construct-
ing Sepset(X,Y ) and Sepset(Y,X) is simply so that we dont have to keep
conducting the independence tests in the other sections of the algorithm.
We could have done this in the Causal Inference algorithm as well. In fact,
the original formulation of the Causal Inference algorithm, as implemented
in TETRAD II and TETRAD III does use separation sets. The separation
sets provide useful information because every variable in Sepset(X,Y) is an
ancestor of either X or Y.
Section 2: This section is simply the algorithm for determining
whether variable Y in an unshielded pattern (XZY ) is a collider
(thus XZY ) or a non-collider (thus XZY ). Now that we
have Sepset(X,Y ) we dont have to re-do all of the (conditional) indepen-
dence tests. If we see that Z is in Sepset(X,Y ) then Z is a non-collider and
if Z isnt in Sepset(X,Y ) then it is a collider. This uses the fact (above) that
if Z is in Sepset(X,Y ) then Z is an ancestor of either X or Y. Therefore,
the orientation cant be XZY because this would imply that Z was
a descendant of both Xand Y. This also explains why causal processes having
feedback relationships like that shown in Figure 8.25 produce dierent
results when we apply my version of the Causal Inference algorithm and the
original version that uses separation sets. In such feedback processes a vari-
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
300
40
There are a few propagation rules that can be added after the algorithm is nished. For
instance, XYZ implies XYZ.
able can be both an ancestor and a descendant at the same time. This is not
possible in an acyclic causal structure.
Section 3: We know that X and Y are adjacent (X*Y ) and that
A is not adjacent to either X or Y by looking at the partially oriented graph.
We also know that those variables that d-separate A and Y are not a subset
of those variables that d-separate A and X; i.e. that A and X are still d-
connected given Sepset(A,Y ). Therefore X is not an ancestor of Y and we
can orient X*Y as X*Y.
Section 4: This section begins by looking for a triplet of variables
that has already been oriented as ABC in section 2 of the algorithm.
However, we have already seen that if there are feedback loops then a var-
iable can be both an ancestor and a descendant of another variable.
Therefore, this section tries to nd some set of variables that d-separates A
and C while including B. Remember that if we see ABC then this
means that, in section 2, we had found A, B and C to form an unshielded
pattern and that B wasnt a member of the separation set that d-separated A
and C. So we will have found two separation sets, one with B and one
without B, that d-separate A and C. This is the signal for a feedback loop.
Since this section looks for the smallest set that includes B and Sepset(A,C)
i.e. Supsepset(A,B,C) this means that every variable in Supsepset
(A,B,C) is an ancestor of A, B, or C. The double underline that is added
to B means that both Aand Ccant both collide at B; some equivalent graphs
have AB and C is not adjacent to B, while other equivalent graphs have
CB and A is not adjacent to B.
Sections 5 and 6: Since every member of Supsepset(A,B,C) is an
ancestor of A, B, or C we can now use this information to orient BD.
The proof of the correctness of each section of this algorithm,
given the assumptions, is provided by Richardson (1996a,b). Lets apply this
algorithm to the causal structure shown in Figure 8.30, reproduced as Figure
8.33.
Assuming that we have a very large sample size, so that we can
ignore errors in determining probabilistic independence due to sampling
variations, the undirected dependency graph, obtained after section 1, is
shown in Figure 8.34.
There are only two unshielded patterns (ABD and A
CD). Since A and D are unconditionally d-separated the Sepset
(A,D) is empty; i.e. Sepset(A,D){null}. Therefore we orient these two
unshielded patterns as ABDand ACD. Figure 8.35 shows the par-
tially oriented ancestral graph after this step.
No changes are made after applying section 3 because the necessary
patterns dont exist. When we apply section 4 we nd that Sepset(A,D)
8. 19 THE CYCLI C CAUSAL DI SCOVERY ALGORI THM
301
EXPLORATI ON, DI SCOVERY AND EQUI VALENCE
302
Figure 8.33. A directed cyclic graph with a feedback relationship
between variables B and C.
Figure 8.34. The undirected dependency graph, obtained after section
A of the Cyclic Causal Discovery algorithm.
Figure 8.35. The partially oriented graph that is obtained after orienting
based on the unshielded colliders.
{null} but that A and D are d-separated given {B,C}. Therefore we re-
orient ABD to be AB
Dand AC