SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis
SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis
In any longitudinal analysis, we can distinguish between analyzing trends vs individual change –
that is, model the actual level of DV (Y) vs model the change in DV (ΔY). The predictors also
can be either actual levels (X=time-varying, Z=time-invariant) or measures of change (ΔX;
because ΔZ=0), as well as time itself (T).
We turn to the main approaches of explaining change in two wave panel dataset. We will review
four main approaches.
This approach is also known as regressor variable approach. The idea is to predict time 2
outcome using time 1 independent variables while controlling for stability in the outcome
variable by including the dependent variable from time 1 into the model.
. reg rworkhours80 l. rworkhours80
------------------------------------------------------------------------------
rworkhours80 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rworkhours80 |
L1. | .7368788 .0092827 79.38 0.000 .7186812 .7550763
|
_cons | 5.339778 .3551734 15.03 0.000 4.643507 6.036048
------------------------------------------------------------------------------
------------------------------------------------------------------------------
rworkhours80 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rworkhours80 |
1
L1. | .7345166 .0094029 78.12 0.000 .7160834 .7529498
|
rallparhel~w |
L1. | -.1601855 .0719849 -2.23 0.026 -.3013029 -.0190681
|
_cons | 5.483749 .3637186 15.08 0.000 4.770724 6.196774
------------------------------------------------------------------------------
------------------------------------------------------------------------------
rworkhours80 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rworkhours80 |
L1. | .668734 .010576 63.23 0.000 .6480009 .6894672
|
rallparhel~w |
L1. | -.0942385 .0734988 -1.28 0.200 -.2383254 .0498485
|
rpoorhealth |
L1. | -4.44369 .5954816 -7.46 0.000 -5.611072 -3.276308
|
rmarried |
L1. | .4209347 .612163 0.69 0.492 -.7791495 1.621019
|
rtotalpar |
L1. | .2755657 .2905194 0.95 0.343 -.2939684 .8450998
|
rsiblog |
L1. | -.42027 .374524 -1.12 0.262 -1.154487 .3139468
|
hchildlg |
L1. | -.5223844 .400087 -1.31 0.192 -1.306715 .2619461
|
raedyrs | .1235686 .0776308 1.59 0.111 -.0286189 .2757561
female | -3.392911 .46171 -7.35 0.000 -4.298048 -2.487775
age | -.7810018 .0711669 -10.97 0.000 -.9205174 -.6414862
minority | -.7411717 .5320883 -1.39 0.164 -1.784278 .3019342
_cons | 52.52523 4.385398 11.98 0.000 43.92809 61.12236
2
rpoorhealth -> r1poorhealth r2poorhealth
rmarried -> r1married r2married
rtotalpar -> r1totalpar r2totalpar
rsiblog -> r1siblog r2siblog
hchildlg -> h1childlg h2childlg
rallparhelptw -> r1allparhelptw r2allparhelptw
-----------------------------------------------------------------------------
This format also allows us to examine interactions of the effects of each of the variables of
interest with the lagged DV.
This approach is also known as the change score approach. There has been a lot of controversy
surrounding this approach.
------------------------------------------------------------------------------
diff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
r1allparhe~w | -.0147277 .076598 -0.19 0.848 -.1648885 .1354331
_cons | -2.792029 .2297434 -12.15 0.000 -3.242412 -2.341645
------------------------------------------------------------------------------
------------------------------------------------------------------------------
diff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
r1allparhe~w | -.0267496 .0798046 -0.34 0.737 -.1831985 .1296994
r1poorhealth | .2642639 .6259046 0.42 0.673 -.9627592 1.491287
r1married | 1.383919 .6641307 2.08 0.037 .0819573 2.68588
r1totalpar | .0906871 .3155152 0.29 0.774 -.5278488 .7092229
r1siblog | -.7903476 .406629 -1.94 0.052 -1.587503 .0068077
h1childlg | -.4283254 .4345873 -0.99 0.324 -1.28029 .4236395
raedyrs | -.1313198 .0838629 -1.57 0.117 -.2957246 .033085
female | 1.381293 .4734211 2.92 0.004 .4531982 2.309387
age | -.4761804 .0765798 -6.22 0.000 -.6263073 -.3260534
minority | -.578333 .5779601 -1.00 0.317 -1.711366 .5546998
_cons | 25.22486 4.668661 5.40 0.000 16.07242 34.3773
------------------------------------------------------------------------------
For many years, difference scores were criticized. One reason is their presumed unreliability – if
the DV for time 1 and time 2 are positively correlated (which is pretty much always the case),
then the difference score will have lower reliability than each of the time points individually, and
if the correlation across time is high, that decrease in reliability will be substantial.
But Paul Allison (1990) has argued that it is not a problem – “low reliability results from the fact
that in calculating the change score we differ out all the stable between-subject variation.” He
showed that what matters is measurement error, not unreliability – the same amount of error
variance that was contained in the individual scores just appears to be more prominent once the
stable component is removed, but in fact it has not changed.
Furthermore, change score models control for any unobserved factors as long as their effects are
stable over time, while the lagged dependent variable models do not, so this is a big advantage of
change score models.
The second critique is that difference score models do not account for the regression to the mean
effect—the trend wherein extremely low initial scores will be followed by an increase, and
4
extremely high scores – by a decrease. So the initial level might shape change, but if we add the
lagged DV to this change score model, we are back to the LDV model, so this strategy is not
useful:
. reg diff r1allparhelptw r1poorhealth r1married r1totalpar r1siblog h1childlg
raedyrs female age minority r1workhours80
------------------------------------------------------------------------------
diff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
r1allparhe~w | -.0942385 .0734988 -1.28 0.200 -.2383254 .0498485
r1poorhealth | -4.44369 .5954816 -7.46 0.000 -5.611072 -3.276308
r1married | .4209347 .612163 0.69 0.492 -.7791495 1.621019
r1totalpar | .2755657 .2905194 0.95 0.343 -.2939684 .8450998
r1siblog | -.42027 .374524 -1.12 0.262 -1.154487 .3139468
h1childlg | -.5223844 .400087 -1.31 0.192 -1.306715 .2619461
raedyrs | .1235686 .0776308 1.59 0.111 -.0286189 .2757561
female | -3.392911 .46171 -7.35 0.000 -4.298048 -2.487775
age | -.7810018 .0711669 -10.97 0.000 -.9205174 -.6414862
minority | -.7411717 .5320883 -1.39 0.164 -1.784278 .3019342
r1workhou~80 | -.331266 .010576 -31.32 0.000 -.3519991 -.3105328
_cons | 52.52523 4.385398 11.98 0.000 43.92809 61.12236
------------------------------------------------------------------------------
But Allison argued that regression to the mean does not always happen (although it is common)
– mostly if there are ceiling and/or floor effects (e.g., if the variable was measured in such a way
that it cannot go below above a certain value and above a certain value – that is usually the case
with scales, by the way); the correlation between the initial score and the increase does not have
to be negative – it can be positive and then the variance of scores increases with time. Allison
argues that regression to the mean is not a problem when we compare stable groups, and in such
cases, difference score approach may produce better results (less bias) than LDV approach.
Evaluating regression to the mean empirically by examining a group with high scores (above 75th
percentile) at time 1 and examining their distance from the mean at time 1 and time 2:
. for var r1workhours80: sum X, det \ scalar Xmean1=r(mean) \ gen sample=1 if
X>r(p75)\ sum X if X>r(p75)\di r(mean)-Xmean1
1 rworkhours80
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 6548
25% 0 0 Sum of Wgt. 6548
5
90% 57 80 Variance 507.5055
95% 63 80 Skewness -.175734
99% 80 80 Kurtosis 1.930742
-> di r(mean)-r1workhours80mean1
26.639072
-> di r(mean)-r2workhours80mean1
18.583979
These individuals moved closer to the mean. So we conclude that regression to the mean is a
problem for our data, so LDV will be better, especially if we want to document interactions
between the starting level of DV and the IVs.
------------------------------------------------------------------------------
diff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
allparhelp~f | -.0796341 .0596778 -1.33 0.182 -.1966276 .0373593
poorhealth~f | -2.450367 .6794682 -3.61 0.000 -3.782409 -1.118325
marrieddiff | -.8902544 1.360583 -0.65 0.513 -3.557567 1.777058
totalpardiff | .5724302 .494059 1.16 0.247 -.3961321 1.540993
siblogdiff | -1.649011 2.908561 -0.57 0.571 -7.351007 4.052985
childlgdiff | 1.415648 1.658858 0.85 0.393 -1.836407 4.667703
_cons | -2.515716 .260116 -9.67 0.000 -3.025652 -2.00578
------------------------------------------------------------------------------
Once we created a first difference model, can we introduce time-invariant variables as well? We
can; by doing that, we are assuming that the effect of this time-invariant variable is not stable
over time, and interpret the resulting coefficient as an interaction term for time and that variable,
although since main effects are not in the model, it is difficult to interpret such results; such
models are usually not used.
------------------------------------------------------------------------------
diff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
allparhelp~f | -.0779137 .0595081 -1.31 0.190 -.1945744 .038747
poorhealth~f | -2.417475 .6781968 -3.56 0.000 -3.747025 -1.087926
marrieddiff | -.7896093 1.355911 -0.58 0.560 -3.447763 1.868545
7
totalpardiff | .4298372 .4928697 0.87 0.383 -.5363938 1.396068
siblogdiff | -1.740446 2.905442 -0.60 0.549 -7.436328 3.955437
childlgdiff | 1.10057 1.654093 0.67 0.506 -2.142146 4.343286
raedyrs | -.0944175 .0800286 -1.18 0.238 -.2513072 .0624722
female | 1.262989 .4708219 2.68 0.007 .3399806 2.185997
age | -.4535023 .0760188 -5.97 0.000 -.602531 -.3044735
minority | -.9362349 .5703079 -1.64 0.101 -2.054277 .1818075
_cons | 23.36358 4.426971 5.28 0.000 14.68486 32.0423
------------------------------------------------------------------------------
This type of model, in many ways similar to LDV (in that it models level rather than change), is
useful if you are interested in mutual effects of two variables on one another:
8
To establish causal predominance, we can compare standardized effects:
. reg r2allparhelptw r1allparhelptw r1workhours80, beta
Source | SS df MS Number of obs = 5697
-------------+------------------------------ F( 2, 5694) = 151.12
Model | 3376.80486 2 1688.40243 Prob > F = 0.0000
Residual | 63615.6175 5694 11.1723951 R-squared = 0.0504
-------------+------------------------------ Adj R-squared = 0.0501
Total | 66992.4223 5696 11.7613101 Root MSE = 3.3425
------------------------------------------------------------------------------
r2allparhe~w | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
r1allparhe~w | .261863 .0151334 17.30 0.000 .2240261
r1workhou~80 | -.0008848 .0019782 -0.45 0.655 -.0057909
_cons | 1.129847 .0764159 14.79 0.000 .
------------------------------------------------------------------------------
------------------------------------------------------------------------------
r2workhou~80 | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
r1workhou~80 | .7345166 .0094029 78.12 0.000 .7171015
r1allparhe~w | -.1601855 .0719849 -2.23 0.026 -.0204278
_cons | 5.483749 .3637186 15.08 0.000 .
A better way of modeling these same relationships is to perform simultaneous estimation with
correlated residuals. We can do this with structural equation modeling (SEM). [For a more in-
depth exploration of SEM, take my class next year.]
. sem (r1workhours80 -> r2workhours80, ) (r1workhours80 -> r2allparhelptw, )
(r1allparhelptw -> r2workhours80, ) (r1allparhelptw -> r2allparhelptw, ), cov(
r1allparhelptw*r1workhours80 e.r2workhours80*e.r2allparhelptw) nocapslatent
We can also request standardized coefficients in SEM by using the “standardized” option.
. sem (r1workhours80 -> r2workhours80, ) (r1workhours80 -> r2allparhelptw, )
(r1allparhelptw -> r2workhours80, ) (r1allparhelptw -> r2allparhelptw, ), cov(
r1allparhelptw*r1workhours80 e.r2workhours80*e.r2allparhelptw) nocapslatent stand
chi2( 1) = 0.74
Prob > chi2 = 0.3909
Here, neither standardized coefficient is significantly larger than the other - we cannot reject the
null hypothesis that they are equal (p = 0.39).
There are a number of advantages to using SEM for two-wave analysis (e.g., construction of
latent variables, direct modeling of mediation, management of missing data via MLMV, etc.),
but one of the most practical for us here is the diagramming of paths. Stata allows you to specify
SEM models not only with syntax, but also by using path diagrams via its SEM Builder. (Many
SEM software packages will produce path diagrams as output along with results tables; Stata
only produces path diagram outputs if you specify the model using the SEM Builder.)
Using the dropdown menus, select: Statistics SEM (structural equation modeling) Model
building and estimation. As a note, in SEM measured variables are represented using rectangles,
while latent variables are represented by ellipses. Ordinary regression only uses measured
variables, so for our purposes here all you need to know is that our variables will be represented
using rectangles. In the SEM Builder, we can specify a model that matches the path diagram in
the notes above:
Click the “Estimate” button in the upper right-hand corner and hit “OK” for Maximum
likelihood estimation, and Stata will perform this simple cross-lagged SEM model.
Structural equation model Number of obs = 5651
Estimation method = ml
Log likelihood = -78231.18
--------------------------------------------------------------------------------------
| OIM
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------------+----------------------------------------------------------------
Structural |
r2workhours80 <- |
r1workhours80 | .7377979 .0095035 77.63 0.000 .7191713 .7564244
r1allparhelptw | -.1516555 .0723852 -2.10 0.036 -.2935278 -.0097831
_cons | 5.366347 .3672892 14.61 0.000 4.646473 6.086221
-------------------+----------------------------------------------------------------
r2allparhelptw <- |
r1workhours80 | -.0008713 .0019916 -0.44 0.662 -.0047748 .0030323
r1allparhelptw | .2616838 .0151696 17.25 0.000 .2319519 .2914157
_cons | 1.133529 .0769721 14.73 0.000 .9826666 1.284392
---------------------+----------------------------------------------------------------
Mean |
r1workhours80 | 30.77579 .2983912 103.14 0.000 30.19096 31.36063
r1allparhelptw | .6459817 .039176 16.49 0.000 .569198 .7227653
11
---------------------+----------------------------------------------------------------
Variance |
e.r2workhours80 | 255.4756 4.8062 246.2272 265.0714
e.r2allparhelptw | 11.22017 .2110823 10.81399 11.6416
r1workhours80 | 503.1497 9.465631 484.9353 522.0483
r1allparhelptw | 8.672941 .1631619 8.358973 8.998701
---------------------+----------------------------------------------------------------
Covariance |
e.r2workhours80 |
e.r2allparhelptw | -1.446145 .7124758 -2.03 0.042 -2.842572 -.0497185
-------------------+----------------------------------------------------------------
r1workhours80 |
r1allparhelptw | -4.739734 .8810168 -5.38 0.000 -6.466495 -3.012972
Note that the results are exactly the same whether the model is estimated using syntax or the
SEM Builder.
SEM also allows for standardized coefficients to be reported in the SEM Builder diagram, even
when the “standardized” option wasn’t requested at estimation. You can report the standardized
coefficients on the paths in the diagram by selecting View Standardized estimates.
Continuity of causal process: This model assumes that the causal processes are continuous and
ongoing so we can observe that at any time.
Equality of causal lags: We assume that AB and BA causal lag is of the same length.
Since the vast majority of the models we discussed can be estimated using OLS regression,
diagnostics should be conducted the same way as they are for OLS.
12