Propensity Score Matching
Propensity Score Matching
http://orm.sagepub.com/ Using the Propensity Score Method to Estimate Causal Effects: A Review and Practical Guide
Mingxiang Li Organizational Research Methods published online 13 June 2012 DOI: 10.1177/1094428112447816
The online version of this article can be found at: http://orm.sagepub.com/content/early/2012/06/11/1094428112447816 Published by:
http://www.sagepublications.com
On behalf of:
Additional services and information for Organizational Research Methods can be found at: Email Alerts: http://orm.sagepub.com/cgi/alerts Subscriptions: http://orm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav
Using the Propensity Score Method to Estimate Causal Effects: A Review and Practical Guide
Mingxiang Li1
Organizational Research Methods 00(0) 1-39 The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1094428112447816 http://orm.sagepub.com
Abstract Evidence-based management requires management scholars to draw causal inferences. Researchers generally rely on observational data sets and regression models where the independent variables have not been exogenously manipulated to estimate causal effects; however, using such models on observational data sets can produce a biased effect size of treatment intervention. This article introduces the propensity score method (PSM)which has previously been widely employed in social science disciplines such as public health and economicsto the management field. This research reviews the PSM literature, develops a procedure for applying the PSM to estimate the causal effects of intervention, elaborates on the procedure using an empirical example, and discusses the potential application of the PSM in different management fields. The implementation of the PSM in the management field will increase researchers ability to draw causal inferences using observational data sets. Keywords causal effect, propensity score method, matching
Management scholars are interested in drawing causal inferences (Mellor & Mark, 1998). One example of a causal inference that researchers might try to determine is whether a specific management practice, such as group training or a stock option plan, increases organizational performance. Typically, management scholars rely on observational data sets to estimate causal effects of the management practice. Yet, endogeneitywhich occurs when a predictor variable correlates with the error termprevents scholars from drawing correct inferences (Antonakis, Bendahan, Jacquart, & Lalive, 2010; Wooldridge, 2002). Econometricians have proposed a number of techniques to deal
Department of Management and Human Resources, University of Wisconsin-Madison, Madison, WI, USA
Corresponding Author: Mingxiang Li, Department of Management and Human Resources, University of Wisconsin-Madison, 975 University Avenue, 5268 Grainger Hall, Madison, WI 53706, USA Email: [email protected]
with endogeneityincluding selection models, fixed effects models, and instrumental variables, all of which have been used by management scholars. In this article, I introduce the propensity score method (PSM) as another technique that can be used to calculate causal effects. In management research, many scholars are interested in evidence-based management (Rynes, Giluk, & Brown, 2007), which derives principles from research evidence and translates them into practices that solve organizational problems (Rousseau, 2006, p. 256). To contribute to evidencebased management, scholars must be able to draw correct causal inferences. Cox (1992) defined a cause as an intervention that brings about a change in the variable of interest, compared with the baseline control model. A causal effect can be simply defined as the average effect due to a certain intervention or treatment. For example, researchers might be interested in the extent to which training influences future earnings. While field experiment is one approach that can be used to correctly estimate causal effects, in many situations field experiments are impractical. This has prompted scholars to rely on observational data, which makes it difficult for scholars to gauge unbiased causal effects. The PSM is a technique that, if used appropriately, can increase scholars ability to draw causal inferences using observational data. Though widely implemented in other social science fields, the PSM has generally been overlooked by management scholars. Since it was introduced by Rosenbaum and Rubin (1983), the PSM has been widely used by economists (Dehejia & Wahba, 1999) and medical scientists (Wolfe & Michaud, 2004) to estimate the causal effects. Recently, financial scholars (Campello, Graham, & Harvey, 2010), sociologists (Gangl, 2006; Grodsky, 2007), and political scientists (Arceneaux, Gerber, & Green, 2006) have implemented the PSM in their empirical studies. A Google Scholar search in early 2012 showed that over 7,300 publications cited Rosenbaum and Rubins classic 1983 article that introduced the PSM. An additional Web of Science analysis indicated that over 3,000 academic articles cited this influential article. Of these citations, 20% of the publications were in economics, 14% were in statistics, 10% were in methodological journals, and the remaining 56% were in health-related fields. Despite the widespread use of the PSM across a variety of disciplines, it has not been employed by management scholars, prompting Gerharts (2007) conclusion that to date, there appear to be no applications of propensity score in the management literature (p. 563). This article begins with an overview of a counterfactual model, experiment, regression, and endogeneity. This section illustrates why the counterfactual model is important for estimating causal effects and why regression models sometimes cannot successfully reconstruct counterfactuals. This is followed by a short review of the PSM and a discussion of the reasons for using the PSM. The third section employs a detailed example to illustrate how a treatment effect can be estimated using the PSM. The following section presents a short summary on the empirical studies that used the PSM in other social science fields, along with a description of potential implementation of the PSM in the management field. Finally, this article concludes with a discussion of the pros and cons of using the PSM to estimate causal effects.
Li
Counterfactual Model
To better understand causal effect, it is important to discuss counterfactuals. In Rubins causal model (see Rubin, 2004, for a summary), Y1i and Y0i are potential earnings for individual i when i receives (Y1i ) or does not receive training (Y0i . The fundamental problem of making a causal inference is how to reconstruct the outcomes that are not observed, sometimes called counterfactuals, because they are not what happened. Conceptually, either the treatment or the nontreatment is not observed and hence is missing (Morgan & Winship, 2007). Specifically, if i received training at time t, the earnings for i at t 1 is Y1i . But if i also did not receive training at time t, the potential earnings for i at t 1 is Y0i . Then the effect of training can be simply expressed as Y1i Y0i . Yet, because it is impossible for i to simultaneously receive (Y1i ) and not receive (Y0i the training, scholars need to find other ways to overcome this fundamental problem. One can also understand this fundamental issue as the what-if problem. That is, what if individual i does not receive training? Hence, reconstructing the counterfactuals is crucial to estimate unbiased causal effects. The counterfactual model shows that it is impossible to calculate individual-level treatment effects, and therefore scholars have to calculate aggregated treatment effects (Morgan & Winship, 2007). There are two major versions of aggregated treatment effects: the average treatment effect (ATE) and the average treatment effect on the treated group (ATT). A simple definition of the ATE can be written as ATE EY1i jTi 1; 0 EY0i jTi 1; 0; 1:1a
where E(.) represents the expectation in the population. Ti denotes the treatment with the value of 1 for the treated group and the value of 0 for the control group. In other words, the ATE can be defined as the average effect that would be observed if everyone in the treated and the control groups received treatment, compared with if no one in both groups received treatment (Harder, Stuart, & Anthony, 2010). The definition of ATT can be expressed as ATT EY1i jTi 1 EY0i jTi 1: 1:1b
In contrast to the ATE, the ATT refers to the average difference that would be found if everyone in the treated group received treatment compared with if none of these individuals in the treated group received treatment. The value for the ATE will be the same as that for the ATT when the research design is experimental.1
Experiment
There are different ways to estimate treatment effects other than PSM. Of these, the experiment is the gold standard (Antonakis et al., 2010). If the participants are randomly assigned to the treated or the control group, then the treatment effect can simply be estimated by comparing the mean difference between these two groups. Experimental data can generate an unbiased estimator for causal effects because the randomized design ensures the equivalent distributions of the treated and the control groups on all observed and unobserved characteristics. Thus, any observed difference on outcome can be caused only by the treatment difference. Because randomized experiments can successfully reconstruct counterfactuals, the causal effect generated by experiment is unbiased.
Regression
In situations when the causal effects of training cannot be studied using an experimental design, scholars want to examine whether receiving training (T) has any effect on future earnings (Y). In this case, scholars generally rely on potentially biased observational data sets to investigate the causal
effect. For example, one can use a simple regression model by regressing future earnings (Y) on training (T) and demographic variables such as age (x1 ) and race (x2 ). Y b0 b1 x1 b2 x2 tT e: 1:2
Scholars then interpret the results by saying ceteris paribus, the effect due to training is t. They typically assume t is the causal effect due to management intervention. Indeed, regression or the structural equation models (SEM) (cf. Duncan, 1975; James, Mulaik, & Brett, 1982) is still a dominant approach for estimating treatment effect.2 Yet, regression cannot detect whether the cases are comparable in terms of distribution overlap on observed characteristics. Thus, regression models are unable to reconstruct counterfactuals. One can easily find many empirical studies that seek to estimate causal effects by regressing an outcome variable on an intervention dummy variable. The findings of these studies, which used observational data sets, could be wrong because they did not adjust for the distribution between the treated and control groups.
Endogeneity
In addition to the nonequivalence of distribution between the control and treated groups, another severe error that prevents scholars from calculating unbiased causal effects is endogeneity. This occurs when predictor T correlates with error term e in Equation 1.2. A number of review articles have described the endogeneity problem and warned management scholars of its biasing effects (e.g., Antonakis et al., 2010; Hamilton & Nickerson, 2003). As discussed previously, endogeneity manifests from measurement error, simultaneity, and omitted variables. Measurement error typically attenuates the effect size of regression estimators in explanatory variables. Simultaneity happens when at least one of the predictors is determined simultaneously along with the dependent variable. An example of simultaneity is the estimation of price in a supply and demand model (Greene, 2008). An omitted variable appears when one does not control for additional variables that correlate with explanatory as well as dependent variables. Of these three sources of endogeneity, the omitted variable bias has probably received the most attention from management scholars. Returning to the earlier training example, suppose the researcher only controls for demographic variables but does not control for an individuals ability. If training correlates with ability and ability correlates with future earnings, the result will be biased because of endogeneity. Consequently, omitting ability will cause a correlation between training dummy T and residuals e. This violates the assumption of strict exogeneity for linear regression models. Thus, the estimated causal effect (t in Equation 1.2 will be biased. If the omitted variable is time-invariant, one can use the fixed effects model to deal with endogeneity (Allison, 2009). Beck, Bruderl, and Woywodes (2008) simulation showed that the fixed effects model provided correction for biased estimation due to the omitted variable. One can also view nonrandom sample selection as a special case of the omitted variable problem. Taking the effect of training on earnings as an example, one can only observe earnings for individuals who are employed. Employed individuals could be a nonrandom subset of the population. One can write the nonrandom selection process as Equation 1.3, D aZ u; 1:3
where D is latent selection variable (1 for employed individuals), Z represents a vector of variables (e.g., education level) that predicts selection, and u denotes disturbances. One can call Equation 1.2 the substantive equation and Equation 1.3 the selection equation. Sample selection bias is likely to materialize when there is correlation between the disturbances for substantive (e) and selection equation (u) (Antonakis et al., 2010, p. 1094; Berk, 1983; Heckman, 1979). When there is a correlation between e and u, the Heckman selection model, rather than the PSM, should be used to calculate
Li
causal effect (Antonakis et al., 2010). To correct for the sample selection bias, one can first fit the selection model using probit or logit model. Then the predicted values from the selection model will be saved to compute the density and distribution values, from which the inverse Mills ratio (l)the ratio for the density value to the distribution valuewill be calculated. Finally, the inverse Mills ratio will be included in the substantive Equation 1.2 to correct for the bias of t due to selection. For more information on two-stage selection models, readers can consult Berk (1983).
at or near this level. The PSM can easily detect the lack of covariate distribution between two groups and adjust the distribution accordingly. Third, linear or logistic models have been used to adjust for confounding covariates, but such models rely on assumptions regarding functional form. For example, one assumption required for a linear model to produce an unbiased estimator is that it does not suffer from the aforementioned problem of endogeneity. Although the procedure to calculate propensity scores is parametric, using propensity scores to compute causal effect is largely nonparametric. Thus, using the PSM to calculate the causal effect is less susceptible to the violation of model assumptions. Overall, when one is interested in investigating the effectiveness of a certain management practice but is unable to collect experimental data, the PSM should be used, at least as a robust test to justify the findings estimated by parametric models.
Li
observational covariates X, potential outcomes (Y1 and Y0) are not influenced by treatment assignment (Y1 ; Y0 ?T jX ). This assumption simply asserts that the researcher can observe all variables that need to be adjusted. The overlap assumption means that given covariates X, the person with the same X values has positive and equal opportunity of being assigned to the treated group or the control group 0 < prT 1jX < 1. Strongly ignorable assumption rules out the systematic, pretreatment, and unobserved differences between the treated and the control subjects that participate in the study (Joffe & Rosenbaum, 1999). Given the strongly ignorable assumption, the ATT defined in Equation 1.1b can be estimated using the balancing score. Because the propensity score e(x) is one form of balancing score, one can estimate the ATT by subtracting the average treatment effect of the treated group from that of the control group at a particular propensity score. Thus, Equation 1.1b could be rewritten as ATT EfY jT 1; e xg EfY jT 0; e xg. If there are unobserved variables that simultaneously affect the treatment assignment and the outcome variable, the treatment assignment is not strongly ignorable. One can compare the failure of the strongly ignorable assumption with endogeneity in the mis-specified econometric models. One can view this as the omitted or unmeasured variable problem (cf. James, 1980). Specifically, when one calculates the propensity scores, one or more variables that may affect treatment assignment and outcomes are omitted. For example, suppose an unobserved variable partially determines treatment assignment. In this case, two individuals with the same values of observed covariates will receive the same propensity score, despite the fact that they have different values of unobserved covariates and, thus, should receive different propensity scores. If the strongly ignorable assumption is violated, the PSM will produce biased causal effects.
Covariate is balanced
2.
Sensitivity test: 1. Multiple comparison groups 2. Specification 3. Instrumental variables 4. Rosenbaum bounds
participants. Dehejia and Wahba (1999, 2002) reconstructed Lalondes original NSW data by including individuals who attended the program early enough to obtain retrospective 1974 earning information. The final NSW sample includes 185 treated and 265 control individuals. Lalondes (1986) observational data consisted of two distinct comparison groups in the years between 1975 and 1979: the Population Survey of Income Dynamics (PSID-1) and the Current Population SurveySocial Security Administration File (CPS-1). Initiated in 1968, the PSID is a nationally representative longitudinal database that interviewed individuals and families for information on dynamics of employment, income, and earnings. The CPS, a monthly survey conducted by Bureau of the Census for the Bureau of Labor Statistics, provides comprehensive information on the unemployment, income, and poverty of the nations population. Lalonde further extracted four data sets (denoted as PSID-2, PSID-3, CPS-2, and CPS-3) that represent the treatment group based on
Li
Table 1a. Description of Data Sets and Definition of Variables Data Sets NSW Treated Sample Size 185 Description National Supported Work Demonstration (NSW) data were collected using experimental design, where qualified individuals were randomly assigned to the training position to receive pay and accumulate experience. Experimental control group: The set of qualified individuals were randomly assigned to this control group so that they have no opportunity to receive the benefit of NSW program. Nonexperimental control group: 1975-1979 Population Survey of Income Dynamics (PSID) where all male household heads under age 55 who did not classify as retired in 1975. Data set was selected from PSID-1 who were not working in the spring of 1976. Data set was selected from PSID-2 who were not working in the spring of 1975. Nonexperimental control group: 1975-1979 Current Population Survey (CPS) where all participants with age under 55. Data set was selected from CPS-1 where all men who were not working when surveyed in March 1976. Data set was selected from CPS-2 where all unemployed men in 1976 whose income in 1975 was below the poverty line. Definition Set to 1 if the participant comes from NSW treated data set, 0 otherwise The age of the participants (in years) Number of years of schooling Set to 1 for Black participants, 0 otherwise Set to 1 for Hispanic participants, 0 otherwise Set to 1 for married participants, 0 otherwise Set to 1 for the participants with no high school degree, 0 otherwise Earnings in 1974 Earnings in 1975 Earnings in 1978, the outcome variable
NSW Control
260
PSID-1
2,490
PSID-2 PSID-3 CPS-1 CPS-2 CPS-3 Variables Treatment Age Education Black Hispanic Married Nodegree RE74 RE75 RE78
simple pre-intervention characteristics (e.g., age or employment status; see Table 1a for details). Table 1a reports details of data sets and the definitions of the variables.
10
NSW Treated M 25.82 SD 7.16 NSW Control M 25.05 SD 7.06 SB 10.73 M 34.85 PSID-1a SD 10.44 SB 100.94 M 30.96 PSID-1Mb SD 9.46 SB 61.35 Percentage 39.22 reduction in SB
2,095.57 1,532.06 185 4,886.62 3,219.25 2,107.03 1,266.91 260 5,687.91 3,102.98 0.22 8.39 19,428.75 19,063.34 2,490 13,406.88 13,596.95 171.78 177.44 1,1386.48 9,528.64 1,103 9,326.64 8,222.72 124.79 128.07 27.36 27.82
Note: SB standardized bias estimated using Formula 2.1; N number of cases. a PSID-1: All male house heads under age 55 who did not classify as retired. b PSID-1M is the subsample of PSID-1 that is matched to the treatment group (NSW treated).
Morral, 2004). Boosted regression can simplify the process of achieving balance in each stratum. Appendix A provides further discussion on this technique. Steiner, Cook, Shadish, and Clark (2010) replicated a prior study to show the importance of appropriately selecting covariates. They summarized three strategies for covariates selection: First, select covariates that are correctly measured and modeled. Second, choose covariates that reduce selection bias. These will be covariates that are highly correlated with the treatment (best predicted treatment) and with the outcomes (best predicted outcomes). Finally, if there was no prior theoretically or empirically sound guidance for the covariates selection (e.g., the research question is very new), scholars can measure a rich set of covariates to increase the likelihood of including covariates that satisfy the strongly ignorable assumption. After specifying the observational covariates, the propensity scores can be estimated using these observational variables. This article summarizes four different approaches that can be used to estimate the propensity scores. If there is only one treatment (e.g., training), then one can use a logistic model, probit model, or prepared program.3 If treatment has more than two versions (e.g., individuals receive several doses of medicine), then an ordinal logistic model can be used (Joffe & Rosenbaum, 1999). The treatment must be ordered based on certain threshold values. If there is more than one treatment and the treatments are discrete choices (e.g., Group 1 receives payment, Group 2 receives training), the propensity scores can be estimated using a multinomial logistic model. Receiving treatment does not need to happen at the same time. For many treatments, a decision needs to be made regarding whether to treat now or to wait and treat later. The decision to treat now versus later is driven by the participants preferences. Under this condition, one can use the Cox proportional hazard model to compute the propensity scores. Li, Propert, and Rosenbaum (2001) demonstrated that the hazard model has properties similar to those of propensity scores. Except for the Cox model that uses partial likelihood (PL) and does not require us to specify the baseline hazard function, the estimating technique used in the aforementioned models is maximum likelihood estimation (MLE) (see Greene, 2008, Chapter 16, for more information on MLE). The logistic models and the hazard model all assume a latent variable (Y*) that represents an underlying propensity or probability to receive treatment. Long (1997) argues that one can view a binary outcome variable as a latent variable. When the estimated probability is greater than a certain threshold or cut point (t), one observes the treatment (Y* > t; T 1). For an ordinal logistical model, one can
Downloaded from orm.sagepub.com at SUNY BINGHAMTON on August 8, 2012
Li
11
understand the latent variable with multiple thresholds and observe the treatment according to the thresholds (e.g., t1 < Y* < t2; T 2). The multinomial logistical model can simply be viewed as the model that simultaneously estimates a binary model for all possible comparisons among outcome categories (Long, 1997), but it is more efficient to use a multinomial logistical model than using multiple binary models. It is somewhat tricky to generate the predicted probability from the Cox model because it is semiparametric with no assumption of the distribution of baseline. Two alternative choices can be used to better derive probability for survival model: (1) One can rely on a parametric survival model that specifies the baseline model; (2) one can transform the data in order to use the discrete-time model. To illustrate how to calculate propensity scores, this study employed treatment group data from the NSW and control group data from the observational data extracted from the PSID-2. Following Dehejia and Wahba (1999), I selected age, education, no degree, Black, Hispanic, RE74, RE75, age square, RE74 square, RE75 square, and RE74 Black as covariates to calculate propensity scores. To compute propensity scores, one can first run a logistic or probit model using a treatment dummy (whether an individual received training) as the dependent variable and the aforementioned covariates as the independent variables. Propensity scores can be obtained by calculating the fitted value from the logistic or probit models (use predict mypscore, p in STATA). Readers can refer Hoetker (2007) for more information on calculating probability from logit or probit models. After calculating propensity scores, Appendix B includes a randomly selected sample (n 50) from the combined data set NSW and PSID-2. Readers can obtain data for Appendix B, NSW treated, and PSID-2 from the author.
where X1M V1M and X0M V0M are the means (variance) for the treated group and the matched control group. In addition to these two widely used tests, the Kolmogorov-Smirnovs two-sample test can also be used to investigate the overlap of the covariates between the treated and the control groups. Balanced strata between the treated and the matched control group ensure the minimal distance in the marginal distributions of the covariates. If any pretreatment variable is not balanced in a particular block, one needs to subclassify the block into additional blocks until all blocks are balanced. To obtain strata balance, researchers sometimes need to add high-order covariates and recalculate the propensity scores. Rosenbaum and Rubin (1984) detailed the process of cycling between checking for balance within strata and reformulating the propensity model. Two guidelines for adding high-order covariates have been proposed: (1) When the variances of a critical covariate are found
12
to differ dramatically between the treatment and the control group, the squared terms of the covariate need to be included in the revised propensity score model and (2) when the correlation between two important covariates differs greatly between the groups, the interaction of the covariates can be added to the propensity score model. Appendix B shows a simple example of stratifying data into five blocks after calculating the propensity scores. For this illustration, I stratified the 50 cases into five groups. I first identified the cases with propensity scores smaller than 0.05, which were classified as unmatched. When the propensity scores were smaller than 0.2 but larger than 0.05, I coded this as block 1 (Block ID 1). When the propensity scores were smaller than 0.4 but larger than 0.2, this was coded as block 2. This process was repeated until I had created five blocks, and then I conducted the t-test within each block to detect any significant difference of propensity scores between the treated and control groups. Tvalues for each block were added in the columns next to the column of Block ID. Overall, the t-test reveals that the difference of propensity scores between the treated and control groups is statistically insignificant. If the t-test shows that there are statistically significant differences in propensity scores, one should either change threshold value of propensity scores in each block or change the covariates to recalculate the propensity scores. When the propensity scores in each stratum are balanced, all covariates in each stratum should also achieve equivalence of distribution. To confirm this, one can conduct the t-test for each observational variable. To illustrate how balance of propensity scores within strata helps to achieve distribution overlap for other covariates, Appendix B reports the values for one continuous variable, age. One can conduct the t-test to ensure that there is no age difference between the treated and control groups within each stratum. The column Tage reports the t-test for age within the strata. After balancing each blocks propensity scores, the age difference between the treated and control groups in each block became statistically insignificant. I recommend that readers use a prepared statistic package to stratify propensity scores, as a program can simultaneously categorize propensity scores and conduct balance tests. For instance, one can use the -pscore- program in STATA (Becker & Ichino, 2002) to estimate, stratify, and test the balance of propensity scores. To further illustrate how the PSM can achieve strata balance, I replicated the aforementioned two procedures for the combined experimental data set and each of the observational data sets in Table 1a. Following Dehejia and Wahbas (1999) suggestions on choice of covariates, I first computed propensity scores for each data set. Then, the propensity scores were stratified and tested for the balance within each stratum. When the propensity scores achieved balance within each stratum, I plotted the means of propensity scores in each stratum for each matched data set. Figure 2 provides evidence that the means of the propensity scores are almost the same for each sample within each balanced block. To demonstrate the effectiveness of the PSM in adjusting for the balance of other covariates, Table 1b summarizes the means, standard errors, and SB of the matched sample. Comparing the results between the matched and unmatched samples, one can see that the difference of most observed characteristics between the experimental design and the nonexperimental design reduces dramatically. For instance, PSID-1 of Table 1b reports that the absolute SB values range from 12.86 to 184.23 (before using propensity score matching), but PSID-1M of Table 1b shows that the absolute minimum value of SB is 3.48 and the absolute maximum value of SB is 128.07. Furthermore, the t-test and the Kolmogorov-Smirnov sample test were conducted to examine the balance of each variable. As reported from Table 2, for the PSID-1 sample, except for RE74 in Block 3, one cannot see a p value smaller than 0.1. For simplicity, Table 2 uses only continuous variables that have been included for estimating the propensity scores to illustrate the effectiveness of the PSM in increasing the distribution overlap between the treated group and the matched control group. Overall, Table 2 shows strong evidence that after obtaining balance of propensity scores within a stratum, the covariates achieve overlap in terms of distribution. To preserve space, Table 1b and
Li
13
0 .2 .4 .6 .8 1
1 2 3 4 5 6 7
1 2 3 4 5 6 7
0 .2 .4 .6 .8 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Block ID
Figure 2. Means of propensity scores in balanced strata
Note: PSID Population Survey of Income Dynamics (PSID-1); CPS Current Population SurveySocial Security Administration File (CPS-1).
Table 2. Test of Strata Balance t-test for Matched Sample PSID-1 Block ID 1 2 3 4 5 6 7 Age 0.800 0.856 0.834 0.853 0.341 0.353 0.603 Education 0.995 0.319 0.765 0.378 0.816 0.196 0.574 RE74 0.283 0.632 0.077 0.744 0.711 0.888 0.791 RE75 0.685 0.627 0.641 0.874 0.113 0.956 0.747 Age 0.566 0.998 0.832 0.954 0.613 0.950 0.280 KS Test for Matcheda Education 1.000 0.894 1.000 0.999 0.844 0.942 0.828 RE74 0.697 0.983 0.044 0.949 0.512 0.466 1.000 RE75 0.984 0.998 0.851 0.754 0.026 0.878 1.000
Note: The table reports the p value of each variable for each stratum between National Supported Work Demonstration (NSW) Treated and matched control groups. PSID-1 1975-1979 Population Survey of Income Dynamics (PSID) where all male household heads under age 55 who did not classify as retired in 1975. a KS (Kolmogorov-Smirnov) two-sample test between NSW Treated and matched control groups.
Table 2 report statistics only for PSID-1. Readers can get a full version of these two tables by contacting the author. The aforementioned evidences generally support that the covariates are balanced for the treated and control groups.
14
Stratified Matching
After achieving strata balance, one can apply stratified matching to calculate the ATT. In each balanced block, the average differences in the outcomes of the treated group and the matched control group are calculated. The ATT will be estimated by the mean difference weighted by the number of treated cases in each block. The ATT can be expressed as P P C Q T X i2I q YiT Nq j2I q Yj ATT T; 2:2 T C Nq Nq N q1
T C where Q denotes the number of blocks with balanced propensity scores, Nq and Nq refer to the numT C ber of cases in the treated and the control groups for matched block q, Yi andYj represent the observational outcomes for case i in the matched treated group q and case j in the matched control group q, respectively, and N T stands for the total number of cases in the treated group.
Table 3. Estimation Results Matching Stratified ATTc 3 ATT 5 Nd 4 Nd 6 ATTe 7 Nd 8 ATTf 9 Nd 10 Neighbor Radius Kernel Covariate Adjustment ATTg 11 Nd 12
ATT
Unadjusteda 1
Adjustedb 2
NSW 1,288 308 250 4,563 1,438 508 273 271 79 53 280 102 217 167 231 77 248 37
1,794.34
PSID-1h
15,204.78
PSID-2h
3,646.81
PSID-3h
1,069.85
CPS-1i
8,497.52
CPS-2i
3,821.97
CPS-3i
635.03
Meanj Variancej
5,122.71 3,5078,950.9
1,676.34 (638.68) 751.95 (915.26) 1,873.77 (1,060.56) 1,833.13 (1,159.78) 699.13 (547.64) 1,172.70 (645.86) 1,548.24 (781.28) 1,313.15 270,327.32 1,637.43 (805.43) 1,467.04 (1,461.75) 1,843.20 (981.42) 1,488.29 (716.79) 1,676.43 (796.62) 1,505.49 (1,065.52) 1,602.98 21,084.82 1,654.57 (1,174.63) 1,604.09 (1,092.40) 1,522.23 (1,920.24) 1,600.74 (957.05) 1,638.74 (1,014.64) 1,376.65 (1,129.24) 1,566.17 10,712.11 1,871.44 (5,837.10) 1,519.60 (2,110.71) 1,632.74 (1,598.12) 1,890.13 (1,993.50) 1,775.99 (2,286.23) 1,307.63 (2,821.56) 1,544.47 45,779.09 1,507.10 (826.11) 1,712.18 (1,226.90) 1,776.37 (1,425.32) 1,513.78 (726.47) 1,590.49 (736.85) 1,166.93 (864.38) 1,666.26 51,101.52
1,952.23 (791.45) 1,593.32 (1,476.54) 1,583.41 (1,866.46) 1,634.81 (515.58) 1,550.90 (625.04) 1,572.09 (943.65) 1,647.80 23,016.46
Note: Bootstrap with 100 replications was used to estimate standard errors for the propensity score matching; standard errors in parentheses. a The mean difference between treatment group (NSW Treated) and corresponding control groups (NSW Control, PSID-1 to CSP-3). b Least squares regression: regress RE78 (earning in 1978) on age, treatment dummy, education, no degree, Black, Hispanic, RE74 (earning in 1974), and RE75 (earning in 1975). c Stratifying blocks based on propensity scores, and then use Formula 2.2 to estimate ATT (average treatment effect on treated). d The total number of observations, including observations in NSW Treated and corresponding matched control groups. e For Kernel matching, when the number of cases is small, use narrower bandwidth (.01) instead of .06. f Radius value ranges from .0001 to .0000025. g Use regression, take weights, which are defined by the number of treated observations in each balanced propensity score block. h Observational covariates: age, treatment dummy, education, no degree, Black, Hispanic, RE74, and RE75. Higher order covariates: age2, RE742, RE752, RE74 Black. i Observational covariates: same as h; high-order covariates: age2, education2, RE742, RE752, Education RE74. j Mean and variance are calculated using estimated ATT for each technique.
15
16
After stratifying data into different blocks, one can calculate the ATT using data listed in AppenP T Yi (the summation of the outcome variable in each block for each dix B. First, one can compute of the treated cases, denoted as YiT in Appendix B) and
i2I 1
able in each block for each of the control cases, denoted as YjC in Appendix B). For example, in block 1 the summation of the outcome for two treated cases is 49,237.66, and the summation of the T outcome for five control cases is 31,301.69. The number of cases in the treatment (N1 ) and the conC trol group (N1 ) for matched block 1 is 2 and 5, respectively. One then can calculate the ATT for each block. For instance, ATTq1 (for block 1) 49,237.66/2 31,301.69/5 18,388.492. After computing the ATT for each block, one can get weighted ATTs using the weight given by the fraction of treated cases in each block. For example, the weight for block 1 is 0.08 (two treated cases in block 1 divided by 25 treated cases in total). The final ATT is estimated by taking a summation of the weighted ATT ($1,702.321), which means that individuals who received training will, on average, earn around $1,702.321 more per year than their counterparts who did not obtain governmental training. The estimated ATT using simple regression is $2,316.414. Comparing this with the true treatment effect in Table 3 ($1,676.34), one can see that the PSM produces an ATT substantively similar to the actual casual effect, given that the propensity scores of every block are balanced. I also conducted another simulation with 200 randomly selected cases from NSW and PSID-2 for 50 times. The average ATT calculated by the PSM is $1,376.713, whereas the average ATT computed by regression analysis is $709.039. Clearly, the PSM produces an ATT closer to the true causal effects than does the ordinary least squares (OLS). I further examined the balance test for each of these 50 randomly drawn data sets. Thirteen of 50 data sets did not achieve strata balance. The average ATT calculated by the PSM was $979.612, and the average ATT calculated by OLS was $697.626. For the remaining 37 data sets that achieved strata balance, the average ATT calculated by the PSM was $1,516.23, and the average ATT calculated by OLS was $713.04. Therefore, achieving balance of propensity scores in each stratum is very important for obtaining a less biased estimator of causal effect. I also provided SPSS code in Appendix C and STATA code in Appendix D, which readers can adjust appropriately to other statistical packages for stratified matching. The codes show how to fit the model with the logit model, calculate propensity scores, stratify propensity scores, conduct the balance test, and compute the ATT using stratified matching. It is also convenient to implement the procedure in Excel after calculating the propensity scores using other statistical packages. Readers who are interested in Excel calculation can contact the author directly to obtain the original file for the calculation in Appendix B. Moreover, Appendix E also presents a table that reports the PSM prewritten software in R, SAS, SPSS, and STATA for readers to conveniently find appropriate statistical packages. Combining NSW Treated with other observational data sets, column 3 of Table 3 further details the estimated ATT using stratified matching. Column 3 shows that the lowest estimated result is $1,467.04 (PSID-2) and the highest estimation of the treatment effect is $1,843.20 (PSID-3). Overall, stratified matching produces an ATT relatively close to the unbiased ATT ($1,676.34).
j2I 1
Li
17
ATT
1 X T 1 X C Yi C Y ; N T i2T Ni j2C j
2:3
where NT is the number of cases in the treated group and NiC is a weighting scheme that equals the number of cases in the control group using a specific algorithm (e.g., nearest neighbor matching, NiC , will be the n comparison units with the closest propensity scores). For more information, readers can consult Heckman et al. (1997). For NN matching, one can randomly draw either backward or forward matches. For example, in Appendix B, for case 7 (propensity score 0.101), one can draw forward matches and find the control case (case 2) with the closest propensity score (0.109). Drawing backward matches, one can find case 1 with the closest propensity score (0.076). After repeating this for each treated case, one can calculate the ATT using Formula 2.3. For radius matching, one needs to specify the radius first. For example, suppose one sets the radius at 0.01, then the only matched case for case 7 is case 2, because the absolute value of the difference of the propensity scores between case 7 and case 2 is 0.008 (|0.101 0.109|), smaller than the radius value 0.01. One can repeat this matching procedure for each of the treated cases and use Formula 2.3 to estimate the ATT. In Table 3, column 5 reports the estimated ATT using NN matching, which produced an ATT with a range from $1,376.65 (CPS-3) to $1,654.57 (PSID-1). Column 7 describes the estimated ATT using the radius matching, which generated an ATT with a range from $1,307.63 (CPS-3) to $1,890.13 (CPS-1).
Kernel Matching
Kernel matching is another nonparametric estimation technique that matches all treated units with a weighted average of all controls. The weighting value is determined by distance of propensity scores, bandwidth parameter hn, and a kernel function K(.). Scholars can specify the Gaussian kernel and the appropriate bandwidth parameter to estimate the treatment effect using the Formula 2.4 ej x ei x X 1 X T X C ek x ei x fYi Yj K K = g; 2:4 ATT T N i2T hn hn j2C k2C where ej x denotes the propensity score of case j in the control group and ei x denotes the propensity score of case i in the treated group, and ej x ei x represents the distance of the propensity scores. When one applies kernel matching, one downweights the case in the control group that has a long distance from the case in the treated group. The weight function K : in Equation 2.4 takes large values when ej x is close to ei x. To show how it happens, suppose one chooses Gaussian density ej x ei x 1 2 and hn 0.005, and wants to match treated function K z p ez =2 where z hn 2p case 14 with control cases 10 and 11 (Appendix B). One then can compute z values for case 10 ([0.282 0.312]/0.05 0.6) and case 11 ([0.313 0.312]/0.05 0.02). The weights for case 10 and 11 are 0.33 (k(0.6)) and 0.40 (k(0.02)), respectively. Clearly, the weight is low for case 10 (0.33) that has a long distance of propensity score with treated case 14 (0.282 0.312 0.04), whereas the weight is relatively large for case 11 (0.40) that has a short distance of propensity score with case 14 (0.313 0.312 0.001). For more information on kernel matching, readers can refer to Heckman et al. (1998). In Table 3, column 9 shows the results for the kernel matching. The estimated ATT using the kernel matching technique ranges from $1,166.93 (CPS-3) to $1,776.37 (PSID-3).
18
Covariance Adjustment
Covariance adjustment is a type of regression adjustment that weights the regression using propensity scores. The matching process does not consider the variance in the observational variables because the PSM can balance the difference in the pretreatment variables in each block. Therefore, the observational variables in the balanced strata do not contribute to the treatment assignment and the potential outcome. Although each block has a balanced propensity score, the pretreatment variables may not have exactly the same distributions between the treatment group and the control group. Table 2 provides evidence that although the propensity scores are balanced in each stratum, the distributions of some variables do not fully overlap. For example, RE74 are statistically different between the treated and the matched control group for PSID-1. Covariate adjustment is achieved by using a matched sample to regress the treatment outcome on the covariates with appropriate weights for unmatched cases and duplicated cases. Dehejia and Wahba (1999) estimated the causal effect by conducting within-stratum regression, taking a weighted sum over the strata. Imbens (2000) proposed that one can use the inverse of one minus the propensity scores as the weight for each control case and the inverse of propensity scores as the weight for each treated case. Rubin (2001) provided additional discussion on covariate adjustment. Unlike matched sapling, covariance adjustment is a hybrid technique that combines nonparametric propensity matching with parametric regression. Column 11 of Table 3 reports the results of the covariance adjustment, which were produced by regressing RE78 on all observational variables, weighted by number of treated cases in each block. This approach generates an ATT ranging from $1,550.90 (CPS-2) to $1,925.23 (PSID-1). Researchers have suggested two ways to calculate the variance of the nonparametric estimators of the ATT. First, Imbens (2004) suggested that one can estimate the variance by calculating each of five components5 included in the variance formula. The asymptotic variance can generally be estimated consistently using kernel methods, which can consistently compute each of these five components. The bootstrap is the second nonparametric approach to calculate variance (Efron & Tibshirani, 1997). Efron and Tibshirani (1997) argued that 50 bootstrap replications can produce a good estimator for standard errors, yet a much larger number of replications are needed to determine the bootstrap confidence interval. In Table 3, 100 bootstrap replications were used to calculate the standard errors for the matching technique. In addition to calculating the variance nonparametrically, one can also compute it parametrically if covariance adjustment is used to produce the ATT. In Table 3, for the covariate adjustment technique, the standard errors in Column 11 of Table 3 were generated by linear regression.
Choosing Techniques
This article has reviewed different techniques for gauging the ATT. The performance of these strategies differs case by case and depends on data structure. Dehejia and Wahba (2002) demonstrated that when there is substantial overlap in the distribution of propensity scores (or balanced strata) between the treated and control groups, most matching techniques will produce similar results. Imbens (2004) remarked that there are no fully applicable versions of tools that do not require applied researchers to specify smoothing parameters. Specifically, little is still known about the optimal bandwidth, radius, and number of matches. That being said, scholars still need to consider particular issues in choosing the techniques that their research will employ. For nearest neighbor matching, it is important to determine how many comparison units match each treated unit. Increasing comparison units decreases the variance of the estimator but increases the bias of the estimator. Furthermore, one needs to choose between matching with replacement and
Li
19
Number of matched neighbor ; Bias ; Variance Nearest neighbor Match without replacement; Bias ; Variance Radius matching Matched sampling Weighting: kernel function (e.g., Gaussian) Kernel matching Bandwidth ; Bias ; Variance Balanced strata Stratified matching Weighting: fraction of treated cases within strata Maximum value of radius ; Bias ; Variance
Number of treated cases in each stratum Covariate adjustment Weighting Inverse of propensity score for treated case
matching without replacement (Dehejia & Wahba, 2002). When there are few comparison units, matching without replacement will force us to match treated units to the comparison ones that are quite different in propensity scores. This enhances the likelihood of bad matches (increase the bias of the estimator), but it could also decrease the variance of the estimator. Thus, matching without replacement decreases the variance of the estimator at the cost of increasing the estimation bias. In contrast, because matching with replacement allows one comparison unit to be matched more than once with each nearest treatment unit, matching with replacement can minimize the distance between the treatment unit and the matched comparison unit. This will reduce bias of the estimator but increase variance of the estimator. In regard to radius matching, it is important to choose the maximum value of the radius. The larger the radius is, the more matches can be found. More matches typically increase the likelihood of finding bad matches, which raises the bias of the estimator but decreases the variance of the estimator. As far as kernel matching is concerned, choosing an appropriate bandwidth is also crucial because a wider bandwidth will produce a smoother function at the cost of tracking data less closely. Typically, wider bandwidth increases chance of bad matches so that the bias of the estimator will also be high. Yet, more comparison units due to wider bandwidth will also decrease the variance of the estimator. Figure 3 summarizes the issues that scholars need to consider before choosing appropriate techniques. For organizational scholars, I recommend using stratified matching and covariate adjustment for the following reasons: First, these two techniques do not require scholars to choose specific smoothing parameters. The estimation of the ATT from these two techniques requires minimum statistical knowledge. Second, the weighting parameters can be easily constructed from the data. One can use a similar version of weighting parameters (the number of treated cases in each block) for both techniques. For stratified matching, one calculates the number of treated cases in each stratum, and then the proportion of treated cases will be computed. For covariate adjustment, one can use the number of treated cases as weights in the regression model. Finally, the performance of these two approaches (Table 3) is relatively close to other matching techniques. Overall, these two techniques are not only relatively simple, but can also produce a reliable ATT.
20
Li
21
Table 4a. Sensitivity Test Matching Stratified ATT 1 PSID-1 N 2 Neighbor ATT 3 N 4 257 232 229 380 297 284 Radius ATT 5 N 6 Kernel ATT 7 N 8 ATT 9 N 10 1,345 369 270 5,961 1,747 557 Covariate Adjustment
1,342.40 1,345 1,545.52 (763.09) (1,093.77) PSID-2 813.20 369 996.59 (1,081.68) (,1643.11) PSID-3 1,035.09 270 1,855.61 (1,091.28) (1,703.87) CPS-1 1,348.56 5,961 1,765.35 (651.14) (869.69) CPS-2 1,301.86 1,747 1,108.86 (714.36) (995.48) CPS-3 1,077.56 557 1,346.78 (707.68) (1,019.54) Mean 1,153.11 1,436.45 Variance 46,267.12 120,918.64
835.68 21 831.12 1,260 2,328.20 (3,877.08) (805.65) (693.69) 2,110.03 17 1,778.12 357 2,145.41 (2,999.31) (1,000.81) (1,143.55) 1,764.55 219 1,724.97 269 1,535.83 (1,269.51) (1,283.44) (1,400.24) 1,194.55 129 1,186.89 5,851 1,342.50 (1,855.94) (578.68) (470.60) 1,296.92 79 1,049.00 1,742 1,570.37 (2,341.93) (654.90) (478.94) 868.22 53 1,269.21 554 1,357.84 (2,752.29) (704.80) (685.77) 1,306.55 1,344.99 1,713.36 141,108.36 254,592.59 176,117.80
Note: All the sensitivity tests used only observational covariates: age, education, no degree (no high school degree), Black, Hispanic, RE74 (earning in 1974), and RE75 (earning in 1975). No high-order covariates are included; bootstrap with 100 replications was used to estimate standard errors for the propensity score matching; ATT: average treatment effect on treated. Standard errors in parentheses.
Table 4b. Sensitivity Test PSID-1 G 1.00 1.05 1.10 1.15 1.20 1.25 1.30 p-criticala 0.042 0.074 0.119 0.177 0.246 0.325 0.409 Lower Bound 216.997 57.226 26.215 188.640 343.541 455.599 621.988 Upper Bound 1,752.880 1,941.530 2,090.720 2,293.670 2,478.540 2,627.530 2,778.500 p-criticala 0.006 0.013 0.025 0.044 0.072 0.110 0.157 CPS-2 Lower Bound 641.387 468.296 320.627 196.642 43.579 4.340 112.684 Upper Bound 2,089.060 2,262.150 2,413.840 2,545.930 2,741.260 2,894.800 3,039.860
Note: G The odds ratio that individuals will receive treatment. a Wilcoxon signed-rank gives the significance test for upper bound.
can compare the estimate of the causal effect from the PSM with the IV estimators to determine the accuracy of the estimators calculated by the PSM. Unfortunately, the limited number of covariates in these data sets prevents me from using the IV approach to conduct the sensitivity analysis. Readers who are interested in this topic can find examples from Angrist et al. (1996) and DiPrete and Gangl (2004). Wooldridge (2002) provides further theoretical background on how IV can be used when one suspects the failure of a strongly ignorable assumption. Finally, Rosenbaum (2002, Chapter 4) proposed a bounding approach to test the existence of hidden bias, which potentially arises to make the estimated treatment effect biased. Suppose u1i and u0j are unobserved characteristics for individuals i and j in the treated group and the control group. G
22
refers to the effect of these unobserved variables on treatment assignment. The odds ratio that individuals receive treatment can be simply written as G exp(u1i u0j). If the unobserved variables u1i and u0j are uninformative, then the assignment process is random (G 1) and the estimated ATT and confidence intervals are unbiased. When the unobserved variables are informative, then the confidence intervals of the ATT become wider and the likelihood of finding support for the null hypothesis increases. Rosenbaum Bounding sensitivity test changes the effect of the unobserved variables on the treatment assignment to determine the end point of the significant test that leads one to accept the null hypothesis. Diprete and Gangl (2004) implemented the procedure in STATA for testing the continuous outcomes, however, their program only works for one to one matching. Becker and Caliendo (2007) also implemented this method in STATA but for testing the dichotomous outcome. Table 4b presents an example of using the RB test. The table reports only the test for PSID-1 and CPS-2 because the t-values for the ATT estimated using stratified matching show strong evidence of treatment effect. By varying the value of G, Table 4b reports the p value as well as the upper and lower bounds of the ATT. The Wilcoxon signed-rank test generates a significance test at a given level of hidden bias specified by parameter G (DiPrete & Gangl, 2004). As reported from Table 4b, the estimated ATT is very sensitive to hidden bias. As far as PSID-1 is concerned, when the critical value of G is between 1.05 and 1.10 (the unobserved variables cause the odds ratio of being assigned to the treated group or the control group to be about 1.10), one needs to question the conclusion of the positive effect of training on salary in the year 1978. In regards to the CPS-2 sample, when the critical value of G is between 1.20 and 1.25, one should question the positive effect of training on future salary. Yet, a value for G of 1.25 in CPS-2 does not mean that one will not observe the positive effect of training on future earnings; it only means that when unobserved variables determine the treatment assignment by a ratio of 1.25, it will be so strong that the salary effect would include zero and that unobserved covariates almost perfectly determine the future salary in each matched case. RB presents a worst-case scenario that assumes treatment assignment is influenced by unobserved covariates. This sensitivity test conveys important information about how the level of uncertainty involved in matching estimators will undermine the conclusions of matched sampling analyses. The simple test in Table 4b generally reveals that the causal effect of training is very sensitive to hidden biases that could influence the odds of treatment assignment.
Li
23
published before 2002, yet around 300 articles were published between 2009 and 2011. I first randomly selected one to two empirical studies from these top economics journals: American Economic Review, Econometrica, Quarterly Journal of Economics, and Review of Economic Studies. I then randomly selected one to two empirical articles from two top sociology journals: American Journal of Sociology and American Sociological Review. I finally randomly selected one to two studies from three top financial journals: Journal of Finance, Journal of Financial Economics, and Review of Financial Studies. Table 5 summarizes the data, analytical techniques, and key findings of these empirical articles employing the PSM in their fields. Given that management scholars have relied on observational data sets, using the PSM will be fundamentally helpful in discovering the effectiveness of management interventions, including areas such as strategy, entrepreneurship, and human resource management. For strategy scholars, future research can use the PSM to examine whether firms that adopt long-term incentive plans (e.g., stock options and stock ownership) can increase overall performance. Apparently, the data used in this type of study are not experimental. Future research can use the PSM to adjust the distribution between firms using long-term incentive policies and ones that have not adopted such policies. Indeed, the PSM can be widely used by strategy scholars who want to examine the outcomes of certain strategies. For example, one can examine whether duality (the practice of the CEO also being the Chairman of the Board) has real implications for stock price and longterm performance. The PSM can also be used in entrepreneurship research. Wasserman (2003) documented the paradox of success in that founders were more likely to be replaced by professional managers when founders led firms to an important breakthrough (e.g., the receipt of additional funding from an external resource). Future research can further explore this question by investigating which types of funding lead to turnover in the top management team in newly founded firms. For example, scholars can examine whether funding received from venture capitalists (VCs) has a different effect on executive turnover than that obtained from a Small Business Innovative Research (SBIR) program. Similarly, using the PSM, scholars can examine how other interventions, such as a business plan, can affect entrepreneurial performance. Like strategy scholars, entrepreneurship researchers can implement the PSM in many other questions. The PSM can also be widely implemented by strategic human resource management (SHRM) scholars. A major interest in SHRM literature is whether HR practices contribute to firm performance. One can implement the PSM to investigate whether HR practices (e.g., downsizing) contribute to firm performance. When the strongly ignorable assumption is satisfied, the PSM provides an opportunity for HR scholars to document a less biased effect size between HR practices and firm performance. HR researchers can adjust the distributions of the observational variables and then estimate the ATT of the HR practices on firm performance. In conclusion, the PSM is an effective technique for scholars to reconstruct counterfactuals using observational data sets.
Discussion
Research in other academic fields has documented the effectiveness of the PSM. Yet, like other methods, the PSM has its strength and weakness. The first advantage in using the PSM is that it simplifies the matching procedure. The PSM can reduce k-dimension observable variables into one dimension. Therefore, scholars can match observational data sets with k-dimensional covariates without sacrificing many observations or worrying about computational complexity. Second, the PSM eliminates two sources of bias (Heckman et al., 1998): bias from nonoverlapping supports and bias from different density weighting. The PSM increases the likelihood of achieving distribution overlap between the treated and control groups. Moreover, this technique reweights nonparticipant
24 Data Because of the nonrandom selection issues in the labor market, the propensity score matching technique and instrumental variables were used to examine the voluntary military service on earnings. Analytical Technique Key Findings Soldiers serving in the military in the early 1980s were paid more than comparable civilians. Military service increased the employment rate for veterans after service. Military service led to only a modest long-run increase in earnings for non-White veterans, but reduced the civilian earnings of White veterans. Credit constrained firms burned more cash, sold more assets to fund their operation, drew more heavily on lines of credit, and planned deeper cuts in spending. In addition, inability to borrow forced many firms to bypass lucrative investment opportunities. CFOs were asked to report whether their firms were credit constrained or not. Demographics of asset size, ownership form, and credit ratings were used to predict propensity scores. Average treatment effects of constrained credit were estimated by comparing the difference of spending between constrained and unconstrained firms. Propensity score matching on observable variables was used to reduce individual heterogeneity. Propensity score estimators calculating average treatment effects on treated (ATT) and the average difference-indifference showed that earning losses were 33% at the time of mass layoff and 12% 6 years later. (continued)
Author(s)
Military data come from Defense Manpower Data Center. Earnings data come from Social Security Administration.
Table 5. (continued) Data Propensity score matching was used to match non-current loans to currents loans. Propensity score is calculated using observational variables including credit rating, firm industry, and other variables. Analytical Technique Key Findings
Author(s)
They combined data sets from multiple databases. They collected data on seasoned equity issuers, including credit rating, stock return, lending history, and insurance history.
Data were collected from New Immigrant Survey with around 1,000 cases.
They used ordinal logistic model to calculate propensity scores, which were used to estimate the effect of skin color on earnings.
Survey of Income and Program Participation (SIPP) and European Community Household Panel (ECHP)
Data came from a number of sources, including the representative samples of students who completed high school in 1972, 1982, and 1992.
In the first stage, propensity score was used to adjust for selection on observational variables. In the second stage, the author examined the type of college a student will attend controlling for propensity scores.
Overall, underwriters (commercial banks and investment banks) engaged in concurrent lending and provide discounts. In addition, concurrent lending helped underwriters build relationships, which help underwriters increase the probability of receiving current and future business. They found an average difference of $2,435.63 difference between lighter and darker skinned individuals. In other words, darker skin individuals earn around $2,500 less per year than counterparts. Gangl found strong evidence that postunemployment losses are largely permanent, and such effect is particularly significant for older and high-wage workers as well as for female employees. The author found the evidence that a wide range of institutions engage in affirmative action for African American students as well as for Hispanic students. (continued)
25
26 Data Propensity matching, nonparametric conditional difference-in-difference Analytical Technique Key Findings Multinomial model was used to estimate propensity scores of discrete choices (basic training, further training, employment program, and temporary wage subsidy). After decomposing program evaluation bias into a number of components, it was found that selection bias due to unobservable variable is less important than other components. Matching technique can potentially eliminate much of the bias. The empirical evidence revealed support for the fact that the propensity score matching can be an informative tool to adjust for individual heterogeneity when individuals have multiple programs to be selected. They found that award-winning CEOs underperform over the 3 years following the award: Relative underperformance is between 15% to 26%. Specialist CEOs, defined as CEOs who have promoted from a certain divisions of their firm, negatively affect segment investment efficiency. They used propensity score matching to create counterfactual sample for nonwinning CEOs. Nearest neighbor matching technique, both with and without bias adjustment, was used to identify the counterfactual sample. Ordinary least square was used as the major technique. The propensity score method was used as robust check to address the issue of endogenous selection of CEO.
Table 5. (continued)
Author(s)
The National Job Training Partnership Act (JTPA) and Survey of Income and Program Participation (SIPP)
Hand-collected list of the winners of CEO awards between 1975 and 2002
Li
27
data to obtain equal distribution between the treated and control groups. Third, if treatment assignment is strongly ignorable, scholars can use the PSM on observational data sets to estimate an ATT that is reasonably close to the ATT calculated from experiments. Fourth, the matching technique, by its nature, is nonparametric. Like other nonparametric approaches, this technique will not suffer from problems that are prevalent in most parametric models, such as the assumption of distribution. It generally outperforms simple regression analysis when the true functional form for the regression is nonlinear (Morgan & Harding, 2006). Finally, the PSM is an intuitively sounder method for dealing with covariates than is traditional regression analysis. For example, the idea that covariates in both the treated group and the control group have the same distributions is much easier to understand than the interpretation using control all other variables at mean or ceteris paribus. Even for regression, without appropriately adjusting for the covariate distribution, one can get an ATT with the regression technique despite the fact that no meaningful ATT exists. Despite its many advantages, the PSM also has its limitations. Like other nonparametric techniques, the PSM generally has no test statistics. Although the bootstrap technique can be used to estimate the variance, such techniques are not fully justified or widely accepted by researchers (Imbens, 2004). Hence, the use of the PSM may be limited because while it can help scholars draw causal inferences, it cannot help with drawing statistical inferences. Another key hurdle of this method is that there are currently no established procedures to investigate whether treatment assignment is strongly ignorable. Heckman et al. (1998) demonstrated that the PSM cannot eliminate bias due to unobservable differences across groups. The PSM can reweight observational covariates, but it cannot deal with unobservable variables. Some unobservable variables (e.g., environmental context, region) can increase the bias of the ATT estimated using the PSM. Third, even when the treatment assignment is strongly ignorable, the accuracy of the ATT estimated by the PSM depends on the quality of the observational data. Thus, measurement error (cf. Gerhart, Wright, & McMahan, 2000) and nonrandom missing values can affect the estimated ATT. Finally, although there are a few propensity score matching techniques, one can find little guidance on which types of matching techniques work best for different applications. Overall, despite its shortcomings, the PSM can be employed by management scholars to investigate the ATT of management interventions. Appropriately used, the PSM can eliminate bias due to nonoverlapping distributions between the treatment and the control groups. The PSM can also reduce the problem of unfair comparison. However, scholars must be careful about the quality of the data because the effectiveness of the PSM depends on the observational covariates. Research using objective measures will be an optimal setting for using the PSM. In empirical settings with low quality data, scholars can implement nonparametric PSM as a robust test to justify the parametric findings generated by traditional econometric models. To draw meaningful and honest causal inferences, one must appropriately choose the technique that works best for testing the causal relationship. When one has collected panel data and believes that omitted variable is time-invariant, then the fixed effects model is the best choice for estimating bias due to an omitted variable (Allison, 2009; Beck et al., 2008). When one finds one or more perfect instrumental variables, using two-stage least-squares (2SLS) can also address the bias of causal effects calculated through conventional regression techniques. When the endogenous variable suffers only from measurement error and when one knows the reliability coefficient, one can use regression analysis and correct the bias using the reliability coefficient. Almost no technique is perfect in drawing an unbiased causal inference, including experimental design. Heckman and Vytlacil (2007) remarked that explicitly manipulating treatment assignment cannot always represent the real-world problem because experimentation naturally discards information contained in a real-world context that includes dropout, self-selection, and noncompliance. Sometimes a combination of techniques is also recommended. For example, to alleviate the extrapolation bias in the regression models Imbens and Wooldridge (2009) recommend using matching to
28
generate a balanced sample. Similarly, Rosenbaum and Rubin (1983) suggested that differences due to unobserved heterogeneity should be addressed after balancing the observed covariates. Additionally, the PSM can also be incorporated in studies using the longitudinal design. Readers who are interested in estimating the ATT using longitudinal data can also refer to the nonparametric conditional difference-in-difference model (Heckman et al., 1997) and the semiparametric conditional difference-in-difference model (Heckman et al., 1998). To conclude, to draw the best causal inference, one needs to choose the appropriate methods. Of various techniques, the PSM should be a potential choice.
Conclusion
The purpose of this article is to introduce the PSM to the management field. This article makes several contributions to organizational research methods literature. First, it not only advances management scholars understanding of a neglected method to estimate causal effects, but also discusses some of the techniques limitations. Second, by integrating previous work on the PSM, it provides a step-by-step flowchart that management scholars can easily implement in their empirical studies. The attached data set with SPSS and STATA stratified matching codes help management scholars to calculate the ATT. Readers can make context-dependent decisions and choose a matching algorithm that is most beneficial for their objectives. Finally, a brief review of the applications of the PSM in other social science fields and a discussion of potential usage of the PSM in the management field provides an overview of how management scholars can employ the PSM in future empirical studies.
Appendix B. A Small Data Set for Manually Calculating Average Treatment Effect on the Treated Group (ATT)
Step 2 Block ID 1.32 0.167 49,237.660 31,401.687 2 5 18,338.4926 Tpscore Tage YiT YjC NqT NqC ATTq15 Weight 0.08 Step 3: Estimate Causal Effect ATT Weight 1,467.079
Step 1
Case ID
Outcome
Treatment
Age
PScore
1.00
0.136
42,465.315
44,775.121
1,661.30455
0.16
265.809
1.86
0.025
33,313.704
39,898.620
348.702
0.16
55.792
1,0048.54 0 2,0688.17 0 664.977 36,646.95 12,590.71 24,642.57 10,344.09 9,788.461 0 0 13,167.52 4,321.705 12,558.02 12,418.07 0 17,732.72 4,433.18 0 17,732.72 7,284.986 5,522.788 20,505.93 0 2,364.363 22,165.9 7,447.742 2,164.022 11,141.39 3,462.564 559.443 4,279.613 0
0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1
50 19 26 44 39 35 33 32 44 41 33 20 22 26 46 46 40 26 30 21 20 41 17 24 27 41 23 24 21 23 29 20 19 23
0.076 0.109 0.128 0.14 0.177 0.075 0.101 0.265 0.268 0.282 0.313 0.365 0.261 0.312 0.361 0.392 0.412 0.456 0.481 0.513 0.558 0.511 0.525 0.547 0.59 0.678 0.727 0.746 0.654 0.739 0.758 0.759 0.764 0.768
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4
0.24
1,693.959
(continued)
29
30
Step 2 Block ID 1.23 0.552 65,459.109 5,615.361 9 2 Tpscore Tage YiT YjC NqT NqC Step 3: Estimate Causal Effect ATTq15 4,465.553833 Weight 0.36 ATT Weight 1,607.599 0.923 0.954 0.913 0.948 0.954 0.959 0.961 0.965 0.966 0.97 0.987 0.001 0.003 0.009 0.013 0.016 5 5 5 5 5 5 5 5 5 5 5 Unmatched cases ATT 1,702.321
Appendix B. (continued)
Step 1
Case ID
Outcome
Treatment
Age
PScore
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
5,615.361 0 13,385.86 8,472.158 0 6,181.88 289.79 17,814.98 9,265.788 1,923.938 8,124.715 11,821.81 24,825.81 33,987.71 33,987.71 54,675.88
0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0
28 23 18 27 18 17 21 37 17 25 25 53 52 28 41 38
Note: PScore propensity scores; Tage/Tpscore t-test for age and propensity scores in each balanced block; YiT summation of outcome variable for treated cases in each block; YiC summation of outcome variable for control cases in each block; NqT total number of treated cases in each block; NqC total number of control cases in each block; ATTq15 YiT/NqT YiC/NqC; average treatment effect for each balanced block; weight total number of treated cases in each block divided by total number of treated cases in the sample.
Li
31
The above code calculates predicted probability using a number of observation variables (e.g. X1, X2, and X3). Readers can change their variables correspondingly.
*Step 2: Stratify into five blocks. compute blockid. if (pscore< .2) & (pscore > .05) blockid1. if (pscore< .4) & (pscore > .2) blockid2. if (pscore< .6) & (pscore > .4) blockid3. if (pscore< .8) & (pscore > .6) blockid4. if ( pscore > .8) blockid5. execute. *Perform t test for each block. *Split file first, and then excute t test. SORT CASES BY blockid. SPLIT FILE SEPARATE BY blockid. T-TEST GROUPStreatment(0 1) /MISSINGANALYSIS /VARIABLESage pscore /CRITERIACI(.95).
The above code first stratifies variables into five blocks, and then carries on the t-test for each of the blocks. SPSS has no if option for t-test, thus it is important to split the data based on block ID, and then conduct the t-test.
*Step 3: Perform Stratification Matching Procedure. *Caclulate YiT and YjC in Appendix B. AGGREGATE /OUTFILE* MODEADDVARIABLES /BREAKblockid treatment /outcome_sumSUM(outcome). *Calculate NqT and NqC in Appendix B. AGGREGATE /OUTFILE* MODEADDVARIABLES /BREAKblockid treatment /N_BREAKN. *Calculate total number of treatment cases. AGGREGATE /OUTFILE* MODEADDVARIABLES /BREAK /N_Treatmentsum(treatment).
32
COMPUTE ATTQoutcome_sum/N_BREAK. EXECUTE. DATASET DECLARE agg_all. AGGREGATE /OUTFILEagg_all /BREAKtreatment blockid /N_Block_TMEAN(N_BREAK) /ATTQ_TMEAN(ATTQ) /N_TreatmentMEAN(N_Treatment). DATASET ACTIVATE agg_all. DATASET COPY agg_treat. DATASET ACTIVATE agg_treat. FILTER OFF. USE ALL. SELECT IF (treatment 1). EXECUTE. DATASET ACTIVATE agg_all. DATASET COPY agg_control. DATASET ACTIVATE agg_control. FILTER OFF. USE ALL. SELECT IF (treatment0&blockid<6). EXECUTE. DATASET ACTIVATE agg_control.
RENAME VARIABLES (N_Block_T ATTQ_T N_Block_C ATTQ_C ). MATCH FILES /FILE* /FILEagg_treat /RENAME (blockid N_Treatment treatment d0 d1 d2) /DROP d0 d1 d2. EXECUTE. COMPUTE ATTQATTQ_T-ATTQ_C. EXECUTE. COMPUTE weightN_Block_T/N_Treatment. EXECUTE. COMPUTE ATTxweightATTQ*weight. EXECUTE. AGGREGATE /OUTFILE* MODEADDVARIABLES OVERWRITEVARSYES /BREAK /ATTxweight_sumSUM(ATTxweight). DATASET CLOSE agg_all. DATASET CLOSE agg_control. DATASET CLOSE agg_treat.
Li
33
This step computes each of the components in Equation 2.2. For example, it first calculates the number of treated cases and the number of control cases in each matched block. Then, it also gauges the summation of outcome in each balanced blocks. The code then extracts each of the necessary components into two different data sets: agg_control and agg_treat. Finally, the code matches these two data sets based on block ID and estimates the ATT. The final result will be displayed in the variable called ATTxweight.
*STEP 2: t test for balance in each block foreach var of varlist age pscore f forvalues i1/5 f ttest var if blockid i, by(treatment) g g *STEP 3: Estimate causal effects using stratified matching sort blockid treatment gen YTQ. *Yic in Appendix B table gen TTN1 *Nqt in Appendix B table gen YCQ. *Yjc in Appendix B table gen TCN1 *Nqc in Appendix B table forvalues i1/5 f *Get sum for outcome in each treated block sum outcome if treatment1 & blockidi replace YTQr(sum) if blockidi *Number of treated cases in each block sum TTN if treatment1 & blockidi replace TTNr(sum) if blockidi *Get sum for outcome in each control block sum outcome if treatment0 & blockidi replace YCQr(sum) if blockidi *Number of treated cases in each block sum TCN if treatment0 & blockidi
34
Appendix E. Software Packages for Applying the Propensity Score Method (PSM)
Environment Software Name R Matching Authors Sekhon (2007) Function and Download Sources Relies on an automated procedure to detect matches based on a number of univariate and multivariate metrics. It performs propensity matching, primarily 1:M matching. The package also allows matching with and without replacement. Download source: http://sekhon.berkeley.edu/matching/ Document: http://cran.r-project.org/web/packages/ Matching/Matching.pdf Provides enriched graphical tools to test within strata balance. It also provides graphical tools to detect covariate distributions across strata. Download source: http://cran.r-project.org/web/packages/ PSAgraphics/index.html Includes propensity score estimating and weighting. Generalized boosted regression is used to estimate propensity scores thus simplifying the procedure to estimate propensity scores. Download source: http://cran.r-project.org/web/packages/ twang/index.html Performs 1:1 nearest neighbor matching. Download source: http:// mayoresearch.mayo.edu/mayo/research/ biostat/upload/gmatch.sas Allows users to specify the propensity score matching from 1:1 or 1:M. Download source: http://www2.sas.com/proceedings/sugi29/ 165-29.pdf (continued)
PSAgraphics
Twang
SAS
Greedy matching
OneToManyMTCH
Parsons (2004)
Li
35
Appendix E. (continued) Environment Software Name SPSS SPSS Macro for P score matching Authors Painter (2004) Function and Download Sources
STATA
Pscore
Psmatch2
Performs nearest neighbor propensity score matching. It seems to solely do 1:1 matching without replacement. Download source: http://www.unc.edu/*painter/SPSSsyntax/ propen.txt Becker and Ichino (2002) Estimates propensity scores and conducts a number of matching such as radius, nearest neighbor, kernel, and stratified. Download source: http://www.lrz.de/*sobecker/pscore.html Leuven and Sianesi Allows a number of matching procedures, (2003) including kernel matching and k:1 matching. It also supports common support graphs and balance testing. Download source: http://ideas.repec.org/c/boc/bocode/ s432001.html
Acknowledgments
Special thanks to Barry Gerhart for his invaluable support and to Associate Editor James LeBreton and anonymous reviewers for their constructive feedbacks. This article has also benefited from suggestions by Russ Coff, Jose Cortina, Cindy Devers, Jon Eckhardt, Phil Kim, and seminar participants at 2011 AOM conference.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
1. Harder, Stuart, and Anthony (2010) argued that propensity score method (PSM) can be used to estimate the average treatment effect on the treated group (ATT), and subclassifying the propensity score can be used to calculate the average treatment effect (ATE). However, economists typically viewed the PSM as a technique to estimate the ATT (Dehejia & Wahba, 1999, 2002). Following Dehejia and Wahba (1999, 2002), the remaining section regards the PSM as a way to calculate the ATT. The remaining sections use causal effects, treatment effects, and ATT interchangeably. 2. Psychology scholars also extended this to develop the causal steps approach to draw mediating causal inference (e.g., Baron & Kenny, 1986). It is beyond the scope of this article to fully discuss mediation. Interested readers can read LeBreton, Wu, and Bing (2008) and Wood, Goodman, Beckmann, and Cook (2008) for surveys. 3. Becker and Ichino (2002) have written a nice STATA program (pscore) to estimate the propensity score. The convenience of using pscore is that the program can stratify propensity scores to a specified number of blocks and test the balance of propensity scores in each block. However, when there is more than one treatment, it is inappropriate to use pscore to estimate the propensity score.
36
4. Propensity score matching is one technique of many matched sampling technique. One can use exact matching simply based on one or more covariates. For example, scholars may match sample based on standard industry classification (SIC) and firm size rather than matching using propensity scores. 5. These components are: the variance of the covariates in the control groups, the variance of the covariates in the treated groups, the mean of the covariates in the control groups, the mean of the covariates in the treated groups, and the estimated propensity score. The variance of the covariates in the treated and the control groups are weighted by the propensity score. 6. Instrumental variable (IV) is typically used by scholars under the condition of simultaneity. Because of the difficulty in finding an IV, it is not viewed as a general remedy for endogeneity issues.
References
Allison, P. (2009). Fixed effects regression models. Newbury Park, CA: Sage. Angrist, J. (1998). Estimating the labor market impact of voluntary military service using social security data on military applicants. Econometrica, 66, 249-288. Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 9, 444-455. Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6), 1086-1120. Arceneaux, K., Gerber, A., & Green, D. (2006). Comparing experimental and matching methods using a largescale voter mobilization experiment. Political Analysis, 14, 1-26. Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173-1182. Beck, N., Bruderl, J., & Woywode, M. (2008). Momentum or deceleration? Theoretical and methodolo gical reflections on the analysis of organizational change. Academyof Management Journal, 51(3), 413-435. Becker, S., & Caliendo, M. (2007). Sensitivity analysis for average treatment effects. Stata Journal, 7(1), 71-83. Becker, S., & Ichino, A. (2002). Estimation of average treatment effects based on propensity scores. The Stata Journal, 2, 358-377. Berk, R. A. (1983). An introduction to sample selection bias in sociological data. American Sociological Review, 48(3), 386-398. Campello, M., Graham, J., & Harvey, C. (2010). The real effects of financial constraints: Evidence from a financial crisis. Journal of Financial Economics, 97, 470-487. Cochran, W. (1957). Analysis of covariance: Its nature and uses. Biometrics, 13(3), 261-281. Cochran, W. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24, 295-313. Couch, K. A., & Placzek, D. W. (2010). Earnings losses of displaced workers revisited. American Economic Review, 100, 572-589. Cox, D. (1992). Causality: Some statistical aspects. Journal of the Royal Statistical Society, Series A (Statistics in Society), 155, 291-301. Dehejia, R., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94, 1053-1062. Dehejia, R., & Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics, 84, 151-161. DiPrete, T. A., & Gangl, M. (2004). Assessing bias in the estimation of causal effects: Rosenbaum bounds on matching estimators and instrumental variables estimation with imperfect instruments. Sociological Methodology, 34, 271-310. Duncan, O. D. (1975). Introduction to structural equation models. San Diego, CA: Academic Press.
Li
37
Drucker, S., & Puri, M. (2005). On the benefits of concurrent lending and underwriting. Journal of Finance, 60(6), 2763-2799. Efron, B., & Tibshirani, R. (1997). An introduction to the bootstrap. London: Chapman & Hall. Frank, R., Akresh, I. R., & Lu, B. (2010). Latino Immigrants and the US racial order: How and where do they fit in? American Sociological Review, 75(3), 378-401. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189-1232. Gangl, M. (2006). Scar effects of unemployment: An assessment of institutional complementarities. American Sociological Review, 71(6), 986-1013. Gerhart, B. (2007). Modeling human resource management and performance linkages. In P. Boxall, J. Purcell, & P. Wright (Eds.), The Oxford handbook of human resource management (pp. 552-580). Oxford: Oxford University Press. Gerhart, B., Wright, P., & McMahan, G. (2000). Measurement error in research on the human resources and firm performance relationship: Further evidence and analysis. Personnel Psychology, 53, 855-872. Greene, W. (2008). Econometric analysis (6th ed.). Upper Saddle River, NJ: Prentice Hall. Grodsky, E. (2007). Compensatory sponsorship in higher education. American Journal of Sociology, 112(6), 1662-1712. Gu, X., & Rosenbaum, P. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2, 405-420. Harder, V. S., Stuart, E. A., & Anthony, J. C. (2010). Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological Methods, 15, 234-249. Hamilton, B. H., & Nickerson, J. A. (2003). Correcting for endogeneity in strategic management research. Strategic Organization, 1, 51-78. Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153-161. Heckman, J., & Hotz, V. (1989). Choosing among alternative nonexperimental methods for estimating the impact of social programs: The case of manpower training. Journal of the American Statistical Association, 84, 862-874. Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing selection bias using experimental data. Econometrica, 66, 1017-1098. Heckman, J., Ichimura, H., & Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating job training program. Review of Economic Studies, 64, 605-654. Heckman, J. J., & Vytlacil, E. J. (2007). Econometric evaluation of social programs, part II: Using the marginal treatment effect to organize alternative econometric estimators to evaluate social programs, and to forecast their effects in new environments. Handbook of Econometrics, 6, 4875-5143. Helmreich, J. E., & Pruzek, R. M. (2009). PSAgraphics: An R package to support propensity score analysis. Journal of Statistical Software, 29, 1-23. Hoetker, G. (2007). The use of logit and probit models in strategic management research: Critical issues. Strategic Management Journal, 28(4), 331-343. Imbens, G. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87(3), 706-710. Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. The Review of Economics and Statistics, 86, 4-29. Imbens, G. W., & Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation. Journal of Economic Literature, 47(1), 5-86. James, L. R. (1980). The unmeasured variables problem in path analysis. Journal of Applied Psychology, 65(4), 415-421. James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models, and data. Thousand Oaks, CA: Sage.
38
Joffe, M. M., & Rosenbaum, P. R. (1999). Invited commentary: Propensity scores. American Journal of Epidemiology, 150, 327-333. King, G., Keohane, R. O., & Verba, S. (1994). Designing social inquiry: Scientific inference in qualitative research. Princeton, NJ: Princeton University Press. Kosanke, J., & Bergstralh, E. (2004). gmatch: Match 1 or more controls to cases using the GREEDY algorithm. Retrieved from http://mayoresearch.mayo.edu/mayo/research/biostat/upload/gmatch.sas (accessed May 15, 2012) Lalonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76, 604-620. LeBreton, J. M., Wu, J., & Bing, M. N. (2008). The truth(s) on testing for mediation in the social and organizational sciences. In C. E. Lance, & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends (pp. 107-140). New York, NY: Routledge. Lechner, M. (2002). Program heterogeneity and propensity score matching: An application to the evaluation of active labor market policies. Review of Economics and Statistics, 84, 205-220. Leuven, E., & Sianesi, B. (2003). PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing [Statistical software components]. Boston, MA: Boston College. Li, Y., Propert, K., & Rosenbaum, P. (2001). Balanced risk set matching. Journal of the American Statistical Association, 96, 870-882. Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage. Malmendier, U., & Tate, G. (2009). Superstar CEOs. The Quarterly Journal of Economics, 124(4), 1593-1638. McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods, 9, 403-425. Mellor, S., & Mark, M. M. (1998). A quasi-experimental design for studies on the impact of administrative decisions: Applications and extensions of the regression-discontinuity design. Organizational Research Methods, 1(3), 315-333. Morgan, S. L., & Harding, D. J. (2006). Matching estimators of causal effectsProspects and pitfalls in theory and practice. Sociological Methods & Research, 35, 3-60. Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge, UK: Cambridge University Press. Painter, J. (2004). SPSS Syntax for nearest neighbor propensity score matching. Retrieved from http://www. unc.edu/~painter/SPSSsyntax/propen.txt (accessed May 15, 2012) Parsons, L. (2004). Performing a 1: N case-control match on propensity score. Proceedings of the 29th Annual SAS Users Group International Conference, SAS Institute, Montreal, Canada. Ridgeway, G., McCaffrey, D., & Morral, A. (2006). Toolkit for weighting and analysis of nonequivalent groups: A tutorial for the twang package. Santa Monica, CA: RAND Corporation. Rosenbaum, P. (1987). The role of a second control group in an observational study. Statistical Science, 2, 292-306. Rosenbaum, P. (2002). Observational studies. New York, NY: Springer-Verlag. Rosenbaum, P. (2004). Matching in observational studies. In A. Gelman & X. Meng (Eds.), Applied Bayesian modeling and causal inference from an incomplete-data perspective (pp. 15-24). New York, NY: Wiley. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Rosenbaum, P., & Rubin, D. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516-524. Rosenbaum, P., & Rubin, D. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. American Statistician, 39, 33-38.
Li
39
Rousseau, D. (2006). Is there such a thing as evidence-based management. Academy of Management Review, 31, 256-269. Rubin, D. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, 757-763. Rubin, D. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3), 169-188. Rubin, D. (2004). Teaching statistical inference for causal effects in experiments and observational studies. Journal of Educational and Behavioral Statistics, 29, 343-367. Rynes, S., Giluk, T., & Brown, K. (2007). The very separate worlds of academic and practitioner periodicals in human resource management: Implications for evidence-based management. Academy of Management Journal, 50(5), 987-1008. Schonlau, M. (2005). Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal, 5, 330-354. Sekhon, J. S. (2007). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 10(2), 1-51. Simith, J., & Todd, P. E. (2005). Does matching overcome Lalondes critique of nonexperimental estimators. Journal of Econometrics, 125, 305-353. Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15, 250-267. Wasserman, N. (2003). Founder-CEO succession and the paradox of entrepreneurial success. Organization Science, 14(2), 149-172. Wolfe, F., & Michaud, K. (2004). Heart failure in rheumatoid arthritis: Rates, predictors, and the effect of anti-tumor necrosis factor therapy. American Journal of Medicine, 116, 305-311. Wood, R. E., Goodman, J. S., Beckmann, N., & Cook, A. (2008). Mediation testing in management research: A review and proposals. Organizational Research Methods, 11(2), 270-295. Wooldridge, J. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press. Xuan, Y. (2009). Empire-building or bridge-building? Evidence from new CEOs internal capital allocation decisions. Review of Financial Studies, 22, 4919-4918.
Bio
Mingxiang Li is a doctoral candidate at the Wisconsin School of Business, University of Wisconsin-Madison. In addition to research methods, his current research interests include corporate governance, social network, and entrepreneurship.