Design-Based Causal Inference in Bipartite Experiments
Design-Based Causal Inference in Bipartite Experiments
∗
Sizhu Lu† Lei Shi‡ Yue Fang§ Wenxin Zhang¶ Peng Ding‖
Abstract
Bipartite experiments are widely used across various fields, yet existing methods often rely on strong
assumptions about modeling the potential outcomes and exposure mapping. In this paper, we explore
design-based causal inference in bipartite experiments, where treatments are randomized over one set of
units, while outcomes are measured over a separate set of units. We first formulate the causal inference
problem under a design-based framework that generalizes the classic assumption to account for bipartite
interference. We then propose point and variance estimators for the total treatment effect, establish a
central limit theorem for the estimator, and propose a conservative variance estimator. Additionally, we
discuss a covariate adjustment strategy to enhance estimation efficiency.
Key Words: bipartite interference; finite population; covariate adjustment; randomization inference
1 Introduction
Bipartite experiments have gained increasing recognition for their utility in various fields, including digital ex-
perimentation (Harshaw et al., 2023; Shi et al., 2024) and environmental science (Zigler and Papadogeorgou,
2021). In bipartite experiments, the treatments are randomized over one set of units, called treatment units,
while the outcomes are measured over a separate set of units, called the outcome units. This is different from
the classic experiment settings where we randomly assign units to different treatment groups and measure
their outcome of interest after the initiation of treatment. As illustrated in Figure 1, the treatment units and
outcome units are connected through a known fixed bipartite graph, and causal dependencies are represented
by the bipartite network, leading to what is known as bipartite interference (Zigler and Papadogeorgou,
2021).
There has been extensive research on bipartite experiments. A well-known special case is cluster ran-
domization, which has been studied in theory and adopted in practice (Donner, 1998; Donner et al., 2000;
∗ Thefirst two authors contributed equally to this work.
† Department of Statistics, University of California, Berkeley, CA 94720, USA. Email: sizhu [email protected]
‡ Division of Biostatistics, University of California, Berkeley, CA 94720, USA. Email: [email protected]
§ School of Management and Economics, The Chinese University of Hong Kong, Shenzhen, 518172 China. Email:
[email protected]
¶ Division of Biostatistics, University of California, Berkeley, CA 94720, USA. Email: wenxin [email protected]
‖ Department of Statistics, University of California, Berkeley, CA 94720, USA. Email: [email protected]
1
1
1
2
2
treatment units 3 outcome units
3
4
4
5
Su and Ding, 2021). The cluster experiment setup corresponds to a situation where each outcome unit is con-
nected to exactly one treatment unit. More recent works have focused on the general bipartite network, where
each outcome unit may be connected to multiple treatment units, and vice versa. Zigler and Papadogeorgou
(2021) formulated a set of causal estimands in bipartite experiments and proposed an inverse probability-
weighted estimator for observational studies. Doudchenko et al. (2020) leveraged the generalized propen-
sity score to obtain unbiased estimates of causal effects. Harshaw et al. (2023) explored the estimation
and inference of the average total treatment effect under a linear exposure-response model. Shi et al.
(2024) extended Harshaw et al. (2023) by studying covariate adjustment under a double linear model.
Song and Papadogeorgou (2024) studied bipartite experiments in the time series and random network setting
under exposure examined bipartite experiments in time series and random network settings using exposure
mapping and matching estimators for observational studies. Several works have also addressed the de-
sign of experiments in the presence of bipartite interference. For example, Pouget-Abadie et al. (2019),
Harshaw et al. (2023), and Brennan et al. (2022) investigated methods for constructing better experimental
designs in such context.
In this work, we conduct a design-based analysis for bipartite experiments, where we define casual
parameters based on fixed potential outcomes and derive properties of estimators for causal effects under
the randomness of treatment assignment (Imbens and Rubin, 2015; Ding, 2024). Many existing approaches
(e.g. Harshaw et al., 2023; Doudchenko et al., 2020; Lu et al., 2024) have touched such a perspective for
bipartite experiments. However, these works typically rely on strong modeling assumptions on the potential
outcomes. For example, Harshaw et al. (2023) adopted a linear exposure mapping (Aronow and Samii, 2017;
Forastiere et al., 2021) with a linear outcome model. Doudchenko et al. (2020) used exposure mapping for
estimation along with a linear model for variance estimation. Lu et al. (2024) applied a heterogeneous
additive effect model on the potential outcomes. Estimation and inference using these model-based methods
depend heavily on correct model specifications, and a more flexible design-based framework that requires
fewer assumptions has not yet been rigorously discussed in the context of bipartite experiments. There
2
are several challenges to adopting such a framework in this context. For instance, how can we establish
central limit theorems and construct valid variance estimators with strong dependencies across outcome
units? Furthermore, if covariate information is available, how should we perform covariate adjustment in a
model-agnostic fashion without relying on parametric assumptions on the outcomes and covariates, following
the spirit of Lin (2013)?
Our contributions. First, we formulate the causal inference problem in bipartite experiments under the
design-based framework. We generalize the stable unit treatment value assumption (SUTVA) to account
for bipartite interference. This generalization is tailored to the bipartite network structure, enabling the
identification of the total treatment effect as a function of observed data. Unlike model-based frameworks,
our approach avoids the strong assumptions on modeling the potential outcomes or exposure mapping.
Second, we propose a Hájek estimator for the total treatment effect and prove its consistency and asymp-
totic normality under mild assumptions on the network structure. We also propose a conservative variance
estimator that ensures valid inference by accounting for the complexity of the network.
Third, we present a model-agnostic covariate-adjusted estimator that is asymptotically no less efficient
than the Hájek-type estimator while maintaining valid inference.
Organization of the paper. The rest of the paper is organized as follows. Section 2 introduces the design-
based setup of the bipartite experiments. Section 3 discusses the estimation of the total treatment effect
under bipartite interference. Section 4 presents a covariate adjustment strategy for constructing point and
variance estimators and proves their asymptotic properties. Section 5 conducts many numerical experiments
to validate the proposed methods and theoretical results. Section 6 concludes the paper and outlines future
research directions. All proofs and technical details are in the Supplementary Material.
Notation. We will use the following notation. Let 1{·} denote the indicator function. Let plim denote
the probability limit, avar(·) and acov(·, ·) denote the asymptotic variance and covariance, respectively, and
≍ denote asymptotically the same order as the sample size increases to infinity. For any positive integer K,
denote [K] = {1, . . . , K} as the set of all positive integers smaller than or equal to K. Write bn = O(an ) if
bn /an is bounded and bn = o(an ) if bn /an converges to 0 as n → ∞. Write bn = Op (an ) if bn /an is bounded
in probability and bn = op (an ) if bn /an converges to 0 in probability.
2 Setup
Example 1 (Power plant). Zigler and Papadogeorgou (2021) studies the problem of evaluating how the
installation of selective noncatalytic reduction system or not (treatment) in their upwind power plants (treat-
ment units) causally affects the hospitalization rates in the neighborhoods (outcome units). In this case, a
3
neighborhood can be affected by the treatments of multiple upwind power plants, while one power plant may
affect a set of neighborhoods. We will revisit this example in Section 5.2.
Example 2 (Facebook Group). Shi et al. (2024) considers a bipartite experiment where treatments are
randomized across Facebook Groups and outcomes are measured by user-level engagement. The outcome of
each user is affected by interventions on a set of groups they belong to, while treatment in each group affects
all users within that group.
Example 3 (Amazon market). Harshaw et al. (2023) simulates a bipartite experiment on the Amazon mar-
ketplace to evaluate the impact of new pricing mechanisms (treatments) randomized across items (treatment
units) on the level of satisfaction (outcome) of the customers (outcome units). In this scenario, items with
new pricing mechanisms may influence the group of customers who view them, while each customer may
encounter a variety of items subject to different pricing strategies.
Assumption 1. The potential outcomes of unit i depend only on the treatment status of the groups to which
it belongs. Formally, Yi (z) = Yi (zSi ), where zSi denotes the subvector of z corresponding to groups in Si .
Here the potential outcome Yi (z) depends on the treatment vector zSi , whose dimension varies across
units, unlike the classic setting. We focus on the total treatment effect
n
X
τ = n−1 {Yi (1) − Yi (0)} ,
i=1
4
which captures the difference in the average potential outcomes when all groups are treated versus when
none are controlled. It is a widely studied estimand in settings with interference such as bipartite spatial
Pn
experiments (Zigler and Papadogeorgou, 2021; Harshaw et al., 2023). Denote µ1 = n−1 i=1 Yi (1) and
µ0 = n−1 ni=1 Yi (0), we have τ = µ1 − µ0 .
P
Assumption 2 (Bernoulli randomization). Each group is randomly assigned to the treatment group with
iid
probability p and to the control group with probability 1 − p, i.e., Zk ∼ Bern(p).
Under Assumption 2, we have E(Ti ) = p|Si | and E(Ci ) = (1 − p)|Si | . A natural Horvitz-Thompson-type
Pn Pn
estimator n−1 i=1 Ti Yi /p|Si | − n−1 i=1 Ci Yi /(1 − p)|Si | is unbiased for τ . However, throughout this paper,
we focus on the following Hájek-type weighting estimator τ̂ = µ̂1 − µ̂0 , where
n n
X Ti Yi . X Ti
µ̂1 = n−1 n−1 ,
i=1
p|Si | i=1
p|Si |
n n
X Ci Yi . −1 X Ci
µ̂0 = n−1 n ,
i=1
(1 − p)|Si | i=1
(1 − p)|Si |
because of its better finite-sample performance compared with the unbiased Horvitz-Thompson estimator.
3.2 Consistency of τ̂
In this subsection, we establish the consistency of τ̂ . We need the following regularity conditions.
Assumption 3 restricts the density of the bipartite graph as n increases. We restrict the maximum number
of groups each unit belongs to bounded by a constant while allowing for the maximum number of units each
group contains to increase with n but at a slower rate.
We impose the boundedness of the potential outcomes and covariates in Assumption 4 to prove the
limiting theorems. We can relax it to some moment conditions but keep the form of Assumption 4 to
simplify the presentation.
We have the following consistency result for τ̂ .
5
3.3 Asymptotic distribution
In this section, we establish the asymptotic normality of the point estimator τ̂ . For this purpose, we further
assume the following condition on the density of the bipartite graph.
Assumption 5 (Sparse bipartite graph). Define groups j1 and j2 are connected if there exists at least one
unit belonging to both groups. Assume for any group k, the total number of groups that are connected to k
is bounded by an absolute constant B:
j∈[m]\{k}
Assumption 5 imposes some sparsity conditions on the degree of the group network. Assuming a bounded
degree simplifies the presentation of the theoretical results, though our theory allows B to grow in some
polynomial order of n. This sparsity condition can be justified in many bipartite experiments. For instance,
in Example 1, two power plants are defined as connected only if there is at least one neighborhood monitored
within a certain distance of both power plants. If two power plants are far away from each other, they are
not connected. Therefore, such a geographical network formation naturally restricts the sparsity of the
network degrees. However, there are examples where this assumption is less likely to hold. In Example 3,
the items are connected in a dense pattern because each customer may encounter a wide range of items while
browsing the website, and the browsing lists from different customers can have many overlaps. We might
need different estimation strategies and theoretical tools to analyze bipartite experiments with such dense
network formations.
We introduce some additional notation for the potential outcomes. Let Y (z) = (Y1 (z), . . . , Yn (z))
denote the vector consisting all potential outcomes under treatment assignment z, and Ỹ (z) = Y (z) −
Pn
n−1 i=1 Yi (z) denote the centered potential outcome vector. Moreover, for i, j = 1, . . . , n, define the
following matrices:
(Λ1 )i,j = p−|Si ∩Sj | − 1, (Λ0 )i,j = (1 − p)−|Si ∩Sj | − 1, (Λτ )i,j = 1{Si ∩ Sj 6= ∅}.
vn
√ → ∞, (1)
m · (D̄/n)2
6
Remark 3.1 (Non-degeneracy of vn ). In Theorem 3.2, we impose the non-degeneracy condition (1) to rule
out the cases where the asymptotic variance (2) is too small in order. The rate is motivated by our central
limit theorem established in Theorem A.1 in the supplementary material. Naturally, it requires the potential
outcomes to have a non-degenerate covariance structure. In the classic Bernoulli randomized experiments
setting, D̄ = 1 and n = m, and (2) requires vn to have a larger order than n−3/2 . This is automatically
satisfied according to the standard results in the literature (e.g. Li and Ding, 2017) which typically gives the
rate of vn = O(n−1 ) under mild assumptions on the potential outcomes.
As an additional sanity check, we justify the nondegeneracy condition in some random data examples.
Consider a network where D̄ and S̄ are both finite. Then m ≍ n. If the potential outcomes (Yi (1), Yi (0)) are
generated independently from a bivariate normal distribution N (02 , σi2 I2 ), then the quantities in vn has the
following order:
n
X
n−2 Ỹ (1)t Λ1 Ỹ (1) = n−2 (p−|Si | − 1)σi2 + op (n−1 ),
i=1
n
X
n−2 Ỹ (0)t Λ0 Ỹ (0) = n−2 (p−|Si | − 1)σi2 + op (n−1 ),
i=1
−2 t −1
n Ỹ (1) Λτ Ỹ (0) = op (n ),
Below we use classic Bernoulli randomization and cluster randomization as examples to illustrate the
variance formula in Theorem 3.2. Our Theorem 3.2 recovers the existing results.
Example 4 (Classic Bernoulli randomized experiment). In classic Bernoulli randomization where the ran-
domization units are identical to the outcome units,
1,
if i = j,
Si ∩ Sj =
0, if i 6= j.
n
( )2
−2
X Ỹi (1) Ỹi (0)
vn = n p(1 − p) − ,
i=1
p 1−p
which recovers the classic result of Bernoulli randomization in Miratrix et al. (2012, Theorem 1).
Example 5 (Cluster randomization). In a cluster randomization setting with m clusters and the treatment
7
iid
assignment Zk ∼ Bern(p) for k = 1, . . . , m, we have
1,
if i, j belong to the same group,
Si ∩ Sj =
0,
otherwise.
If we order the units according to the cluster they belong to, then
1n1 0 ··· 0
0 1n2 ··· 0 1−p p
Λτ = . , Λ1 = Λτ , Λ0 = Λτ ,
.. .. .. p 1−p
. .
0 0 · · · 1nm
where 1nk is an nk × nk -dimensional matrix with all entries equal to 1 and nk is the total number of units
in cluster k for k = 1, . . . , m. Therefore, the asymptotic variance in equation (2)reduces to
m
" ( )#2
X X Ỹi (1) Ỹi (0)
−2
vn = n p(1 − p) − .
p 1−p
k=1 i∈Dk
The variance formula in (3) involves double summations over units i and j, where the two parts inside
brackets are sample analogs of n−2 Ỹ (1)t Λ1 Ỹ (1) and n−2 Ỹ (0)t Λ0 Ỹ (0), respectively.
The following Theorem 3.3 shows that v̂ converges in probability and is conservative to the true asymp-
totic variance avar(τ̂ ). Therefore, we can construct the Wald-type large-sample confidence interval: [τ̂ −
qα/2 v̂ 1/2 , τ̂ + qα/2 v̂ 1/2 ] that ensures valid Type I error control in a large sample, where qα/2 denotes the upper
α/2 quantile of a standard Gaussian distribution.
8
Condition (4) quantifies the requirement to achieve consistent variance estimation, which is derived based
on the Cauchy–Schwarz inequality. It depends on both the values of the potential outcomes and the structure
of the networks. We revisit the two Examples 4 and 5 to provide more intuition of the equality condition in
special cases.
Continuance of Example 4 (Classic Bernoulli randomized experiment). In the classic Bernoulli random-
ized experiment setting, condition (4) reduces to
n
( n n
)1/2
X X X
2 2
Ỹi (1)Ỹi (0) = Ỹi (1) Ỹi (0) ,
i=1 i=1 i=1
which is equivalent to Ỹi (1) = ζ1 Ỹi (0) for any i = 1, . . . , n and ζ1 > 0 is a positive constant. A special
case that satisfies this condition is the constant treatment effect case with Yi (1) − Yi (0) = τ for all units
i = 1, . . . , n.
Continuance of Example 5 (Cluster randomization). In the cluster experiment setting, condition (4)
reduces to
m
( )( )
m
( )2 m
( )2 1/2
X X X X X X X
Ỹi (1) Ỹi (0) = Ỹi (1) Ỹi (0) ,
k=1 i∈Dk i∈Dk k=1 i∈Dk k=1 i∈Dk
P P
which is equivalent to i∈Dk Ỹi (1) = ζ2 i∈Dk Ỹi (0) for any k = 1, . . . , m and ζ2 > 0 is a positive constant.
A special case that satisfies this condition is when the cluster-specific average treatment effect on each cluster
is a constant, i.e., |Dk |−1 { i∈Dk Yi (1) − i∈Dk Yi (0)} = τ for all clusters k = 1, . . . , m.
P P
4 Covariate adjustment
4.1 Methodology
In this section, we propose a covariate adjustment strategy to improve efficiency in bipartite experiments.
Covariate adjustment is a classic topic in randomized experiments. For completely randomized experi-
ments, Fisher (1925) proposed to use the analysis of covariance (ANCOVA) to improve estimation efficiency.
Freedman (2008a,b) later reanalyzed the ANCOVA and found that it does not guarantee efficiency improve-
ment in completely randomized experiments. In respond to the critics, Lin (2013) proposed another covariate
adjustment strategy that guarantees asymptotic efficiency gains, and the resulting estimator is usually called
the Lin’s estimator in the literature. Ding (2024, Chapter 6) reviewed the intuition of Lin (2013) from dif-
ferent points of view. We will generalize the idea of Lin (2013) to obtain a covariate adjustment strategy in
the current setting.
Pn
Let X̃i denote the centered covariates, i.e., X̃i = Xi − n−1 i=1 Xi . Consider a class of linearly adjusted
estimators indexed by (β1 , β0 ):
n n n n
X Ti (Yi − β t X̃i ) . 1
X Ti X Ci (Yi − β0t X̃i ) . −1 X Ci
τ̂ (β1 , β0 ) = n−1 n−1 − n −1
n ,
i=1
p|Si | i=1
p |Si |
i=1
(1 − p)|Si |
i=1
(1 − p)|Si |
9
where we replace Yi in the naive estimator τ̂ with linearly adjusted residuals. Further denote X̃ =
(X̃1 , . . . , X̃n ) the centered covariate matrix including all n units. The covariate adjustment estimator
t
Proposition 4.1 (Consistency and asymptotic distribution of τ̂ (β1 , β0 )). Under Assumptions 1–4, for any
fixed (β1 , β0 ), τ̂ (β1 , β0 ) converges in probability to τ . Further suppose Assumption 5 holds, the variance of
τ̂ (β1 , β0 ) has the order var{τ̂ (β1 , β0 )}/vn (β1 , β0 ) = 1 + o(1), where
h
vn (β1 , β0 ) = n−2 {Ỹ (1) − X̃β1 }t Λ1 {Ỹ (1) − X̃β1 } + {Ỹ (0) − X̃β0 }t Λ0 {Ỹ (0) − X̃β0 }
i
+ 2{Ỹ (1) − X̃β1 }t Λτ {Ỹ (0) − X̃β0 } . (5)
in distribution.
Proposition 4.1 states the analogous results to Theorem 3.2 on the asymptotic distribution of the class of
covariate-adjusted estimators. The results follow directly when we treat the linearly adjusted residuals of the
potential outcomes, Yi (1) − β1t X̃i and Yi (0) − β0t X̃i , as “pseudo potential outcomes” and apply Theorem 3.2.
Similarly, we can construct a conservative variance estimator for τ̂ (β1 , β0 ). Denote the upper bound of
vn (β1 , β0 ) as
h i1/2 h i1/2 2
vn,ub (β1 , β0 ) = n−2 {Ỹ (1) − X̃β1 }t Λ1 {Ỹ (1) − X̃β1 } + n−2 {Ỹ (0) − X̃β0 }t Λ0 {Ỹ (0) − X̃β0 }
1/2 2
X Ci Cj (Yi − µ̂0 − β t X̃i )(Yj − µ̂0 − β t X̃j )(Λ0 )i,j
0 0
+ n−2 .
i,j
(1 − p)|Si ∪Sj |
To gain the best asymptotic efficiency by using covariate adjustment estimators, ideally, we want to
minimize the asymptotic variance in (5) over (β1 , β0 ). However, the third term in (5) is not estimable
because it depends on the joint distribution of the potential outcomes. Instead, the improvement in the
asymptotic variance of the covariate-adjusted estimator, τ̂ (β1 , β0 ), compared with that of the naive estimator
τ̂ , is estimable. Denote
t t
β1 X̃ t Λ1 X̃ X̃ t Λτ X̃ β1 X̃ t
Λ 1 Ỹ (1) + X̃ t
Λ τ Ỹ (0) β
L(β1 , β0 ) = n−2 − 2n−2 1 .
β0 X̃ t Λτ X̃ X̃ Λ0 X̃
t
β0 X̃ Λ0 Ỹ (0) + X̃ Λτ Ỹ (1)
t t
β0
10
We can verify that L(β1 , β0 ) is the difference between the two asymptotic variances vn (β1 , β0 ) and vn (0, 0)
thus measures the efficiency gain of τ̂ (β1 , β0 ) compared with τ̂ . We formalize it in Lemma 4.1.
By construction, the improvement in asymptotic variance is guaranteed. We formalize this result in the
following proposition.
Proposition 4.2 (Improvement in asymptotic variance). The covariate adjustment estimator τ̂ (β̃1 , β̃0 ) has
an asymptotic variance no larger that of τ̂ , i.e., vn (β̃1 , β̃0 ) ≤ vn .
We propose to estimate the vector which includes unobserved potential outcomes using the sample analog by
inverse probability weighting, similar to the trick used for variance estimation in Section 3.4. We construct
the following estimator of (β̃1 , β̃0 ),
−1 P
Ti Tj X̃i (Yj −µ̂1 )(Λ1 )i,j P Ci Cj X̃i (Yj −µ̂0 )(Λτ )i,j
β̂1 X̃ Λ1 X̃
t
X̃ Λτ X̃
t +
= Pi,j p|Si ∪Sj | i,j (1−p)|Si ∪Sj |
P Ci Cj X̃i (Yj −µ̂0 )(Λ0 )i,j .
Ti Tj X̃i (Yj −µ̂1 )(Λτ )i,j
β̂0 X̃ Λτ X̃
t
X̃ t Λ0 X̃ i,j + i,j
p|Si ∪Sj | (1−p)|Si ∪Sj |
We next establish the asymptotic properties of the covariate-adjusted estimator τ̂ (β̂1 , β̂0 ). To simplify the
discussion, we introduce the following assumption that imposes the limits for several population quantities.
11
Assumption 6 requires the weighted covariance matrices of potential outcomes and covariates to have
limiting values not depending on n as n → ∞. In the special case of complete randomized experiments
without interference, it reduces to the assumption in Theorem 5 in Li and Ding (2017).
The following theorem shows the consistency and asymptotic distribution of τ̂ (β̂1 , β̂0 ).
Theorem 4.1 (Consistency and asymptotic distribution of τ̂ (β̂1 , β̂0 )). Under Assumptions 1–4 and 6,
τ̂ (β̂1 , β̂0 ) converges in probability to τ . Further suppose Assumption 5 holds,
h i−1/2 n o
avar{τ̂ (β̂1 , β̂0 } τ̂ (β̂1 , β̂0 ) − τ → N (0, 1)
in distribution.
Combined with Proposition 4.2, Theorem 4.1 suggests the covariate-adjusted estimator can reduce the
asymptotic variance compared with the unadjusted estimator.
Now following our discussion in Section 4.1, we can use the variance estimator v̂n,ub (β̂1 , β̂0 ) by plugging
in the estimated coefficients. The following theorem establishes the convergence and conservativeness of this
variance estimator.
Theorem 4.2 (Conservative variance estimator for τ̂ adj ). Under Assumption 1-6, the variance estimator
v̂n,ub (β̂1 , β̂0 ) is a conservative variance estimator following the facts that v̂n,ub (β̂1 , β̂0 )/vn,ub (β̃1 , β̃0 ) converges
in probability to 1 and that vn,ub (β̃1 , β̃0 ) ≥ avar{τ̂ (β̃1 , β̃0 )}.
Theorem 4.2 proves the conservativeness of the variance estimator, which directly motivates a valid
1/2 1/2
confidence interval: [τ̂ (β̂1 , β̂0 ) − qα/2 v̂n,ub , τ̂ (β̂1 , β̂0 ) + qα/2 v̂n,ub ].
12
We consider three different regimes of data generating process. For each regime, we generate covariates
Xi = (X1i , X2i ) ∼ (U [0, 1])2 and the potential outcomes Yi (1) and Yi (0) from the following conditional
iid
distributions summarized in Table 1, with γ = (1, 1)t and αi ∼ U [0, 0.5]. The treatment indicator Zk ∼
Bern(p) with p = 0.5.
Table 2 reports the finite-sample performance of the two estimators τ̂ and τ̂ adj with n = 5000, m = 1500,
and S̄ = 5 based on 1000 Monte Carlo replications. In all three regimes, the two estimators both have small
finite sample bias, and the proposed variance estimators are conservative, leading to valid inference with
coverage rates larger than 95%. Compared with the naive estimator τ̂ , the covariate-adjusted estimator has
a smaller standard error and higher power under all regimes. Although our theory guarantees efficiency im-
provement only in asymptotic variance, in the numerical studies, we also observe smaller variance estimators
and thus shorter constructed confidence intervals under all three regimes.
13
units and air pollution monitors as outcome units.
We construct our dataset using the power plant dataset from Papadogeorgou et al. (2019) and 2004 air
pollution data at the monitor level from the United States Environmental Protection Agency’s website.
Additionally, we incorporate population information for the counties where the monitors are located. The
initial dataset of outcome units includes 95,762 air quality monitors, and the treatment units correspond to
473 coal or natural gas-burning power plants operating in the continental U.S. during the summer of 2004.
To prepare the dataset, we remove outcome units with an “Arithmetic Mean” above the 90th percentile
among all observations, and we also exclude outcome units with an “Arithmetic Mean” around 0 (we choose
2 as the threshold in this application), and outcome units with a population size exceeding 106 . To address
computational constraints, we randomly select 10% of the remaining monitors. We calculate the distances
between monitors and power plants using their longitude and latitude coordinates. A bipartite graph is then
constructed by connecting monitors to power plants located within 15 km. Finally, we remove monitors and
power plants that are not connected to any other units, resulting in a dataset comprising 795 outcome units
and 228 treatment units. The maximum degree of outcome units is restricted to be 2.
We assume the potential outcomes to be Yi (1) = γ1t Xi + ε1 and Yi (0) = γ0t Xi + ε0 , where γ1 =
(2, −2, −2)t , γ0 = (1, −1, −1)t , and ε1 , ε0 ∼ U [0, 15]. This data generation process is designed to simulate
the distribution of the observed ‘Arithmetic Mean’ in the pollution dataset. To standardize the covariates
and ensure numerical stability, we scale the population seize of the county where the monitor i is located by
dividing it by 106 (X1i ), and the distance between monitor i and its closest power plant by dividing it by 30
(X2i ). The third covariate, X3i , represents the number of power plants connected to monitor i. We consider
1000 Monte Carlo replications. In each replication, treatment units are randomly assigned to treatment with
a probability of p = 0.5. When applying the covariate adjustment estimator, we include the scaled covariates
X1i , X2i , and X3i in the model. The true total treatment effect is −1.266.
Table 3 reports the simulation results based on the real-world bipartite graph. We can see that both
the naive estimator and the covariate-adjusted estimator have small biases for estimating the true treatment
effect, and both strategies lead to valid yet slightly conservative confidence intervals. Nevertheless, by
applying the covariate adjustment strategy we introduced in Section 4, we can witness a reduction in both
the standard deviation of the point estimator and the estimated variance, which leads to a great improvement
in the power.
Table 3: Simulation results based on real bipartite graph in the power plant application
Note: we report two point estimators (with and without covariate adjustment), their standard error se, standard error
ˆ the coverage rate of the 95% confidence interval constructed using the conservative variance estimator,
estimator se,
and their power.
14
6 Discussion
We propose a design-based causal inference framework for bipartite experiments. We generalize the classic
SUTVA to the bipartite experiment setting and provide point and variance estimators for estimating the
total treatment effect. These estimators are based on theoretical results that guarantee the consistency and
asymptotic normality of the point estimator and the conservativeness of the variance estimator. We also
propose covariate adjustment strategies that improve the efficiency of the point estimator. This framework
extends the design-based causal inference frameworks for completely randomized experiments and cluster
randomized experiments.
While this framework is useful for estimating causal effects in many general scenarios involving bipartite
experiments, there are several directions for further investigation. First, we focus on the total average treat-
ment effect which compares all versus nothing treatment regimes. There are more general causal parameters
of interest that we can explore. Second, we only discuss the Bernoulli randomization treatment regime,
leaving other more complex bipartite intervention strategies undeveloped. Third, we mainly focus on the
outcome unit-level covariates for the covariate adjustment strategy. When treatment unit-level covariates
are also available, as an ad-hoc strategy, we can incorporate them by using a summary at the outcome-unit
level, for instance, taking the average or sum of the covariate values for the groups that each unit is connected
to. However, a more rigorous and systematic way of incorporating treatment-unit level covariates is unclear.
We leave them for future research.
References
Aronow, P. M. and Samii, C. (2017). Estimating average causal effects under general interference, with
application to a social network experiment. Annals of Applied Statistics, 11(4):1912–1947.
Brennan, J., Mirrokni, V., and Pouget-Abadie, J. (2022). Cluster randomized designs for one-sided bipartite
experiments. Advances in Neural Information Processing Systems, 35:37962–37974.
Donner, A. (1998). Some aspects of the design and analysis of cluster randomization trials. Journal of the
Royal Statistical Society: Series C (Applied Statistics), 47(1):95–113.
Donner, A., Klar, N., and Klar, N. S. (2000). Design and analysis of cluster randomization trials in health
research, volume 27. Arnold Publishers: London.
Doudchenko, N., Zhang, M., Drynkin, E., Airoldi, E., Mirrokni, V., and Pouget-Abadie, J. (2020). Causal
inference with bipartite designs. arXiv preprint arXiv:2010.02108.
Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, 1st edition.
Forastiere, L., Airoldi, E. M., and Mealli, F. (2021). Identification and estimation of treatment and in-
terference effects in observational studies on networks. Journal of the American Statistical Association,
116(534):901–918.
15
Freedman, D. A. (2008a). On regression adjustments in experiments with several treatments.
Hall, P. and Heyde, C. C. (2014). Martingale limit theory and its application. Academic Press.
Harshaw, C., Sävje, F., Eisenstat, D., Mirrokni, V., and Pouget-Abadie, J. (2023). Design and analysis of
bipartite experiments under a linear exposure-response model. Electronic Journal of Statistics, 17(1):464–
518.
Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences.
Cambridge University Press.
Li, X. and Ding, P. (2017). General forms of finite population central limit theorems with applications to
causal inference. Journal of the American Statistical Association, 112(520):1759–1769.
Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining freedman’s
critique. The Annals of Applied Statistics, 7(1):295–318.
Lu, X., Li, H., and Liu, H. (2024). Estimation and inference of average treatment effects under heterogeneous
additive treatment effect model. arXiv preprint arXiv:2408.17205.
Miratrix, L. W., Sekhon, J. S., and Yu, B. (2012). Adjusting treatment effect estimates by post-stratification
in randomized experiments. Journal of the Royal Statistical Society Series B: Statistical Methodology,
75(2):369–396.
Papadogeorgou, G., Choirat, C., and Zigler, C. M. (2019). Adjusting for unmeasured spatial confounding
with distance adjusted propensity score matching. Biostatistics, 20(2):256–272.
Pouget-Abadie, J., Aydin, K., Schudy, W., Brodersen, K., and Mirrokni, V. (2019). Variance reduction in
bipartite experiments through correlation clustering. Advances in Neural Information Processing Systems,
32.
Shi, L., Bakhitov, E., Hung, K., Karrer, B., Walker, C., Bhole, M., and Schrijvers, O. (2024). Scalable
analysis of bipartite experiments. arXiv preprint arXiv:2402.11070.
Song, Z. and Papadogeorgou, G. (2024). Bipartite causal inference with interference, time series data, and
a random network. arXiv preprint arXiv:2404.04775.
Su, F. and Ding, P. (2021). Model-assisted analyses of cluster-randomized experiments. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 83(5):994–1015.
Zigler, C. M. and Papadogeorgou, G. (2021). Bipartite causal inference with interference. Statistical Science,
36(1):109 – 123.
16
Supplementary Material
Section A provides a general theory of establishing central limit theorems for bipartite experiments under
Bernoulli randomization.
Section B provides proofs of all theorems in the main text.
X X X
Γ = ak1 Z̃k1 + ak1 k2 Z̃k1 Z̃k2 + · · · + ak1 ...kS̄ Z̃k1 · · · Z̃kS̄ . (A.1)
k1 k1 <k2 k1 <···<kS̄
Here {ak1 ...ks : k1 , . . . , ks ∈ [m], k1 6= . . . 6= ks } is an s-dimensional array that are symmetric in its indices,
i.e.,
As a convention, we use (k1 . . . ks ) to denote an unordered s-tuple with k1 6= · · · 6= ks . Moreover, Z̃k ’s are
i.i.d. copies of a random variable Z̃ with mean zero, variance σ 2 and fourth moments bounded by E Z̃ 4 ≤ ν44 .
Note that here we do not require Z̃ to be a centered Bernoulli variable.
For the statistic in (A.1), we have E(Γ) = 0 and
X X X
vΓ = var(Γ) = a2k1 σ 2 + a2k1 k2 σ 4 + · · · + a2k1 ...kS̄ σ 2S̄ .
k1 k1 <k2 k1 <···<kS̄
1{|akk1 ...ks | 6= 0} ≤ B;
X
We will use the martingale central limit theorem in Hall and Heyde (2014) to prove Theorem A.1. For
completeness of our proof, we first review the martingale central limit theorem as the following Proposi-
tion A.1.
S1
Proposition A.1 (Theorem 3.2 of Hall and Heyde (2014)). Let {Sni , Fni , 1 ≤ i ≤ kn , n ≥ 1} be a zero-
mean, square-integrable martingale array with differences ∆ni , and let η 2 be an almost surely finite random
variable. Suppose the following conditions hold:
1. Squared sum convergence:
X
E(∆2ni | Fn,i−1 ) → η 2 (A.2)
i
in probability,
2. Lindeberg condition:
in probability,
P
and the σ-fields are nested: Fn,i ⊆ Fn+1,i , for 1 ≤ i ≤ kn , n ≥ 1. Then Snkn = i ∆ni converges in
1 2 2
distribution (stably) to some random variable with characteristics function E{exp − 2 η t }.
In particular, a sufficient condition for the Lindeberg condition (A.3) is given by the following Lyapunov
condition:
n
X
For some δ > 0, E{|∆ni |2+δ } → 0. (A.4)
i=1
Proof of Theorem A.1. We prove Theorem A.1 following three steps. We first construct a martingale dif-
ference sequence based on Γ. Next, we check the convergence of the summation of the conditional squared
differences in equation (A.2). Finally, we check the Lyapunov condition in equation (A.4).
Step 1. Construct a martingale difference sequence based on Γ. Let Fm,k be the σ-algebra
generated by Z̃1 , . . . , Z̃k , i.e. Fm,k = σ{Z̃1 , . . . , Z̃k }. For ease of notation, for any k1 , . . . , kℓ ∈ [m], we
denote Z̃k1 ...kℓ = Z̃k1 · · · Z̃kℓ . Let
S̄∧k
−1/2
X X
∆mk = vΓ ak1 ...ks−1 k Z̃k1 ...ks−1 k ,
s=1 (k1 ...ks−1 )⊂[k−1]
m
X
Γ = ∆mk .
k=1
Step 2. Check the convergence of the summation of the conditional squared differences in
equation (A.2). We show equation (A.2) by computing the variance of its LHS,
( )
X
var E(∆2mk | Fm,k−1 )
k
S2
σ4 X S̄∧k
X X
= 2 var ak1 ...ks k ak1′ ...kr′ k Z̃k1 ...ks Z̃k1′ ...kr′
vΓ
s,r (k1 ...ks )⊂[k−1]
k
(k1′ ...kr′ )⊂[k−1]
S̄∧k S̄∧ℓ
σ4 X X X
X
= a a ′ ′ a a ′ ′
k1 ...ks k k1 ...kr k ℓ1 ...ℓt ℓ ℓ1 ...ℓu ℓ cov(Z̃ Z̃ ′ ′
k1 ...ks k1 ...kr , Z̃ ℓ1 ...ℓt ℓ1 ...ℓu (A.5)
Z̃ ′ ′ ) .
vΓ2
k,ℓ s,r t,u (k1 ...ks )⊂[k−1]
(k1′ ...kr′ )⊂[k−1]
′ ′
(ℓ1 ...ℓt )⊂[ℓ−1]
(ℓ′1 ...ℓ′u )⊂[ℓ−1]
Note that cov(Z̃k1 ...ks Z̃k1′ ...kr′ , Z̃ℓ1 ...ℓt Z̃ℓ′1 ...ℓ′u ) 6= 0 only if
S3
4(S̄−1) 4
σ 4 ν4 ām (B 4 S̄ 6 )m
≤
vΓ2
= o(1),
X
E(∆2mk | Fm,k−1 ) → 1
k
in probability.
Step 3. Check the Lyapunov condition in equation (A.4). We have
m
S̄∧k
X 1 X X
E(∆4mk ) = ak1 ...ks k ak1′ ...kr′ k ak1′′ ...kt′′ k ak1′′′ ...ku′′′ k E(Z̃k1 ...ks Z̃k1′ ...kr′ Z̃k1′′ ...kt′′ Z̃k1′′′ ...ku′′′ )
vΓ2 s,r,t,u
k=1
(k1 ...ks )⊂[k−1]
(k1′ ...kr′ )⊂[k−1]
(k1′′ ...kt′′ )⊂[k−1]
(k1′′′ ...ku′′′
)⊂[k−1]
S̄∧k
ν44S̄
X X
≤ |ak1 ...ks k ||ak1′ ...kr′ k ||ak1′′ ...kt′′ k ||ak1′′′ ...ku′′′ k |
vΓ2
s,r,t,u
(k1 ...ks )⊂[k−1]
(k1′ ...kr′ )⊂[k−1]
(k1′′ ...kt′′ )⊂[k−1]
(k1′′′ ...ku′′′
)⊂[k−1]
ν44S̄ ā4
1{|aG1 | 6= 0}1{|aG2 | 6= 0}1{|aG3 | 6= 0}1{|aG4 | 6= 0}
X
≤
vΓ2
G1 ,G2 ,G3 ,G4 ⊂[S̄]:
G1 ∩G2 ∩G3 ∩G4 6=∅
Combining results in Steps 1–3 and Proposition A.1, we prove the results in Theorem A.1.
B Proofs
B.1 Lemmas
We first introduce two lemmas in order to simplify the proofs for the main theorems across the paper.
S4
Lemma B.1. For any two arrays {ai }ni=1 and {bi }ni=1 , we have
X
n−2 ai bj (p−|Si ∩Sj | − 1) ≤ n−1 (p−S̄ − 1)(max ai )(max bi )S̄ D̄.
i i
i,j
Proof of Lemma B.1. p−|Si ∩Sj | − 1 is nonzero if and only if Si ∩ Sj 6= ∅. For each unit i, the number of
groups i belongs to is no larger than S̄, and there are at most D̄ units in each group. Therefore, for each i,
X
| bj (p−|Si ∩Sj | − 1)| ≤ (p−S̄ − 1)(max bi )S̄ D̄,
i
j
thus
X
| ai bj (p−|Si ∩Sj | − 1)| ≤ n(p−S̄ − 1)(max ai )(max bi )S̄ D̄.
i i
i,j
Similar results also holds for the the quantites defined by Ci ’s.
Proof of Lemma B.2. By the fact that E(Ti Tj ) = p|Si ∪Sj | , we have
X Ti Tj ai bj (p−|Si ∩Sj | − 1) X
E n−2 = n−2 ai bj (p−|Si ∩Sj | − 1).
i,j
p|Si ∪Sj |
i,j
S5
Without loss of generality, assume that (Si ∪ Sj ) ∩ Su 6= ∅, we have
X
|cov(Ti Tj , Tu Tv )ai bj au bv (Λ1 )i,j (Λ1 )u,v |
i,j,u,v
X X
≤ |ai bj |(Λ1 )i,j |cov(Ti Tj , Tu Tv )au bv (Λ1 )u,v |
i,j u,v
X X X
≤ |ai bj |(Λ1 )i,j |au | |cov(Ti Tj , Tu Tv )bv (Λ1 )u,v |
i,j u v
X X
≤ |ai bj |(Λ1 )i,j |au |(p−S̄ − 1)(max bv )S̄ D̄1{Si ∪ Sj ) ∩ Su 6= ∅}
v
i,j u
X
≤ |ai bj |(Λ1 )i,j (p−S̄ − 1)(max au )(max bv )S̄ 2 D̄2
u v
i,j
where the third inequality follows from (Si ∪ Sj ) ∩ Su 6= ∅ and a similar argument in the proof of Lemma B.1
that for each u, | v bv (Λ1 )u,v | ≤ (p−S̄ − 1)(maxv bv )S̄ D̄, the forth inequality follows from the fact that u
P
has to be connected to either i of j, and the total number of u such that 1{(Si ∪ Sj ) ∩ Su 6= ∅} is nonzero is
no larger than S̄ D̄, and the last equality follows from Lemma B.1. Plugging in back to equation (B.8) gives
the second inequality in Lemma B.2.
P
where the second-to-last equality follows from the fact that i,j {Yi (1) − µ1 } {Yj (1) − µ1 } = 0, and the
last inequality follows from Lemma B.1. Therefore, the numerator of µ̂1 − µ1 converges in probability to 0
S6
by Chebyshev’s inequality. Similarly, the denominator of µ̂1 − µ1 has mean 1 and variance converging in
probability to 0. This concludes the proof of µ̂1 converges in probability to µ1 . Analogously, µ̂0 converges
in probability to µ0 , which concludes the proof of Theorem 3.1.
By symmetry, we have
X
avar(µ̂0 ) = n−2 {Yi (0) − µ0 } {Yj (0) − µ0 } (Λ0 )i,j .
i,j
S7
X
+2n−2 {Yi (1) − µ1 } {Yj (0) − µ0 } (Λτ )i,j
i,j
n o
= n−2 Ỹ (1)t Λ1 Ỹ (1) + Ỹ (0)t Λ0 Ỹ (0) + 2Ỹ (1)t Λτ Ỹ (0) .
n
1{ m
P
k=1 Wik (1 − Zik ) = 0}{Yi (1) − µ1 }
X
−1
n
i=1
p|Si |
n
X Zk X
= 1
1{|Si | = 1}Wik1 {Yi (1) − µ1 }
np i=1
k1
X Zk Zk Xn
+ 1 2
1{|Si | = 2}Wik1 Wik2 {Yi (1) − µ1 }
np2 i=1
k1 <k2
+···
n
Zk1 · · · ZkS̄ X
1{|Si | = S̄}Wik1 · · · WikS̄ {Yi (1) − µ1 }
X
+
k1 <···<kS̄
npS̄ i=1
n
X Z̃k + p X
= 1
1{|Si | = 1}Wik1 {Yi (1) − µ1 }
np i=1
k1
X (Z̃k + p)(Z̃k + p) Xn
+ 1 2
1{|Si | = 2}Wik1 Wik2 {Yi (1) − µ1 }
np2 i=1
k1 <k2
+···
n
(Z̃k1 + p) · · · (Z̃kS̄ + p) X
1{|Si | = S̄}Wik1 · · · WikS̄ {Yi (1) − µ1 }.
X
+ (B.10)
k1 <···<kS̄
npS̄ i=1
n
(Z̃k1 + p) · · · (Z̃ks + p) X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }
X
nps i=1
k1 <···<ks
n
1 (Z̃k1 + p) · · · (Z̃ks + p) X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }
X
=
s! nps i=1
k1 6=...6=ks
n
1 s X Z̃k1 X
1{|Si | = s}Wik1 {Yi (1) − µ1 }
X
= Wik2 · · · Wiks
s! 1 np i=1
k1 k2 6=···6=ks ,
ku 6=k1 ,∀1<u≤s
+···
n
1 s Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }
X X
+ Wikℓ+1 · · · Wiks
s! ℓ npℓ i=1
k1 6=...6=kℓ kℓ+1 6=···6=ks ,
ku 6=k1 ,...,kℓ ,∀ℓ<u≤s
+···
n
Z̃k1 · · · Z̃ks−1 X
1 s
1{|Si | = s}Wik1 · · · Wiks−1 {Yi (1) − µ1 }
X X
+ Wiks
s! s − 1 nps−1 i=1
k1 6=...6=ks−1 ks ,
ks 6=k1 ,...,ks−1
n
1 Z̃k1 · · · Z̃ks X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }.
X
+ (B.11)
s! nps i=1
k1 6=...6=ks
S8
For the ℓ-th term, we have
n
1 s Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }
X X
Wikℓ+1 · · · Wiks
s! ℓ npℓ i=1
k1 6=...6=kℓ kℓ+1 6=···6=ks ,
ku 6=k1 ,...,kℓ ,∀ℓ<u≤s
X n
1 s Z̃k1 · · · Z̃kℓ X
= 1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }(s − ℓ)!
s! ℓ npℓ i=1
k1 6=...6=kℓ
X Z̃k · · · Z̃k X n
1 s
= (s − ℓ)!ℓ! 1 ℓ
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }
s! ℓ npℓ i=1
k1 <···<kℓ
n
Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }.
X
=
npℓ i=1
k1 <···<kℓ
n
(Z̃k1 + p) · · · (Z̃ks + p) X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }
X
nps i=1
k1 <···<ks
s n
Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 },
X X
=
npℓ i=1
ℓ=1 k1 <···<kℓ
n
X Z̃k X S̄
1{|Si | = s}
X
1
Wik1 {Yi (1) − µ1 }
np i=1 s=1
k1
X Z̃k Z̃k Xn S̄
1{|Si | = s}
X
+ 1 2
Wik Wik {Yi (1) − µ 1 }
np2 i=1 1 2
s=2
k1 <k2
+···
n S̄
Z̃k1 · · · Z̃kS̄ X
1{|Si | = s}.
X X
+ Wik1 · · · Wik {Yi (1) − µ 1 }
k1 <···<kS̄
npS̄ i=1
S̄
s=S̄
n
1{ m
P
k=1 Wik Zik = 0}{Yi (0) − µ0 }
X
n−1
i=1
n(1 − p)|Si |
n S̄
Z̃k1 X
1{|Si | = s}
X X
= − Wik1 {Yi (0) − µ0 }
n(1 − p) i=1 s=1
k1
n S̄
Z̃k1 Z̃k2 X
1{|Si | = s}
X X
+ (−1)2 Wik Wik {Yi (0) − µ 0 }
n(1 − p)2 i=1 1 2
s=2
k1 <k2
+···
n S̄
Z̃k1 · · · Z̃kS̄ X
1{|Si | = s}.
X X
+ (−1)S̄ Wik1 · · · Wik {Yi (0) − µ 0 }
k1 <···<kS̄
n(1 − p)S̄ i=1 S̄
s=S̄
S9
Define
n S̄
1 X
1{|Si | = s},
X
a1,k1 ···kℓ = Wik1 · · · Wikℓ {Yi (1) − µ1 }
npℓ i=1
s=ℓ
ℓ n S̄
(−1)
1{|Si | = s}.
X X
a0,k1 ···kℓ = Wik1 · · · Wikℓ {Yi (0) − µ0 }
n(1 − p)ℓ i=1 s=ℓ
S̄
X X
az,k1 ···kℓ Z̃k1 · · · Z̃kℓ
ℓ=1 k1 <···<kℓ
for z = 1, 0.
Step 2. We now consider any linear combination of the numerators of µ̂1 − µ1 and µ̂0 − µ0 . We show it can
be reformulated in the form of Γ defined in equation (A.1). Consider any c1 and c0 that has c21 + c20 = 1.
Define
n S̄
−1
X c1 Ti {Yi (1) − µ1 } c0 Ci {Yi (0) − µ0 } X X
n + = ak1 ···kℓ Z̃k1 · · · Z̃kℓ . (B.12)
i=1
p|Si | (1 − p)|Si |
ℓ=1 k1 <···<kℓ
We will apply Theorem A.1 to establish a central limit theorem for (B.12). We check the two conditions
required in Theorem A.1.
We first show the boundedness of a’s. Note that
n S̄
c1 {Yi (1) − µ1 } (−1)ℓ c0 {Yi (0) − µ0 } X
1{|Si | = s}.
X
ak1 ···kℓ = Wik1 · · · Wikℓ +
i=1
npℓ n(1 − p)ℓ
s=ℓ
The summand indexed by i is nonzero only if unit i belongs to groups k1 , . . . , kℓ . By Assumption 3, for each
k1 , . . . , kℓ , we have at most D̄ such units. Hence we obtain
S10
where the last inequality holds because by Assumption 5, there are at most B groups connected to group k,
thus the number of combinations (k1 , . . . , ks ) such that all of them are connected to k is upper bounded by
B s
s ≤ B .
Therefore, by Step 2, we conclude that the numerators of µ̂1 − µ1 and µ̂0 − µ0 converge jointly to a
bivariate standard normal distribution, after standardization via
n−2 Ỹ (1)t Λ1 Ỹ (1) n−2 Ỹ (1)Λτ Ỹ (0)
.
n−2 Ỹ (1)Λτ Ỹ (0) n−2 Ỹ (0)t Λ0 Ỹ (0)
Moreover, the denominators of µ̂1 and µ̂0 are converging in probability to 1, thus the asymptotic distri-
bution in Theorem 3.2 holds by Slutsky’s Theorem.
1/2 1/2
then we have v̂ = (v̂1 + v̂0 )2 . We prove the convergence of v̂/plim(v̂) by showing that v̂1 /plim(v̂1 ) =
1 + op (1) and v̂0 /plim(v̂0 ) = 1 + op (1), where
Rewrite
S11
and use T1 , T2 , T3 to denote the three terms in (B.13)–(B.15), respectively. By the fact that E(Ti Tj ) =
p|Si ∪Sj | , we have E(T1 ) = plim(v̂1 ). The variance of T1 ,
var(T1 ) ≤ n−3 p−4S̄ [max{Yi (1) − µ1 }4 ]S̄ 3 D̄3 (p−S̄ − 1)2 = Op (n−3 D̄3 )
i
by taking ai = Yi (1) − µ1 and bi = 1. By the proof of Theorem 3.2, we have µ̂1 − µ1 = Op (n−1/2 D̄1/2 ), and
T2 = E(T2 ) + Op {var(T2 )1/2 } gives us
T2 = Op (n−1/2 D̄1/2 ) · Op (n−1 D̄) + Op [{n−1 D̄ · n−3 D̄3 }1/2 ] = op (n−1 D̄).
Also, we have
X Ti Tj (p−|Si ∩Sj | − 1)
E n−2 = Op (n−1 D̄),
i,j
p|Si ∪Sj |
X Ti Tj (p−|Si ∩Sj | − 1)
var n−2 = Op (n−3 D̄3 )
i,j
p|Si ∪Sj |
T3 = Op (n−1 D̄) · Op (n−1 D̄) + Op [{n−2 D̄2 · n−3 D̄3 }1/2 ] = op (n−1 D̄).
Under the regularity condition that the weighted covariance matrix of the potential outcomes Yi (1) and
Yi (0) are non-degenerated, plim(v̂1 ) = Op (n−1 D̄), thus v̂1 /plim(v̂1 ) = 1 + op (1). Analogously, v̂0 /plim(v̂0 ) =
1 + op (1). By the continuous mapping theorem, v̂/plim(v̂) converges in probability to 1.
Next, we prove that plim(v̂) ≥ avar(τ̂ ). Recall that plim(v̂) = {plim(v̂1 )1/2 + plim(v̂0 )1/2 }2 , by Cauchy-
S12
Schwarz inequality, we have
n2 {vn (β1 , β0 ) − vn (0, 0)} = {Ỹ (1) − X̃β1 }t Λ1 {Ỹ (1) − X̃β1 } + {Ỹ (0) − X̃β0 }t Λ0 {Ỹ (0) − X̃β0 }
+2{Ỹ (1) − X̃β1 }t Λτ {Ỹ (0) − X̃β0 } − vn (0, 0)
= Ỹ (1)t Λ1 Ỹ (1) + Ỹ (0)t Λ0 Ỹ (0) + 2Ỹ (1)t Λτ Ỹ (0)
+β1t X̃ t Λ1 X̃β1 + β0t X̃ t Λ0 X̃β0 + 2β1t X̃ t Λτ X̃β0
−β1t X̃ t {Λ1 Ỹ (1) + Λτ Ỹ (0)} + β0t X̃ t {Λ0 Ỹ (0) + Λτ Ỹ (1)}
−β1t X̃ t Λ1 X̃β1 + β0t X̃ t Λ0 X̃β0 + 2β1t X̃ t Λτ X̃β0
= n2 L(β1 , β0 ).
S13
B.8 Proof of Theorem 4.1
Convergence of the regression coefficients. Define the population limit counterpart for the closed-
form solution (β̃1 , β̃0 ):
β⋆
1 −1
= Ωxx Ωyx , (B.16)
β0⋆
where
Ωxx,11 Ωxx,10 Ωyx,11 + Ωyx,01
Ωxx = , Ωyx = . (B.17)
Ωxx,01 Ωxx,00 Ωyx,00 + Ωyx,10
By Assumption 6, we have
β̂1 X̃ t Λ1 X̃ X̃ t Λτ X̃
= → Ωxx .
β̂0 X̃ t Λτ X̃ X̃ t Λ0 X̃
By similar arguments as in Theorem 3.3, under Assumption 6, the following holds asymptotically in proba-
bility:
P t
Ti Tj X̃i (Yj −µ̂1 )(Λ1 )i,j P Ci Cj X̃i (Yj −µ̂0 )(Λτ )i,j
i,j + i,j Ωyx,11 + Ωyx,01
p|Si ∪Sj | (1−p)|Si ∪Sj |
P
Ti Tj X̃i (Yj −µ̂1 )(Λτ )i,j P Ci Cj X̃i (Yj −µ̂0 )(Λ0 )i,j → .
i,j + i,j Ωyx,00 + Ωyx,10
p|Si ∪Sj | (1−p)|Si ∪Sj |
Consistency and asymptotic distribution of τ̂ (β̂1 , β̂0 ). The difference between τ̂ (β̂1 , β̂0 ) and τ̂ (β1⋆ , β0⋆ )
is
n n n n
X Ti (β̂1 − β1⋆ )t X̃i . −1 X Ti X Ci (β̂0 − β0⋆ )t X̃i . −1 X Ci
τ̂ (β̂1 , β̂0 ) − τ̂ (β1⋆ , β0⋆ ) = n−1 |Si |
n |Si |
− n −1
|Si |
n .
i=1
p i=1
p i=1
(1 − p) i=1
(1 − p)|Si |
S14
following similar arguments as in proof of Theorem 3.2, we can conclude that
|τ̂ (β̂1 , β̂0 ) − τ̂ (β1⋆ , β0⋆ )| = op n−1/2 D̄1/2 .
By Proposition 4.1, τ̂ (β1⋆ , β0⋆ ) converges in probability to τ . Hence τ̂ (β̂1 , β̂0 ) is also consistent to τ .
The asymptotic distribution of τ̂ (β̂1 , β̂0 ) follows from Slutsky’s Theorem and the fact that
n o
−1/2
{vn (β1⋆ , β0⋆ )} τ̂ (β̂1 , β̂0 ) − τ → N (0, 1)
in distribution.
Meanwhile, due to the fact that v1⋆ (β1 ) ≍ n−1 D̄, we have
and similarly
Therefore,
n o1/2 n o1/2
1/2 1/2
v̂1,n (β̂1 ) + v̂0,n (β̂0 ) = {v1⋆ (β1⋆ )} + {v0⋆ (β0⋆ )} + Op (n−3/4 D̄−3/4 ),
S15
and thus
by the fact that v1⋆ (β1⋆ ) ≍ n−1 D̄ and v0⋆ (β0⋆ ) ≍ n−1 D̄. Therefore, v̂n,ub (β̂1 , β̂0 )/vn,ub (β1⋆ , β0⋆ ) converges in
probability to 1.
⋆
The conservativeness of vub (β1⋆ , β0⋆ ) for the true variance v ⋆ (β1⋆ , β0⋆ ) can be established similarly to The-
orem 3.3 when no covariates are adjusted. The trick is to take Y (1) − X̃β1⋆ and Y (0) − X̃β0⋆ as pseudo
potential outcomes and apply the Cauchy-Schwarz inequality for the covariance. Details are omitted.
S16