0% found this document useful (0 votes)
24 views32 pages

Design-Based Causal Inference in Bipartite Experiments

This paper discusses design-based causal inference in bipartite experiments, where treatments are randomized over one set of units and outcomes are measured over another. The authors propose a new framework that generalizes existing assumptions to account for bipartite interference, along with point and variance estimators for total treatment effects. They also introduce a model-agnostic covariate adjustment strategy to improve estimation efficiency without relying on strong modeling assumptions.

Uploaded by

pengdingpku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views32 pages

Design-Based Causal Inference in Bipartite Experiments

This paper discusses design-based causal inference in bipartite experiments, where treatments are randomized over one set of units and outcomes are measured over another. The authors propose a new framework that generalizes existing assumptions to account for bipartite interference, along with point and variance estimators for total treatment effects. They also introduce a model-agnostic covariate adjustment strategy to improve estimation efficiency without relying on strong modeling assumptions.

Uploaded by

pengdingpku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Design-based causal inference in bipartite experiments


Sizhu Lu† Lei Shi‡ Yue Fang§ Wenxin Zhang¶ Peng Ding‖

January 20, 2025


arXiv:2501.09844v1 [stat.ME] 16 Jan 2025

Abstract

Bipartite experiments are widely used across various fields, yet existing methods often rely on strong
assumptions about modeling the potential outcomes and exposure mapping. In this paper, we explore
design-based causal inference in bipartite experiments, where treatments are randomized over one set of
units, while outcomes are measured over a separate set of units. We first formulate the causal inference
problem under a design-based framework that generalizes the classic assumption to account for bipartite
interference. We then propose point and variance estimators for the total treatment effect, establish a
central limit theorem for the estimator, and propose a conservative variance estimator. Additionally, we
discuss a covariate adjustment strategy to enhance estimation efficiency.

Key Words: bipartite interference; finite population; covariate adjustment; randomization inference

1 Introduction
Bipartite experiments have gained increasing recognition for their utility in various fields, including digital ex-
perimentation (Harshaw et al., 2023; Shi et al., 2024) and environmental science (Zigler and Papadogeorgou,
2021). In bipartite experiments, the treatments are randomized over one set of units, called treatment units,
while the outcomes are measured over a separate set of units, called the outcome units. This is different from
the classic experiment settings where we randomly assign units to different treatment groups and measure
their outcome of interest after the initiation of treatment. As illustrated in Figure 1, the treatment units and
outcome units are connected through a known fixed bipartite graph, and causal dependencies are represented
by the bipartite network, leading to what is known as bipartite interference (Zigler and Papadogeorgou,
2021).
There has been extensive research on bipartite experiments. A well-known special case is cluster ran-
domization, which has been studied in theory and adopted in practice (Donner, 1998; Donner et al., 2000;
∗ Thefirst two authors contributed equally to this work.
† Department of Statistics, University of California, Berkeley, CA 94720, USA. Email: sizhu [email protected]
‡ Division of Biostatistics, University of California, Berkeley, CA 94720, USA. Email: [email protected]
§ School of Management and Economics, The Chinese University of Hong Kong, Shenzhen, 518172 China. Email:
[email protected]
¶ Division of Biostatistics, University of California, Berkeley, CA 94720, USA. Email: wenxin [email protected]
‖ Department of Statistics, University of California, Berkeley, CA 94720, USA. Email: [email protected]

1
1
1
2
2
treatment units 3 outcome units
3
4
4
5

Figure 1: Illustration of a bipartite experiment with n = 4 and m = 5

Su and Ding, 2021). The cluster experiment setup corresponds to a situation where each outcome unit is con-
nected to exactly one treatment unit. More recent works have focused on the general bipartite network, where
each outcome unit may be connected to multiple treatment units, and vice versa. Zigler and Papadogeorgou
(2021) formulated a set of causal estimands in bipartite experiments and proposed an inverse probability-
weighted estimator for observational studies. Doudchenko et al. (2020) leveraged the generalized propen-
sity score to obtain unbiased estimates of causal effects. Harshaw et al. (2023) explored the estimation
and inference of the average total treatment effect under a linear exposure-response model. Shi et al.
(2024) extended Harshaw et al. (2023) by studying covariate adjustment under a double linear model.
Song and Papadogeorgou (2024) studied bipartite experiments in the time series and random network setting
under exposure examined bipartite experiments in time series and random network settings using exposure
mapping and matching estimators for observational studies. Several works have also addressed the de-
sign of experiments in the presence of bipartite interference. For example, Pouget-Abadie et al. (2019),
Harshaw et al. (2023), and Brennan et al. (2022) investigated methods for constructing better experimental
designs in such context.
In this work, we conduct a design-based analysis for bipartite experiments, where we define casual
parameters based on fixed potential outcomes and derive properties of estimators for causal effects under
the randomness of treatment assignment (Imbens and Rubin, 2015; Ding, 2024). Many existing approaches
(e.g. Harshaw et al., 2023; Doudchenko et al., 2020; Lu et al., 2024) have touched such a perspective for
bipartite experiments. However, these works typically rely on strong modeling assumptions on the potential
outcomes. For example, Harshaw et al. (2023) adopted a linear exposure mapping (Aronow and Samii, 2017;
Forastiere et al., 2021) with a linear outcome model. Doudchenko et al. (2020) used exposure mapping for
estimation along with a linear model for variance estimation. Lu et al. (2024) applied a heterogeneous
additive effect model on the potential outcomes. Estimation and inference using these model-based methods
depend heavily on correct model specifications, and a more flexible design-based framework that requires
fewer assumptions has not yet been rigorously discussed in the context of bipartite experiments. There

2
are several challenges to adopting such a framework in this context. For instance, how can we establish
central limit theorems and construct valid variance estimators with strong dependencies across outcome
units? Furthermore, if covariate information is available, how should we perform covariate adjustment in a
model-agnostic fashion without relying on parametric assumptions on the outcomes and covariates, following
the spirit of Lin (2013)?

Our contributions. First, we formulate the causal inference problem in bipartite experiments under the
design-based framework. We generalize the stable unit treatment value assumption (SUTVA) to account
for bipartite interference. This generalization is tailored to the bipartite network structure, enabling the
identification of the total treatment effect as a function of observed data. Unlike model-based frameworks,
our approach avoids the strong assumptions on modeling the potential outcomes or exposure mapping.
Second, we propose a Hájek estimator for the total treatment effect and prove its consistency and asymp-
totic normality under mild assumptions on the network structure. We also propose a conservative variance
estimator that ensures valid inference by accounting for the complexity of the network.
Third, we present a model-agnostic covariate-adjusted estimator that is asymptotically no less efficient
than the Hájek-type estimator while maintaining valid inference.

Organization of the paper. The rest of the paper is organized as follows. Section 2 introduces the design-
based setup of the bipartite experiments. Section 3 discusses the estimation of the total treatment effect
under bipartite interference. Section 4 presents a covariate adjustment strategy for constructing point and
variance estimators and proves their asymptotic properties. Section 5 conducts many numerical experiments
to validate the proposed methods and theoretical results. Section 6 concludes the paper and outlines future
research directions. All proofs and technical details are in the Supplementary Material.

Notation. We will use the following notation. Let 1{·} denote the indicator function. Let plim denote
the probability limit, avar(·) and acov(·, ·) denote the asymptotic variance and covariance, respectively, and
≍ denote asymptotically the same order as the sample size increases to infinity. For any positive integer K,
denote [K] = {1, . . . , K} as the set of all positive integers smaller than or equal to K. Write bn = O(an ) if
bn /an is bounded and bn = o(an ) if bn /an converges to 0 as n → ∞. Write bn = Op (an ) if bn /an is bounded
in probability and bn = op (an ) if bn /an converges to 0 in probability.

2 Setup

2.1 Motivating examples


We first present several motivating examples that highlight the applicability of bipartite experiments.

Example 1 (Power plant). Zigler and Papadogeorgou (2021) studies the problem of evaluating how the
installation of selective noncatalytic reduction system or not (treatment) in their upwind power plants (treat-
ment units) causally affects the hospitalization rates in the neighborhoods (outcome units). In this case, a

3
neighborhood can be affected by the treatments of multiple upwind power plants, while one power plant may
affect a set of neighborhoods. We will revisit this example in Section 5.2.

Example 2 (Facebook Group). Shi et al. (2024) considers a bipartite experiment where treatments are
randomized across Facebook Groups and outcomes are measured by user-level engagement. The outcome of
each user is affected by interventions on a set of groups they belong to, while treatment in each group affects
all users within that group.

Example 3 (Amazon market). Harshaw et al. (2023) simulates a bipartite experiment on the Amazon mar-
ketplace to evaluate the impact of new pricing mechanisms (treatments) randomized across items (treatment
units) on the level of satisfaction (outcome) of the customers (outcome units). In this scenario, items with
new pricing mechanisms may influence the group of customers who view them, while each customer may
encounter a variety of items subject to different pricing strategies.

2.2 Setup of bipartite experiments


Consider a finite population with m treatment units and n outcome units. For simplicity, we abbreviate the
terminology and call the treatment units “groups” and the outcome units “units”. The units and groups
are connected through a bipartite network, summarized by a known n × m adjacency matrix W , where Wik
equals 1 if unit i is in the group k, and 0 otherwise for i = 1, . . . , n and k = 1, . . . , m. Let Si ⊂ {1, . . . , m}
denote the subset which includes indices of all groups unit i is in, i.e., k ∈ Si if and only if Wik = 1, and let
Pm
|Si | = k=1 Wik denote the total number of groups unit i belongs to and S̄ = maxi |Si | denote the maximum
number of groups the units belong to.
As we have introduce in Section 1, in the bipartite experiment, the treatment assignment is randomly
assigned at the group level, while the outcome of interest is measured on the unit side. On the group side,
we randomly assign the m groups to treatment and control arms. Let Zk denote the binary treatment status
of group k, where Zk equals 1 if group k is assigned to treatment, and 0 otherwise. Let Dk ⊂ {1, . . . , n}
Pn
denote the subset which includes indices of all units that belong to group k, and |Dk | = i=1 Wik denote
the number of units group k contains and D̄ = maxk |Dk | denote the maximum number of units the groups
contain.
For each unit i, there are 2m potential outcomes Yi (z), where z = (z1 , . . . , zm ) and zk = 0, 1, k = 1, . . . , m.
The observed outcome Y = Y (Z), where Z = (Z1 , · · · , Zm ). We make the following assumption:

Assumption 1. The potential outcomes of unit i depend only on the treatment status of the groups to which
it belongs. Formally, Yi (z) = Yi (zSi ), where zSi denotes the subvector of z corresponding to groups in Si .

Here the potential outcome Yi (z) depends on the treatment vector zSi , whose dimension varies across
units, unlike the classic setting. We focus on the total treatment effect

n
X
τ = n−1 {Yi (1) − Yi (0)} ,
i=1

4
which captures the difference in the average potential outcomes when all groups are treated versus when
none are controlled. It is a widely studied estimand in settings with interference such as bipartite spatial
Pn
experiments (Zigler and Papadogeorgou, 2021; Harshaw et al., 2023). Denote µ1 = n−1 i=1 Yi (1) and
µ0 = n−1 ni=1 Yi (0), we have τ = µ1 − µ0 .
P

3 Estimation and inference

3.1 Point estimator


1{ 1{
Pm Pm
Let Ti = k=1 Wik (1 − Zk ) = 0} and Ci = k=1 Wik Zk = 0} denote the indicator that all groups
which unit i belongs to were assigned to the treatment group and the control group, respectively. Across
the paper, we focus on Bernoulli randomization as the treatment assignment mechanism on the group level,
formally defined in the following assumption:

Assumption 2 (Bernoulli randomization). Each group is randomly assigned to the treatment group with
iid
probability p and to the control group with probability 1 − p, i.e., Zk ∼ Bern(p).

Under Assumption 2, we have E(Ti ) = p|Si | and E(Ci ) = (1 − p)|Si | . A natural Horvitz-Thompson-type
Pn Pn
estimator n−1 i=1 Ti Yi /p|Si | − n−1 i=1 Ci Yi /(1 − p)|Si | is unbiased for τ . However, throughout this paper,
we focus on the following Hájek-type weighting estimator τ̂ = µ̂1 − µ̂0 , where

n n
X Ti Yi . X Ti
µ̂1 = n−1 n−1 ,
i=1
p|Si | i=1
p|Si |
n n
X Ci Yi . −1 X Ci
µ̂0 = n−1 n ,
i=1
(1 − p)|Si | i=1
(1 − p)|Si |

because of its better finite-sample performance compared with the unbiased Horvitz-Thompson estimator.

3.2 Consistency of τ̂
In this subsection, we establish the consistency of τ̂ . We need the following regularity conditions.

Assumption 3. S̄ = O(1) and D̄/n = o(1).

Assumption 3 restricts the density of the bipartite graph as n increases. We restrict the maximum number
of groups each unit belongs to bounded by a constant while allowing for the maximum number of units each
group contains to increase with n but at a slower rate.

Assumption 4. The potential outcomes and the covariates are bounded.

We impose the boundedness of the potential outcomes and covariates in Assumption 4 to prove the
limiting theorems. We can relax it to some moment conditions but keep the form of Assumption 4 to
simplify the presentation.
We have the following consistency result for τ̂ .

Theorem 3.1 (Consistency of τ̂ ). Under Assumptions 1–4, τ̂ converges in probability to τ .

5
3.3 Asymptotic distribution
In this section, we establish the asymptotic normality of the point estimator τ̂ . For this purpose, we further
assume the following condition on the density of the bipartite graph.

Assumption 5 (Sparse bipartite graph). Define groups j1 and j2 are connected if there exists at least one
unit belonging to both groups. Assume for any group k, the total number of groups that are connected to k
is bounded by an absolute constant B:

1{j, k are connected} ≤ B, k = 1, . . . , m.


X

j∈[m]\{k}

Assumption 5 imposes some sparsity conditions on the degree of the group network. Assuming a bounded
degree simplifies the presentation of the theoretical results, though our theory allows B to grow in some
polynomial order of n. This sparsity condition can be justified in many bipartite experiments. For instance,
in Example 1, two power plants are defined as connected only if there is at least one neighborhood monitored
within a certain distance of both power plants. If two power plants are far away from each other, they are
not connected. Therefore, such a geographical network formation naturally restricts the sparsity of the
network degrees. However, there are examples where this assumption is less likely to hold. In Example 3,
the items are connected in a dense pattern because each customer may encounter a wide range of items while
browsing the website, and the browsing lists from different customers can have many overlaps. We might
need different estimation strategies and theoretical tools to analyze bipartite experiments with such dense
network formations.
We introduce some additional notation for the potential outcomes. Let Y (z) = (Y1 (z), . . . , Yn (z))
denote the vector consisting all potential outcomes under treatment assignment z, and Ỹ (z) = Y (z) −
Pn
n−1 i=1 Yi (z) denote the centered potential outcome vector. Moreover, for i, j = 1, . . . , n, define the
following matrices:

(Λ1 )i,j = p−|Si ∩Sj | − 1, (Λ0 )i,j = (1 − p)−|Si ∩Sj | − 1, (Λτ )i,j = 1{Si ∩ Sj 6= ∅}.

The following theorem formally establishes the asymptotic normality of τ̂ :

Theorem 3.2 (Asymptotic normality of τ̂ ). Under Assumptions 1–5, we have

vn−1/2 (τ̂ − τ ) → N (0, 1)

if the asymptotic variance vn is nondegenerate in the sense that:

vn
√ → ∞, (1)
m · (D̄/n)2

where vn has the following expression:


n o
vn = n−2 Ỹ (1)t Λ1 Ỹ (1) + Ỹ (0)t Λ0 Ỹ (0) + 2Ỹ (1)t Λτ Ỹ (0) . (2)

6
Remark 3.1 (Non-degeneracy of vn ). In Theorem 3.2, we impose the non-degeneracy condition (1) to rule
out the cases where the asymptotic variance (2) is too small in order. The rate is motivated by our central
limit theorem established in Theorem A.1 in the supplementary material. Naturally, it requires the potential
outcomes to have a non-degenerate covariance structure. In the classic Bernoulli randomized experiments
setting, D̄ = 1 and n = m, and (2) requires vn to have a larger order than n−3/2 . This is automatically
satisfied according to the standard results in the literature (e.g. Li and Ding, 2017) which typically gives the
rate of vn = O(n−1 ) under mild assumptions on the potential outcomes.
As an additional sanity check, we justify the nondegeneracy condition in some random data examples.
Consider a network where D̄ and S̄ are both finite. Then m ≍ n. If the potential outcomes (Yi (1), Yi (0)) are
generated independently from a bivariate normal distribution N (02 , σi2 I2 ), then the quantities in vn has the
following order:

n
X
n−2 Ỹ (1)t Λ1 Ỹ (1) = n−2 (p−|Si | − 1)σi2 + op (n−1 ),
i=1
n
X
n−2 Ỹ (0)t Λ0 Ỹ (0) = n−2 (p−|Si | − 1)σi2 + op (n−1 ),
i=1
−2 t −1
n Ỹ (1) Λτ Ỹ (0) = op (n ),

which ensure that


n
X
vn = 2n−2 (p−|Si | − 1)σi2 + op (n−1 ),
i=1

thus Condition (2) is satisfied.

Below we use classic Bernoulli randomization and cluster randomization as examples to illustrate the
variance formula in Theorem 3.2. Our Theorem 3.2 recovers the existing results.

Example 4 (Classic Bernoulli randomized experiment). In classic Bernoulli randomization where the ran-
domization units are identical to the outcome units,

1,

if i = j,
Si ∩ Sj =
0, if i 6= j.

Thus the asymptotic variance in equation (2) reduces to

n
( )2
−2
X Ỹi (1) Ỹi (0)
vn = n p(1 − p) − ,
i=1
p 1−p

which recovers the classic result of Bernoulli randomization in Miratrix et al. (2012, Theorem 1).

Example 5 (Cluster randomization). In a cluster randomization setting with m clusters and the treatment

7
iid
assignment Zk ∼ Bern(p) for k = 1, . . . , m, we have

1,

if i, j belong to the same group,
Si ∩ Sj =
0,

otherwise.

If we order the units according to the cluster they belong to, then
 
1n1 0 ··· 0
 
 0 1n2 ··· 0  1−p p
Λτ =  . , Λ1 = Λτ , Λ0 = Λτ ,
 
 .. .. .. p 1−p
 . . 

0 0 · · · 1nm

where 1nk is an nk × nk -dimensional matrix with all entries equal to 1 and nk is the total number of units
in cluster k for k = 1, . . . , m. Therefore, the asymptotic variance in equation (2)reduces to

m
" ( )#2
X X Ỹi (1) Ỹi (0)
−2
vn = n p(1 − p) − .
p 1−p
k=1 i∈Dk

3.4 Variance estimation


To conduct Wald-type inference based on the central limit theorem in Theorem 3.2, we need to estimate the
asymptotic variance vn . We propose the following variance estimator
 1/2  1/2 2
 −2 X Ti Tj (Yi − µ̂1 )(Yj − µ̂1 )(Λ1 )i,j X Ci Cj (Yi − µ̂0 )(Yj − µ̂0 )(Λ0 )i,j  
  
−2
v̂ =  n + n  (3)
.

i,j
p|Si ∪Sj |   (1 − p)|Si ∪Sj |
i,j

The variance formula in (3) involves double summations over units i and j, where the two parts inside
brackets are sample analogs of n−2 Ỹ (1)t Λ1 Ỹ (1) and n−2 Ỹ (0)t Λ0 Ỹ (0), respectively.
The following Theorem 3.3 shows that v̂ converges in probability and is conservative to the true asymp-
totic variance avar(τ̂ ). Therefore, we can construct the Wald-type large-sample confidence interval: [τ̂ −
qα/2 v̂ 1/2 , τ̂ + qα/2 v̂ 1/2 ] that ensures valid Type I error control in a large sample, where qα/2 denotes the upper
α/2 quantile of a standard Gaussian distribution.

Theorem 3.3 (Conservative variance estimator for τ̂ ). Under Assumptions 1–5,

(a) v̂/plim(v̂) converges in probability to 1, where


n o1/2 n o1/2 2
−2 t −2 t
plim(v̂) = n Ỹ (1) Λ1 Ỹ (1) + n Ỹ (0) Λ0 Ỹ (0) .

(b) plim(v̂) ≥ avar(τ̂ ), where the equality holds if and only if

Ỹ (1)t Λτ Ỹ (0) = {Ỹ (1)t Λ1 Ỹ (1)}1/2 {Ỹ (0)t Λ0 Ỹ (0)}1/2 . (4)

8
Condition (4) quantifies the requirement to achieve consistent variance estimation, which is derived based
on the Cauchy–Schwarz inequality. It depends on both the values of the potential outcomes and the structure
of the networks. We revisit the two Examples 4 and 5 to provide more intuition of the equality condition in
special cases.

Continuance of Example 4 (Classic Bernoulli randomized experiment). In the classic Bernoulli random-
ized experiment setting, condition (4) reduces to

n
( n n
)1/2
X X X
2 2
Ỹi (1)Ỹi (0) = Ỹi (1) Ỹi (0) ,
i=1 i=1 i=1

which is equivalent to Ỹi (1) = ζ1 Ỹi (0) for any i = 1, . . . , n and ζ1 > 0 is a positive constant. A special
case that satisfies this condition is the constant treatment effect case with Yi (1) − Yi (0) = τ for all units
i = 1, . . . , n.

Continuance of Example 5 (Cluster randomization). In the cluster experiment setting, condition (4)
reduces to
m
( )( ) 
m
( )2 m
( )2 1/2
X X X X X X X
Ỹi (1) Ỹi (0) = Ỹi (1) Ỹi (0)  ,
k=1 i∈Dk i∈Dk k=1 i∈Dk k=1 i∈Dk

P P
which is equivalent to i∈Dk Ỹi (1) = ζ2 i∈Dk Ỹi (0) for any k = 1, . . . , m and ζ2 > 0 is a positive constant.
A special case that satisfies this condition is when the cluster-specific average treatment effect on each cluster
is a constant, i.e., |Dk |−1 { i∈Dk Yi (1) − i∈Dk Yi (0)} = τ for all clusters k = 1, . . . , m.
P P

4 Covariate adjustment

4.1 Methodology
In this section, we propose a covariate adjustment strategy to improve efficiency in bipartite experiments.
Covariate adjustment is a classic topic in randomized experiments. For completely randomized experi-
ments, Fisher (1925) proposed to use the analysis of covariance (ANCOVA) to improve estimation efficiency.
Freedman (2008a,b) later reanalyzed the ANCOVA and found that it does not guarantee efficiency improve-
ment in completely randomized experiments. In respond to the critics, Lin (2013) proposed another covariate
adjustment strategy that guarantees asymptotic efficiency gains, and the resulting estimator is usually called
the Lin’s estimator in the literature. Ding (2024, Chapter 6) reviewed the intuition of Lin (2013) from dif-
ferent points of view. We will generalize the idea of Lin (2013) to obtain a covariate adjustment strategy in
the current setting.
Pn
Let X̃i denote the centered covariates, i.e., X̃i = Xi − n−1 i=1 Xi . Consider a class of linearly adjusted
estimators indexed by (β1 , β0 ):

n n n n
X Ti (Yi − β t X̃i ) . 1
X Ti X Ci (Yi − β0t X̃i ) . −1 X Ci
τ̂ (β1 , β0 ) = n−1 n−1 − n −1
n ,
i=1
p|Si | i=1
p |Si |
i=1
(1 − p)|Si |
i=1
(1 − p)|Si |

9
where we replace Yi in the naive estimator τ̂ with linearly adjusted residuals. Further denote X̃ =
(X̃1 , . . . , X̃n ) the centered covariate matrix including all n units. The covariate adjustment estimator
t

τ̂ (β1 , β0 ) has the following properties for fixed (β1 , β0 ).

Proposition 4.1 (Consistency and asymptotic distribution of τ̂ (β1 , β0 )). Under Assumptions 1–4, for any
fixed (β1 , β0 ), τ̂ (β1 , β0 ) converges in probability to τ . Further suppose Assumption 5 holds, the variance of
τ̂ (β1 , β0 ) has the order var{τ̂ (β1 , β0 )}/vn (β1 , β0 ) = 1 + o(1), where
h
vn (β1 , β0 ) = n−2 {Ỹ (1) − X̃β1 }t Λ1 {Ỹ (1) − X̃β1 } + {Ỹ (0) − X̃β0 }t Λ0 {Ỹ (0) − X̃β0 }
i
+ 2{Ỹ (1) − X̃β1 }t Λτ {Ỹ (0) − X̃β0 } . (5)

Further assuming the non-degeneracy of vn (β1 , β0 ), i.e., vn (β1 , β0 ) ≍ D̄/n, we have

vn (β1 , β0 )−1/2 {τ̂ (β1 , β0 ) − τ (β1 , β0 )} → N (0, 1)

in distribution.

Proposition 4.1 states the analogous results to Theorem 3.2 on the asymptotic distribution of the class of
covariate-adjusted estimators. The results follow directly when we treat the linearly adjusted residuals of the
potential outcomes, Yi (1) − β1t X̃i and Yi (0) − β0t X̃i , as “pseudo potential outcomes” and apply Theorem 3.2.
Similarly, we can construct a conservative variance estimator for τ̂ (β1 , β0 ). Denote the upper bound of
vn (β1 , β0 ) as
h i1/2 h i1/2 2
vn,ub (β1 , β0 ) = n−2 {Ỹ (1) − X̃β1 }t Λ1 {Ỹ (1) − X̃β1 } + n−2 {Ỹ (0) − X̃β0 }t Λ0 {Ỹ (0) − X̃β0 }

and the corresponding consistent estimator of the upper bound as


 1/2

 −2 X Ti Tj (Yi − µ̂1 − β1 X̃i )(Yj − µ̂1 − β1 X̃j )(Λ1 )i,j 
t t
v̂n,ub (β1 , β0 ) =  n

i,j
p|Si ∪Sj | 

 1/2 2
 X Ci Cj (Yi − µ̂0 − β t X̃i )(Yj − µ̂0 − β t X̃j )(Λ0 )i,j  
0 0
+ n−2  .

i,j
(1 − p)|Si ∪Sj | 

To gain the best asymptotic efficiency by using covariate adjustment estimators, ideally, we want to
minimize the asymptotic variance in (5) over (β1 , β0 ). However, the third term in (5) is not estimable
because it depends on the joint distribution of the potential outcomes. Instead, the improvement in the
asymptotic variance of the covariate-adjusted estimator, τ̂ (β1 , β0 ), compared with that of the naive estimator
τ̂ , is estimable. Denote
 t     t  
β1 X̃ t Λ1 X̃ X̃ t Λτ X̃ β1 X̃ t
Λ 1 Ỹ (1) + X̃ t
Λ τ Ỹ (0) β
L(β1 , β0 ) = n−2       − 2n−2    1 .
β0 X̃ t Λτ X̃ X̃ Λ0 X̃
t
β0 X̃ Λ0 Ỹ (0) + X̃ Λτ Ỹ (1)
t t
β0

10
We can verify that L(β1 , β0 ) is the difference between the two asymptotic variances vn (β1 , β0 ) and vn (0, 0)
thus measures the efficiency gain of τ̂ (β1 , β0 ) compared with τ̂ . We formalize it in Lemma 4.1.

Lemma 4.1. We have L(β1 , β0 ) = vn (β1 , β0 ) − vn (0, 0).

Define the optimization problem and its solution as


 
β̃1 , β̃0 = arg min L(β1 , β0 ). (6)
β1 ,β0

By construction, the improvement in asymptotic variance is guaranteed. We formalize this result in the
following proposition.

Proposition 4.2 (Improvement in asymptotic variance). The covariate adjustment estimator τ̂ (β̃1 , β̃0 ) has
an asymptotic variance no larger that of τ̂ , i.e., vn (β̃1 , β̃0 ) ≤ vn .

4.2 Estimation and inference based on covariate adjustment


The optimization in (6) is a population-level convex problem which has a closed-form global optimal solution
(β̃1 , β̃0 ),
   −1  
β̃1 X̃ t Λ1 X̃ X̃ t Λτ X̃ X̃ t Λ1 Ỹ (1) + X̃ t Λτ Ỹ (0)
  =    .
β̃0 X̃ t Λτ X̃ X̃ t Λ0 X̃ X̃ t Λ0 Ỹ (0) + X̃ t Λτ Ỹ (1)

We propose to estimate the vector which includes unobserved potential outcomes using the sample analog by
inverse probability weighting, similar to the trick used for variance estimation in Section 3.4. We construct
the following estimator of (β̃1 , β̃0 ),
   −1 P 
Ti Tj X̃i (Yj −µ̂1 )(Λ1 )i,j P Ci Cj X̃i (Yj −µ̂0 )(Λτ )i,j
β̂1 X̃ Λ1 X̃
t
X̃ Λτ X̃
t +
  =   Pi,j p|Si ∪Sj | i,j (1−p)|Si ∪Sj |
P Ci Cj X̃i (Yj −µ̂0 )(Λ0 )i,j  .
Ti Tj X̃i (Yj −µ̂1 )(Λτ )i,j
β̂0 X̃ Λτ X̃
t
X̃ t Λ0 X̃ i,j + i,j
p|Si ∪Sj | (1−p)|Si ∪Sj |

We next establish the asymptotic properties of the covariate-adjusted estimator τ̂ (β̂1 , β̂0 ). To simplify the
discussion, we introduce the following assumption that imposes the limits for several population quantities.

Assumption 6 (Existence of limiting values). Assume that


 
 t   Ωyy,11 Ωyx,11
(D̄/n) Ỹ (1) X̃ Λ1 Ỹ (1) X̃ →   =: Ω11 ,
Ωtyx,11 Ωxx,11
 
 t   Ωyy,00 Ωyx,00
(D̄/n) Ỹ (0) X̃ Λ0 Ỹ (0) X̃ →   =: Ω00 ,
Ωtyx,00 Ωxx,00
 
 t   Ωyy,10 Ωyx,10
(D̄/n) Ỹ (1) X̃ Λτ Ỹ (0) X̃ →   =: Ω10 .
Ωtyx,01 Ωxx,10

11
Assumption 6 requires the weighted covariance matrices of potential outcomes and covariates to have
limiting values not depending on n as n → ∞. In the special case of complete randomized experiments
without interference, it reduces to the assumption in Theorem 5 in Li and Ding (2017).
The following theorem shows the consistency and asymptotic distribution of τ̂ (β̂1 , β̂0 ).

Theorem 4.1 (Consistency and asymptotic distribution of τ̂ (β̂1 , β̂0 )). Under Assumptions 1–4 and 6,
τ̂ (β̂1 , β̂0 ) converges in probability to τ . Further suppose Assumption 5 holds,
h i−1/2 n o
avar{τ̂ (β̂1 , β̂0 } τ̂ (β̂1 , β̂0 ) − τ → N (0, 1)

in distribution.

Combined with Proposition 4.2, Theorem 4.1 suggests the covariate-adjusted estimator can reduce the
asymptotic variance compared with the unadjusted estimator.
Now following our discussion in Section 4.1, we can use the variance estimator v̂n,ub (β̂1 , β̂0 ) by plugging
in the estimated coefficients. The following theorem establishes the convergence and conservativeness of this
variance estimator.

Theorem 4.2 (Conservative variance estimator for τ̂ adj ). Under Assumption 1-6, the variance estimator
v̂n,ub (β̂1 , β̂0 ) is a conservative variance estimator following the facts that v̂n,ub (β̂1 , β̂0 )/vn,ub (β̃1 , β̃0 ) converges
in probability to 1 and that vn,ub (β̃1 , β̃0 ) ≥ avar{τ̂ (β̃1 , β̃0 )}.

Theorem 4.2 proves the conservativeness of the variance estimator, which directly motivates a valid
1/2 1/2
confidence interval: [τ̂ (β̂1 , β̂0 ) − qα/2 v̂n,ub , τ̂ (β̂1 , β̂0 ) + qα/2 v̂n,ub ].

5 Simulation and application

5.1 Simulated bipartite graph


We conduct simulation studies to evaluate the finite-sample performance of our proposed estimators in this
subsection. In the simulation, we start by creating two types of nodes: individual nodes and group nodes. We
generate the degree of each individual node |Si | sampling from a Gaussian distribution N ((S̄ +1)/2, (S̄−1)/6)
and rounding the sampled value to the nearest integer. We choose this distribution with the specified mean
and standard deviation to ensure that the sampled degrees predominantly fall within the range of 1 and
S̄, minimizing the need for truncation or clipping. Next, we randomly connect individual nodes i to |Si |
different groups from the set of group nodes. Moreover, we make the following adjustments to ensure
the graph satisfies the sparsity condition in Assumption 5. For each degree s, we examine the number of
connected group sets through individuals with degree s. If the count surpasses a predefined upper limit,
we break a random subset of the connections. Specifically, we break links between individuals and groups
that are part of the same overly connected set. We then randomly establish new connections with other
group sets that were not previously overly connected, ensuring that the total degree of each individual is
unchanged. We keep the bipartite graph fixed after building it up.

12
We consider three different regimes of data generating process. For each regime, we generate covariates
Xi = (X1i , X2i ) ∼ (U [0, 1])2 and the potential outcomes Yi (1) and Yi (0) from the following conditional
iid
distributions summarized in Table 1, with γ = (1, 1)t and αi ∼ U [0, 0.5]. The treatment indicator Zk ∼
Bern(p) with p = 0.5.

Table 1: Three regimes of data generating process

Regime Yi (1) Yi (0)


R1 N (0.25 + γ t Xi , 1) N (γ t Xi , 1)
R2 N (αi + γ t Xi , 1) N (γ t Xi , 1)
R3 N (0.1|Si | + 1.1γ t Xi , 1.5) N (γ t Xi , 1.5)

Table 2 reports the finite-sample performance of the two estimators τ̂ and τ̂ adj with n = 5000, m = 1500,
and S̄ = 5 based on 1000 Monte Carlo replications. In all three regimes, the two estimators both have small
finite sample bias, and the proposed variance estimators are conservative, leading to valid inference with
coverage rates larger than 95%. Compared with the naive estimator τ̂ , the covariate-adjusted estimator has
a smaller standard error and higher power under all regimes. Although our theory guarantees efficiency im-
provement only in asymptotic variance, in the numerical studies, we also observe smaller variance estimators
and thus shorter constructed confidence intervals under all three regimes.

Table 2: Finite sample performance of estimators τ̂ and τ̂ adj .

naive estimator covariate adjustment


adj adj
Regime τ τ̂ se(τ̂ ) ˆ )
se(τ̂ coverage power τ̂ se(τ̂ ) ˆ adj )
se(τ̂ coverage power
R1 0.221 0.223 0.059 0.086 99.7% 82.3% 0.223 0.055 0.080 99.5% 89.3%
R2 0.256 0.255 0.062 0.085 98.8% 92.8% 0.254 0.058 0.079 98.8% 96.0%
R3 0.355 0.358 0.085 0.124 99.6% 90.6% 0.358 0.082 0.119 99.5% 93.4%
Note: For each regime of data generating process, we report the true total treatment effect τ , the two point estimators,
ˆ
their standard error se(·), standard error estimator se(·), the coverage rate of the 95% confidence interval constructed
using the conservative variance estimator, and their power.

5.2 Data analysis based on real-world bipartite graph


In this subsection, we apply our estimators to analyze a real-world bipartite graph. We revisit the application
discussed in Zigler and Papadogeorgou (2021) and Papadogeorgou et al. (2019), which studies the causal
effect of the installation of selective non-catalytic reduction system at a power plant on the air pollution
level in the nearby areas. The intervention is at the power plant level, i.e., each power plant may be
assigned to the implementation of the new system (Zk = 1) or not (Zk = 0). Since multiple power plants
simultaneously influence a given area, and each power plant can potentially affect multiple areas, it forms a
natural bipartite graph. To model the scenario, we simulate a bipartite randomized experiment based on the
real-world bipartite structure between power plants and nearby areas. We take power plants as treatment

13
units and air pollution monitors as outcome units.
We construct our dataset using the power plant dataset from Papadogeorgou et al. (2019) and 2004 air
pollution data at the monitor level from the United States Environmental Protection Agency’s website.
Additionally, we incorporate population information for the counties where the monitors are located. The
initial dataset of outcome units includes 95,762 air quality monitors, and the treatment units correspond to
473 coal or natural gas-burning power plants operating in the continental U.S. during the summer of 2004.
To prepare the dataset, we remove outcome units with an “Arithmetic Mean” above the 90th percentile
among all observations, and we also exclude outcome units with an “Arithmetic Mean” around 0 (we choose
2 as the threshold in this application), and outcome units with a population size exceeding 106 . To address
computational constraints, we randomly select 10% of the remaining monitors. We calculate the distances
between monitors and power plants using their longitude and latitude coordinates. A bipartite graph is then
constructed by connecting monitors to power plants located within 15 km. Finally, we remove monitors and
power plants that are not connected to any other units, resulting in a dataset comprising 795 outcome units
and 228 treatment units. The maximum degree of outcome units is restricted to be 2.
We assume the potential outcomes to be Yi (1) = γ1t Xi + ε1 and Yi (0) = γ0t Xi + ε0 , where γ1 =
(2, −2, −2)t , γ0 = (1, −1, −1)t , and ε1 , ε0 ∼ U [0, 15]. This data generation process is designed to simulate
the distribution of the observed ‘Arithmetic Mean’ in the pollution dataset. To standardize the covariates
and ensure numerical stability, we scale the population seize of the county where the monitor i is located by
dividing it by 106 (X1i ), and the distance between monitor i and its closest power plant by dividing it by 30
(X2i ). The third covariate, X3i , represents the number of power plants connected to monitor i. We consider
1000 Monte Carlo replications. In each replication, treatment units are randomly assigned to treatment with
a probability of p = 0.5. When applying the covariate adjustment estimator, we include the scaled covariates
X1i , X2i , and X3i in the model. The true total treatment effect is −1.266.
Table 3 reports the simulation results based on the real-world bipartite graph. We can see that both
the naive estimator and the covariate-adjusted estimator have small biases for estimating the true treatment
effect, and both strategies lead to valid yet slightly conservative confidence intervals. Nevertheless, by
applying the covariate adjustment strategy we introduced in Section 4, we can witness a reduction in both
the standard deviation of the point estimator and the estimated variance, which leads to a great improvement
in the power.

Table 3: Simulation results based on real bipartite graph in the power plant application

estimator point estimator se ˆ


se coverage power

naive estimator τ̂ −1.251 0.136 0.227 98.2% 80.6%

covariate adjustment τ̂ adj −1.202 0.116 0.170 97.7% 86.3%

Note: we report two point estimators (with and without covariate adjustment), their standard error se, standard error
ˆ the coverage rate of the 95% confidence interval constructed using the conservative variance estimator,
estimator se,
and their power.

14
6 Discussion
We propose a design-based causal inference framework for bipartite experiments. We generalize the classic
SUTVA to the bipartite experiment setting and provide point and variance estimators for estimating the
total treatment effect. These estimators are based on theoretical results that guarantee the consistency and
asymptotic normality of the point estimator and the conservativeness of the variance estimator. We also
propose covariate adjustment strategies that improve the efficiency of the point estimator. This framework
extends the design-based causal inference frameworks for completely randomized experiments and cluster
randomized experiments.
While this framework is useful for estimating causal effects in many general scenarios involving bipartite
experiments, there are several directions for further investigation. First, we focus on the total average treat-
ment effect which compares all versus nothing treatment regimes. There are more general causal parameters
of interest that we can explore. Second, we only discuss the Bernoulli randomization treatment regime,
leaving other more complex bipartite intervention strategies undeveloped. Third, we mainly focus on the
outcome unit-level covariates for the covariate adjustment strategy. When treatment unit-level covariates
are also available, as an ad-hoc strategy, we can incorporate them by using a summary at the outcome-unit
level, for instance, taking the average or sum of the covariate values for the groups that each unit is connected
to. However, a more rigorous and systematic way of incorporating treatment-unit level covariates is unclear.
We leave them for future research.

References
Aronow, P. M. and Samii, C. (2017). Estimating average causal effects under general interference, with
application to a social network experiment. Annals of Applied Statistics, 11(4):1912–1947.

Brennan, J., Mirrokni, V., and Pouget-Abadie, J. (2022). Cluster randomized designs for one-sided bipartite
experiments. Advances in Neural Information Processing Systems, 35:37962–37974.

Ding, P. (2024). A first course in causal inference. CRC Press.

Donner, A. (1998). Some aspects of the design and analysis of cluster randomization trials. Journal of the
Royal Statistical Society: Series C (Applied Statistics), 47(1):95–113.

Donner, A., Klar, N., and Klar, N. S. (2000). Design and analysis of cluster randomization trials in health
research, volume 27. Arnold Publishers: London.

Doudchenko, N., Zhang, M., Drynkin, E., Airoldi, E., Mirrokni, V., and Pouget-Abadie, J. (2020). Causal
inference with bipartite designs. arXiv preprint arXiv:2010.02108.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, 1st edition.

Forastiere, L., Airoldi, E. M., and Mealli, F. (2021). Identification and estimation of treatment and in-
terference effects in observational studies on networks. Journal of the American Statistical Association,
116(534):901–918.

15
Freedman, D. A. (2008a). On regression adjustments in experiments with several treatments.

Freedman, D. A. (2008b). On regression adjustments to experimental data. Advances in Applied Mathematics,


40(2):180–193.

Hall, P. and Heyde, C. C. (2014). Martingale limit theory and its application. Academic Press.

Harshaw, C., Sävje, F., Eisenstat, D., Mirrokni, V., and Pouget-Abadie, J. (2023). Design and analysis of
bipartite experiments under a linear exposure-response model. Electronic Journal of Statistics, 17(1):464–
518.

Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences.
Cambridge University Press.

Li, X. and Ding, P. (2017). General forms of finite population central limit theorems with applications to
causal inference. Journal of the American Statistical Association, 112(520):1759–1769.

Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining freedman’s
critique. The Annals of Applied Statistics, 7(1):295–318.

Lu, X., Li, H., and Liu, H. (2024). Estimation and inference of average treatment effects under heterogeneous
additive treatment effect model. arXiv preprint arXiv:2408.17205.

Miratrix, L. W., Sekhon, J. S., and Yu, B. (2012). Adjusting treatment effect estimates by post-stratification
in randomized experiments. Journal of the Royal Statistical Society Series B: Statistical Methodology,
75(2):369–396.

Papadogeorgou, G., Choirat, C., and Zigler, C. M. (2019). Adjusting for unmeasured spatial confounding
with distance adjusted propensity score matching. Biostatistics, 20(2):256–272.

Pouget-Abadie, J., Aydin, K., Schudy, W., Brodersen, K., and Mirrokni, V. (2019). Variance reduction in
bipartite experiments through correlation clustering. Advances in Neural Information Processing Systems,
32.

Shi, L., Bakhitov, E., Hung, K., Karrer, B., Walker, C., Bhole, M., and Schrijvers, O. (2024). Scalable
analysis of bipartite experiments. arXiv preprint arXiv:2402.11070.

Song, Z. and Papadogeorgou, G. (2024). Bipartite causal inference with interference, time series data, and
a random network. arXiv preprint arXiv:2404.04775.

Su, F. and Ding, P. (2021). Model-assisted analyses of cluster-randomized experiments. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 83(5):994–1015.

Zigler, C. M. and Papadogeorgou, G. (2021). Bipartite causal inference with interference. Statistical Science,
36(1):109 – 123.

16
Supplementary Material
Section A provides a general theory of establishing central limit theorems for bipartite experiments under
Bernoulli randomization.
Section B provides proofs of all theorems in the main text.

A A useful central limit theorem


To prove the central limit theorem for τ̂ , we first prove a central limit theorem for a general statistic defined
as follows:

X X X
Γ = ak1 Z̃k1 + ak1 k2 Z̃k1 Z̃k2 + · · · + ak1 ...kS̄ Z̃k1 · · · Z̃kS̄ . (A.1)
k1 k1 <k2 k1 <···<kS̄

Here {ak1 ...ks : k1 , . . . , ks ∈ [m], k1 6= . . . 6= ks } is an s-dimensional array that are symmetric in its indices,
i.e.,

ak1 ...ks = ak1′ ...ks′ if {k1 , . . . , ks } = {k1′ , . . . , ks′ }.

As a convention, we use (k1 . . . ks ) to denote an unordered s-tuple with k1 6= · · · 6= ks . Moreover, Z̃k ’s are
i.i.d. copies of a random variable Z̃ with mean zero, variance σ 2 and fourth moments bounded by E Z̃ 4 ≤ ν44 .
Note that here we do not require Z̃ to be a centered Bernoulli variable.
For the statistic in (A.1), we have E(Γ) = 0 and

X X X
vΓ = var(Γ) = a2k1 σ 2 + a2k1 k2 σ 4 + · · · + a2k1 ...kS̄ σ 2S̄ .
k1 k1 <k2 k1 <···<kS̄

We have the following central limit theorem for Γ:

Theorem A.1. Assume that


1. the elements of the array a’s are bounded by some constant ām that possibly depends on m;
2. there exists a universal constant B such that for all k ∈ [m] and s ∈ [S̄],

1{|akk1 ...ks | 6= 0} ≤ B;
X

(k1 ,...,ks )⊂[m]\{k}

3. the variance is nondegenerate: vΓ /(m1/2 ā2m ) goes to ∞.


−1/2
We have vΓ Γ → N (0, 1) in distribution.

We will use the martingale central limit theorem in Hall and Heyde (2014) to prove Theorem A.1. For
completeness of our proof, we first review the martingale central limit theorem as the following Proposi-
tion A.1.

S1
Proposition A.1 (Theorem 3.2 of Hall and Heyde (2014)). Let {Sni , Fni , 1 ≤ i ≤ kn , n ≥ 1} be a zero-
mean, square-integrable martingale array with differences ∆ni , and let η 2 be an almost surely finite random
variable. Suppose the following conditions hold:
1. Squared sum convergence:

X
E(∆2ni | Fn,i−1 ) → η 2 (A.2)
i

in probability,
2. Lindeberg condition:

E(∆2ni 1 {|∆ni | > ε} | Fn,i−1 ) → 0


X
for all ε > 0, (A.3)
i

in probability,
P
and the σ-fields are nested: Fn,i ⊆ Fn+1,i , for 1 ≤ i ≤ kn , n ≥ 1. Then Snkn = i ∆ni converges in
1 2 2

distribution (stably) to some random variable with characteristics function E{exp − 2 η t }.

In particular, a sufficient condition for the Lindeberg condition (A.3) is given by the following Lyapunov
condition:
n
X
For some δ > 0, E{|∆ni |2+δ } → 0. (A.4)
i=1

Proof of Theorem A.1. We prove Theorem A.1 following three steps. We first construct a martingale dif-
ference sequence based on Γ. Next, we check the convergence of the summation of the conditional squared
differences in equation (A.2). Finally, we check the Lyapunov condition in equation (A.4).
Step 1. Construct a martingale difference sequence based on Γ. Let Fm,k be the σ-algebra
generated by Z̃1 , . . . , Z̃k , i.e. Fm,k = σ{Z̃1 , . . . , Z̃k }. For ease of notation, for any k1 , . . . , kℓ ∈ [m], we
denote Z̃k1 ...kℓ = Z̃k1 · · · Z̃kℓ . Let

S̄∧k
−1/2
X X
∆mk = vΓ ak1 ...ks−1 k Z̃k1 ...ks−1 k ,
s=1 (k1 ...ks−1 )⊂[k−1]

with a∅k = ak and Z̃∅ = 1. Then {∆mk , Fm,k }m


k=1 forms a martingale difference sequence and

m
X
Γ = ∆mk .
k=1

Step 2. Check the convergence of the summation of the conditional squared differences in
equation (A.2). We show equation (A.2) by computing the variance of its LHS,
( )
X
var E(∆2mk | Fm,k−1 )
k

S2
 
σ4  X S̄∧k
X X 
= 2 var  ak1 ...ks k ak1′ ...kr′ k Z̃k1 ...ks Z̃k1′ ...kr′ 
vΓ 
s,r (k1 ...ks )⊂[k−1]

k
(k1′ ...kr′ )⊂[k−1]
 

 


 


 

 
S̄∧k S̄∧ℓ
σ4 X X X

 

X
= a a ′ ′ a a ′ ′
k1 ...ks k k1 ...kr k ℓ1 ...ℓt ℓ ℓ1 ...ℓu ℓ cov(Z̃ Z̃ ′ ′
k1 ...ks k1 ...kr , Z̃ ℓ1 ...ℓt ℓ1 ...ℓu (A.5)
Z̃ ′ ′ ) .
vΓ2 
 k,ℓ s,r t,u (k1 ...ks )⊂[k−1] 

 
(k1′ ...kr′ )⊂[k−1]

 

 
 ′ ′ 

 (ℓ1 ...ℓt )⊂[ℓ−1] 

(ℓ′1 ...ℓ′u )⊂[ℓ−1]

Note that cov(Z̃k1 ...ks Z̃k1′ ...kr′ , Z̃ℓ1 ...ℓt Z̃ℓ′1 ...ℓ′u ) 6= 0 only if

{(k1 . . . ks ) ∪ (k1′ . . . kr′ )} ∩ {(ℓ1 . . . ℓt ) ∪ (ℓ′1 . . . ℓ′u )} 6= ∅.

For the nonzero covariance, we have

cov(Z̃k1 ...ks Z̃k1′ ...kr′ , Z̃ℓ1 ...ℓt Z̃ℓ′1 ...ℓ′u )


n o1/2
≤ var(Z̃k1 ...ks Z̃k1′ ...kr′ )var(Z̃ℓ1 ...ℓt Z̃ℓ′1 ...ℓ′u )
n o1/2
≤ E(Zk21 ...ks Zk21′ ...kr′ )E(Zℓ21 ...ℓt Zℓ2′1 ...ℓ′u )
≤ E(Z̃k41 ...ks )1/4 E(Z̃k41′ ...kr′ )1/4 E(Z̃ℓ41 ...ℓt )1/4 E(Z̃ℓ4′1 ...ℓ′u )1/4
4(S̄−1)
≤ ν4s+r+t+u ≤ ν4 .

Therefore, we can further bound (A.5) as


( )
X
var E(∆2mk | Fm,k−1 )
k
 
 
 
4(S̄−1) 4
σ 4 ν4 ām  
1{|aG1 | 6= 0}1{|aG2 | 6= 0}1{|aG3 | 6= 0}1{|aG4 | 6= 0}
X


vΓ2
 
 
G1 ,G2 ,G3 ,G4 ⊂[S̄]: 
 G1 ∩G2 6=∅, 
G2 ∩G3 6=∅,
G3 ∩G4 6=∅
 
 
4(S̄−1) 4
σ 4 ν4 ām (B S̄ 2 )  
1{|aG1 | 6= 0}1{|aG2 | 6= 0}1{|aG3 | 6= 0}
X


vΓ2
 
 
G1 ,G2 ,G3 ⊂[S̄]: 
G1 ∩G2 6=∅,
G2 ∩G3 6=∅
 
 
4(S̄−1) 4
σ 4 ν4 ām (B 2 S̄ 4 )  X
 

≤ 1 {|a G | 6 = 0} 1 {|a G | 6 = 0}
vΓ2  1 2

G1 ,G2 ⊂[S̄]:
 

G1 ∩G2 6=∅

S3
4(S̄−1) 4
σ 4 ν4 ām (B 4 S̄ 6 )m

vΓ2
= o(1),

under the third assumed condition.


Also, we have
( )
X
E E(∆2mk | Fm,k−1 ) = 1.
k

Therefore, by Chebyshev’s inequality,

X
E(∆2mk | Fm,k−1 ) → 1
k

in probability.
Step 3. Check the Lyapunov condition in equation (A.4). We have
 

 


 


 

m

 S̄∧k 

X 1 X X 
E(∆4mk ) = ak1 ...ks k ak1′ ...kr′ k ak1′′ ...kt′′ k ak1′′′ ...ku′′′ k E(Z̃k1 ...ks Z̃k1′ ...kr′ Z̃k1′′ ...kt′′ Z̃k1′′′ ...ku′′′ )
vΓ2  s,r,t,u

k=1 
 (k1 ...ks )⊂[k−1] 

(k1′ ...kr′ )⊂[k−1]

 

 
(k1′′ ...kt′′ )⊂[k−1]

 

 
(k1′′′ ...ku′′′
)⊂[k−1]
 
 
 
 S̄∧k
ν44S̄

X X 
≤  |ak1 ...ks k ||ak1′ ...kr′ k ||ak1′′ ...kt′′ k ||ak1′′′ ...ku′′′ k |
vΓ2 
s,r,t,u

(k1 ...ks )⊂[k−1] 
(k1′ ...kr′ )⊂[k−1]
 
 
(k1′′ ...kt′′ )⊂[k−1]
(k1′′′ ...ku′′′
)⊂[k−1]
 
ν44S̄ ā4 
1{|aG1 | 6= 0}1{|aG2 | 6= 0}1{|aG3 | 6= 0}1{|aG4 | 6= 0}
X 
≤ 
vΓ2  
G1 ,G2 ,G3 ,G4 ⊂[S̄]:
G1 ∩G2 ∩G3 ∩G4 6=∅

ν44S̄ ā4m (B S̄)4 m


≤ = o(1).
vΓ2

Combining results in Steps 1–3 and Proposition A.1, we prove the results in Theorem A.1.

B Proofs

B.1 Lemmas
We first introduce two lemmas in order to simplify the proofs for the main theorems across the paper.

S4
Lemma B.1. For any two arrays {ai }ni=1 and {bi }ni=1 , we have

X
n−2 ai bj (p−|Si ∩Sj | − 1) ≤ n−1 (p−S̄ − 1)(max ai )(max bi )S̄ D̄.
i i
i,j

Proof of Lemma B.1. p−|Si ∩Sj | − 1 is nonzero if and only if Si ∩ Sj 6= ∅. For each unit i, the number of
groups i belongs to is no larger than S̄, and there are at most D̄ units in each group. Therefore, for each i,

X
| bj (p−|Si ∩Sj | − 1)| ≤ (p−S̄ − 1)(max bi )S̄ D̄,
i
j

thus
X
| ai bj (p−|Si ∩Sj | − 1)| ≤ n(p−S̄ − 1)(max ai )(max bi )S̄ D̄.
i i
i,j

Lemma B.2. Recall Ti = 1{ Wik (1 − Zk ) = 0} and Ci = 1{


Pm Pm
k=1 k=1 Wik Zk = 0} denote the indicator
that all groups which unit i belongs to were assigned to the treatment group and the control group, respectively,
as introduced in the beginning of Section 3.1. We have
 
 X Ti Tj ai bj (p−|Si ∩Sj | − 1) 
E n−2 ≤ n−1 (p−S̄ − 1)(max ai )(max bi )S̄ D̄, (B.6)

i,j
p|Si ∪Sj |  i i
 
 X Ti Tj ai bj (p−|Si ∩Sj | − 1) 
var n−2 ≤ n−3 p−4S̄ (max a2i )(max b2i )S̄ 3 D̄3 (p−S̄ − 1)2 . (B.7)

i,j
p|Si ∪Sj |  i i

Similar results also holds for the the quantites defined by Ci ’s.

Proof of Lemma B.2. By the fact that E(Ti Tj ) = p|Si ∪Sj | , we have
 
 X Ti Tj ai bj (p−|Si ∩Sj | − 1)  X
E n−2 = n−2 ai bj (p−|Si ∩Sj | − 1).

i,j
p|Si ∪Sj | 
i,j

Equation (B.6) holds by Lemma B.1.


Next, we prove equation (B.7). Recall that we denote (Λ1 )i,j = p−|Si ∩Sj | − 1. We have
 
 X Ti Tj ai bj (p−|Si ∩Sj | − 1)  X cov(Ti Tj , Tu Tv )ai bj au bv (Λ1 )i,j (Λ1 )u,v
var n−2 = n−4

i,j
p|Si ∪Sj | 
i,j,u,v
p|Si ∪Sj |+|Su ∪Sv |
X
≤ n−4 p−4S̄ |cov(Ti Tj , Tu Tv )ai bj au bv (Λ1 )i,j (Λ1 )u,v |.(B.8)
i,j,u,v
(B.9)

If (Si ∪ Sj ) ∩ (Su ∪ Sv ) = ∅, then Ti Tj and Tu Tv are independent, thus cov(Ti Tj , Tu Tv ) = 0. Therefore,


cov(Ti Tj , Tu Tv )(Λ1 )i,j (Λ1 )u,v is nonzero if and only if Si ∪Sj 6= ∅, Su ∪Sv 6= ∅, and (Si ∪Sj )∩(Su ∪Sv ) 6= ∅.

S5
Without loss of generality, assume that (Si ∪ Sj ) ∩ Su 6= ∅, we have

X
|cov(Ti Tj , Tu Tv )ai bj au bv (Λ1 )i,j (Λ1 )u,v |
i,j,u,v
X X
≤ |ai bj |(Λ1 )i,j |cov(Ti Tj , Tu Tv )au bv (Λ1 )u,v |
i,j u,v
X X X
≤ |ai bj |(Λ1 )i,j |au | |cov(Ti Tj , Tu Tv )bv (Λ1 )u,v |
i,j u v
X X
≤ |ai bj |(Λ1 )i,j |au |(p−S̄ − 1)(max bv )S̄ D̄1{Si ∪ Sj ) ∩ Su 6= ∅}
v
i,j u
X
≤ |ai bj |(Λ1 )i,j (p−S̄ − 1)(max au )(max bv )S̄ 2 D̄2
u v
i,j

≤ n(p−S̄ − 1)2 (max a2u )(max b2v )S̄ 3 D̄3 ,


u v

where the third inequality follows from (Si ∪ Sj ) ∩ Su 6= ∅ and a similar argument in the proof of Lemma B.1
that for each u, | v bv (Λ1 )u,v | ≤ (p−S̄ − 1)(maxv bv )S̄ D̄, the forth inequality follows from the fact that u
P

has to be connected to either i of j, and the total number of u such that 1{(Si ∪ Sj ) ∩ Su 6= ∅} is nonzero is
no larger than S̄ D̄, and the last equality follows from Lemma B.1. Plugging in back to equation (B.8) gives
the second inequality in Lemma B.2.

B.2 Proof of Theorem 3.1


We first show µ̂1 is consistent to µ1 by showing the numerator of µ̂1 − µ1 converges in probability to 0 and
the denominator converges in probability to 1. The numerator of µ̂1 − µ1 has mean zero and variance equal
to
n
( )
−1
X Ti (Yi − µ1 )
var n
i=1
p|Si |
" #2 
n
1
Pm
X {Y
−1 i (1) − µ 1 } { k=1 Wik (1 − Z k ) = 0}
= E n 
i=1
p|Si |
" (m ) (m )#
1
E {Yi (1) − µ1 } {Yj (1) − µ1 } 1 Wik (1 − Zk ) = 0 1
X X X
= n−2 Wjk (1 − Zk ) = 0
i,j
p|Si |+|Sj | k=1 k=1
X {Yi (1) − µ1 } {Yj (1) − µ1 } p|Si ∪Sj |
= n−2
i,j
p|Si |+|Sj |
X
= n−2 {Yi (1) − µ1 } {Yj (1) − µ1 } p−|Si ∩Sj |
i,j
X
−2
= n {Yi (1) − µ1 } {Yj (1) − µ1 } (p−|Si ∩Sj | − 1)
i,j
−1
≤ n (p−S̄ − 1)(max ai )(max bi )S̄ D̄,
i i

P
where the second-to-last equality follows from the fact that i,j {Yi (1) − µ1 } {Yj (1) − µ1 } = 0, and the
last inequality follows from Lemma B.1. Therefore, the numerator of µ̂1 − µ1 converges in probability to 0

S6
by Chebyshev’s inequality. Similarly, the denominator of µ̂1 − µ1 has mean 1 and variance converging in
probability to 0. This concludes the proof of µ̂1 converges in probability to µ1 . Analogously, µ̂0 converges
in probability to µ0 , which concludes the proof of Theorem 3.1.

B.3 Proof of Theorem 3.2


We first compute the asymptotic variance of the proposed estimators µ̂1 . The denominator of µ̂1 converges
in probability to 1. By Slutsky’s theorem, we have

{Yi (1) − µ1 } 1 { k=1 Wik (1 − Zk ) = 0}


n
" Pm #
X
−1
avar(µ̂1 ) = var n
i=1
p|Si |
" #2 
n
1
Pm
X {Y i (1) − µ 1 } { Wik (1 − Z k ) = 0}
= E  n−1 k=1
|Si |

i=1
p
" (m ) (m )#
1
E {Yi (1) − µ1 } {Yj (1) − µ1 } 1 Wik (1 − Zk ) = 0 1
X X X
−2
= n Wjk (1 − Zk ) = 0
i,j
p|Si |+|Sj | k=1 k=1
X {Yi (1) − µ1 } {Yj (1) − µ1 } p|Si ∪Sj |
= n−2
i,j
p|Si |+|Sj |
X
= n−2 {Yi (1) − µ1 } {Yj (1) − µ1 } p−|Si ∩Sj |
i,j
X
−2
= n {Yi (1) − µ1 } {Yj (1) − µ1 } (Λ1 )i,j .
i,j

By symmetry, we have

X
avar(µ̂0 ) = n−2 {Yi (0) − µ0 } {Yj (0) − µ0 } (Λ0 )i,j .
i,j

Next, we compute the asymptotic covariance between µ̂1 and µ̂0 :

X {Yi (1) − µ1 } 1 {Pm Wik (1 − Zk ) = 0} X 1


" n n Pm #
{Yi (0) − µ 0 } { Wik Z k = 0}
acov(µ̂1 , µ̂0 ) = n−2 E k=1
|Si |
, k=1
|Si |
i=1
p i=1
(1 − p)
{Yi (1) − µ1 } {Yj (0) − µ0 } 1 { k=1 Wik (1 − Zk ) = 0} 1 { k=1 Wjk Zk = 0}
X  Pm P m 
−2
= n E
i,j
p|Si | (1 − p)|Sj |

{Yi (1) − µ1 } {Yj (0) − µ0 } 1{Si ∩ Sj = ∅}


X
= n−2
i,j

{Yi (1) − µ1 } {Yj (0) − µ0 } 1{Si ∩ Sj 6= ∅}.


X
−2
= −n
i,j

Combining the results, we have

avar(τ̂ ) = avar(µ̂1 ) + avar(µ̂2 ) − 2acov(µ̂1 , µ̂2 )


X X
= n−2 {Yi (1) − µ1 } {Yj (1) − µ1 } (Λ1 )i,j + n−2 {Yi (0) − µ0 } {Yj (0) − µ0 } (Λ0 )i,j
i,j i,j

S7
X
+2n−2 {Yi (1) − µ1 } {Yj (0) − µ0 } (Λτ )i,j
i,j
n o
= n−2 Ỹ (1)t Λ1 Ỹ (1) + Ỹ (0)t Λ0 Ỹ (0) + 2Ỹ (1)t Λτ Ỹ (0) .

We then apply Theorem A.1 following two steps.


Step 1. We first give an alternative representation of the numerator of µ̂1 − µ1 , which is equal to

n
1{ m
P
k=1 Wik (1 − Zik ) = 0}{Yi (1) − µ1 }
X
−1
n
i=1
p|Si |
n
X Zk X
= 1
1{|Si | = 1}Wik1 {Yi (1) − µ1 }
np i=1
k1
X Zk Zk Xn
+ 1 2
1{|Si | = 2}Wik1 Wik2 {Yi (1) − µ1 }
np2 i=1
k1 <k2
+···
n
Zk1 · · · ZkS̄ X
1{|Si | = S̄}Wik1 · · · WikS̄ {Yi (1) − µ1 }
X
+
k1 <···<kS̄
npS̄ i=1
n
X Z̃k + p X
= 1
1{|Si | = 1}Wik1 {Yi (1) − µ1 }
np i=1
k1
X (Z̃k + p)(Z̃k + p) Xn
+ 1 2
1{|Si | = 2}Wik1 Wik2 {Yi (1) − µ1 }
np2 i=1
k1 <k2
+···
n
(Z̃k1 + p) · · · (Z̃kS̄ + p) X
1{|Si | = S̄}Wik1 · · · WikS̄ {Yi (1) − µ1 }.
X
+ (B.10)
k1 <···<kS̄
npS̄ i=1

By binomial expansion, for any s ∈ [S̄], we have

n
(Z̃k1 + p) · · · (Z̃ks + p) X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }
X
nps i=1
k1 <···<ks
n
1 (Z̃k1 + p) · · · (Z̃ks + p) X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }
X
=
s! nps i=1
k1 6=...6=ks
  n
1 s X Z̃k1 X
1{|Si | = s}Wik1 {Yi (1) − µ1 }
X
= Wik2 · · · Wiks
s! 1 np i=1
k1 k2 6=···6=ks ,
ku 6=k1 ,∀1<u≤s
+···
  n
1 s Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }
X X
+ Wikℓ+1 · · · Wiks
s! ℓ npℓ i=1
k1 6=...6=kℓ kℓ+1 6=···6=ks ,
ku 6=k1 ,...,kℓ ,∀ℓ<u≤s
+···
n
Z̃k1 · · · Z̃ks−1 X
 
1 s
1{|Si | = s}Wik1 · · · Wiks−1 {Yi (1) − µ1 }
X X
+ Wiks
s! s − 1 nps−1 i=1
k1 6=...6=ks−1 ks ,
ks 6=k1 ,...,ks−1
n
1 Z̃k1 · · · Z̃ks X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }.
X
+ (B.11)
s! nps i=1
k1 6=...6=ks

S8
For the ℓ-th term, we have
  n
1 s Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }
X X
Wikℓ+1 · · · Wiks
s! ℓ npℓ i=1
k1 6=...6=kℓ kℓ+1 6=···6=ks ,
ku 6=k1 ,...,kℓ ,∀ℓ<u≤s
  X n
1 s Z̃k1 · · · Z̃kℓ X
= 1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }(s − ℓ)!
s! ℓ npℓ i=1
k1 6=...6=kℓ
  X Z̃k · · · Z̃k X n
1 s
= (s − ℓ)!ℓ! 1 ℓ
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }
s! ℓ npℓ i=1
k1 <···<kℓ
n
Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 }.
X
=
npℓ i=1
k1 <···<kℓ

Plugging back to equation (B.11), we have

n
(Z̃k1 + p) · · · (Z̃ks + p) X
1{|Si | = s}Wik1 · · · Wiks {Yi (1) − µ1 }
X
nps i=1
k1 <···<ks
s n
Z̃k1 · · · Z̃kℓ X
1{|Si | = s}Wik1 · · · Wikℓ {Yi (1) − µ1 },
X X
=
npℓ i=1
ℓ=1 k1 <···<kℓ

thus, the summation in (B.10) is equal to

n
X Z̃k X S̄
1{|Si | = s}
X
1
Wik1 {Yi (1) − µ1 }
np i=1 s=1
k1

X Z̃k Z̃k Xn S̄
1{|Si | = s}
X
+ 1 2
Wik Wik {Yi (1) − µ 1 }
np2 i=1 1 2
s=2
k1 <k2
+···
n S̄
Z̃k1 · · · Z̃kS̄ X
1{|Si | = s}.
X X
+ Wik1 · · · Wik {Yi (1) − µ 1 }
k1 <···<kS̄
npS̄ i=1

s=S̄

By symmetry, the numerator of µ̂0 − µ0 equals

n
1{ m
P
k=1 Wik Zik = 0}{Yi (0) − µ0 }
X
n−1
i=1
n(1 − p)|Si |
n S̄
Z̃k1 X
1{|Si | = s}
X X
= − Wik1 {Yi (0) − µ0 }
n(1 − p) i=1 s=1
k1
n S̄
Z̃k1 Z̃k2 X
1{|Si | = s}
X X
+ (−1)2 Wik Wik {Yi (0) − µ 0 }
n(1 − p)2 i=1 1 2
s=2
k1 <k2
+···
n S̄
Z̃k1 · · · Z̃kS̄ X
1{|Si | = s}.
X X
+ (−1)S̄ Wik1 · · · Wik {Yi (0) − µ 0 }
k1 <···<kS̄
n(1 − p)S̄ i=1 S̄
s=S̄

S9
Define

n S̄
1 X
1{|Si | = s},
X
a1,k1 ···kℓ = Wik1 · · · Wikℓ {Yi (1) − µ1 }
npℓ i=1
s=ℓ
ℓ n S̄
(−1)
1{|Si | = s}.
X X
a0,k1 ···kℓ = Wik1 · · · Wikℓ {Yi (0) − µ0 }
n(1 − p)ℓ i=1 s=ℓ

To summarize, we have shown that the numerator of µ̂z − µz is equal to


X X
az,k1 ···kℓ Z̃k1 · · · Z̃kℓ
ℓ=1 k1 <···<kℓ

for z = 1, 0.
Step 2. We now consider any linear combination of the numerators of µ̂1 − µ1 and µ̂0 − µ0 . We show it can
be reformulated in the form of Γ defined in equation (A.1). Consider any c1 and c0 that has c21 + c20 = 1.
Define

ak1 ···kℓ = c1 a1,k1 ···kℓ + c0 a0,k1 ···kℓ .

Then we can write

n   S̄
−1
X c1 Ti {Yi (1) − µ1 } c0 Ci {Yi (0) − µ0 } X X
n + = ak1 ···kℓ Z̃k1 · · · Z̃kℓ . (B.12)
i=1
p|Si | (1 − p)|Si |
ℓ=1 k1 <···<kℓ

We will apply Theorem A.1 to establish a central limit theorem for (B.12). We check the two conditions
required in Theorem A.1.
We first show the boundedness of a’s. Note that

n  S̄
c1 {Yi (1) − µ1 } (−1)ℓ c0 {Yi (0) − µ0 } X

1{|Si | = s}.
X
ak1 ···kℓ = Wik1 · · · Wikℓ +
i=1
npℓ n(1 − p)ℓ
s=ℓ

The summand indexed by i is nonzero only if unit i belongs to groups k1 , . . . , kℓ . By Assumption 3, for each
k1 , . . . , kℓ , we have at most D̄ such units. Hence we obtain

D̄ maxi {|Yi (1) − µ1 |, |Yi (0) − µ0 |} n −S̄ o


|ak1 ···kℓ | ≤ p + (1 − p)−S̄ := ām .
n

1{|akk1 ...ks | 6= 0} ≤ B. For any


P
Second, we verify the limited overlapping condition (k1 ···ks )⊂[m]\{k}
(k1 · · · ks ) ⊂ [m] and k, we have 1{|akk1 ...ks | 6= 0} = 1{∃ i, such that Wik1 · · · Wiks Wik = 1}, which is
nonzero if and only if k1 , . . . , ks are all connected to group k. Therefore,

1{|akk1 ...ks | 6= 0} ≤ 1{k1 , . . . , ks are all connected to group k} ≤ B s ,


X X

(k1 ···ks )⊂[m]\{k} (k1 ···ks )⊂[m]\{k}

S10
where the last inequality holds because by Assumption 5, there are at most B groups connected to group k,
thus the number of combinations (k1 , . . . , ks ) such that all of them are connected to k is upper bounded by
B s

s ≤ B .
Therefore, by Step 2, we conclude that the numerators of µ̂1 − µ1 and µ̂0 − µ0 converge jointly to a
bivariate standard normal distribution, after standardization via
 
n−2 Ỹ (1)t Λ1 Ỹ (1) n−2 Ỹ (1)Λτ Ỹ (0)
 .
n−2 Ỹ (1)Λτ Ỹ (0) n−2 Ỹ (0)t Λ0 Ỹ (0)

Moreover, the denominators of µ̂1 and µ̂0 are converging in probability to 1, thus the asymptotic distri-
bution in Theorem 3.2 holds by Slutsky’s Theorem.

B.4 Proof of Theorem 3.3


We first prove the convergence of v̂/plim(v̂). Denote

X Ti Tj (Yi − µ̂1 )(Yj − µ̂1 )(Λ1 )i,j


v̂1 = n−2 ,
i,j
p|Si ∪Sj |
X Ci Cj (Yi − µ̂0 )(Yj − µ̂0 )(Λ0 )i,j
v̂0 = n−2 ,
i,j
(1 − p)|Si ∪Sj |

1/2 1/2
then we have v̂ = (v̂1 + v̂0 )2 . We prove the convergence of v̂/plim(v̂) by showing that v̂1 /plim(v̂1 ) =
1 + op (1) and v̂0 /plim(v̂0 ) = 1 + op (1), where

plim(v̂1 ) = avar(µ̂1 ) = n−2 Ỹ (1)t Λ1 Ỹ (1),


plim(v̂0 ) = avar(µ̂0 ) = n−2 Ỹ (0)t Λ0 Ỹ (0).

Rewrite

X Ti Tj {Yi − µ1 + (µ1 − µ̂1 )}{Yj − µ1 + (µ1 − µ̂1 )}(p−|Si ∩Sj | − 1)


v̂1 = n−2
i,j
p|Si ∪Sj |
X Ti Tj (Yi − µ1 )(Yj − µ1 )(p−|Si ∩Sj | − 1)
= n−2 (B.13)
i,j
p|Si ∪Sj |
X Ti Tj (Yi − µ1 )(p−|Si ∩Sj | − 1)
+2(µ1 − µ̂1 )n−2 (B.14)
i,j
p|Si ∪Sj |
X Ti Tj (p−|Si ∩Sj | − 1)
+(µ1 − µ̂1 )2 n−2 , (B.15)
i,j
p|Si ∪Sj |

S11
and use T1 , T2 , T3 to denote the three terms in (B.13)–(B.15), respectively. By the fact that E(Ti Tj ) =
p|Si ∪Sj | , we have E(T1 ) = plim(v̂1 ). The variance of T1 ,

var(T1 ) ≤ n−3 p−4S̄ [max{Yi (1) − µ1 }4 ]S̄ 3 D̄3 (p−S̄ − 1)2 = Op (n−3 D̄3 )
i

by Lemma B.2 when taking ai = bi = Yi (1) − µ1 . Thus,

T1 = E(T1 ) + Op {var(T1 )1/2 } = plim(v̂1 ) + Op (n−3/2 D̄3/2 ) = plim(v̂1 ) + op (n−1 D̄).

Similarly, by Lemma B.2, we have


 
 X Ti Tj (Yi − µ1 )(p−|Si ∩Sj | − 1) 
E n−2 = Op (n−1 D̄),

i,j
p|Si ∪Sj |

 
X Ti Tj (Yi − µ1 )(p−|Si ∩Sj | − 1) 

var n−2 = Op (n−3 D̄3 ),
 p|Si ∪Sj |
i,j

by taking ai = Yi (1) − µ1 and bi = 1. By the proof of Theorem 3.2, we have µ̂1 − µ1 = Op (n−1/2 D̄1/2 ), and
T2 = E(T2 ) + Op {var(T2 )1/2 } gives us

T2 = Op (n−1/2 D̄1/2 ) · Op (n−1 D̄) + Op [{n−1 D̄ · n−3 D̄3 }1/2 ] = op (n−1 D̄).

Also, we have
 
 X Ti Tj (p−|Si ∩Sj | − 1) 
E n−2 = Op (n−1 D̄),

i,j
p|Si ∪Sj |

 
X Ti Tj (p−|Si ∩Sj | − 1) 
var n−2 = Op (n−3 D̄3 )

i,j
p|Si ∪Sj | 

by taking (ai , bi ) = (1, 1) in Lemma B.2. Again, we have

T3 = Op (n−1 D̄) · Op (n−1 D̄) + Op [{n−2 D̄2 · n−3 D̄3 }1/2 ] = op (n−1 D̄).

Combining the three terms T1 –T3 , we have

v̂1 = plim(v̂1 ) + op (n−1 D̄).

Under the regularity condition that the weighted covariance matrix of the potential outcomes Yi (1) and
Yi (0) are non-degenerated, plim(v̂1 ) = Op (n−1 D̄), thus v̂1 /plim(v̂1 ) = 1 + op (1). Analogously, v̂0 /plim(v̂0 ) =
1 + op (1). By the continuous mapping theorem, v̂/plim(v̂) converges in probability to 1.
Next, we prove that plim(v̂) ≥ avar(τ̂ ). Recall that plim(v̂) = {plim(v̂1 )1/2 + plim(v̂0 )1/2 }2 , by Cauchy-

S12
Schwarz inequality, we have

avar(τ̂ ) = avar(µ̂1 ) + avar(µ̂0 ) − 2acov(µ̂1 , µ̂0 )


≤ plim(v̂1 ) + plim(v̂0 ) + 2plim(v̂1 )1/2 plim(v̂0 )1/2 = v.

B.5 Proof of Proposition 4.1


The proof for the consistency and asymptotic normality of τ̂ (β1 , β0 ) is analogous to that of Theorem 3.1
and 3.2. We only need to treat the Yi (1) − X̃it β1 and Yi (0) − X̃it β0 as pseudo potential outcomes. The
remaining step is to check that the conditions in Theorems 3.1 and 3.2 still hold with the pseudo potential
outcomes. Assumptions 1, 3 and 5 still hold because the network structure remains the same. To check
Assumption 4, suppose |Yi (z)| ≤ aY and |Xik | ≤ aX , then we have |Yi − β t X̃i | ≤ aY + kβk1 aX is also
bounded.

B.6 Proof of Lemma 4.1


We have

n2 {vn (β1 , β0 ) − vn (0, 0)} = {Ỹ (1) − X̃β1 }t Λ1 {Ỹ (1) − X̃β1 } + {Ỹ (0) − X̃β0 }t Λ0 {Ỹ (0) − X̃β0 }
+2{Ỹ (1) − X̃β1 }t Λτ {Ỹ (0) − X̃β0 } − vn (0, 0)
= Ỹ (1)t Λ1 Ỹ (1) + Ỹ (0)t Λ0 Ỹ (0) + 2Ỹ (1)t Λτ Ỹ (0)
+β1t X̃ t Λ1 X̃β1 + β0t X̃ t Λ0 X̃β0 + 2β1t X̃ t Λτ X̃β0
−β1t X̃ t {Λ1 Ỹ (1) + Λτ Ỹ (0)} + β0t X̃ t {Λ0 Ỹ (0) + Λτ Ỹ (1)}
−β1t X̃ t Λ1 X̃β1 + β0t X̃ t Λ0 X̃β0 + 2β1t X̃ t Λτ X̃β0
= n2 L(β1 , β0 ).

B.7 Proof of Proposition 4.2


By the minimization step, L(β̃1 , β̃0 ) ≤ L(0, 0) ≤ 0. By Lemma 4.1, it holds that

n2 vn (β̃1 , β̃0 ) = n2 {vn + L(β̃1 , β̃0 )} ≤ n2 vn

where the last inequality follows from the constraint in (6).

S13
B.8 Proof of Theorem 4.1
Convergence of the regression coefficients. Define the population limit counterpart for the closed-
form solution (β̃1 , β̃0 ):
 
β⋆
 1 −1
= Ωxx Ωyx , (B.16)
β0⋆

where
   
Ωxx,11 Ωxx,10 Ωyx,11 + Ωyx,01
Ωxx =  , Ωyx =  . (B.17)
Ωxx,01 Ωxx,00 Ωyx,00 + Ωyx,10

By Assumption 6, we have
   
β̂1 X̃ t Λ1 X̃ X̃ t Λτ X̃
  =   → Ωxx .
β̂0 X̃ t Λτ X̃ X̃ t Λ0 X̃

By similar arguments as in Theorem 3.3, under Assumption 6, the following holds asymptotically in proba-
bility:
P t  
Ti Tj X̃i (Yj −µ̂1 )(Λ1 )i,j P Ci Cj X̃i (Yj −µ̂0 )(Λτ )i,j
i,j + i,j Ωyx,11 + Ωyx,01
p|Si ∪Sj | (1−p)|Si ∪Sj |
P
Ti Tj X̃i (Yj −µ̂1 )(Λτ )i,j P Ci Cj X̃i (Yj −µ̂0 )(Λ0 )i,j  → .
i,j + i,j Ωyx,00 + Ωyx,10
p|Si ∪Sj | (1−p)|Si ∪Sj |

Therefore, we conclude that

(β̂1 , β̂0 ) − (β1⋆ , β0⋆ ) = op (1). (B.18)

Consistency and asymptotic distribution of τ̂ (β̂1 , β̂0 ). The difference between τ̂ (β̂1 , β̂0 ) and τ̂ (β1⋆ , β0⋆ )
is
n n n n
X Ti (β̂1 − β1⋆ )t X̃i . −1 X Ti X Ci (β̂0 − β0⋆ )t X̃i . −1 X Ci
τ̂ (β̂1 , β̂0 ) − τ̂ (β1⋆ , β0⋆ ) = n−1 |Si |
n |Si |
− n −1
|Si |
n .
i=1
p i=1
p i=1
(1 − p) i=1
(1 − p)|Si |

By the consistency of the optimization solutions

β̂1 − β1⋆ = op (1), β̂0 − β0⋆ = op (1),

and the facts that


n n
X Ti X̃i   X Ci X̃i  
n−1 = Op n−1/2 D̄1/2 , n−1 = Op n −1/2 1/2
D̄ ,
i=1
p|Si | i=1
(1 − p)|Si |
n n
X Ti   X Ci  
n−1 = 1 + Op n −1/2 1/2
D̄ , n−1 = 1 + Op n −1/2 1/2
D̄ ,
i=1
p|Si | i=1
(1 − p)|Si |

S14
following similar arguments as in proof of Theorem 3.2, we can conclude that
 
|τ̂ (β̂1 , β̂0 ) − τ̂ (β1⋆ , β0⋆ )| = op n−1/2 D̄1/2 .

By Proposition 4.1, τ̂ (β1⋆ , β0⋆ ) converges in probability to τ . Hence τ̂ (β̂1 , β̂0 ) is also consistent to τ .
The asymptotic distribution of τ̂ (β̂1 , β̂0 ) follows from Slutsky’s Theorem and the fact that
n o
−1/2
{vn (β1⋆ , β0⋆ )} τ̂ (β̂1 , β̂0 ) − τ → N (0, 1)

in distribution.

B.9 Proof of Theorem 4.2

X Ti Tj (Yi − µ̂1 − β̂ t X̃i )(Yj − µ̂1 − β̂ t X̃j )(Λ1 )i,j


1 1
n−2 |Si ∪Sj |
i,j
p
X Ti Tj {Ỹi − β ⋆t X̃i − (µ̂1 − µ1 ) − (β̂1 − β ⋆ )t X̃i }{Ỹj − β ⋆t X̃j − (µ̂1 − µ1 ) − (β̂1 − β ⋆ )t X̃j }(Λ1 )i,j
1 1
= n−2 |Si ∪Sj |
i,j
p
X Ti Tj (Ỹi − β ⋆t X̃i )(Ỹj − β ⋆t X̃j )(Λ1 )i,j
1 1
= n−2 |Si ∪Sj |
i,j
p
X Ti Tj (Ỹi − β ⋆t X̃i ){(µ̂1 − µ1 ) + (β̂1 − β ⋆ )t X̃j }(Λ1 )i,j
1
+2n−2
i,j
p|Si ∪Sj |
X Ti Tj {(µ̂1 − µ1 ) + (β̂1 − β ⋆ )t X̃i }{(µ̂1 − µ1 ) + (β̂1 − β ⋆ )t X̃j }(Λ1 )i,j
+n−2
i,j
p|Si ∪Sj |
= I + II + III.

Following a similar proof as that of Theorem 3.3, we have

I = v1⋆ (β1⋆ ) + Op (n−3/2 D̄−3/2 ), II = op (n−3/2 D̄−3/2 ), III = op (n−3/2 D̄−3/2 ).

Meanwhile, due to the fact that v1⋆ (β1 ) ≍ n−1 D̄, we have

v̂1,n (β̂1 ) = v1⋆ (β1⋆ ) + Op (n−3/2 D̄−3/2 )

and similarly

v̂0,n (β̂0 ) = v0⋆ (β0⋆ ) + Op (n−3/2 D̄−3/2 ).

Therefore,
n o1/2 n o1/2
1/2 1/2
v̂1,n (β̂1 ) + v̂0,n (β̂0 ) = {v1⋆ (β1⋆ )} + {v0⋆ (β0⋆ )} + Op (n−3/4 D̄−3/4 ),

S15
and thus

{v̂1,n (β̂1 )}1/2 + {v̂0,n (β̂0 )}1/2


= 1 + Op (n−1/4 D̄−1/4 )
{v1⋆ (β1⋆ )}1/2 + {v0⋆ (β0⋆ )}1/2

by the fact that v1⋆ (β1⋆ ) ≍ n−1 D̄ and v0⋆ (β0⋆ ) ≍ n−1 D̄. Therefore, v̂n,ub (β̂1 , β̂0 )/vn,ub (β1⋆ , β0⋆ ) converges in
probability to 1.

The conservativeness of vub (β1⋆ , β0⋆ ) for the true variance v ⋆ (β1⋆ , β0⋆ ) can be established similarly to The-
orem 3.3 when no covariates are adjusted. The trick is to take Y (1) − X̃β1⋆ and Y (0) − X̃β0⋆ as pseudo
potential outcomes and apply the Cauchy-Schwarz inequality for the covariance. Details are omitted.

S16

You might also like