NeurIPS2020上与【因果推理】相关论文（六篇论文）资源-CSDN下载

共6个文件

pdf：6个

NeurIPS

因果推理

需积分: 50 39 浏览量 2020-11-15 12:59:14 上传评论 2 收藏 10.84MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

近期必读的六篇 NeurIPS2020【因果推理】相关论文和代码.zip （6个子文件）

4. Causal Imitation Learning with Unobserved Confounders.pdf.pdf 1011KB

1. Causal analysis of Covid-19 spread in Germany.pdf.pdf 2.07MB

2. Algorithmic recourse under imperfect causal knowledge：a probabilistic app.pdf.pdf 1.61MB

5. Causal Intervention for Weakly-Supervised Semantic Segmentation.pdf.pdf 1.99MB

6. Identifying Causal-Effect Inference Failure with Uncertainty-Aware Models.pdf.pdf 4.92MB

3. CASTLE：Regularization via Auxiliary Causal Graph Discovery.pdf.pdf 750KB

Identifying Causal-Effect Inference Failure with

Uncertainty-Aware Models

Andrew Jesson

∗

Department of Computer Science

University of Oxford

Oxford, UK OX1 3QD

[email protected]

Sören Mindermann

Department of Computer Science

University of Oxford

Oxford, UK OX1 3QD

[email protected]

Uri Shalit

Technion

Haifa, Israel 3200003

[email protected]

Yarin Gal

Department of Computer Science

University of Oxford

Oxford, UK OX1 3QD

[email protected]

Abstract

Recommending the best course of action for an individual is a major application

of individual-level causal effect estimation. This application is often needed in

safety-critical domains such as healthcare, where estimating and communicating

uncertainty to decision-makers is crucial. We introduce a practical approach for

integrating uncertainty estimation into a class of state-of-the-art neural network

methods used for individual-level causal estimates. We show that our methods

enable us to deal gracefully with situations of “no-overlap”, common in high-

dimensional data, where standard applications of causal effect approaches fail.

Further, our methods allow us to handle covariate shift, where the train and test

distributions differ, common when systems are deployed in practice. We show that

when such a covariate shift occurs, correctly modeling uncertainty can keep us from

giving overconﬁdent and potentially harmful recommendations. We demonstrate

our methodology with a range of state-of-the-art models. Under both covariate shift

and lack of overlap, our uncertainty-equipped methods can alert decision makers

when predictions are not to be trusted while outperforming standard methods that

use the propensity score to identify lack of overlap.

1 Introduction

Learning individual-level causal effects is concerned with learning how units of interest respond

to interventions or treatments. These could be the medications prescribed to particular patients,

training-programs to job seekers, or educational courses for students. Ideally, such causal effects

would be estimated from randomized controlled trials, but in many cases, such trials are unethical

or expensive: researchers cannot randomly prescribe smoking to assess health risks. Observational

data offers an alternative, with typically larger sample sizes and lower costs, and more relevance

to the target population. However, the price we pay for using observational data is lower certainty

in our causal estimates, due to the possibility of unmeasured confounding, and the measured and

unmeasured differences between the populations who were subject to different treatments.

Progress in learning individual-level causal effects is being accelerated by deep learning approaches

to causal inference [

]. Such neural networks can be used to learn causal effects from

∗

Equal contribution.

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arXiv:2007.00163v2 [cs.LG] 22 Oct 2020

observational data, but current deep learning tools for causal inference cannot yet indicate when they

are unfamiliar with a given data point. For example, a system may offer a patient a recommendation

even though it may not have learned from data belonging to anyone with similar age or gender as

the patient, or it may have never observed someone like this patient receive a speciﬁc treatment

before. In the language of machine learning and causal inference, the ﬁrst example corresponds to a

covariate shift, and the second example corresponds to a violation of the overlap assumption, also

known as positivity. When a system experiences either covariate shift or violations of overlap, the

recommendation would be uninformed and could lead to undue stress, ﬁnancial burden, false hope,

or worse. In this paper, we explain how covariate shift and violations of overlap are concerns for real-

world learning of conditional average treatment effects (CATE) from observational data, we examine

why deep learning systems should indicate their lack of conﬁdence when these phenomena are

encountered, and we develop a new and principled approach to incorporating uncertainty estimating

into the design of systems for CATE inference.

First, we reformulate the lack of overlap at test time as an instance of covariate shift, allowing us

to address both problems with one methodology. When an observation

lacks overlap, the model

predicts the outcome

for a treatment

that has probability zero or near-zero under the training

distribution. We extend the Causal-Effect Variational Autoencoder (CEVAE) [

] by incorporating

negative sampling, a method for out-of-distribution (OoD) training, to improve uncertainty estimates

on OoD inputs. Negative sampling is effective and theoretically justiﬁed but usually intractable [

Our insight is that it becomes tractable for addressing non-overlap since the distribution of test-time

inputs

(x, t)

is known: it equals the training distribution but with a different choice of treatment

(for example, if at training we observe outcome

for patient

only under treatment

t = 0

, then we

know that the outcome for

(x, t = 1)

should be uncertain). This can be seen as a special case of

transductive learning [

, Ch. 9]. For addressing covariate shift in the inputs

, negative sampling

remains intractable as the new covariate distribution is unknown; however, it has been shown in

non-causal applications that Bayesian parameter uncertainty captures “epistemic” uncertainty which

can indicate covariate shift [

]. We, therefore, propose to treat the decoder

p(y|x, t)

in CEVAE as a

Bayesian neural network able to capture epistemic uncertainty.

In addition to casting lack of overlap as a distribution shift problem and proposing an OoD training

methodology for the CEVAE model, we further extend the modeling of epistemic uncertainty to

a range of state-of-the-art neural models including TARNet, CFRNet [

], and Dragonnet [

developing a practical Bayesian counter-part to each. We demonstrate that, by excluding test points

with high epistemic uncertainty at test time, we outperform baselines that use the propensity score

p(t = 1|x)

to exclude points that violate overlap. This result holds across different state-of-the-art

architectures on the causal inference benchmarks IHDP [

] and ACIC [

]. Leveraging uncertainty

for exclusion ties it into causal inference practice where a large number of overlap-violating points

must often be discarded or submitted for further scrutiny [

]. Finally, we introduce

a new semi-synthetic benchmark dataset, CEMNIST, to explore the problem of non-overlap in

high-dimensional settings.

2 Background

Classic machine learning is concerned with functions that map an input (e.g. an image) to an output

(e.g. “is a person”). The speciﬁc function

for a given task is typically chosen by an algorithm

that minimizes a loss between the outputs

f(x

)

and targets

over a dataset

, y

}

i=1

of input

covariates and output targets. Causal effect estimation differs in that, for each input

, there is a

corresponding treatment

∈ {0, 1}

and two potential outcomes

– one for each choice of

treatment [

]. In this work, we are interested in the Conditional Average Treatment Effect (

CATE

CATE(x

) =

− Y

|X = x

] (1)

= µ

) − µ

), (2)

where the expectation is needed both because the individual treatment effect

− Y

may be

non-deterministic, and because it cannot in general be identiﬁed without further assumptions. Under

the assumption of ignorability conditioned on

(or no-hidden confounding), which we make in this

paper, we have that

|X = x

] =

[y|X = x

, t = a]

, thus opening the way to estimate CATE

from observational data [

]. Speciﬁcally, we are motivated by cases where

is high-dimensional,

for example, a patient’s entire medical record, in which case we can think of the CATE as representing

an individual-level causal effect. Though the speciﬁc meaning of a CATE measurement depends on

Figure 1: Explanation of how epistemic uncertainty detects lack of data.

Top

: binary outcome

(blue circle) given no treatment, and different functions

p(y = 1|x, t = 0, ω)

(purple) predicting

outcome probability (blue dashed line, ground truth). Functions disagree where data is scarce.

middle

: binary outcome

given treatment, and functions

p(y = 1|x, t = 1, ω)

(green) predicting

outcome probability.

Bottom

: measures of uncertainty/disagreement between outcome predictions

(dashed purple and dotted green lines) are high when data is lacking. CATE uncertainty (solid black

line) is higher where at least one model lacks data (non-overlap, light blue) or where both lack data

(out-of-distribution / covariate shift, dark blue).

context, in general, a positive value indicates that an individual with covariates

will have a positive

response to treatment, a negative value indicates a negative response, and a value of zero indicates

that the treatment will not affect such an individual.

The fundamental problem of learning to infer

CATE

from an observational dataset

D =

, y

, t

}

i=1

is that only the factual outcome

= Y

corresponding to the treatment

can

be observed. Because the counterfactual outcome

1−t

is never observed, it is difﬁcult to learn a

function for

CATE(x

)

directly. Instead, a standard approach are either to treat

as an additional

covariate [

] or focus on learning functions for

)

and

)

using the observed

targets [47, 36, 48].

2.1 Epistemic uncertainty and covariate shift

In probabilistic modelling, predictions may be assumed to come from a graphical model

p(y|x, t, ω)

– a distribution over outputs (the likelihood) given a single set of parameters

. Considering a binary

label

given, for example,

t = 0

, a neural network can be described as a function deﬁning the

likelihood

p(y = 1|x, t = 0, ω

)

, with parameters

deﬁning the network weights. Different draws

from a distribution over parameters

p(ω

|D)

would then correspond to different neural networks,

i.e. functions from (x, t = 0) to y (e.g. the purple curves in Fig. 1 (top)).

For parametric models such as neural networks (NNs), we treat the weights as random variables,

and, with a chosen prior distribution

p(ω

)

, aim to infer the posterior distribution

p(ω

|D)

. The

purple curves in Figure 1 (top) are individual NNs

(·)

sampled from the posterior of such

a Bayesian neural network (BNN). Bayesian inference can be performed by marginalizing the

likelihood function

p(y|µ

(x))

over the posterior

p(ω

|D)

in order to obtain the posterior predictive

probability

p(y|x, t = 0, D) =

p(y|x, t = 0, ω

)p(ω

|D)dω

. This marginalization is intractable

for BNNs in practice, so variational inference is commonly used as a scalable approximate inference

technique, for example, by sampling the weights from a Dropout approximate posterior

q(ω

|D)

[15].

Figure 1 (top) illustrates the effects of a BNN’s parameter uncertainty in the range

x ∈ [−1, 1]

(shaded region). While all sampled functions

(x)

with

∼ p(ω

|D, t = 0)

(shown in blue)

agree with each other for inputs

in-distribution (

x ∈ [−6, −1]

) these functions make disagreeing

predictions for inputs

x ∈ [−1, 1]

because these lie out-of-distribution (OoD) with respect to the

training distribution p(x|t = 0). This is an example of covariate shift.

To avoid overconﬁdent erroneous extrapolations on such OoD examples, we would like to indicate

that the prediction

(x)

is uncertain. This epistemic uncertainty stems from a lack of data, and

is distinct from aleatoric uncertainty, which stems from measurement noise. Epistemic uncertainty

about the random variable (r.v.)

can be quantiﬁed in various ways. For classiﬁcation tasks, a

popular information-theoretic measure is the information gained about the r.v.

if the label

y = Y

were observed for a new data point

, given the training dataset

[

]. This is captured by the

mutual information between ω

and Y

, given by

I[ω

, Y

|D, x] = H[Y

|x, D] −

q( ω

|D)



H[Y

|x, ω

]



, (3)

where

H[•]

is the entropy of a given r.v.. For regression tasks, it is common to measure how the

r.v.

(x)

varies when marginalizing over

Var

q(ω

|D)

[µ

(x)]

. We will later use this measure for

epistemic uncertainty in CATE.

3 Non-overlap as a covariate shift problem

Standard causal inference tasks, under the assumption of ignorability conditioned on

, usually deal

with estimating both

(x) =

[y|X = x, t = 0]

and

(x) =

[y|X = x, t = 1]

. Overlap is

usually assumed as a means to address this problem. The overlap assumption (also known as common

support or positivity) states that there exists

0 < η < 0.5

such that the propensity score

p(t = 1|x)

satisﬁes:

η < p(t = 1|x) < 1 − η, (4)

i.e., that for every

x ∼ p(x)

, we have a non-zero probability of observing its outcome under

t = 1

well as under

t = 0

. This version is sometimes called strict overlap, see [

] for discussion. When

overlap does not hold for some

, we might lack data to estimate either

(x)

—this is the

case in the grey shaded areas in Figure 1 (bottom).

Overlap is a central assumption in causal inference [

]. Nonetheless, it is usually not satisﬁed

for all units in a given observational dataset [

]. It is even harder to satisfy for high-

dimensional data such as images and comprehensive demographic data [

] where neural networks

are used in practice [17].

Since overlap must be assumed for most causal inference methods, an enormously popular practice is

“trimming”: removing the data points for which overlap is not satisﬁed before training [

]. In practice, points are trimmed when they have a propensity close to 0 or 1, as predicted by a

trained propensity model

(t|x)

. The average treatment effect (ATE), is then calculated by over

the remaining training points.

However, trimming has a different implication when estimating the CATE for each unit with covariates

: it means that for some units a CATE estimate is not given. If we think of CATE as a tool for

recommending treatment assignment, a trimmed unit receives no treatment recommendation. This

reﬂects the uncertainty in estimating one of the potential outcomes for this unit, since treatment was

rarely (if ever) given to similar units. In what follows, we will explore how trimming can be replaced

with more data-efﬁcient rejection methods that are speciﬁcally focused on assessing the level of

uncertainty in estimating the expected outcomes for x

under both treatment options.

Our model of the CATE is:

CATE

0/1

(x) = µ

(x) − µ

(x). (5)

In Figure 1, we illustrate that lack of overlap constitutes a covariate shift problem. When

p(t =

1|x

test

) ≈ 0

, we face a covariate shift for

(·)

(because by Bayes rule

p(x

test

|t = 1) ≈ 0

When

p(t = 1|x

test

) ≈ 1

, we face a covariate shift for

(·)

, and when

p(x

test

) ≈ 0

, we face

a covariate shift for

CATE

0/1

(x)

(“out of distribution” in the Figure 1 (bottom)). With this

understanding, we can deploy tools for epistemic uncertainty to address both covariate shift and

non-overlap simultaneously.

3.1 Epistemic uncertainty in CATE

To the best of our knowledge, uncertainty in high-dimensional

CATE

(i.e. where each value of

is only expected to be observed at most once) has not been previously addressed.

CATE(x)

can

be seen as the ﬁrst moment of the random variable

− Y

given

X = x

. Here, we extend this

notion and examine the second moment, the variance, which we can decompose into its aleatoric and

epistemic parts by using the law of total variance:

Var

p(ω

,ω

|D)

− Y

|x] =

p(ω

,ω

|D)



Var

− Y

| µ

(x), µ

(x)]



+ Var

p(ω

,ω

|D)

[µ

(x) − µ

(x)].

(6)

The second term on the r.h.s. is

Var[

CATE

0/1

(x)]

. It measures the epistemic uncertainty in

CATE

since it only stems from the disagreement between predictions for different values of the parameters,

not from noise in

, Y

. We will use this uncertainty in our methods and estimate it directly by

sampling from the approximate posterior

q(ω

, ω

|D)

. The ﬁrst term on the r.h.s. is the expected

aleatoric uncertainty, which is disregarded in CATE estimation (but could be relevant elsewhere).

Referring back to Figure 1, when overlap is not satisﬁed for

Var[

CATE

0/1

(x)]

is large because

at least one of

Var

[µ

(x)]

and

Var

[µ

(x)]

is large. Similarly, under regular covariate shift

(p(x) ≈ 0), both will be large.

3.2 Rejection policies with epistemic uncertainty versus propensity score

If there is insufﬁcient knowledge about an individual, and a high cost associated with making errors,

it may be preferable to withhold the treatment recommendation. It is therefore important to have

an informed rejection policy. In our experiments, we reject, i.e. choose to make no treatment

recommendation, when the epistemic uncertainty exceeds a certain threshold. In general, setting

the threshold will be a domain-speciﬁc problem that depends on the cost of type I (incorrectly

recommending treatment) and type II (incorrectly withholding treatment) errors. In the diagnostic

setting, thresholds have been set to satisfy public health authority speciﬁcations, e.g. for diabetic

retinopathy [

]. Some rejection methods additionally weigh the chance of algorithmic error against

that of human error [41].

When instead using the propensity score for rejection, a simple policy is to specify

and reject

points that do not satisfy eq.

(4)

with

η = η

. More sophisticated standard guidelines were proposed

by Caliendo & Kopeinig

[4]

. These methods only account for the uncertainty about

CATE(x)

that is

due to limited overlap and do not consider that uncertainty is also modulated by the availability of

data on similar individuals (as well as the noise in this data).

4 Adapting neural causal models for covariate shift

4.1 Parameter uncertainty

To obtain the epistemic uncertainty in the

CATE

, we must infer the parameter uncertainty distribu-

tion conditioned on the training data

p(ω

, ω

|D)

, which deﬁnes the distribution of each network

(·), µ

(·)

, conditioned on

. There exists a large suite of methods we can leverage for this

task, surveyed in Gal

[14]

. Here, we use MC Dropout [

] because of its high scalability [

], ease

of implementation, and state-of-the-art performance [

]. However, our contributions

are compatible with other approximate inference methods. Gal & Ghahramani

[15]

has shown that

评论收藏

内容反馈

syp_net

粉丝: 158

NeurIPS 2020上与【因果推理】相关论文（六篇论文）

评论0

最新资源

NeurIPS 2020上与【因果推理】相关论文（六篇论文）

评论0

ICML 2020上与【因果推理】相关的论文（六篇）

AAAI 2020最新「因果推理表示学习」【附122页ppt和最新综述论文】.zip

EMNLP 2020上与【反事实推理】相关的论文（五篇）.zip

ICLR 2021【推荐系统】相关投稿论文(六篇)

【干货书】《因果推理导论-机器学习角度》，132页pdf

Long-Tailed-Recognition.pytorch:[NeurIPS 2020]该项目为长尾分类，检测和实例分段（LVIS）提供了强大的单阶段基线。这也是NeurIPS 2020论文“通过保持良好状态并消除不良动量因果效应进行长尾分类”的PyTorch实施。

Paper_CausalInference_abtest:因果推理&AB实验相关论文小书库

NeurIPS 2020上与【域自适应】相关论文（六篇）

self-supervised-relational-reasoning:PyTorch正式实施的论文“用于表示学习的自我监督关系推理”，NeurIPS 2020聚焦

NeurIPS 2020上与【三维点云分析】相关论文（五篇）

awesome_deep_learning_解释性：深度学习特定关于神经网络模型解释性的相关高引用顶会论文（附代码）

CausInf:因果推理课程

NeurIPS 2020上与【对比学习】相关论文（附代码，七篇）

ICLR 2021上与【因果推理】相关的投稿论文（七篇）

visual-reasoning-rationalization:与“ EMNLP Findings 2020”论文“具有全栈视觉推理的自然语言原理”相关的代码

NeurIPS 2020上与【元学习】相关的论文（五篇）

ICML 2020上与【元学习（Meta Learning）】相关的论文（六篇）

近期必读的与ICML 2020【对比学习】相关的论文（六篇）

近期必读的六篇 IJCAI 2020【图神经网络 (GNN)+数据挖掘（DM） 】相关论文.zip

2020年7月前后必读的六篇与知识图谱表示学习KGRL相关的顶会论文（6篇，附代码）

towards-neural-programming-interfaces:与NeurIPS 2020论文相关的代码存储库

Ollama软件windows安装包(版本0.3.10)

Page Assist - 本地 AI 模型(deepseek)的 Web UI

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

Chatbox-1.3.5-安装包

DeepSeek R1 本地部署 桌面客户端 Windows版本

Ollama 0.5.7

机器学习期末复习题及答案

神经网络回归预测--气温数据集

人工智能导论（第5版）.pdf

最新资源

近期必读的六篇 IJCAI 2020【图神经网络 (GNN)+数据挖掘（DM）】相关论文.zip

DeepSeek R1 本地部署桌面客户端 Windows版本