From Engagement To Value Optimisation
From Engagement To Value Optimisation
ABSTRACT versus clicks versus shares and so on? How do we evaluate whether
Most recommendation engines today are based on predicting user our designed objective captures our intended notion of “value”?
engagement, e.g. predicting whether a user will click on an item or
not. However, there is potentially a large gap between engagement 1.1 Our contributions
signals and a desired notion of value that is worth optimizing for. We make three primary contributions.
arXiv:2008.12623v2 [cs.SI] 19 Jul 2021
We use the framework of measurement theory to (a) confront the 1. We propose measurement theory as a principled approach
designer with a normative question about what the designer values, to aggregating engagement signals into an objective function that
(b) provide a general latent variable model approach that can be captures a desired notion of “value”. The resulting objective function
used to operationalize the target construct and directly optimize can be optimized from data, serving as a plug-in replacement for
for it, and (c) guide the designer in evaluating and revising their the ad-hoc objectives typically used in engagement optimization
operationalization. We implement our approach on the Twitter frameworks.
platform on millions of users. In line with established approaches 2. Our approach is based on the creation of a latent variable
to assessing the validity of measurements, we perform a qualitative model that relates value to various observed engagement signals.
evaluation of how well our model captures a desired notion of We devise a new identification strategy for the latent variable model
“value”. tailored to the intended use case of online recommendation systems.
ACM Reference Format: Our identification strategy needs only a single robust engagement
Smitha Milli, Luca Belli, and Moritz Hardt. 2021. From Optimizing Engage- signal for which we know the conditional probability of value given
ment to Measuring Value. In Conference on Fairness, Accountability, and the signal.
Transparency (FAccT ’21), March 3–10, 2021, Virtual Event, Canada. ACM, 3. We implemented our approach on the Twitter platform on
New York, NY, USA, 10 pages. https://doi.org/10.1145/3442188.3445933
millions of users. In line with an established validity framework for
measurement theory, we conduct a qualitative analysis of how well
1 INTRODUCTION our model captures “value”.
Most recommendation engines today are based on predicting user
engagement, e.g. predicting whether a user will click an item or
1.2 Measurement theory and latent variable
not. However, there is potentially a large gap between engagement
signals and a desired notion of value that is worth optimizing for models
[1]. Just because a user engages with an item doesn’t mean they The framework of measurement theory [3, 4] is widely used in the
value it. A user might reply to an item because they are angry about social sciences as a guide to measuring unobservable theoretical con-
it, or click an item in order to gain more information about it [2], structs like “quality of life”, “political ideology”, or “socio-economic
or watch addictive videos out of temptation. status”. Under the measurement approach, theoretical constructs
It is clear that engagements provide some signal for “value”, but are operationalized as latent variables, which are related to observ-
are not equivalent to it. Further, different types of engagement may able data through a latent variable model (LVM).
provide differing levels of evidence for value. For example, if a user Similarly, we treat the “value” of a recommendation as a theoret-
explicitly likes an item, we are more likely to believe that they value ical construct, which we operationalize as a (binary) latent variable
it, compared to if they had merely clicked on it. Ideally, we want 𝑉 . We represent the LVM as a a Bayesian network [5] that contains
the objective for our recommender system to take engagement 𝑉 as well as each of the possible types of user engagements (clicks,
signals into account, but only insofar as they relate to a desired shares, etc). The structure of the Bayesian network allows us to
notion of “value”. However, directly specifying such an objective is specify conditional independences between variables, enabling us
a non-trivial problem. Exactly how much should we rely on likes to capture dependencies like e.g. needing to click an item before
replying to it.
∗ Work done while the author was an intern at Twitter.
† †MH Under the measurement approach, the ideal objective becomes
is a paid consultant at Twitter. Work performed while consulting for Twitter.
clear: P(𝑉 = 1 | Behaviors) - the probability the user values the
Permission to make digital or hard copies of part or all of this work for personal or item given their engagements with it. Such an objective uses all
classroom use is granted without fee provided that copies are not made or distributed engagement signals, but only insofar provide evidence of Value 𝑉 .
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. If we can identify P(𝑉 = 1 | Behaviors), then it can be used as a
For all other uses, contact the owner/author(s). drop-in replacement for any objective that scores items based on
FAccT ’21, March 3–10, 2021, Virtual Event, Canada engagement signals.
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8309-7/21/03. Our key insight is that we can identify P(𝑉 | Behaviors) — the
https://doi.org/10.1145/3442188.3445933 probability of Value given all behaviors — through the use of a
FAccT ’21, March 3–10, 2021, Virtual Event, Canada Smitha Milli, Luca Belli, and Moritz Hardt
single anchor variable 𝐴 for which we know P(𝑉 = 1 | 𝐴 = 1). We use measurement theory as a principled way to disentangle
The anchor variable, together with the structure of the Bayesian latent value from observed engagement. We provide a general latent
network, is what gives “value” its meaning. Through the choice of variable model approach in which an anchor variable provides the
the anchor variable and the structure of the Bayesian network, the key link between the latent variable and the observed behaviors.
designer has the flexibility to give “value” subtly different meanings. The term anchor variable has been used been used in various ways
Recommendation engines have natural candidates for anchor in prior work on factor models [10–13]; our usage is most similar
variables: strong, explicit feedback from the user. For example, to [13]. Our use of the anchor variable is also similar to the use
strong negative feedback could include downvoting or reporting of a proxy variable to identify causal effects under unobserved
a content item, or blocking another user. Strong positive feedback confounding [14, 15].
could be explicitly liking or upvoting an item. For negative feedback,
we make the assumption that P(𝑉 = 1 | 𝐴 = 1) = 𝜖 for 𝜖 ≈ 0, while 3 IDENTIFICATION OF THE LVM WITH
for positive feedback we make the assumption that P(𝑉 = 1 | 𝐴 = ANCHOR 𝐴
1) = 1 − 𝜖.
We now describe our general approach to operationalizing a target
construct through a latent variable model (LVM) with an anchor
1.3 A case study on the Twitter platform variable. We operationalize the construct for value through a LVM
We implemented our approach on the Twitter platform on millions in which the construct is represented through an unobserved, bi-
of users. On Twitter, there are numerous user behaviors: clicks, nary latent variable 𝑉 that the other binary, observed behaviors
favorites, retweets, replies, and many more. It would be difficult provide evidence for. We assume there is one observed behavior,
to directly specify an objective that properly trades-off all these an anchor variable 𝐴, which we know P(𝑉 = 1 | 𝐴 = 1) for. We
behaviors. Instead, we identify a natural anchor variable. On Twitter, represent all other observed behaviors in the binary random vector
users can give explicit feedback on tweets by clicking “See less often” B = (𝐵 1, . . . , 𝐵𝑛 ). We refer to 𝐴 as an anchor variable because it
(SLO) on them. We use SLO as our anchor and assume that the user will provide the crucial link to identifying P(𝑉 | 𝐴, B). In other
does not value tweets they click “See less often” on. After specifying words, it will anchor the other observed behaviors B to Value 𝑉 .
the anchor variable and the Bayesian network, we are able to learn We represent the LVM as a Bayesian network. A Bayesian net-
P(𝑉 | Behaviors) from data. work is a directed acyclic graph (DAG) that graphically encodes
The model automatically learns a natural ordering of which a factorization of the joint distribution of the variables in the net-
behaviors should provide stronger evidence for Value 𝑉 , e.g. P(𝑉 = work. In particular, the DAG encodes all conditional independences
1 | Retweet = 1) > P(𝑉 = 1 | Reply = 1) > P(𝑉 = 1 | among the nodes through the 𝑑-separation rule [5]. This is impor-
Click = 1). Furthermore, it learns complex inferences about the tant because in most real-world settings, the observed behaviors
evidence provided by combinations of behavior. Such inferences have complex dependencies among each other (e.g. one may need
would not be possible under the standard approach, which uses a to click on an item before replying to it). Through our choice of
linear combination of behaviors as the objective. the DAG we can model both the dependencies among the observed
Unlike other work on recommender systems, we do not evaluate behaviors as well as the dependence of the unobserved variable 𝑉
through engagement metrics. If we believe that engagement is on the observed behaviors.
not the same as the construct “value”, then we cannot evaluate Our goal is to determine P(𝑉 | 𝐴, B) so that it can later be used
our approach merely by reporting engagement numbers. Instead, downstream as a target for optimization. We now discuss sufficient
we must take a more holisitc approach. We discuss established conditions for identifying the conditional distribution P(𝑉 | 𝐴, B).
approaches to assessing the validity [6–8] of a measurement, and There are three assumptions on the anchor variable 𝐴 that we will
explain how they translate to the recommender system setting by consider in turn.
using Twitter as an example. Notation. We use Pa(𝑋 ) to denote the parents of a node 𝑋 and
use Pa−𝑉 (𝑋 ) = Pa(𝑋 ) \ 𝑉 to denote all parents of 𝑋 except for 𝑉 .
2 RELATED WORK Assumption 1 (Value-sensitive). For every realization 𝑏 of the
In the social sciences, especially in psychology, education, and random vector B, we have that P(𝐴 = 1 | B = 𝑏, 𝑉 = 1) ≠ P(𝐴 = 1 |
political science, measurement theory [3] has long been used to B = 𝑏, 𝑉 = 0).
operationalize constructs like “personality”, “intelligence”, “political
ideology”, etc. Often the operationalization of such constructs is Assumption 1 simply means that the anchor 𝐴 carries signal
heavily contested, and many types of evidence for validity and about Value 𝑉 , regardless of what the other variables B are.1
reliability are used to evaluate the match between a construct and Assumption 2 (No children). The anchor variable 𝐴 has no
its operationalization [6, 7]. children.
Recently, Jacobs and Wallach [9] introduced the language of
measurement in the context of computer science. They argue that Since the anchor 𝐴 is chosen to be a strong type of explicit
many harms effected by computational systems are the direct result feedback, it is usually the last type of behavior the user engages in
of a mis-match between a theoretical construct and its operational- on a content item (e.g. a “report” button that removes the content
ization. In the context of recommender systems, many have argued 1 When combined with Assumption 2, Assumption 1 simplifies to the condition P(𝐴 =
that the engagement metrics used in practice are a poor operational- 1 | Pa−𝑉 (𝐴) = 𝑧, 𝑉 = 1) ≠ P(𝐴 = 1 | Pa−𝑉 (𝐴) = 𝑧, 𝑉 = 0) for every realization
ization of “value” [1]. 𝑧 of Pa−𝑉 (𝐴) , the parents of 𝐴 excluding 𝑉 .
From Optimizing Engagement to Measuring Value FAccT ’21, March 3–10, 2021, Virtual Event, Canada
from the user’s timeline), and thus, it typically makes sense to P(𝐴, B) and P(𝐴 | Pa(𝐴)) uniquely identify the conditional distri-
model 𝐴 as having no children. bution P(𝑉 | 𝐴, B).
Assumption 3 (One-sided conditional independence). Let
Pa−𝑉 (𝐴) be all parents of 𝐴 excluding 𝑉 . Value 𝑉 is independent Proof. In a Bayesian network, the Markov blanket for a variable
from Pa−𝑉 (𝐴) given that 𝐴 = 1: 𝑋 is the set of variables MB(𝑋 ) ⊆ Z that shield 𝑋 from all other
variables Z in the DAG, i.e. P(𝑋 | Z) = P(𝑋 | MB(𝑋 )) [5]. The
P(𝑉 = 1 | 𝐴 = 1, Pa−𝑉 (𝐴)) = P(𝑉 = 1 | 𝐴 = 1) . Markov blanket for a variable 𝑋 consists of its parents, children,
Assumption 3 means that when the user has opted to give feed- and parents of its children. Since the anchor 𝐴 has no children,
back (𝐴 = 1), the level of information that feedback contains about P(𝐴 | 𝑉 , B) = P(𝐴 | MB(𝐴)) = P(𝐴 | Pa(𝐴)). Thus, by Theorem
Value 𝑉 does not depend on the other parents of 𝐴. The assumption 1, P(𝐴 | Pa(𝐴)), and P(𝐴, B) identify the conditional distribution
rests on the fact that 𝐴 is a strong type of feedback that the user P(𝑉 | 𝐴, B) □
only provides when they are confident of their assessment.
Finally, when we add Assumption 3, one-sided conditional inde-
3.1 Conditions for identification pendence, then the distributions P(𝑉 ), P(𝐴, B), P(𝑉 = 1 | 𝐴 = 1),
The next theorem establishes that under A1, the distribution of ob- and P(Pa−𝑉 (𝐴) | 𝑉 ) are sufficient. The proof follows from Corol-
servable behaviors P(𝐴, B) and the conditional distribution P(𝐴 | lary 1 because, under Assumption 3, the distributions P(𝑉 = 1 |
𝑉 , B) are sufficient for identifying the conditional distribution, 𝐴 = 1), P(Pa−𝑉 (𝐴) | 𝑉 ), and P(𝑉 ) identify P(𝐴 | Pa(𝐴)).
P(𝑉 | 𝐴, B). The proof uses a matrix adjustment method (Rothman
et al., 2008; pg. 360) and is very similar to that in Pearl [14], Kuroki Corollary 2. If the joint distribution P(𝑉 , 𝐴, B) is Markov with
and Pearl [15]. respect to a DAG 𝐺 in which A1-3 hold, then P(𝑉 ), P(𝐴, B), P(𝑉 =
1 | 𝐴 = 1), and P(Pa−𝑉 (𝐴) | 𝑉 ) uniquely identify the conditional
Theorem 1. Let 𝑉 and 𝐴 be binary random variables and let distribution P(𝑉 | 𝐴, B).
B = (𝐵 1, . . . , 𝐵𝑛 ) be a binary random vector. If A1 holds, then the dis-
tributions P(𝐴, B) and P(𝐴 | 𝑉 , B) uniquely identify the conditional
Proof. We will show that, under Assumption 3, the distributions
distribution P(𝑉 | 𝐴, B).
P(𝑉 = 1 | 𝐴 = 1), P(Pa−𝑉 (𝐴) | 𝑉 ), and P(𝑉 ) identify P(𝐴 |
Proof. Since the conditional distribution P(𝑉 | 𝐴, B) is equal to Pa(𝐴)). The proof then follows from Corollary 1.
P( B,𝑉 ) ·P(𝐴 | B,𝑉 ) We show that we can identify P(𝐴 | Pa(𝐴)) by solving a set of
P(𝐴,B) , we can reduce the problem to determining the dis-
tribution P(B, 𝑉 ). We can relate P(B, 𝑉 ) to the given distributions, linear equations. For short-hand let 𝑝 𝑤,𝑎,𝑣 = P(Pa−𝑉 (𝐴) = 𝑤, 𝐴 =
P(𝐴, B) and P(𝐴 | B, 𝑉 ), via the law of total probability: 𝑎, 𝑉 = 𝑣). For any realization 𝑤, by marginalizing over 𝐴 and 𝑉 ,
∑︁ we can derive the following four equations for the four unknown
P(𝐴, B) = P(B, 𝑉 = 𝑣)P(𝐴 | B, 𝑉 = 𝑣) . (1) probabilities 𝑝 𝑤,0,0, 𝑝 𝑤,0,1, 𝑝 𝑤,1,0, 𝑝 𝑤,1,1 :
𝑣 ∈ {0,1 }
P(Pa−𝑉 (𝐴) = 𝑤, 𝐴 = 0) = 𝑝 𝑤,0,0 + 𝑝 𝑤,0,1 (2)
For every realization 𝑏 of the random vector B, we can write Equa-
tion 1 as 𝑧𝑏 = P𝑏 𝜇𝑏 where the matrix P𝑏 ∈ [0, 1] 2×2 and the P(Pa−𝑉 (𝐴) = 𝑤, 𝐴 = 1) = 𝑝 𝑤,1,0 + 𝑝 𝑤,1,1 (3)
vectors 𝜇𝑏 , 𝑧𝑏 ∈ [0, 1] 2 are defined as P(Pa−𝑉 (𝐴) = 𝑤, 𝑉 = 0) = 𝑝 𝑤,0,0 + 𝑝 𝑤,1,0 (4)
P𝑏𝑖,𝑗 = P(𝐴 = 𝑖 | B = 𝑏, 𝑉 = 𝑗) for 𝑖, 𝑗 ∈ {0, 1} , P(Pa−𝑉 (𝐴) = 𝑤, 𝑉 = 1) = 𝑝 𝑤,0,1 + 𝑝 𝑤,1,1 (5)
Open
Click
Engage Opt-out
SL
O
Figure 1: A workflow of how users can interact with ML-based notifications on Twitter. To view the tweet, the user can either
“open” the notification from the home screen on their phone or “click” on it from the notifications tab within the app. If the
user sees the tweet from their notifications tab, they can also click "See Less Often" on it. Once the user has opened or clicked
on the notification, they can engage with the tweet in many ways, e.g. replying, retweeting, or favoriting. At any point, the
user can opt-out of notifications all-together.
NTab
Value
View
Opt
Click Open
Out
Figure 2: Bayesian network for Twitter notifications. An arrow from a node 𝑋 to a box means that the node 𝑋 is a parent of
all the nodes in the box, e.g. Click and Open are parents of Fav, RT, ..., Linger > 6s. The latent variable Value is a parent of
everything except NTabView. The measurement node SLO is highlighted in pink.
Twitter began to auto-play, the signal of whether or not a user Bayesian network and the anchor variable 𝐴, a behavior that we
watched a video presumably became less relevant. The reality is believe provides strong evidence for value or the lack of it. On
that the objective is never static - how users interact with the Twitter, the user can provide strong, explicit feedback by clicking
platform is constantly changing, and the objective must change “See less often” (SLO) on a tweet. We use SLO as our anchor 𝐴 and
accordingly. assume that if a user clicks "See less often" on a tweet, they do not
Our approach provides a principled solution to objective specifi- value it: P(𝑉 = 1 | SLO = 1) = 0.
cation. We directly operationalize our intended construct “value”
as a latent variable 𝑉 . The meaning of Value 𝑉 is defined by the
FAccT ’21, March 3–10, 2021, Virtual Event, Canada Smitha Milli, Luca Belli, and Moritz Hardt
Under this approach, there is no need to manually specify how all Identifying the joint distribution We fit our model on three
the behaviors should factor into the objective. Having operational- days of data that contained of all user interactions with ML-based
ized Value, the ideal objective to use is clear: P(𝑉 = 1 | B, 𝐴) - the push notifications on Twitter. In Section 3, we proved that the tar-
probability of Value 𝑉 given the observed behaviors. As discussed get objective - the conditional distribution P(𝑉 = 1 | B, 𝐴) - is
in Section 4, we can directly estimate P(𝑉 = 1 | B, 𝐴) from data. uniquely identified from P(𝑉 = 1 | 𝐴 = 1), P(𝑉 ), P(B, 𝐴), and
Furthermore, presuming that the anchor and structure of Bayesian P(Pa−𝑉 (𝐴) | 𝐴) (see Corollary 2). We set the four distributions as
network remain stable, we can regularly re-estimate the model with follows. We used SLO as our anchor variable 𝐴 and assumed that
new data at any point, allowing us to account for change in user P(𝑉 = 1 | 𝐴 = 1) = 0, i.e. a user never says “See less often” if they
behavior on the platform. value the notification. The prior distribution of value P(𝑉 ) was set
The Bayesian network. We applied our approach to ML-driven to be uniform. The distribution of observed behaviors P(B, 𝐴) was
notifications on Twitter. These notifications have various forms, e.g. set to the empirical distribution. The distribution P(Pa−𝑉 (𝐴) | 𝑉 )
"Users A, B, C just liked User Z’s tweet", "User A just tweeted after was estimated as described in Section 3.2 by using two sources of
a long time", or "Users A, B, C followed User Z". Figure 1 shows historical data, one in which notifications were sent at random and
an example notification and how a user can interact with it. The the other in which notifications were sent according to a recom-
Bayesian network in Figure 2 succinctly encodes the dependen- mendation algorithm.8
cies between different types of interactions users can have with Evaluation of internal structure. Assessing our measure of
notifications.7 “value” for validity will necessarily be an on-going and multi-faceted
Notifications are sent both to the user’s home screen on their process. We do not, as typical of papers on recommendation, report
mobile phone, as well as to the notifications tab within the Twitter engagement metrics. The reason is that if we expect our measure of
app. The user can start their interaction either by seeing the notifi- “value” to differ from engagement, we cannot evaluate it by simply
cation in their notification tab (NTabView), and then clicking on it reporting engagement metrics. The evaluation of a measurement
(Click), or by seeing it as a the notification on their phone home necessitates a more holistic approach. In Section 5, we describe the
screen and opening it from there directly (Open). After clicking five categories of evidence for validity described by the Standards
or opening the notification, the user can engage in many more for educational and psychological testing, the handbook considered
interactions: they can favorite (Fav), retweet (RT), quote retweet the gold standard on approaches to testing [6].
(Quote), or reply (Reply) to the tweet; if the tweet has a link, they Here, we focus on evaluating what is known as evidence based on
can click on it (LinkClick); if it has a video, they can watch it internal structure, i.e whether expected theoretical relationships be-
(VidWatch). In addition, other implicit signals are logged: whether tween the variables in the model hold. To justify why the structure
the amount the user lingered on the tweet exceeds certain thresh- of our Bayesian network is necessary, we compare our full model
olds (Linger > 6s, Linger > 12s, Linger > 20s) and whether from Figure 2 to two other models: a naive Bayes model and the full
the number of user active minutes (UAM) spent in the app after model but without arrows from Open and Click to SLO. In Table 1,
clicking/opening the notification exceeds a threshold. we show P(𝑉 = 1 | Behavior = 1) for all behaviors and models. As
Furthermore, when the user is in the notification tab, the user noted by prior work [5, 13], matrix adjustment methods can result
can provide explicit feedback on a particular notification by clicking in negative values when conditional independence assumptions
"See Less Often" (SLO) on it. Notably, unlike other types of behavior, are not satisfied. To address this, we clamp all inferences to the
the user does not need to actually click or open the notification interval [0, 1]. We include the table of non-clamped inferences in
before clicking SLO. However, we found empirically that users are the appendix (Table 2).
more likely to click SLO after clicking or opening the notification, The first, simple theoretical relationship we expect to hold is
probably because they need to gain more information before making that compared to observing no user interaction, observing any user
an assessment. Thus, in addition to NTabView, we also model Click behavior besides opt-out should increase the probability that the
and Open as parents of SLO. user values the tweet, i.e. P(𝑉 = 1 | Behavior = 1) < P(𝑉 = 1) =
Finally, at any time the user can opt-out of notifications to their 0.5 for all Behavior ≠ OptOut. Furthermore, we also expect some
phone home screen (OptOut). When the user decides to opt-out, behaviors to provide stronger signals of value than others, e.g. that
it is attributed to any ML-based notification saw within a day of P(𝑉 = 1 | Fav = 1) > P(𝑉 = 1 | Click = 1).
choosing to opt-out. Since ML-based notifications are relatively The first model is the naive Bayes model, which simply assumes
rare on Twitter (users usually get less than one a day), there are that all behaviors are conditionally independent given Value 𝑉 . It
usually at most one or two notifications attributed to an opt-out does extremely poorly - almost all inferences have negative values
event. and are clamped to zero, indicating that the conditional indepen-
We model the latent variable 𝑉 as being a parent of all behaviors dence assumptions are unrealistic.
except NTabView (whether or not the user saw the notification in The second model is the full model except without arrows from
their notifications tab or not). Since users may check their notifi- Click and Open to SLO. It models all pre-requisite relationships be-
cations tab for many other notifications, it is difficult to attribute tween behaviors, i.e. if a behavior 𝑋 is required for another behavior
NTabView to a particular notification, and so we consider it to be 𝑌 , then there is an arrow from 𝑋 to 𝑌 . Compared to the naive Bayes
an exogenous, random event.
8 We assume that the dataset of randomized notifications has a prior probability P (𝑉 =
𝑅
7 Thenetwork can be interpreted as a causal Bayesian network [5], although for our 1) = 0 and the dataset of algorithmically chosen notifications has a prior probability
purposes, we do not strictly need the causal interpretation. P𝐶 (𝑉 = 1) = 0.5.
From Optimizing Engagement to Measuring Value FAccT ’21, March 3–10, 2021, Virtual Event, Canada
P(𝑉 = 1 | Behavior = 1)
Behavior Naive Bayes Click, Open ↛ SLO Full Model
OptOut 0 0 0
Click 0 0.316 0.652
Open 0 0.442 0.685
UAM 0 0.157 0.719
VidWatch 0 0.254 0.772
Linger > 6s 0 0.264 0.802
LinkClick 0 0.320 0.836
Reply 0.358 0.570 0.932
Linger > 12s 0 0.245 0.948
Fav 0.579 0.672 0.949
RT 0.680 0.720 0.956
Linger > 20s 0.019 0.296 0.991
Quote 1.0 1.0 1.0
Table 1: The inferences made by LVMs with different DAGs. For each model and for each behavior, we list P(𝑉 = 1 | Behavior =
1) – how much evidence the model learns that a behavior provides for Value 𝑉 (when all other behaviors are marginalized over).
model, the second model does not make mainly negative-valued in- the app because of the notification, it is much more direct of an
ferences, indicating that its conditional independence assumptions attribution.
are more realistic. However, relative to the prior, most behaviors It is clear that manually specifying the inferences our model
actually reduce the probability of Value, rather than increase it! makes would be very difficult. The advantage of our approach is
After investigation, we realized that although users were not that after specifying (a) the anchor variable and (b) the Bayesian
technically required to click or open the notification before clicking network, we can automatically learn these inferences from data.
SLO, in practice, they were more likely to do so, probably because Further, the model is able to learn complex inferences (e.g. that UAM
they needed to gain information before making an assessment. We is more reliable after Open than Click) that would be impossible
found that explicitly modeling the connection, i.e. adding arrows to specify under the typical linear weighting of behaviors.
from Click and Open to SLO was critical for making reasonable in-
ferences. We believe this takeaway will apply across recommender
systems. The user never has perfect information and may need to
engage with an item before providing explicit feedback [2]. It is 5 ASSESSING VALIDITY
important to model the relationship between information-gaining Thus far, we have described our framework for designing a measure
behavior and explicit feedback in the Bayesian network. of “value”, which can be used as a principled replacement for the ad-
Our full model satisfies the theoretical relationships we expect. hoc objectives ordinarily used in engagement optimization. How do
All the behaviors that we expect to increase the probability of Value we evaluate such a measure? Notably, we do not advocate evaluating
𝑉 do indeed do so. Furthermore, the relative strength of different the measure purely through engagement metrics. If we expect our
types of behavior seems reasonable as well, e.g. P(𝑉 = 1 | Fav = 1) measure of “value” to differ from engagement, then we cannot
and P(𝑉 = 1 | RT = 1) are higher than P(𝑉 = 1 | VidWatch = 1) evaluate it by simply reporting engagement metrics. Instead, the
and P(𝑉 = 1 | LinkClick = 1). assessment of any measure is necessarily an ongoing, multi-faceted,
The full model also makes more nuanced theoretical inferences. and interdisciplinary process.
Recall that UAM is whether or not the user had high user active To complete the presentation of our framework, we now discuss
minutes after either clicking the notification from notifications approaches to assess the validity [6–8] of a measurement. In the
tab or by opening the notification from their phone home screen. most recent (2014) edition of the Standards for educational and
The model learns that UAM is a highly indicative signal after Open, psychological testing, the handbook considered the gold standard
but not after Click: P(𝑉 = 1 | Open = 1, UAM = 1) = 0.906 and on approaches to testing, there are five categories of evidence for
P(𝑉 = 1 | Click = 1, UAM = 1) = 0.641. This makes sense because validity [6]. We visit each in turn, and describe how they translate
if the user clicks from notifications tab, it means they were already to the recommender system setting, using Twitter as an example.
in the app, and it is difficult to attribute their high UAM to the Evidence based on content refers to whether the content of a
notification in particular. On the other hand, if the user enters measurement is sufficient to fully capture the target construct. For
example, we may question whether a measure of “socio-economic
FAccT ’21, March 3–10, 2021, Virtual Event, Canada Smitha Milli, Luca Belli, and Moritz Hardt
status” that includes income, but does not account for wealth, accu- class. If it turns out that students sorted by the test do not perform
rately captures the content of the construct [9]. In the recommender better, that may give us reason to reassess the original test. In
engine setting, content-based evidence asks us to reflect on whether the recommender system context, if we find that after using our
the behaviors available on the platform are sufficient to capture measurement of value to optimize recommendations, more users
a worthy notion of the construct “value”. For example, if the only complain or quit the platform, then we would have reason to revise
behavior observed on the platform were clicks by the user, then our measurement.
we may be skeptical of any measurement of “value” derived from
user behavior. What content-based evidence makes clear is that 6 SUMMARY
to measure any worthy notion of “value”, it is essential to design
We have presented a framework for designing an objective function
platforms in which users are empowered with richer channels of
that captures a desired notion of “value”. In line with the principles
feedback. Otherwise, no measurement derived from user behavior
of measurement theory, we treat “value” as a theoretical construct
will accurately capture the construct.
which must be operationalized. Our framework allows the designer
Evidence based on cognitive processes. Measurements de-
to operationalize “value” in a principled manner by specifying
rived from human behavior are often based on implicit assumptions
only an anchor variable and the structure of the Bayesian network.
about the cognitive processes subjects engage in. Cognitive process
Through these two choices, the designer has the flexibility to give
evidence refers to evidence about such assumptions, often derived
“value” subtly different meanings.
from explicit studies with subjects. For example, consider a reading
We applied our approach on the Twitter platform on millions of
comprehension test. We assume that high-scoring students succeed
users. We do not, as typical of papers on recommendation, report
by using critical reading skills, rather than a superficial heuristic
engagement metrics. The reason is that if we expect our measure of
like picking the answers with the longest length. To gain evidence
“value” to differ from engagement, we cannot evaluate it simply by
about whether this assumption holds, we might, for instance, ask
reporting engagement metrics. Instead, we discussed established
students to take the test while verbalizing what they are thinking.
ways to assess the validity of a measurement and how they translate
Similarly, in the recommender engine setting, we want to verify
to the recommendation system setting. For the scope of this work,
whether user behaviors occur for the reasons we think they do. On
we focused on assessing evidence based on internal structure and
Twitter, one might think to use Favorite as an anchor for Value
found that our measure of “value” satisfied many desired theoretical
𝑉 , assuming that P(𝑉 = 1 | Favorite = 1) ≈ 1. However, users
relationships.
actually favorite items for reasons that may not reflect value – like
to bookmark a tweet or to stop a conversation. Cognitive process
evidence highlights the importance of user research in assessing ACKNOWLEDGEMENTS
the validity of any measure of “value”. We thank Naz Erkan for giving us the opportunity and freedom to
Evidence based on internal structure refers to whether the conduct this work through her bold leadership and savvy manage-
observations the measurement is derived from conform to expected, rial support. We thank Prakhar Biyani for his extensive effort in
theoretical relationships. For example, for a test with questions helping us apply our approach at scale at Twitter. We thank Tom
which we expect to be of increasing difficulty, we would assess Everitt for feedback on a draft of the paper.
whether students actually perform worse on later questions, com-
pared to earlier ones. In the recommender system context, we may REFERENCES
have expectations on which types of user behaviors should pro- [1] Michael D Ekstrand and Martijn C Willemsen. Behaviorism is not enough: better
vide stronger signal for value. In Section 4, we evaluated internal recommendations through listening to users. In Proceedings of the 10th ACM
structure by comparing P(𝑉 = 1 | Behavior = 1) for all behaviors. Conference on Recommender Systems, pages 221–224, 2016.
[2] Hongyi Wen, Longqi Yang, and Deborah Estrin. Leveraging post-click feedback
Evidence based on relations with other variables is con- for content recommendations. In Proceedings of the 13th ACM Conference on
cerned with the relationships between the measurement and other Recommender Systems, pages 278–286, 2019.
variables that are external to the measurement. The external vari- [3] David J Hand. Measurement theory and practice: The world through quantification.
Arnold London, 2004.
ables could be variables which the measurement is expected to be [4] Simon Jackman. Measurement. In The Oxford Handbook of Political Methodology,
similar to or predict, as well as variables which the measurement is chapter 9. Oxford University Press, 09 2009. ISBN 9780199286546.
[5] Judea Pearl. Causality. Cambridge university press, 2009.
expected to differ from. For example, a new measure of depression [6] American Educational Research Association, American Psychological Associ-
should correlate with other, existing measures of depression, but ation, National Council on Measurement in Education, Joint Committee on
correlate less with measures of other disorders. In the recommender Standards for Educational and Psychological Testing. Standards for educational
and psychological testing. AERA, 2014.
system context, we might look at whether our derived measure- [7] Samuel Messick. Validity. ETS Research Report Series, 1987(2):i–208, 1987.
ment of “value” is predictive of answers that users give in explicit [8] Todd D Reeves and Gili Marbach-Ad. Contemporary test validity in theory and
surveys about content they value. We could also verify that our practice: A primer for discipline-based education researchers. CBE—Life Sciences
Education, 15(1):rm1, 2016.
measure of “value” does not differ based on protected attributes, [9] Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. arXiv preprint
like the sex or race of the author of the content. arXiv:1912.05511, 2019.
[10] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond
Evidence based on consequences. Finally, the consequences svd. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science,
of a measurement cannot be separated from its validity. Consider pages 1–10. IEEE, 2012.
a test to measure student mathematical ability. The test is used to [11] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David
Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling
sort students into beginner or advanced classes with the hypothesis with provable guarantees. In International Conference on Machine Learning, pages
that all students will do better after sorted into their appropriate 280–288, 2013.
From Optimizing Engagement to Measuring Value FAccT ’21, March 3–10, 2021, Virtual Event, Canada
[12] Yoni Halpern, Steven Horng, Youngduck Choi, and David Sontag. Electronic
medical record phenotyping using the anchor and learn framework. Journal of
the American Medical Informatics Association, 23(4):731–740, 2016.
[13] Yoni Halpern, Steven Horng, and David Sontag. Clinical tagging with joint
probabilistic models. In Conference on Machine Learning for Health Care, 2016.
[14] Judea Pearl. On measurement bias in causal inference. In Proceedings of the
Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 425–432.
AUAI Press, 2010.
[15] Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in
causal inference. Biometrika, 101(2):423–437, 2014.
[16] Kenneth J Rothman, Sander Greenland, and Timothy L Lash. Modern epidemiology,
volume 3. Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia,
2008.
FAccT ’21, March 3–10, 2021, Virtual Event, Canada Smitha Milli, Luca Belli, and Moritz Hardt
P(𝑉 = 1 | Behavior = 1)
Behavior Naive Bayes Click, Open ↛ SLO Full Model
OptOut -99.74 -0.932 -0.072
Click -1.194 0.316 0.652
Open -0.366 0.442 0.685
UAM -1.092 0.157 0.719
VidWatch -0.475 0.254 0.772
Linger > 6s -0.525 0.264 0.802
LinkClick -0.302 0.320 0.836
Reply 0.358 0.570 0.932
Linger > 12s -0.254 0.245 0.948
Fav 0.579 0.672 0.949
RT 0.680 0.720 0.956
Linger > 20s 0.019 0.296 0.991
Quote 1.0 1.0 1.0
Table 2: The same inferences as in Table 1, except without clamping to [0, 1].