(Springer Actuarial) Arthur Charpentier - Insurance, Biases, Discrimination and Fairness-Springer Nature (2024)
(Springer Actuarial) Arthur Charpentier - Insurance, Biases, Discrimination and Fairness-Springer Nature (2024)
Arthur Charpentier
Insurance,
Biases,
Discrimination
and Fairness
Springer Actuarial
Editors-in-Chief
Hansjoerg Albrecher, Department of Actuarial Science, University of Lausanne,
Lausanne, Switzerland
Michael Sherris, School of Risk & Actuarial, UNSW Australia, Sydney, NSW,
Australia
Series Editors
Daniel Bauer, Wisconsin School of Business, University of Wisconsin-Madison,
Madison, WI, USA
Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France
Alexander J. McNeil, University of York, York, UK
Antoon Pelsser, Maastricht University, Maastricht, The Netherlands
Gordon Willmot, University of Waterloo, Waterloo, ON, Canada
Hailiang Yang, Department of Statistics & Actuarial Science, The University of
Hong Kong, Hong Kong, Hong Kong
This is a series on actuarial topics in a broad and interdisciplinary sense, aimed at
students, academics and practitioners in the fields of insurance and finance.
Springer Actuarial informs timely on theoretical and practical aspects of top-
ics like risk management, internal models, solvency, asset-liability management,
market-consistent valuation, the actuarial control cycle, insurance and financial
mathematics, and other related interdisciplinary areas.
The series aims to serve as a primary scientific reference for education, research,
development and model validation.
The type of material considered for publication includes lecture notes, mono-
graphs and textbooks. All submissions will be peer-reviewed.
Arthur Charpentier
Insurance, Biases,
Discrimination and Fairness
Arthur Charpentier
Department of Mathematics
UQAM
Montreal, QC, Canada
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
“14 litres d’encre de chine, 30 pinceaux, 62 crayons à mine grasse, 1 crayon à mine
dure, 27 gommes à effacer, 38 kilos de papier, 16 rubans de machine à écrire, 2
machines à écrire, 67 litres de bière ont été nécessaires à la réalisation de cette
aventure,” Goscinny and Uderzo (1965), Astérix et Cléopâtre.1
Discrimination is a complicated concept. The most neutral definition is, accord-
ing to Merriam-Webster (2022), simply “the act (or power) of distinguishing.”
Amnesty International (2023) adds that it is an “unjustified distinction.” And indeed,
most of the time, the word has a negative connotation, because discrimination
is associated with some prejudice. An alternative definition, still according to
Merriam-Webster (2022), is that discrimination is the “act of discriminating cate-
gorically rather than individually.” This corresponds to “statistical discrimination”
but also actuarial pricing. Because actuaries do discriminate. As Lippert-Rasmussen
(2017) clearly states, “insurance discrimination seems immune to some of the stan-
dard objections to discrimination.” Avraham (2017) goes further: “what is unique
about insurance is that even statistical discrimination which by definition is absent
of any malicious intentions, poses significant moral and legal challenges. Why?
Because on the one hand, policy makers would like insurers to treat their insureds
equally, without discriminating based on race, gender, age, or other characteristics,
even if it makes statistical sense to discriminate (...) On the other hand, at the core
of insurance business lies discrimination between risky and non-risky insureds. But
riskiness often statistically correlates with the same characteristics policy makers
would like to prohibit insurers from taking into account.” This is precisely the
purpose of this book, to dig further into those issues, to understand the seeming
oxymoron “fair discrimination” used in insurance, to weave together the multiple
perspectives that have been posed on discrimination in insurance, linking a legal
and a statistical view, an economic and an actuarial view, all in a context where
computer scientists have also recently brought an enlightened eye to the question
1 14 liters of ink, 30 brushes, 62 grease pencils, 1 hard pencil, 27 erasers, 38 kilos of paper, 16
typewriter ribbons, 2 typewriters, 67 liters of beer were necessary to realize this adventure.
v
vi Preface
2 The ILB (Institut Louis Bachelier) is a nonprofit organization created in 2008. Its activities are
aimed at engaging academic researchers, as well as public authorities and private companies in
research projects in economics and finance with a focus on four societal transitions: environmental,
digital, demographic, and financial. The ILB is, thus, fully involved in the design of research
programs and initiatives aimed at promoting sustainable development in economics and finance.
Preface vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 A Brief Overview on Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Discrimination? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Legal Perspective on Discrimination . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Discrimination from a Philosophical Perspective . . . . . . . . . 5
1.1.4 From Discrimination to Fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 Economics Perspective on Efficient Discrimination . . . . . . 10
1.1.6 Algorithmic Injustice and Fairness
of Predictive Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.7 Discrimination Mitigation and Affirmative Action . . . . . . . 15
1.2 From Words and Concepts to Mathematical Formalism . . . . . . . . . . . 15
1.2.1 Mathematical Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Legitimate Segmentation and Unfair Discrimination . . . . . 16
1.3 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Datasets and Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ix
x Contents
Part II Data
5 What Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.1 Data (a Brief Introduction). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.2 Personal and Sensitive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2.1 Personal and Nonpersonal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2.2 Sensitive and Protected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2.3 Sensitive Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.2.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.2.5 Right to be Forgotten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3 Internal and External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.3.1 Internal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.3.2 Connecting Internal and External Data . . . . . . . . . . . . . . . . . . . . 192
5.3.3 External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.4 Typology of Ratemaking Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.4.1 Ratemaking Variables in Motor Insurance . . . . . . . . . . . . . . . . 199
5.4.2 Criteria for Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.4.3 An Actuarial Criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.4.4 An Operational Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.4.5 A Criterion of Social Acceptability . . . . . . . . . . . . . . . . . . . . . . . . 204
5.4.6 A Legal Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.5 Behaviors and Experience Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.6 Omitted Variable Bias and Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . 205
5.6.1 Omitted Variable in a Linear Model . . . . . . . . . . . . . . . . . . . . . . . 205
5.6.2 School Admission and Affirmative Action . . . . . . . . . . . . . . . . 207
5.6.3 Survival of the Sinking of the Titanic. . . . . . . . . . . . . . . . . . . . . . 208
5.6.4 Simpson’s Paradox in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.6.5 Ecological Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law . . . . . . . . . . . . . . . 211
5.7.1 Goodhart’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.7.2 Other Biases and “Dark Data” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6 Some Examples of Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.1 Racial Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.1.1 A Sensitive Variable Difficult to Define . . . . . . . . . . . . . . . . . . . 218
6.1.2 Race and Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.2 Sex and Gender Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.2.1 Sex or Gender? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2.2 Sex, Risk and Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.2.3 The “Gender Directive” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.3 Age Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.3.1 Young or Old? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
xii Contents
Part IV Mitigation
10 Pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
10.1 Removing Sensitive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.2 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.2.2 Binary Sensitive Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10.3 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
10.4 Application to toydata2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10.5 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 393
11 In-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.1 Adding a Group Discrimination Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
11.2 Adding an Individual Discrimination Penalty . . . . . . . . . . . . . . . . . . . . . . 400
11.3 Application on toydata2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
11.3.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
11.3.2 Equalized Odds and Class Balance . . . . . . . . . . . . . . . . . . . . . . . . 403
11.4 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 407
11.4.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
11.4.2 Equalized Odds and Class Balance . . . . . . . . . . . . . . . . . . . . . . . . 410
12 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.1 Post-Processing for Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.2 Weighted Averages of Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
12.3 Average and Barycenters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
12.4 Application to toydata1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
xiv Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Mathematical Notation
Acronyms
Mathematical Notations
xv
xvi Mathematical Notation
1
. indicator, .1A (x) = 1(x ∈ A) = 1 if .x ∈ A, 0 otherwise or vector
of ones, .1 = (1, 1, · · · , 1) ∈ Rn , in linear algebra
I
. identity matrix, square matrix with 1 entries on the diagonal, 0
elsewhere
.{A, B} values taken by a binary sensitive attribute s
.A adjacency matrix
∗
.aj (x ) accumulated local function, for variable j at location .x ∗
j
ACC accuracy, (TP+TN)/(P+N)
argmax arguments (of the maxima)
ATE average treatment effect
AUC area under the (ROC) curve
.B(n, p) binomial distribution—or Bernoulli .B(p) for .B(n, p)
.C ROC curve, .t → TPR ◦ FPR−1 (t)
.C convex hull of the ROC curve
CATE conditional average treatment effect
.Cor[X, Y ], r Pearson’s correlation, .r = Cor[X, Y ] = Cov[X, Y ]/
√
Var[X] · Var[Y ]
.Cov[X, Y ] covariance, .Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])]
.d(x 1 , x 2 ) distance between two points in .X
.D(p1 , p2 ) divergence
.d vector of degrees, in a network
.D training dataset, or .Dn
∗
.j |S (x ) contribution of the j -th variable, at .x ∗ , conditional on a subset of
variables, .S ⊂ {1, · · · , k}\{j }
.do(X = x) do operator (for an intervention in causal inference)
.E[X] expected value (under probability .P)
f density associated with cumulative distribution function F
F cumulative distribution function, with respect to probability .P
.F
−1 (generalized) quantile, .F −1 (p) = inf {x ∈ R : p ≤ F (x)}
FN false negative (from confusion matrix)
FNR false-negative rate, also called miss rate
FP false positive (from confusion matrix)
FPR false-positive rate, also called fall-out
FPR.(t) function .[0, 1] → [0, 1],.FPR(t) = P[m(X > t|Y = 0)]
.γ Gini mean difference, .E |Y − Y | whereY, Y ∼ F are independent
copies
γjbd (x ∗ )
. breakdown contribution of the j -th variable, for individual .x ∗
(x ∗ ) Shapley contribution of the j -th variable, for individual .x ∗
shap
.γj
G Gini index
.G some graph, with nodes and edges
.GUI group fairness index
i index for individuals (usually, .i ∈ {1, · · · , n})
j index for variables (usually, .j ∈ {0, 1, · · · , k})
k number of features used in model)
Mathematical Notation xvii
L Lorenz curve
.L likelihood, .L(θ ; y)
.(y1 , y2 ) loss function on .L, .(y,
y)
.1 absolute deviation loss function, .1 (y, y ) = |y − y|
.2 quadratic loss function, .2 (y,
y ) = (y − y )2
∗
.j (x ) local dependence plot, for variable j at location .x ∗
j
.λ tuning parameter associated with a penalty in an optimization
problem (Lagrangian)
.log natural logarithm (with .log(ex ) = x, .∀x ∈ R)
.m(z) predictive model, .m : Z → Y, possibly a score in .[0, 1]
(z)
.m fitted predictive model from data .D (collection of .(yi , zi )’s)
.mt (z) classifier based on model .m(·) and threshold .t ∈ (0, 1), .mt : Z →
{0, 1}, .mt = 1(m > t)
.mx ∗ ,j (z) ceteris paribus profile
.M set of possible predictive models
.μ(x) regression function, .E[Y |X = x]
n number of observations in the training sample
.nA , nB number of observations in the training sample, respectively with .s =
A and .s = B
.N set of natural numbers, or non-negative integers (.0 ∈ N)
.N normal (Gaussian) distribution, .N(μ, σ 2 ) or .N(μ, )
.P true probability measure
p probability, .p ∈ [0, 1]
∗
.pj (x ) partial dependence plot, for variable j at location .x ∗
j
.P(λ) Poisson distribution, .P(λ) with average .λ
PDP partial dependence plot
PPV prevision or positive predictive value (from confusion matrix),
TP/(TP+FP)
.Z orthogonal projection matrix, .Z = Z(Z Z)−1 Z
.(p, q) set of multivariate distributions with “margins” p and q
.Q some probability measure
.r(X, Y ) Pearson’s correlation
.r (X, Y ) maximal correlation
.R set of real numbers
.R
d standard vector space of dimension d
n ×n1
.R 0 set of real-valued matrices .n0 × n1
.R(m) risk of a model .m, associated with loss .
.
Rn (m) empirical risk of a model .m, for a sample of size n
.S, s sensitive attribute
.s collection of sensitive attributes, in .{0, 1}n
.s, S scoring rule, and expected scoring rule (in Sect. 4.2)
.S usually .{A, B} in the book, or .{0, 1}, so that .s ∈ S
.Sd standard probability simplex (.Sd ⊂ Rd )
T treatment variable in causal inference
.T transport / coupling mapping, .X → X or .Y → Y
xviii Mathematical Notation
.T# push-forward operator .P1 (A) = T# P0 (A) = P0 T−1 (A)
t threshold, cut-off for a classifier, .
y = 1(m(x) > t)
TN true negative (from confusion matrix)
TNR true-negative rate, also called specificity or selectivity, TN/(TN+FP)
TP true positive (from confusion matrix)
TPR true-positive rate, also called sensitivity or recall, TP/(TP+FN)
TPR.(t) function .[0, 1] → [0, 1], .TPR(t) = P[m(X > t|Y = 1)]
. ,θ latent unobservable risk factor (.θ for multivariate latent factors) or
unknown parameter in a parametric model
u utility function
.U uniform distribution
.U (a 0 , a 1 ) set of matrices . M ∈ Rn+0 ×n1 : M1n1 = a 0 andM 1n0 = a 1 .
n0
Un0 ,n1
. set of matrices .U 1n0 , 1n1
n1
V
. some value function on a subset of indices, in .{1, · · · , k}
Var[X]
. variance, .Var[X] = E[(X − E[X])2 ]
covariance matrix .Var[X] = E[(X − E[X])(X − E[X]) ]
W Wasserstein distance (.W2 if no index is specified)
w, .ω weight (.ω ≥ 0) or wealth (in the economic model)
. theoretical sample space associated with a probabilistic space
. weight matrix
.x, x i collection of explanatory variables for a single individual, in .X ⊂
Rk
.xj collection of observations, for variable j
.X subset of .Rk , so that .x ∈ X = X1 × · · · × Xk
.Y, y variable of interest
.Y
T ←t potential outcome of y if treatment T had taken value t
.y collection of observations, in .Yn
.
y prediction of the variable of interest
.Y subset of .R, so that .y ∈ Y but also .y , m(x) ∈ Y
.Z, z information, .z = (x, s), including legitimate and protected features
.z collection of observations .z = (x, s), in .Z
.Z set .X × S
1.1.1 Discrimination?
Fig. 1.1 Map (freely) inspired by a Home Owners’ Loan Corporation map from 1937, where red
is used to identify neighborhoods where investment and lending were discouraged, on the left-
hand side (see Crossney 2016 and Rhynhart 2020). In the middle, some risk-related variable (a
fictitious “unsanitary index”) per neighborhood of the city is presented, and on the right-hand side,
a sensitive variable (the proportion of Black people in the neighborhood). Those maps are fictitious
(see Charpentier et al. 2023b)
In the 1970s, when looking at census data, sociologists noticed that red areas,
where insurers did not want to offer coverage, were also those with a high
proportion of Black people, and following the work of John McKnight and Andrew
Gordon, “redlining” received more interest. On the map in the middle, we can
observe information about the proportion of Black people. Thus, on the one hand,
it could be seen as “legitimate” to have a premium for households that could
reflect somehow the general conditions of houses. On the other hand, it would
be discriminatory to have a premium that is a function of the ethnic origin of the
policyholder. Here, the neighborhood, the “unsanitary index,” and the proportion
of Black people are strongly correlated variables. Of course, there could be non-
Black people living in dilapidated houses outside of the red area, Black people living
in wealthy houses inside the red area, etc. If we work using aggregated data, it is
difficult to disentangle information about sanitary conditions and racial information,
to distinguish “legitimate” and “nonlegitimate” discrimination, as discussed in
Hellman (2011). Note that within the context of “redlining,” the utilization of census
and aggregated data may introduce the potential for the occurrence of an “ecological
fallacy” (as discussed in King et al. (2004) or Gelman (2009)). In the 2020s, we now
have much more information (so called “big data era”) and more complex models
(machine-learning literature), and we will see how to disentangle this complex
problem, even if dealing with discrimination in insurance is probably still an ill-
defined unsolvable problem, with strong identification issues. Nevertheless, as we
will see, there are many ways of looking at this problem, and we try, here, to connect
the dots, to explain different perspectives.
4 1 Introduction
In Kansas, more than 100 years ago, a law was passed, allowing an insurance
commissioner to review rates to ensure that they were not “excessive, inadequate, or
unfairly discriminatory with regards to individuals,” as mentioned in Powell (2020).
Since then, the idea of “unfairly discriminatory” insurance rates has been discussed
in many States in the USA (see Box 1.1).
other status” (see Joseph and Castan 2013). Such lists do not really address the
question of what discrimination is. But looking for common features among those
variables can be used to explain what discrimination is. For instance, discrimination
is necessarily oriented toward some people based on their membership of a
certain type of social group, with reference to a comparison group. Hence, our
discourse should not center around the absolute assessment of how effectively an
individual within a specific group is treated but rather on the comparison of the
treatment that an individual receives relative to someone who could be perceived
as “similar” within the reference group. Furthermore, the significance of this
reference group is paramount, as discrimination does not merely entail disparate
treatment, it necessitates the presence of a favored group and a disfavored group,
thus characterizing a fundamentally asymmetrical dynamic. As Altman (2011),
wrote, “as a reasonable first approximation, we can say that discrimination consists
of acts, practices, or policies that impose a relative disadvantage on persons based
on their membership in a salient social group.”
As mentioned already, we should not expect to have universal rules about discrim-
ination. For instance, Supreme Court Justice Thurgood Marshall claimed once that
“a sign that says ‘men only’ looks very different on a bathroom door than on a
courthouse door,” as reported in Hellman (2011). Nevertheless, philosophers have
suggested definitions, starting with a distinction between “direct” and “indirect”
discrimination. As mentioned in Lippert-Rasmussen (2014), it would be too simple
to consider direct discrimination as intentional discrimination. A classic example
would be a paternalistic employer who intends to help women by hiring them
only for certain jobs, or for a promotion, as discussed in Jost et al. (2009). In that
case, acts of direct discrimination can be unconscious in the sense that agents are
unaware of the discriminatory motive behind decisions (related to the “implicit bias”
discussed in Brownstein and Saul (2016a,b). Indirect discrimination corresponds to
decisions with disproportionate effects, that might be seen as discriminatory even if
that is not the objective of the decision process mechanism. A standard example
could be the one where the only way to enter a public building is by a set of
stairs, which could be seen as discrimination against people with disabilities who
use wheelchairs, as they would be unable to enter the building; or if there were a
minimum height requirement for a job where height is not relevant, which could
be seen as discrimination against women, as they are generally shorter than men.
On the one hand, for Young (1990), Cavanagh (2002), or Eidelson (2015), indirect
discrimination should not be considered discrimination, which should be strictly
limited to “intentional and explicitly formulated policies of exclusion or preference.”
For Cavanagh (2002), in many cases, “it is not discrimination they object to, but its
effects; and these effects can equally be brought about by other causes.” On the other
hand, Rawls (1971) considered structural indirect discrimination, that is, when the
6 1 Introduction
Humans have an innate sense of fairness and justice, with studies showing that even
3-year-old children have demonstrated the ability to consider merit when sharing
rewards, as shown by Kanngiesser and Warneken (2012), as well as chimpanzees
and primates (Brosnan 2006), and many other animal species. And given that this
trait is largely innate, it is difficult to define what is “fair,” although many scientists
have attempted to define notions of “fair” sharing, as Brams et al. (1996) recalls.
On the one hand “fair” refers to legality (and to human justice, translated into a
set of laws and regulations), and in a second sense, “fair” refers to an ethical or
moral concept (and to an idea of natural justice). The second reading of the word
“fairness” is the most important here. According to one dictionary, fairness “consists
in attributing to each person what is due to him by reference to the principles of
natural justice.” And being “just” raises questions related to ethics and morality
(we do not differentiate here between ethics and morality).
This has to be related to a concept introduced in Feinberg (1970), called “desert
theory,” corresponding to the moral obligation that good actions must lead to better
results. A student should deserve a good grade by virtue of having written a good
paper, the victim of an industrial accident should deserve substantial compensation
owing to the negligence of his or her employer. For Leibniz or Kant, a person
is supposed to deserve happiness in virtue for being morally good. In Feinberg
(1970)’s approach, “deserts” are often seen as positive, but they are also sometimes
negative, like fines, dishonor, sanctions, condemnations, etc. (see Feldman (1995),
Arneson (2007) or Haas (2013)). The concept of “desert” generally consists of a
relationship among three elements: an agent, a deserved treatment or good, and the
basis on which the agent is deserving.
We evoke in this book the “ethics of models,” or, as coined by Mittelstadt et al.
(2016) or Tsamados et al. (2021), the “ethics of algorithms.” A nuance exists with
respect to the “ethics of artificial intelligence,” which deals with our behavior or
choices (as human beings) in relation to autonomous cars, for example, and which
will attempt to answer questions such as “should a technology be adopted if it
is more efficient?” The ethics of algorithms questions the choices made “by the
machine” (even if they often reflect choices—or objectives—imposed by the person
who programmed the algorithm), or by humans, when choices can be guided by
some algorithm.
Programming an algorithm in an ethical way must be done according to a certain
number of standards. Two types of norms are generally considered by philosophers.
The first is related to conventions, i.e., the rules of the game (chess or Go), or the
rules of the road (for autonomous cars). The second is made up of moral norms,
which must be respected by everyone, and are aimed at the general interest. These
norms must be universal, and therefore not favor any individual, or any group of
individuals. This universality is fundamental for Singer (2011), who asks not to
judge a situation with his or her own perspective, or that of a group to which one
belongs, but to take a “neutral” and “fair” point of view.
8 1 Introduction
1 See https://avalon.law.yale.edu/18th_century/rightsof.asp.
2 See https://www.moralmachine.net/.
1.1 A Brief Overview on Discrimination 9
outside the dedicated crossings). These questions are important for self-driving cars,
as mentioned by Thornton et al. (2016).
For a philosopher, the question “How fair is this model to this group?” will
always be followed by “How fair by what normative principle?” Measuring the
overall effects on all those affected by the model (and not just the rights of a few) will
lead to incorporating measures of fairness into an overall calculation of social costs
and benefits. If we choose one approach, others will suffer. But this is the nature
of moral choices, and the only responsible way to mitigate negative headlines is to
develop a coherent response to these dilemmas, rather than ignore them. To speak
of the ethics of models poses philosophical questions from which we cannot free
ourselves, because, as we have said, a model is aimed at representing reality, “what
is.” To fight against discrimination, or to invoke notions of fairness, is to talk about
“what should be.” We are once again faced with the famous opposition of Hume
(1739). It is a well-known property of statistical models, as well as of machine-
learning ones. As Chollet (2021) wrote: “Keep in mind that machine learning can
only be used to memorize patterns that are present in your training data. You can
only recognize what you’ve seen before. Using machine learning trained on past
data to predict the future is making the assumption that the future will behave
like the past.” For when we speak of “norm,” it is important not to confuse the
descriptive and the normative, or with other words, statistics (which tells us how
things are) and ethics (which tells us how things should be). Statistical law is about
“what is” because it has been observed to be so (e.g., humans are bigger than
dogs). Human (divine, or judicial) law pertains to what is is because it has been
decreed, and therefore ought to be (e.g., humans are free and equal or humans are
good). One can see the “norm” as a regularity of cases, observed with the help of
frequencies (or averages, as mentioned in the next chapter), for example, on the
height of individuals, the length of sleep, in other words, data that make up the
description of individuals. Therefore, anthropometric data have made it possible
to define, for example, an average height of individuals in a given population,
according to their age; in relation to this average height, a deviation of 20% more
or less determines gigantism or dwarfism. If we think of road accidents, it may be
considered “abnormal” to have a road accident in a given year, at an individual
(micro) level, because the majority of drivers do not have an accident. However,
from the insurer’s perspective (macro), the norm is that 10% of drivers have an
accident. It would therefore be abnormal for no one to have an accident. This is
the argument found in Durkheim (1897). From the singular act of suicide, if it is
considered from the point of view of the individual who commits it, Durkheim tries
to see it as a social act, therefore falling within a real norm, within a given society.
From then on, suicide becomes, according to Durkheim, a “normal” phenomenon.
Statistics then make it possible to quantify the tendency to commit suicide in a
given society, as soon as one no longer observes the irregularity that appears in the
singularity of an individual history, but a “social normality” of suicide. Abnormality
is defined as “contrary to the usual order of things” (this might be considered an
empirical, statistical notion), or “contrary to the right order of things” (this notion of
right probably implies a normative definition), but also not conforming to the model.
10 1 Introduction
If jurists used the term “rational discrimination,” economists used the term “effi-
cient” or “statistical discrimination,” such as in Phelps (1972) or Arrow (1973),
following early work by Edgeworth (1922). Following Becker (1957) economists
have tended to define discrimination as a situation where people who are “the
same” (with respect to legitimate covariates) are treated differently. Hence, a
“discrimination” corresponds here to some “disparity,” but we will frequently
use the term “discrimination.” More precisely, it is necessary to distinguish two
standards. One standard corresponds to “disparate treatment,” corresponding to
“any economic agent who applies different rules to people in protected groups is
practicing discrimination” as defined in Yinger (1998). The second discriminatory
standard corresponds to “disparate impact.” This corresponds to practices that seem
to be neutral, but have the effect of disadvantaging one group more than others.
Definition 1.4 (Disparate Treatment (Merriam-Webster 2022)) Disparate treat-
ment corresponds to the treatment of an individual (as an employee or prospective
juror) that is less favorable than treatment of others for discriminatory reasons (such
as race, religion, national origin, sex, or disability).
Definition 1.5 (Disparate Impact (Merriam-Webster 2022)) Disparate impact
corresponds to an unnecessary discriminatory effect on a protected class caused by
a practice or policy (as in employment or housing) that appears to be nondiscrimi-
natory.
In labor economics, wages should be a function of productivity, which is
unobservable when signing a contract, and therefore, as discussed in Riley (1975),
Kohlleppel (1983) or Quinzii and Rochet (1985), employers try to find signals.
As claimed in Lippert-Rasmussen (2013), statistical discrimination occurs when
“there is statistical evidence which suggests that a certain group of people differs
from other groups in a certain dimension, and its members are being treated
disadvantageously on the basis of this information.” Those signals are observable
variables that correlate with productivity.
In the most common version of the model, employers use observable group
membership as a proxy for unobservable skills, and rely on their beliefs about pro-
1.1 A Brief Overview on Discrimination 11
Fig. 1.2 Two analyses of the same descriptive statistics of compas data, with the number of
defendant (1) function of the race of the defendant (Black and white), (2) the risk category, obtained
from a classifier (binary, low, and high), and (3) the indicator that the defendants re-offended, or
not. On the left-hand side, the analysis of Dieterich et al. (2016) and on the right-hand side, that of
Feller et al. (2016)
1.1 A Brief Overview on Discrimination 13
recidivism than they actually were.” On the other hand (on the right-hand side of
the figure), as Dieterich et al. (2016) observed:
• For Black people, among those who were classified as high risk, 35% did not
re-offend.
• For white people, among those who were classified as high risk, 40% did not
re-offend.
Therefore, as the rate of recidivism is approximately equal at each risk score level,
irrespective of race, it should not be claimed that the algorithm is racist. The initial
approach is called “false positive rate parity,” whereas the second one is called
“predictive parity.” Obviously, there are reasonable arguments in favor of both
contradictory positions. From this simple example, we see that having a valid and
common definition of “fairness” or “parity” will be complicated.
Since then, many books and articles have addressed the issues highlighted in this
article, namely the increasing power of these predictive decision-making tools, their
ever-increasing opacity, the discrimination they replicate (or amplify), the ‘biased’
data used to train or calibrate these algorithms, and the sense of unfairness they
produce. For instance, Kirkpatrick (2017) pointed out that “the algorithm itself may
not be biased, but the data used by predictive policing algorithms is colored by years
of biased police practices.”
And justice is not the only area where such techniques are used. In the context
of predictive health systems, Obermeyer et al. (2019) observed that a widely used
health risk prediction tool (predicting how sick individuals are likely to be, and the
associated health care cost), that is applied to roughly 200 million individuals in the
USA per year, exhibited significant racial bias. More precisely, 17.7% of patients
that the algorithm assigned to receive “extra care” were Black, and if the bias in
the system was corrected for, as Ledford (2019) did, the percentage should increase
to 46.5%. Those “correction” techniques will be discussed in Part IV of this book,
when presenting “mitigation.”
Massive data, and machine-learning techniques, have provided an opportunity
to revisit a topic that has been explored by lawyers, economists, philosophers, and
statisticians for the past 50 years or longer. The aim here is to revisit these ideas, to
shed new light on them, with a focus on insurance, and explore possible solutions.
Lawyers, in particular, have discussed these predictive models, this “actuarial
justice,” as Thomas (2007), Harcourt (2011), Gautron and Dubourg (2015), or
Rothschild-Elyassi et al. (2018) coined it.
The idea of bias and algorithmic discrimination is not a new one, as shown for
instance by Pedreshi et al. (2008). However, over the past 20 years, the number of
examples has continued to increase, with more and more interest in the media. “AI
biases caused 80% of black mortgage applicants to be rejected” in Hale (2021),
or “How the use of AI risks recreating the inequity of the insurance industry of
the previous century” in Ito (2021). Pursuing David’s 2015 analysis, McKinsey
(2017) announced that artificial intelligence would disrupt the workplace (including
the insurance and banking sectors, Mundubeltz-Gendron (2019)) particularly to
14 1 Introduction
3 Even if it seems exaggerated, because on the contrary, it is often humans who perform the
repetitive tasks to help robots: “in most cases, the task is repetitive and mechanical. One worker
explained that he once had to listen to recordings to find those containing the name of singer Taylor
Swift in order to teach the algorithm that it is a person” as reported by Radio Canada in April 2019.
4 Member of the Chamber of Deputies from 1885 and 1893 and then Prime Minister of France
The starting point of any statistical or actuarial model is to suppose that observations
are realizations of random variables, in some probabilistic space .(, F, P) (see Rol-
ski et al. (2009), for example, or any actuarial textbook). Therefore, let .P denote the
“true” probability measure, associated with random variables .(Z, Y ) = (S, X, Y ).
Here, features .Z can be split into a couple .(S, X), where .X is the nonsensitive
information whereas S is the sensitive attribute.5 Y is the outcome we want to model,
which would correspond to the annual loss of a given insurance policy (insurance
pricing), the indicator of a false claim (fraud detection), the number of visits to
5 For simplicity, in most of the book, we discuss the case where S is a single sensitive attribute.
16 1 Introduction
the dentist (partial information for insurance pricing), the occurrence of a natural
catastrophe (claims management), the indicator that the policyholder will purchase
insurance to a competitor (churn model), etc. Thus, here, we have a triplet .(S, X, Y ),
defined on .S × X × Y, following some unknown distribution .P. And classically,
.Dn = {(zi , yi )} = {(si , x i , yi )}, where .i = 1, 2, · · · , n, will denote a dataset, and
.Pn will denote the empirical probabilities associated with sample .Dn .
In the previous section, we have tried to explain that there could be “legitimate” and
“illegitimate” discrimination, “fair” and “unfair.” We consider here a first attempt
to illustrate that issue, with a very simple dataset (with simulated data). Consider
a risk, and let y denote the occurrence of that risk (hence, y is binary). As we
discuss in Chap. 2, it is legitimate to ask policyholders to pay a premium that is
proportional to .P[Y = 1], the probability that the risk occurs (which will be the
idea of “actuarial fairness”). Assume now that this occurrence is related to a single
feature x : the larger x, the more likely the risk will occur. A classic example could
be the occurrence of the death of a person, where x is the age of that person. Here,
the correlation between y and x is coming from a common (unobserved) factor, .x0 .
In a small dataset, toydata1 (divided into a training dataset, toydata1_train,
and a validation dataset, toydata1_validation), we have simulated values,
where the confounding variable .X0 (that will not be observed, and therefore cannot
1.2 From Words and Concepts to Mathematical Formalism 17
be used in the modeling process) is a Gaussian variable, .X0 ∼ N(0, 1), and then
⎧
⎪
⎨X = X0 + , ∼ N(0, 1/2 ),
⎪ 2
The sensitive attribute s, which takes values 0 (or A) and 1 (or B), does not
influence y, and therefore it might not be legitimate to use it (it could be seen as
an “illegitimate discrimination”). Note that .x0 influences all variables, x, s, and y
(with a probit model for the last two), and because of that unobserved confounding
variable .x0 , all variables are here (strongly) correlated. In Fig. 1.3, we can visualize
the dependence between x and y (via boxplots of x given y) on the left-hand side,
Fig. 1.3 On top, boxplot of x conditional on y, with .y ∈ {0, 1} on the left-hand side, and
conditional on s, with .s ∈ {A, B} on the right-hand side, from the toydata1 dataset. Below,
the curve on the left-hand side is .x → P[Y = 1|X = x] whereas the curve on the right-hand
side is .x → P[S = A|X = x]. Hence, when .x = +1, .P[Y = 1|X = x] ∼ 25%, and therefore
.P[Y = 0|X = x] ∼ 75% (on the left-hand side), whereas when .x = +1, .P[S = A|X = x] ∼ 95%,
and therefore .P[S = B|X = x] ∼ 5% (on the right-hand side)
18 1 Introduction
and between x and s (via boxplots of x given s) on the right-hand side. For example,
if .x ∼ −1, then y takes values in .{0, 1} respectively with 25% and 75% chance. It
is a 75% and 25% chance if .x ∼ +1. Similarly, when .x ∼ −1, s is four times more
likely to be in group A than in group B.
When fitting a logistic regression to predict y based on both x and s, from
toydata1_train, observe that variable x is clearly significant, but not s (using
glm in R, see Sect. 3.3 for more details about standard classifiers, starting with the
logistic regression):
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2983 0.2083 -1.432 0.152
x 1.0566 0.1564 6.756 1.41e-11 ***
s == A 0.2584 0.2804 0.922 0.357
exp[−0.1390 + 1.1344 x]
(x) =
m
. .
1 + exp[−0.1390 + 1.1344 x]
But it does not mean that this model is perceived as “fair” by everyone. In Fig. 1.4,
exceed a given threshold t, here .50%.
we can visualize the probability that scores .m
Even without using s as a feature in the model, .P[ m(X) > t|S = s] does depend
on s, whatever the threshold t. And if .E[
m(X)] ∼ 50%, observe that .E[ m(X)|S =
A] ∼ 65% while .E[ m(X)|S = B] ∼ 25%. With our premium interpretation, it
means that, on average, people that belong in group A pay a premium at least twice
that paid by people in group B. Of course, ceteris paribus it is not the case, as
individuals with the same x have the same prediction, whatever s, but overall, we
observe a clear difference. One can easily transfer this simple example to many
real-life applications.
Throughout this book, we provide examples of such situations, then formalize
some measures of fairness, and finally discuss methods used to mitigate a possible
discrimination in a predictive model .m, even if .m
is not a function of the sensitive
attribute (fairness through unawareness).
1.3 Structure of the Book 19
Fig. 1.4 Distribution of the score .m(X, S), conditional on A and B, on the left-hand side, and
distribution of the score .m(X) without the sensitive variable, conditional on A and B, on the right-
hand side (fictitious example). In both cases, logistic regressions are considered. From this score,
we can get a classifier .
y = 1(m(z) > t) (where .z is either .(x, s), on the left-hand side, or simply
.x, on the right-hand side). Here, we consider cut-off .t = 50%. Areas on the right of the vertical
line (at .t = 50%) correspond to the proportion of individuals classified as .y = 1, in both groups,
A and B
In the following chapters, and more specifically in Parts III and IV, we use both
generated data and publicly available real datasets to illustrate various techniques,
either to quantify a potential discrimination (in Part III) or to mitigate it (in Part IV).
All the datasets are available from the GitHub repository,6 in R.
> library(devtools)
> devtools::install_github("freakonometrics/InsurFair")
> library(InsurFair)
The first toy dataset is the one discussed previously in Sect. 1.2.2, with
toydata1_train and toydata1_valid, with (only) three variables y
(binary outcome), s (binary sensitive attribute) and x (drawn from a Gaussian
variable).
> str(toydata1_train)
’data.frame’: 600 obs. of three variables:
$ x : num 0.7939 0.5735 0.9569 0.1299 -0.0606 ...
$ s : Factor w/ 2 levels "B","A": 1 1 2 1 2 2 2 1 1 1 ...
$ y : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...
As discussed, the three variables are correlated, as they are all based on an
unobserved common variable z.
The toydata2 dataset consists in two generated data, .n = 5000 are used as a
training sample, and .n = 1000 are used for validation. The process used to generate
the data is the following:
• The binary sensitive attribute, .s ∈ {A, B}, is drawn, with respectively .60% and
.40% individuals in each group
• .(x1 , x3 ) ∼ N(μs , Σ s ), with some correlation of .0.4 when .s = A and .0.7 when
.s = B
6 See Charpentier (2014) for a general overview on the use of R in actuarial science. Note that some
packages mentioned here also exist in Python, in scikit-learn, as well as packages dedicated
to fairness, such as fairlearn, or aif360).
1.4 Datasets and Case Studies 21
Fig. 1.5 Scatter plot on toydata2, with .x1 on the x-axis and .x2 on the y-axis, with on the left-
hand side, colors depending on the outcome y (.y ∈ {GOOD , BAD}, or .y ∈ {0 , 1}) and depending
on the sensitive attribute s (.s ∈ {A , B}) on the right-hand side
10
100
8
80
60
6
6
x2
x2
40
4
20
2
0
0
–4 –2 0 2 4 –4 –2 0 2 4
x1 x1
Fig. 1.6 Level curves of .(x1 , x2 ) → μ(x1 , x2 , A) on the left-hand side and .(x1 , x2 ) →
μ(x1 , x2 , B) on the right-hand side, the true probabilities used to generate the toydata2 dataset.
The blue area in the lower-left corner corresponds to . y close to .0 (blue) (or GOOD (blue) risk),
whereas the red area in the upper right corner corresponds to . y close to .1 (red) (or BAD (red) risk)
gender of the person (binary, with 69% women (B) and 31% men (A), but we can
also use the age, treated as categorical.
The FrenchMotor datasets, from Charpentier (2014), are in personal motor
insurance, with underwriting data, and information about claim occurrence (here
considered as binary). It is obtained as the aggregation of freMPL1, freMPL2,
22 1 Introduction
Predictive modeling involves the use of data to forecast future events. It relies on capturing
relationships between explanatory variables and the predicted variables from past occur-
rences and exploiting these relationships to predict future outcomes. Forecasting future
financial events is a core actuarial skill—actuaries routinely apply predictive modeling
techniques in insurance and other risk management application, Frees et al. (2014a).
The sciences do not try to explain, they hardly even try to interpret, they mainly make
models. By a model is meant a mathematical construct which, with the addition of
certain verbal interpretations, describes observed phenomena. The justification of such a
mathematical construct is solely and precisely that it is expected to work—that is, correctly
to describe phenomena from a reasonably wide area, Von Neumann (1955).
In economic theory, as in Harry Potter, the Emperor’s New Clothes or the tales of King
Solomon, we amuse ourselves in imaginary worlds. Economic theory spins tales and calls
them models. An economic model is also somewhere between fantasy and reality. Models
can be denounced for being simplistic and unrealistic, but modeling is essential because
it is the only method we have of clarifying concepts, evaluating assumptions, verifying
conclusions and acquiring insights that will serve us when we return from the model to
real life. In modern economics, the tales are expressed formally: words are represented by
letters. Economic concepts are housed within mathematical structures, Rubinstein (2012).
Chapter 2
Fundamentals of Actuarial Pricing
Abstract “Insurance is the contribution of the few to the misfortune of the many”
is a simple way to describe what insurance is. But it doesn’t say what the
“contribution” should be, to be fair. In this chapter, we return to the fundamentals of
pricing and risk sharing, and at the end we mention other models used in insurance
(to predict future payments to be provisioned, to create a fraud score, etc.).
Even though insurers will not be able to predict which of their clients will suffer a
loss, they should be capable of estimating probabilities to claim a loss, and possibly
the distribution of their aggregate losses, with an acceptable margin of error, and
budgeting accordingly. The role of actuaries is to run statistical analysis to measure
individual risk and price it.
2.1 Insurance
the application, i.e., the higher the risk that they bring to the pool, the higher the
premium required.
Through effective underwriting, Wilkie (1997) claims that “the risk is evaluated
by the insurer as thoroughly as possible, based on all the facts that are relevant
and available.” Participation in mutual insurance schemes is voluntary and the
amount of cover that the individual purchases is discretionary. An essential feature
of mutual insurance is segmentation, or discrimination in underwriting, leading
to significant differences in premium rates for the same amount of life cover
for different participants. Viswanathan (2006) gives several examples. The second
concept is “solidarity.”
Definition 2.2 (Solidarity (Wilkie 1997)) Solidarity is the basis of most national
or social insurance schemes. Participation in such state-run schemes is generally
compulsory and individuals have no discretion over their level of cover. All
participants normally have the same level of cover. In solidarity schemes the
contributions are not based on the expected risk of each participant.
In those state-run schemes, contributions are often just equal for all, or it can
be according to the individual ability to pay (such as percentage of income). As
everybody pays the same contribution rate, the low-risk participants are effectively
subsidizing the high-risk participants. With an insurance economics perspective,
agents make decisions individually, forgetting that the decisions they make often go
beyond their narrow self-interest, reflecting instead broader community and social
interests, even in situations where they are not known to each other. This is not
altruism, per se, but rather a notion of strong reciprocity, the “predisposition to
cooperate even when there is no apparent benefit in doing so,” as formalized in
Gintis (2000) and Bowles and Gintis (2004).
Solidarity is important in insurance. In most countries, employer-based health
insurance includes maternity benefits for everyone. In the USA, a federal law says
it is discriminatory not to do so (the “Pregnancy Discrimination Act” (PDA) is an
amendment to the Civil Rights Act of 1964 that was enacted in 1978). “Yes, men
should pay for pregnancy coverage, and here’s why, said Hiltzik (2013), it takes two
to tango.” No man has ever given birth to a baby, but it’s also true that no baby has
ever been born without a man being involved somewhere along the line. “Society
has a vested interest in healthy babies and mothers” and “universal coverage is the
only way to make maternity coverage affordable”; therefore, solidarity is imposed,
and men should pay for pregnancy coverage.
One should probably stress here that insurance is not used to eliminate the
risk, but to transfer it, and this transfer is done according to a social philosophy
chosen by the insurer. With “public insurance,” as Ewald (1986) reminds us, the
goal is to transfer risk from individuals to a wider social group, by “socialising,” or
redistributing risk “more fairly within the population.” Thus, low-risk individuals
pay insurance premiums at a higher rate than their risk profile would suggest,
even if this seems “inefficient” from an economic point of view. Social insurance
is organized according to principles of solidarity, where access and coverage are
2.1 Insurance 27
independent of risk status, and sometimes of ability to pay (as noted by Mittra
(2007)). Nevertheless, in many cases, the premium is proportional to the income of
the policyholder, is usually provided by public rather than private entities. For some
social goods, such as health care, long-term care, and perhaps even basic mortgage
life insurance, it may simply be inappropriate to provide such products through a
mutuality-based model that inevitably excludes some individuals, as “primary social
goods, because they are defined as something to which everyone has an inalienable
right, cannot be distributed through a system that excludes individuals based on
their risk status or ability to pay.” Mutual insurance companies are often seen as an
intermediary between such public insurance and for-profit insurance companies.
And as Lasry (2015) points out, “insurance has long been faced with a dilemma:
on the one hand, better knowledge of a risk allows for better pricing; better
knowledge of risk factors can also encourage prevention; on the other hand,
mutualization, which is the basis of insurance, can only subsist in most cases in a
situation of relative ignorance (or even a legal obligation of ignorance).” Actuaries
will then seek to classify or segment risks, all based on the idea of mutualization.
We shall return to the mathematical formalism of this dilemma. De Pril and
Dhaene (1996) point out that segmentation is a technique that the insurer uses to
differentiate the premium and possibly the cover, according to a certain number of
specific characteristics of the risk of being the policyholder (hereinafter referred
to as segmentation criteria), with the aim of achieving a better match between
the estimated cost of the claim and the burdens that a given person places on the
community of policyholders and the premium that this person has to pay for the
cover offered. In Box 2.1, Rodolphe Bigot responds to general legal considerations
regarding risk selection in insurance (in France, but most principles can be observed
elsewhere). Underwriting is the term used to describe the decision-making process
by which insurers determine whether to offer, or refuse, an insurance policy to an
individual based on the available information (and the requested amount). Gandy
(2016) asserts that “right to underwrite” is basically a right to discriminate. Hence,
“higher premium” corresponds to a rating decision, “exclusion waiver” is a coverage
decision whereas “denial” is an underwriting decision.
Box 2.1 Insurance & Underwriting (in French Law), by Rodolphe Bigot1
The insurance transaction and the underlying mutualization are based on so-
called risk selection. Apart from most group insurance policies, which consist
of a kind of mutualization within a mutualization, the insurer refuses or
accepts that each applicant for insurance enters the mutualization constituted
(continued)
1 Lecturer in private law, UFR of Law, Le Mans University, member of Thémis-UM and Ceprisca.
28 2 Fundamentals of Actuarial Pricing
Fig. 2.1 Coverage selected by auto insurance policyholders based on age, with the basic manda-
tory coverage, “third party insurance” and the broadest coverage known as “fully comprehensive.”
(Source: personal communication, real data from an insurance company in France)
about how to build a fair tariff. A difficult task lies in the fact that insurers have
incomplete information about their customers. It is well known that the observable
characteristics of policyholders (which can be used in pricing) explain only a small
proportion of the risks they represent. The only remedy for this imperfection is to
self-select policyholders by differentiating the cover offered to them, i.e., a nonlinear
scale linking the premium to be paid to the amount of the deductible accepted. As
mentioned in the previous chapter, observe that there is a close analogy between
this concept of “fair tariff” and “actuarial fairness,” or that of “equilibrium with
signal” proposed by Spence (1974, 1976) to describe the functioning of certain
labor markets. Riley (1975) proposed a more general model that could be applied
to insurance markets, among others. Cresta and Laffont (1982) proved the existence
of fair insurance rates for a single risk. Although the structure of equilibrium with
signal is now well understood in the case of a one-dimensional parameter, the same
cannot be said for cases where several parameters are involved. Kohlleppel (1983)
gave an example of the non-existence of such an equilibrium in a model satisfying
the natural extension of Spence’s hypotheses. As insurance is generally a highly
competitive and regulated market, the insurer must use all the statistical tools and
data at its disposal to build the best possible rates. At the same time, its premiums
must be aligned with the company’s strategy and take into account competition.
Because of the important role played by insurance in society, premiums are also
scrutinized by regulators. They must be transparent, explainable, and ethical. Thus,
pricing is not only statistical, it also carries strategic and societal issues. These
different issues can push the insurer to offer fairer premiums in relation to a given
variable. For example, the regulations require insurers to present fair premiums
according to the gender of the policy holder given their strategies and to offer fair
premiums according to age. Regardless of the reason why an insurance player must
present fairer pricing in relation to a variable, it must be able to define, measure,
and then mitigate the ethical bias of its pricing while preserving its consistency and
performance.
See Feller (1957) or Denuit and Charpentier (2004) for more details about this
quantity, that exists only if the sum or the integral is finite. Risks with infinite
expected values exhibit unexpected properties. If this quantity exists, because of
the law of large numbers (Proposition 3.1), this corresponds to the probabilistic
counterpart of the average of n values y1 , · · · , yn obtained as independent draws of
random variable Y . Interestingly, as discussed in the next chapter, this quantity can
be obtained as the solution of an optimization problem. More precisely,
n
2
.y = argmin yi − m and E[Y ] = argmin E [(Y − m)2 .
m∈R i=1 m∈R
2 γένος is the etymological source of “gender,” and not “gene,” based on γενεά from the aorist
infinitive of γίγνομαι—I come into being.
3 Denuit and Charpentier (2004) discuss the mathematical formalism that allows such a writing.
for some discount rate r. However, this assumption of homogeneous risks proves
to be overly simplistic in numerous insurance scenarios. Take death insurance, for
example, where the law of T , representing the occurrence of death, should ideally
be influenced by factors such as the policyholder’s age at the time of contract. This
specific aspect will be explored further in Sect. 2.3.3. But before, let us get back to
classical concepts about economic decisions, when facing uncertain events.
4 Without discounting, as death is (at an infinite time horizon) certain, the pure premium would be
All actuaries have been lulled by Akerlof’s 1970 fable of “lemons”. The insurance
market is characterized by information asymmetries. From the insurer’s point of
view, these asymmetries mainly concern the need to find adequate information
on the customer’s risk profile. A decisive factor in the success of an insurance
business model is the insurer’s ability to estimate the cost of risk as accurately
as possible. Although in the case of some simple product lines, such as motor
insurance, the estimation of the cost of risk can be largely or fully automated and
managed in-house; in areas with complex risks, the assistance of an expert third
party can mitigate this type of information asymmetry. With Akerlof’s terminology,
some insurance buyers are considered low-risk peaches, whereas others are high-
risk lemons. In some cases, insurance buyers know (to some extent) whether they
are lemons or peaches. If the insurance company could tell the difference between
lemons and peaches, it would have to charge peaches a premium related to the risk
of the peaches and lemons a premium related to the risk of the lemons, according
to a concept of actuarial fairness, as Baker (2011) reminds us. But if actuaries are
not able to differentiate between lemons and peaches, then they will have to charge
the same price for an insurance contract. The main difference between the market
described by Akerlof (1970) (in the original fable it was a market for used cars)
and an insurance market is that the information asymmetry was initially (in the car
example) in favor of the seller of an asset. In the field of insurance, the situation is
34 2 Fundamentals of Actuarial Pricing
often more complex. In the field of car insurance, Dalziel and Job (1997) pointed
out the optimism bias of most drivers who all think they are “good risks.” The
same bias will be found in many other examples, as mentioned by Royal and Walls
(2019), but excluding health insurance, where the policyholder may indeed have
more information than the insurer.
To use the description given by Chassagnon (1996), let us suppose that an
insurer covers a large number of agents who are heterogeneous in their probability
of suffering a loss. The insurer proposes a single price that reflects the average
probability of loss of the agent representative of this economy, and it becomes
unattractive for agents whose probability of suffering an accident is low to insure
themselves. A phenomenon of selection by price therefore occurs and it is said
to be unfavorable because it is the bad agents who remain. To guard against
this phenomenon of anti-selection, risk selection and premium segmentation are
necessary. “Adverse-selection disappears when risk analysis becomes sufficiently
effective for markets to be segmented efficiently, says (Picard 2003), doesn’t the
difficulty econometricians have in highlighting real anti-selection situations in
the car insurance market reflect the increasingly precise evaluation of the risks
underwritten by insurers ?”
It is important to have models that can capture this heterogeneity (from ε῾ τερογενής,
heterogenes, “of different kinds,” from ε῞ τερος, heteros, “other, another, different”
and γένος, genos, “kinds”). To get back to our introductory example, if .Tx is the age
at (random) death of the policyholder of age x at the time the contract was taken out
(so that .Tx − x is the residual life span), then the pure premium corresponds to the
expected present value of future flows, i.e.,
∞
100 100
ax = E
. = · P[Tx = x + t],
(1 + r)Tx −x (1 + r)t
t=0
where .Lt is the number of people alive at the end of t years in a cohort that we
would follow, so that .Lx+t−1 − Lx+t is the number of people alive at the end of
.x +t −1 years but not .x +t years (and therefore dead in their t-th year). It is De Witt
(1671) who first proposed this premium for a life insurance, where discriminating
according to age seems legitimate.
2.4 Mortality Tables and Life Insurance 35
But we can go further, because .P[Tx = t], the probability that the policyholder
of age x at the time of subscription will die in t years, could also depend on his or
her gender, his or her health history, and probably on other variables that the insurer
might know. And in this case, it is appropriate to calculate conditional probabilities,
.P[Tx = t|woman] or .P[Tx = t|man smoker].
As surprising as it may seem, Pradier (2011) noted that before the end of the
eighteenth century, in the UK and in France, the price of life annuities hardly
ever depended on the sex of the subscriber. However, the first separate mortality
tables, between men and women, constituted as early as 1740 by Nicolas Struyck
(published in the appendices of a geography article, Struyck (1740)) showed that
36 2 Fundamentals of Actuarial Pricing
Table 2.1 Excerpt from the Men and Women life tables in 1720 (Source: Struyck (1912), page
231), for pseudo-cohorts of one thousand people (.L0 = 1000)
Men Women
x .Lx .5 px x .Lx .5 px x .Lx .5 px x .Lx .5 px
women generally lived longer than men (Table 2.1). Struyck (1740) (translated in
Struyck (1912)) shows that at age 20, life expectancy (residual) is 30 years .3/4 for
men and 35 years .1/2 for women. It also provides life annuity tables by gender.
For a 50-year-old woman, a life annuity was worth 969 florins, compared with 809
florins for a man of the same age. This substantial difference seemed to legitimize
a differentiation of premiums. Here, 424 men (.Lx ) and 468 women (out of one
thousand respective births) had reached 40 years of age (.x = 40). And among those
who had reached 40 years of age, 12.5% of men and 9.6% of women would die
within 5 years (mathematically denoted .5 px = P[T ≤ x + 5|T > x]).
According to Pradier (2012), it was not until the Duchy of Calenberg’s widows’
fund went bankrupt in 1779 that the age and sex of subscribers were used in
conjunction to calculate annuity prices. In France, in 1984, the regulatory authorities
of the insurance markets decided to use regulatory tables established for the general
population by INSEE, based on the population observed over 4 years, namely the
PM 73-77 table for men and the PF 73-77 table for women, renamed TD and TV
73-77 tables respectively (with an analytical extension beyond 99 years). Although
the primary factor in mortality is age, gender is also an important factor, as shown
in the TD-TV Table. For more than a century, the mortality rate for men has been
higher than that of women in France.
In practice, however, the actuarial pricing of life insurance policies has continued
to be established without taking into account the gender of the policyholder. In fact,
the reason why two tables were used was that the male table was the regulatory
table for life insurance (PM became TD, for “table de décès,” or “death table”), and
the female table became the table for life insurance (PF became TV, for “table de
vie,” or “life table”). In 1993, the TD and TV 88-90 tables replaced the two previous
tables, with the same principle, i.e., the use of a table built on a male population for
life insurance, and a table built on a female population for life insurance. From a
prudential point of view, the female table models a population that has, on average,
a lower mortality rate, and therefore lives longer.
2.4 Mortality Tables and Life Insurance 37
In 2005, the TH and TF 00-02 tables were used as regulatory tables, still with
tables founded on different populations, namely men and women respectively. But
this time, the term men (H, for hommes) and women (F, for femmes) is maintained,
as regulations allowed for the possibility of different pricing for men and women.
A ruling by the Court of Justice of the European Union on 1 March 2011, however,
made gender-differentiated pricing impossible (as of 21 December 2012), on the
grounds that they would discriminate. In comparison, recent (French) INED tables
are also mentioned in Table 2.2, on the right-hand side.
Beyond gender, all sorts of “discriminating variables” have been studied, in order
to build, for example, mortality tables depending on whether the person is a smoker
or not, as in Benjamin and Michaelson (1988), in Table 2.3. Indeed, since Hoffman
(1931) or Johnston (1945), actuaries had observed that exposure to tobacco, and
smoking, had an important impact on the policyholder’s health. As Miller and
Gerstein (1983) wrote, “it is clear that smoking is an important cause of mortality.”
There are also mortality tables (or calculations of residual life expectancy)
by level of body mass index (BMI, introduced by Adolphe Quetelet in the mid-
nineteenth century), as calculated by Steensma et al. (2013) in Canada. A “normal”
index refers to people with an index between .18.5 and .25 kg/m2 ; “overweight”
refers to an index between 25 and .30 kg/m2 ; obesity level I refers to an index
between 30 and .35 kg/m2 , and obesity level II refers to an index exceeding
2
. 35kg/m . Table shows some of the elements. These orders of magnitude are
comparable with Fontaine et al. (2003) among the pioneering studies, Finkelstein
et al. (2010), or more recently Stenholm et al. (2017). If Adolphe Quetelet
introduced that index, it became popular in the 1970s, when “Dr. Keys was irritated
that life insurance companies [that] were estimating people’s body fat—and hence,
their risk of dying—by comparing their weights with the average weights of others of
the same height, age and gender,” as Callahan (2021) explains. In Keys et al. (1972),
with “more than 7000 healthy, mostly middle-aged men, Dr. Keys and his colleagues
showed that the body mass index was a more accurate—and far simpler—predictor
of body fat than the methods used by the insurance industry.” Nevertheless, this
measure is now known to have many flaws, as explained in Ahima and Lazar (2013)
(Table 2.4).
Higher incomes are associated with longer life expectancy, as mentioned already
in Kitagawa and Hauser (1973) with probably the first documented analysis. But
despite the importance of socioeconomic status to mortality and survival, Yang
38
Table 2.2 Excerpt from French tables, with TD and TV 73-77 on the left-hand side, TD and TV 88-90 in the center, and INED 2017-2019 on the right-hand
side
TD 73-77 TV 73-77 TD 88-90 TV 88-90
INED men INED women
0 100000 0 100000 0 100000 0 100000
0 100000 0 100000
10 97961 10 98447 10 98835 10 99129
10 99486 10 99578
20 97105 20 98055 20 98277 20 98869
20 99281 20 99471
30 95559 30 97439 30 96759 30 98371
30 98656 30 99247
40 93516 40 96419 40 94746 40 97534
40 97661 40 98810
50 88380 50 94056 50 90778 50 95752
50 95497 50 97645
60 77772 60 89106 60 81884 60 92050
60 90104 60 94777
70 57981 70 78659 70 65649 70 84440
70 78947 70 89145
80 28364 80 52974 80 39041 80 65043
80 59879 80 77161
90 4986 90 14743 90 9389 90 24739
90 25123 90 44236
100 103 100 531 100 263 100 1479
100 1412 100 4874
110 0 110 0 110 0 110 2
2 Fundamentals of Actuarial Pricing
2.4 Mortality Tables and Life Insurance 39
Table 2.3 Residual life expectancy (in years) by age (25–65 years) for smokers and nonsmokers
(Source: Benjamin and Michaelson (1988), for 1970–1975 data in the USA)
Men Women
Nonsmoker Smoker Nonsmoker Smoker
25 48.4 42.8 25 52.8 49.8
35 38.7 33.3 35 43.0 40.1
45 29.2 24.2 45 33.5 31.0
55 20.3 16.5 55 24.5 22.6
65 12.8 10.4 65 16.2 15.1
Table 2.4 Residual life expectancy (in years), as a function of age (between 20 and 70 years) as
a function of BMI level (Source: Steensma et al. (2013))
Men Women
Normal Overweight Obese I Obese II Normal Overweight Obese I Obese II
20 57.2 61.0 59.1 53.5 20 62.8 66.5 64.6 59.3
30 47.6 51.4 49.4 44.1 30 53.0 56.7 54.8 49.5
40 38.1 41.7 39.9 34.7 40 43.3 46.9 45.0 39.9
50 28.9 32.4 30.6 25.8 50 33.8 37.3 35.5 30.6
60 20.4 23.6 21.9 17.6 60 24.9 28.1 26.4 21.9
70 13.2 15.8 14.4 10.9 70 16.8 19.7 18.2 14.3
Table 2.5 Excerpt of life tables per wealth quantile and gender in France (Source: Blanpain
(2018))
Men Women
0–5% 45–50% 95–100% 0–5% 45–50% 95–100%
0 100000 100000 100000 0 100000 100000 100000
10 99299 99566 99619 10 99385 99608 99623
20 99024 99396 99469 20 99227 99506 99526
30 97930 98878 99094 30 98814 99302 99340
40 95595 98058 98627 40 97893 98960 99074
50 90031 96172 97757 50 95021 97959 98472
60 77943 91050 95649 60 88786 95543 97192
70 59824 79805 90399 70 79037 90408 94146
80 38548 59103 76115 80 63224 79117 85825
90 13337 23526 38837 90 31190 45750 55918
100 530 1308 3231 100 2935 5433 8717
et al. (2012), Chetty et al. (2016), and Demakakos et al. (2016) stressed that wealth
has been under-investigated as a predictor of mortality. Duggan et al. (2008) and
Waldron (2013) used social security data in the USA. In France, disparities of life
expectancy by social categories are well known. Recently, Blanpain (2018) created
some life tables per wealth quantiles. An excerpt can be visualized on Table 2.5,
with men on the left-hand side, women on the right-hand side, and fictional cohorts,
with the poorest 5% of the population on the left-hand side (“0–5%”) and the richest
40 2 Fundamentals of Actuarial Pricing
Fig. 2.2 Force of mortality (log scale) for men on the left-hand side and women on the right-hand
side, for various income quantiles (bottom, medium, and upper 10%), in France. (Data source:
Blanpain (2018))
A multitude of criteria can be used to create rate classes, as we have seen in the
context of mortality. To get a good predictive model, as in standard regression
models, we simply look for variables that correlate significantly with the variable
of interest, as mentioned by Wolthuis (2004). For instance, in the case of car
insurance, the following information was proposed in Bailey and Simon (1959):
use (leisure—pleasure—or professional—business), age (under 25 or not), gender
and marital status (married or not). Specifically, five risk classes are considered,
with rate surcharges relative to the first class (which is used here as a reference):
– “pleasure, no male operator under 25,” (reference),
– “pleasure, nonprincipal male operator under 25,” .+65%,
– “business use,” .+65%,
– “married owner or principal operator under 25,” .+65%,
– “unmarried owner or principal operator under 25,” .+140%.
In the 1960s, the rate classes resembled those that would be produced by
classification (or regression) trees such as those introduced by Breiman et al. (1984).
But using more advanced algorithms, Davenport (2006) points out that when an
actuary creates risk classes and rate groups, and in most cases, these “groups” are
not self-aware, they are not conscious (at most, the actuary will try to describe
2.5 Modeling Uncertainty and Capturing Heterogeneity 41
them by looking at the averages of the different variables). These groups, or risk
classes, are built on the basis of available data, and exist primarily as the product
of actuarial models. And as Gandy (2016) points out, there is no “physical basis”
for group members to identify other members of their group, in the sense that they
usually don’t share anything, except some common characteristics. As discussed in
Sect. 3.2, these risk groups, developed at a particular point in time, create a transient
collusion between policyholders, who are likely to change groups as they move,
change cars, or even simply grow older.
If Y denotes the life expectancy at the birth of an individual, the literal translation of
the previous expression is that the life expectancy at birth of a randomly selected
individual (on the left) is a weighted average of the life expectancies at birth
of females and males, the weights being the respective proportions of males and
females in the population. And as .E(Y ) is an average of the two,
The law of total expectations (Proposition 2.2) can be written, with that notation
EY [Y ] = EX EY |X [Y |X] .
.
An alternative is to write, with synthetic notations .E[Y ] = E E[Y |X] , where the
same notation—.E—is used indifferently to describe the same operator on different
probability measures.
2.5 Modeling Uncertainty and Capturing Heterogeneity 43
EY [Y ] = EX EY |X [Y |X] = EX μ(X) ,
.
which is a desirable property we want to have on any pricing function m (also called
“globally unbiased,” see Definition 4.26).
Definition 2.8 (Balance Property) A pricing function m satisfies the balance
property if .EX [m(X)] = EY [Y ].
The name “balance property” comes from accounting, as we want assets (what
comes in, or premiums, .m(x)) to equal liabilities (what goes out, or losses y) on
average. This concept, as it appears in economics in Borch (1962), corresponds to
“actuarial fairness,” and is based on a match between the total value of collected
premiums and the total amount of legitimate claims made by the policyholder. As
it is impossible for the insurer to know what future claims will actually be like,
it is considered actuarially fair to set the level of premiums on the basis of the
historical claims record of people in the same (assumed) risk class. It is on this
basis that discrimination is considered “fair” in distributional terms, as explained
in Meyers and Van Hoyweghen (2018). Otherwise, the redistribution would be
considered “unfair,” with forced solidarity from the low-risk group to the high-
risk group. This “fairness” was undermined in the 1980s, when private insurers
limited access to insurance for people with AIDS, or at risk of developing it,
as Daniels (1990) recalls. Feiring (2009) goes further in the context of genetic
information, “since the individual has no choice in selecting his genotype or its
expression, it is unfair to hold him responsible for the consequences of the genes
he inherits—just as it is unfair to hold him responsible for the consequences of any
distribution of factors that are the result of a natural lottery.” In the late 1970s (see
Boonekamp and Donaldson (1979), Kimball (1979) or Maynard (1979)), the idea
that the proportionality between the premium and the risk incurred would guarantee
fairness between policyholders began to be translated into conditional expectation
(conditional on the risk factors retained).
As discussed in Meyers and Van Hoyweghen (2018), who trace the emergence of
actuarial fairness from its conceptual origins in the early 1960s to its position at the
heart of insurance thinking in the 1980s, the concept of “actuarial fairness” appeared
as more and more countries adopted anti-discrimination legislation. At that time,
insurers positioned “actuarial fairness” as a fundamental principle that would be
jeopardized if the industry did not benefit from exemptions to such legislation. For
instance, according to the Equality Act 2010 in the U.K. “it is not a contravention
(...) to do anything in connection with insurance business if (a) that thing is done
by reference to information that is both relevant to the assessment of the risk to be
insured and from a source on which it is reasonable to rely, and (b) it is reasonable
to do that thing ,” as Thomas (2017) wrote.
In most applications, there is a strong heterogeneity within the population, with
respect to risk occurrence and risk costs. For example, when modeling mortality,
the probability of dying within a given year can be above 50% for very old and
44 2 Fundamentals of Actuarial Pricing
sick people, and less than 0.001% for pre-teenagers. Formally, the heterogeneity
will be modeled by a latent factor .Θ. If y designates the occurrence (or not) of
an accident, y is seen as the realization of a random variable Y , which follows a
Bernoulli distribution, .B(Θ), where .Θ is a non-observable latent variable (as in
Gourieroux (1999) or Gourieroux and Jasiak (2007)). If y denotes the number of
accidents occurring during the year, Y follows a Poisson distribution, .P(Θ) (or a
binomial-negative model, or a parametric model with inflation of zeros, etc., as in
Denuit et al. (2007)). If y notes the total cost, Y follows a Tweedie distribution,
or more generally a compound Poisson distribution, which we denote by .L(Θ, ϕ),
where .L denotes a distribution with mean .Θ, and where .ϕ is a dispersion parameter
(see Definition 3.13 for more details). The goal of the segmentation is to constitute
ratemaking classes (denoted .Bi previously) in an optimal way, i.e., by ensuring
that one class does not subsidize the other, from observable characteristics, noted
.x = (x1 , x2 , · · · , xk ). Crocker and Snow (2013) speaks of “categorization based on
exp(x ⊤ β)
px =
. or px = Ф(x ⊤ β),
1 + exp(x ⊤ β)
5 .Ф is here the distribution function of the centered and reduced normal distribution, .N(0, 1).
2.5 Modeling Uncertainty and Capturing Heterogeneity 45
Table 2.6 Individual loss, its expected value, and its variables, for the policyholder on the left-
hand side and the insurer on the right-hand side. .E[Y ] is the premium paid, and Y the total loss,
from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
Policyholder Insurer
Loss .E[Y ] .Y − E[Y ]
Average loss .E[Y ] 0
Variance 0 .Var[Y ]
Table 2.7 Individual loss, its expected value and its variables, for the policyholder on the left-
hand side and the insurer on the right-hand side. .E[Y |Θ] is the premium paid, and Y the total loss,
from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
Policyholder Insurer
Loss .E[Y |Θ] .Y − E[Y |Θ]
Average loss .E[Y ] 0
Variance .Var[E[Y |Θ]] .Var[Y − E[Y |Θ]]
Table 2.8 Individual loss, its expected value, and its variables, for the policyholder on the left-
hand side, and the insurer on the right-hand side. .E[Y |X] is the premium paid, and Y the total loss,
from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
Policyholder Insurer
Loss .E[Y |X] .Y − E[Y |X]
Average loss .E[Y ] 0
Variance .Var[E[Y |X]] .E[Var[Y |X]]
where
comfortable” (as Stone (1993) put it). The danger is that, in this way, the allocation
of each person’s contributions to mutuality would be the result of an actuarial
calculation, as Stone (1993) put it. Porter (2020) said that this process was “a way
of making decisions without seeming to decide.” We review this point when we
discuss exclusions and the interpretability of models. The insurer then uses proxies
to capture this heterogeneity, as we have just seen. A proxy (one might call it a
“proxy variable”) is a variable that is not significant in its own right, but which
replaces a useful but unobservable, or unmeasurable, variable, according to Upton
and Cook (2014).
Most of our discussion focuses on tariff discrimination, and more precisely on
the “technical” tariff. As mentioned in the introduction, from the point of view
of the policyholder, this is not the most relevant variable. Indeed, in addition to
the actuarial premium (the pure premium mentioned earlier), there is a commercial
component, as an insurance agent may decide to offer a discount to one policyholder
or another, taking into account a different risk aversion or a greater or lesser
price elasticity (see Meilijson 2006). But an important underlying question is “is
the provided service the same?” Ingold and Soper (2016) review the example of
Amazon not offering the same services to all its customers, in particular same-
day-delivery offers, offered in certain neighborhoods, chosen by an algorithm that
ultimately reinforced racial bias (by never offering same-day delivery in neighbor-
hoods composed mainly of minority groups). A naive reading of prices on Amazon
would be biased because of this important bias in the data, which should be taken
into account. As Calders and Žliobaite (2013) reminds us, “unbiased computational
processes can lead to discriminative decision procedures.” In insurance, one could
imagine that a claims manager does not offer the same compensation to people
with different profiles—some people being less likely to dispute than others. It is
important to better understand the relationship between the different concepts.
A large part of the actuary’s job is to motivate, and explain, a segmentation. Some
authors, such as Pasquale (2015), Castelvecchi (2016), or Kitchin (2017), have
pointed out that machine-learning algorithms are characterized by their opacity and
their “incomprehensibility,” sometimes called “black box” (or opaque) properties.
And it is essential to explain them, to tell a story. For Rubinstein (2012), as
mentioned earlier, models are “fables”: “economic theory spins tales and calls them
models. An economic model is also somewhere between fantasy and reality (...) the
word model sounds more scientific than the word fable or tale, but I think we are
talking about the same thing.” In the same way, the actuary will have to tell the
story of his or her model, before convincing the underwriting and insurance agents
to adopt it. But this narrative is necessarily imprecise. As Saint Augustine said,
“What is time? If no one asks me, I know. But if someone asks me and I want to
explain it, then I don’t know anymore.”
48 2 Fundamentals of Actuarial Pricing
Fig. 2.3 The evolution of auto insurance claim frequency as a function of primary driver age,
relative to overall annual frequency, with a Poisson regression in yellow, a smoothed regression in
red, a smoothed regression with a too small smoothing bandwidth in blue, and with a regression tree
in green. Dots on the left are the predictions for a 22-year-old driver. (Data source: CASdataset
R package, see Charpentier (2014))
One can hear that age must be involved in the prediction of the frequency of
claims in car insurance, and indeed, as we see in Fig. 2.3, the prediction will not
be the same at 18, 25, or 55 years of age. Quite naturally, a premium surcharge
for young drivers can be legitimized, because of their limited driving experience,
coupled with unlearned reflexes. But this story does not tell us at what order
of magnitude this surcharge would seem legitimate. Going further, the choice
of model is far from neutral on the prediction: for a 22-year-old policyholder,
relatively simple models propose an extra premium of +27%, +73%, +82%, or
+110% (compared with the average premium for the entire population). Although
age discrimination may seem logical, how much difference can be allowed here, and
would be perceived as “quantitatively legitimate”? In Sect. 4.1, we present standard
approaches used to interpret actuarial predictive models, and explain predicted
outcomes.
u(w − π ) ≥ E u(w − Y ) .
.
The utility that they have when paying the premium (on the left-hand side) exceeds
the expected utility that they have when keeping the risk (on the right-hand side).
Thus, an insurer, also with perfect knowledge of the wealth and utility of the agent
(or his or her risk aversion), could ask the following premium, named “indifference
premium”.
Definition 2.9 (Indifference Utility Principle) Let Y be the non-negative random
variable corresponding to the total annual loss associated with a given policy; for a
policyholder with utility u and wealth w, the indifference premium is
.π = w − u−1 E u(w − Y ) .
The technical pure premium is here .π0 = E(Y ) = py2 + (1 − p)y1 = pw, and
when paying that premium, the wealth would be .w − π0 = (1 − p)w.
50 2 Fundamentals of Actuarial Pricing
Fig. 2.4 Utility and (ex-post) wealth, with an increasing concave utility function u, whereas the
straight line corresponds to a linear utility .u0 (risk neutral). Starting with initial wealth .ω, the
agent will have random wealth W after 1 year, with two possible states: either .w1 (complete loss,
on the left part of the x-axis) or .w2 = ω (no loss, on the right part of the x-axis). Complete loss
occurs with .40% chance (.2/5). .π0 is the pure premium (corresponding to a linear utility) whereas
.π is the commercial premium. All premiums in the colored area are high enough for the insurance
company, and low enough for the policyholder
If the agent is risk adverse (strictly), .u(w − π0 ) > E[u(w − Y )], in the sense that
the insurance company can ask for a higher premium than the pure premium
π0 = E[Y ] > 0 : actuarial (pure) premium
.
π − π0 = w − E[Y ] − u−1 E u(w − Y ) ≥ 0 : commercial loading.
Fig. 2.5 On the left, the same graph as in Fig. 2.4, with utility on the y-axis and (ex-post) wealth
on the x-axis, with an increasing concave utility function u, and an ex-post random wealth W
after 1 year, with two possible states: either .w1 (complete loss) or .w2 = ω (no loss). Complete
loss occurs with .40% chance (.2/5) on the left-hand side, whereas complete loss occurs with .60%
chance (.3/5) on the right-hand side. Agents have identical risk aversion and wealth, on both graphs.
Indifference premium is larger when the risk is more likely (with the additional black part on the
technical pure premium; here, the commercial loading is almost the same)
In the context of heterogeneity of the underlying risk only, consider the case in
which heterogeneity is captured through covariates .X and where agents have the
same wealth w and the same utility u,
π0 (x) = E[Y |X = x] : actuarial premium
.
π − π0 = w − E[Y |X = x] − u−1 E u(w − Y )X = x : commercial loading.
For example, in Fig. 2.5, we have on the left, the same example as in Fig. 2.4,
corresponding to some “good” risk,
y2 = w with probability p = 2/5
Y =
.
y1 = 0 with probability 1 − p = 3/5.
On the right, we have some “bad risk,” where the value of the loss is unchanged, but
the probability of claiming a loss is higher (.p' > p). In Fig. 2.5
y2 = w with probability p' = 3/5 > 2/5
Y =
.
y1 = 0 with probability 1 − p' = 2/5.
In that case, it could be seen as legitimate, and fair, to ask a higher technical
premium, and possibly to add the appropriate loading then.
52 2 Fundamentals of Actuarial Pricing
If heterogeneity is no longer of the underlying risk, but of the risk aversion (or
possibly the wealth), if u is now the function of some covariates, .ux we should write
π0 (x) = μ(x) = E[Y ] > 0 : actuarial premium
.
π − π0 = w − E[Y ] − u−1
x E ux (w − Y ) ≥ 0 : commercial loading.
Here, we used the expected utility approach from Von Neumann and Morgenstern
(1953) to illustrate, but alternatives could be considered.
The approach described previously is also named “differential pricing,” where
customers with a similar risk are charged different premiums (for reasons other than
risk occurrence and magnitude). Along these lines, Central Bank of Ireland (2021)
considered “price walking” as discriminatory. “Price walking” corresponds to the
case where longstanding, loyal policyholders are charged higher prices for the same
services than customers who have just switched to that provider. This is a well-
documented practice in the telecommunications industry that can also be observed
in insurance (see Guelman et al. (2012) or He et al. (2020), who model attrition rate,
or “customer churn”). According to Central Bank of Ireland (2021), the practice of
price walking is “unfair” and could result in unfair outcomes for some groups of
consumers, both in the private motor insurance and household insurance markets.
For example, long-term customers (those who stayed with the same insurer for 9
years or more) pay, on average, 14% more for private car insurance and 32% more
for home insurance than the equivalent customer renewing for the first time.
“We define price optimization in P&C insurance [property and casualty insurance,
or nonlife insurance6 ] as the supplementation of traditional supply-side actuarial
models with quantitative customer demand models,” explained Bugbee et al. (2014).
Duncan and McPhail (2013), Guven and McPhail (2013), and Spedicato et al. (2018)
mention that such practices are intensively discussed by practitioners, even if they
did not get much attention in the academic journals. Notable exceptions would
be Morel et al. (2003), who introduced myopic pricing, whereas more realistic
approaches, named “semi-myopic pricing strategies”, were discussed in Krikler
et al. (2004) or more recently Ban and Keskin (2021).
6 Formally, property covers a home (physical building) and the belongings in it from all losses such
as fire, theft, etc., or covers damage to a car when involved in an accident, including protection from
damage/loss caused by other factors such as fire, vandalism, etc. Causality involves coverage if one
is being held responsible for someone injuring themselves on his or her property, or if one were to
cause any damage to someone else’s property, and coverage if one gets into an accident and causes
injuries to someone else or damage to their car.
2.6 From Technical to Commercial Premiums 53
(continued)
7 See https://www.soa.org/about/governance/about-code-of-professional-conduct/.
54 2 Fundamentals of Actuarial Pricing
So far, we have discussed only premium principles, but predictive models are used
almost everywhere in insurance.
Fraud is not self revealed, and therefore, it must be investigated, said Guillen (2006)
and Guillen and Ayuso (2008). Tools for detecting fraud span all kind of actions
undertaken by insurers. They may involve human resources, data mining, external
advisors, statistical analyses, and monitoring. The methods currently available for
detecting fraudulent or suspicious claims based on human resources rely on video
or audiotape surveillance, manual indicator cards, internal audits, and information
collected from agents or informants. Methods based on data analysis seek external
and internal data information. Automated methods use various machine-learning
techniques. such as selecting fuzzy set clustering in Derrig and Ostaszewski (1995),
simple regression models in Derrig and Weisberg (1998), or GLMs, with a logistic
regression in Artıs et al. (1999); Artís et al. (2002) and a probit model in Belhadji
et al. (2000), or neural networks, in Brockett et al. (1998) or Viaene et al. (2005).
2.7.3 Mortality
The modeling and forecasting of mortality rates have been subject to extensive
research in the past. The most widely used approach is the “Lee-Carter Model,”
from Lee and Carter (1992) with its numerous extensions. More recent approaches
involve nonlinear regression and GLMs. But recently, many machine-learning
algorithms have been used to detect (unknown) patterns, such as Levantesi and
Pizzorusso (2019), with decision trees, random forests, and gradient boosting. Perla
et al. (2021) generalized the Lee-Carter Model with a simple convolutional network.
Parametric insurance is also an area where predictive models are important. Here,
we consider guaranteed payment of a predetermined amount of an insurance claim
upon the occurrence of a stipulated triggering event, which must be some predefined
parameter or metric specifically related to the insured person’s particular risk
exposure, as explained in Hillier (2022) or Jerry (2023).
So far, we mentioned the use of data and models in the context of estimating a
“fair premium.” But it should also be highlighted that insurance companies have
helped to improve the quality of life in many countries, using data that they
56 2 Fundamentals of Actuarial Pricing
pressure were more important than changes in systolic blood pressure in predicting
mortality. For insurers, this information, although measured on an ad hoc basis, was
sufficient to exclude certain individuals or to increase their insurance premiums.
The designation of hypertension as a risk factor for reduced life expectancy was not
based on research into the risk factors for hypertension, but on a simple measure of
correlation and risk analysis. And the existence of a correlation did not necessarily
indicate a causal link, but this was not the concern of the many physicians working
for insurers. Medical research was then able to work on a better understanding of
these phenomena, observed by the insurers, who had access to these data (because
they had the good idea of collecting them).
Chapter 3
Models: Overview on Predictive Models
We will not start a philosophical discussion about risk and uncertainty here. How-
ever, in actuarial science, all stories begin with a probabilistic model. “Probability is
the most important concept in modern science, especially as nobody has the slightest
notion what it means” said Bertrand Russell in a conference, back in the early 1930s,
quoted in Bell (1945). Very often, the “physical” probabilities receive an objective
value, on the basis of the law of large numbers, as the empirical frequency converge
toward “the probability” (frequentist theory of probabilities).
1
n
a.s.
. 1(Xi ∈ A) → P({X ∈ A)} = P[A], as n → ∞.
n
i=1
probability
(empirical) frequency
Proof Strong law of large numbers (also called Kolmogorov’s law), see Loève
(1977), or any probability textbook.
This is a so-called “physical” probability, or “objective.” It means that if we throw
a die a lot of times (here n), the “probability” of obtaining a 6 with this die is the
empirical frequency of 6 we obtained. Of course, with a perfectly balanced die, there
is no need to repeat throws of die to affirm that the probability of obtaining 6 at the
time of a throw is equal to .1/6 (by the symmetry of the cube). But if we repeat the
experience of throwing a die millions of time, .1/6 should be close to the frequency
of appearance of 6, corresponding to the “frequentist” definition of the concept
of “probability.” Almost 200 years ago, Cournot (1843) already distinguished an
“objective meaning” of the probability (as a measure of the physical possibility of
realization of a random event) and a “subjective meaning” (the probability being a
judgment made on an event, this judgment being linked to the ignorance of judgment
being linked to the ignorance of the conditions of the realization of the event).
If we use that “frequentist” definition (also coined “long-run probability” in
Kaplan (2023) as Proposition 3.1 is an asymptotic result), we are unable to make
sense of the probability of a “single singular event,” as noted by von Mises (1928,
1939): “When we speak of the ‘probability of death’, the exact meaning of this
expression can be defined in the following way only. We must not think of an
individual, but of a certain class as a whole, e.g., ‘all insured men forty-one years
old living in a given country and not engaged in certain dangerous occupations’.
A probability of death is attached to the class of men or to another class that can
be defined in a similar way. We can say nothing about the probability of death of
an individual even if we know his condition of life and health in detail. The phrase
‘probability of death’, when it refers to a single person, has no meaning for us at
all.” And there are even deeper paradoxes, that can be related to latent risk factors
discussed in the previous chapter, and the “true underlying probability” (to claim
a loss, or to die). In a legal context, Fenton and Neil (2018) quoted a judge, who
was told that a person was less than .50% guilty: “look, the guy either did it or
he did not do it. If he did then he is 100% guilty and if he did not then he is
0% guilty; so giving the chances of guilt as a probability somewhere in between
makes no sense and has no place in the law.” The main difference with actuarial
pricing, is that we should estimate probabilities associated with future events. But
still, one can wonder if “the true probability” is a concept that makes sense when
signing a contract. Thus, the goal here will be to train a model that will compute
3.1 Predictive Model, Algorithms, and “Artificial Intelligence” 61
a score, that might be interpreted as a “probability” (this will raise the question
of “ calibration” of a model, the connection between that score and the “observed
frequencies” (interpreted as probabilities), as discussed in Sect. 4.3.3).
Given a probability measure .P, one can define “conditional probabilities,” the
standard notation being the vertical bar. .P[A|B] is the conditional probability that
event .A occurs given the information that event .B occurred. It is the ratio of the
probability that both .A and .B occurred (corresponding to .P[A ∩ B]) over the
probability that .B occurred. Based on that definition, we can derive Bayes formula.
Proposition 3.2 (Bayes Formula) Given two events .A and .B such that .P[B] = 0,
P[B|A] · P[A]
P[A|B] =
. ∝ P[B|A] · P[A].
P[B]
Another close example would be one where .B is the result of a test, and
But because there is competition in the market, .Pn can be different than .P, the
probability measure for the entire population
> 50%|S = A] = 20%
P[X ∈ [18; 25]|S = A] = 20% and P[Y
.
> 50%|S = B] = 15%.
P[X ∈ [18; 25]|S = B] = 15% and P[Y
There could also be some target probability measure .P as underwriters can be
willing to target some specific segments of the population, as discussed in Chaps. 7
and 12 (on change of measures),
P [X ∈ [18; 25]|S = A] = 25% and P [Y > 50%|S = A] = 25%
.
> 50%|S = B] = 20%,
P [X ∈ [18; 25]|S = B] = 20% and P [Y
It is also possible to mention here the fact that the model is fitted on past data,
associated with probability measure, .Pn but because of the competition on the
market, or because of the general economic context, the structure of the portfolio
might change. The probability measure for next year will then be .P n (with
Pn [X ∈ [18; 25]|S = A] = 35% and Pn [Y > 50%|S = A] = 27%
.
> 50%|S = B] = 27%,
Pn [X ∈ [18; 25]|S = B] = 20% and Pn [Y
if our score, used to assess whether we give a loan to some clients attracts
more young (and more risky) people. We do not discuss this issue here, but the
“generalization” property should be with respect to a new unobservable and hard-
to-predict probability measure .Pn (and not .P as usually considered in machine
learning, as discussed in the next sections).
3.1.2 Models
Fig. 3.1 A simple linear model, a piecewise constant (green) model, or a complex model
(nonlinear but continuous), from ten observations .(xi , yi ), where x is a temperature in degrees
Fahrenheit and y is the temperature in degrees Celsius, at the same location i
1 As discussed previously, notation .z is also used later on, and we distinguish admissible predictors
.x,
and sensitive ones .s. In this chapter, we mainly use .x, as in most textbooks.
2 “Inthat Empire, the Art of Cartography achieved such Perfection that the map of a single
Province occupied an entire City, and the map of the Empire, an entire Province. In time, these
64 3 Models: Overview on Predictive Models
5
y←
. (x − 32).
9
A machine learning (or artificial intelligence) approach offers a very different
solution. Instead of coding the rule into the machine (what computer scientists might
call “Good Old Fashioned Artificial Intelligence,” as Haugeland (1989)), we simply
give to the machine several examples of matches between temperatures in Celsius
and Fahrenheit .(xi , yi ). We enter the data into a training dataset, and the algorithm
will learn a conversion rule by itself, looking for the closest candidate function to
the data. We can then find an example like the one in Fig. 3.1, with some data and
different models (one simple (linear) and one more complex).
It is worth noting that the “complexity” of certain algorithms, or their “opacity”
(which leads to the term “black box”), has nothing to do with the optimization
algorithm used (in deep learning, back-propagation is simply an iterative mechanism
for optimizing a clearly described objective). It is mainly that the model obtained
may seem complex, impenetrable, to take into account the possible interactions
between the predictor variables, for example. For the sake of completeness, a
distinction should be made between classical supervised machine-learning algo-
rithms and reinforcement learning techniques. The latter case describes sequential
(or iterative) learning methods, where the algorithm learns by experimenting, as
described in Charpentier et al. (2021). We find these algorithms in automatic driving,
for example, or if we wanted to correctly model the links between the data, the
constructed model, the new connected data, the update of the model, etc. But we
will not insist more on this class of models here.
To conclude this first section, let us stress that in insurance models, the goal
is not to predict “who” will die, and get involved in a car accident. Actuaries
create scores that are interpreted as the probability of dying, or the probability of
inordinate maps were not satisfactory and the Colleges of Cartographers drew up a map of the
Empire, which was the size of the Empire and coincided exactly with it” [personal translation].
3.2 From Categorical to Continuous Models 65
getting a bodily injury claim, in order to compute “fair” premiums. To use a typical
statistical example, let y denote the face of a die, potentially loaded. If p is the
(true) probability of falling on 6 (say .14.5752%), we say at first that we are able
to acquire information about the way the die was made, about its roughness, its
imperfections, that will allow us to refine our knowledge on this probability, but
also that we have a model able to link the information in an adequate way. Knowing
better the probability of falling on 6 does not guarantee that the die will fall on
6, the random component does not disappear, and will never disappear. Translated
into the insurance problem, p might denote the “true” probability that a specific
policyholder will get involved in a car accident. Based on external information
.x, some model will predict that the probability of being in an accident is .p x
(say .11.1245%). As mentioned by French statistician Alfred Sauvy, “dans toute
statistique, l’inexactitude du nombre est compensée par la précision des décimales”
(or “in all statistics, the inaccuracy of the number is compensated for by the
precision of the decimals,” infinite precision we might add). The goal is not to find a
model that returns either .0% or .100% (this happens with several machine-learning
algorithms), simply to assess with confidence a valid probability, used to compute
for a “fair” premium. And an easy way is perhaps to use simple categories: the
probability of getting a 6 less than 15% (less than a fair dice), between 15% and
18.5% (close to a fair dice), and more than .18.5% (more than a fair dice).
donativa, which the emperors attributed to the soldiers on various occasions, was
not handed over to the beneficiaries in cash but was deposited to the account of each
soldier in his legion’s savings bank. This could be seen as an insurance scheme, and
risks against which a member was insured were diverse. In the case of retirement,
upon the completion of his term of service, the soldier would receive a lump sum that
helped him to somewhat arrange the rest of his life. The membership in a collegium
gave him a mutual insurance against “unforeseen risks.” These collegia, besides
being cooperative insurance companies, had other functions. And because of the
structure of those collegia based on corporatism, members were quite homogeneous.
Sometime in the early 1660s, the Pirate’s Code was supposedly written by
the Portuguese buccaneer Bartolomeu Português. And interestingly, a section is
explicitly dedicated to insurance and benefits: “a standard compensation is provided
for maimed and mutilated buccaneers. Thus they order for the loss of a right arm six
hundred pieces of eight, or six slaves; for the loss of a left arm five hundred pieces
of eight, or five slaves; for a right leg five hundred pieces of eight, or five slaves;
for the left leg four hundred pieces of eight, or four slaves; for an eye one hundred
pieces of eight, or one slave; for a finger of the hand the same reward as for the
eye,” see Barbour (1911) (or more recently Leeson (2009) and Fox (2013) about
this piratical scheme).
In the nineteenth century, in Europe, mutual aid societies involved a group of
individuals who made regular payments into a common fund in order to provide for
themselves in later, unforeseeable moments of financial hardship or of old age. As
mentioned by Garrioch (2011), in 1848, there were 280 mutual aid societies in Paris
with well over 20,000 members. For example, the Société des Arts Graphiques, was
created in 1808. It admitted only men over 20 and under 50, and it charged much
higher admission and annual fees for those who joined at a more advanced age. In
return, they received benefits if they were unable to work, reducing over a period of
time, but in the case of serious illness the Society would pay the admission fee for a
hospice. In England, there were “friendly societies,” as described in Ismay (2018).
In France, after the 1848 revolution and Louis Napoléon Bonaparte’s coup d’état
in 1851, mutual funds were seen as a means of harmonizing social classes. The
money collected through contributions came to the rescue of unfortunate workers,
who would no longer have any reason to radicalize. It was proposed that insurance
should become compulsory (Bismarck proposed this in Germany in 1883), but the
idea was rejected in favor of giving workers the freedom to contribute, as the only
way to moralize the working classes, as Da Silva (2023) explains. In 1852, of the
236 mutual funds created, 21 were on a professional basis, whereas the other 215
were on a territorial basis. And from 1870 onward, mutual funds diversified the
professional profile of contributors beyond blue-collar workers, and expanded to
include employees, civil servants, the self-employed, and artists. But the amount of
the premium is not linked to the risk. As Da Silva (2023) puts it,“mutual insurers
see in the actuarial figure the programmed end of solidarity.” For mutual funds,
solidarity is essential, with everyone contributing according to their means and
receiving according to their needs. At around the same time, in France, the first
insurance companies appeared, based on risk selection, and the first mathematical
3.2 From Categorical to Continuous Models 67
Once heterogeneity with respect to the risk was observed in portfolios, insurers have
operated by categorizing individuals into risk classes and assigning corresponding
tariffs. This ongoing process of categorization ensures that the sums collected,
on average, are sufficient to address the realized risks within specific groups.
The aim of risk classification, as explained in Wortham (1986), is to identify the
specific characteristics that are supposed to determine an individual’s propensity to
suffer an adverse event, forming groups within which the risk is (approximately)
equally shared. The problem, of course, is that the characteristics associated with
various types of risk are almost infinite; as they cannot all be identified and priced
in every risk classification system, there will necessarily be unpriced sources of
heterogeneity between individuals in a given risk class.
In 1915, as mentioned in Rothstein (2003), the president of the Association of
Life Insurance Medical Directors of America noted that the question asked almost
universally of the Medical Examiner was “What is your opinion of the risk? Good,
bad, first-class, second-class, or not acceptable?” Historically, insurance prices
were a (finite) collection of prices (maybe more than the two classes mentioned,
“first-class” and “second-class”). In Box 3.1, in the early 1920s, Albert Henry
Mowbray, who worked for the New York Life Insurance Company and later Liberty
Mutual (and was also an actuary for state-level insurance commissions in New
Carolina and California, and the National Council on Workmen’s Insurance) gives
his perspective on insurance rate making.
(continued)
68 3 Models: Overview on Predictive Models
that scoring technologies are continually swapping predictors, “shuffling the cards,”
so that there is no stable basis for constructing group memberships, or a coherent
sense.
Harry S. Havens in the late 1970s gave the description mentioned in Box 3.2.
In Box 3.3, a paragraph from Casey et al. (1976) provides some historical
perspective, by Barbara Casey, Jacques Pezier and Carl Spetzler.
Here, the variance of the outcome Y is decomposed into two parts, one representing
the variance due to the variability of the underlying risk factor ., and one reflecting
the inherent variability of Y if . did not vary (the homogeneous case). One can
recognize that a similar idea is the basis for analysis of variance (ANOVA) models
(as formalized in Fisher (1921) and Fisher and Mackenzie (1923)) where the total
variability is split into the “within groups” and the “between groups.” The “one-way
ANOVA” is a technique that can be used to compare whether or not the means of
two (or more) samples are significantly different. If the outcome y is continuous
(extensions can be obtained for binary variables, or counts), suppose that
yi,j = μj + εi,j ,
.
where i is the index over individuals, and j the index over groups (with .j =
1, 2, · · · , J ). .μj is the mean of the observations for group j , and errors .εi,j are
supposed to be zero-mean (normally distributed as a classical assumption). One
could also write
where .μ is the overall mean, whereas .αj is the deviation from the overall mean,
for group j . Of course, one can generalize that model to multiple factors. In the
“two-way ANOVA,” two types of groups are considered
where j is the index over groups according to the first factor, whereas k is the index
over groups according to the second factor. .μj,k is the mean of the observations for
groups j and k, and errors .εi,j,k are supposed to be zero-mean. We can write the
mean as a linear combination of factors, in the sense that
where .μ is still the overall mean, whereas .αj and .βj correspond to the deviation
from the overall mean, and .γi,k is the non-additive interaction effect. In order to
have identifiability of the model, some “sum-to-zero” constraints are added, as
previously,
J
J
K
K
. αj = γj,k = 0 and βk = γj,k = 0.
j =1 j =1 k=1 k=1
3.2 From Categorical to Continuous Models 73
A more modern way to consider those models is to use linear models. For example,
for the “one-way ANOVA,” we can write .y = Xβ + ε, where
y = (y1,1 , · · · , yn1 ,1 , y1,2 , · · · , yn2 ,2 , · · · , y1,J , · · · , ynJ ,J )
.
ε = (ε1,1 , · · · , εn1 ,1 , ε1,2 , · · · , yn2 ,2 , · · · , ε1,J , · · · , εnJ ,J )
β = (β0 , β1 , · · · , βJ ) and
.
⎛ ⎞ ⎛ ⎞
1n1 0 ··· 0 1n1 1n1 0 ··· 0
⎜ 0 1n2 · · · 0 ⎟ ⎜ 1n 0 1n2 · · · 0 ⎟
⎜ ⎟ ⎜ 2 ⎟
.X = [1n , A] where A = ⎜ . .. .. ⎟ , and X = ⎜ . .. .. .. ⎟
⎝ .. . . ⎠ ⎝ .. . . . ⎠
0 0 · · · 1nJ 1nJ 0 0 · · · 1nJ
so that .
μj = y ·j is simply the average within group j . In the second case, if .yi,j =
μ + αj + εi,j , where .α1 + α2 + · · · + αJ = 0, we can prove that
1
J
μ=
. y= y ·j and
αj = y ·j −
y,
J
j =1
1
J
.
μ=y= j = y ·j − y.
nj y ·j and β
n
j =1
Let .j ∈ {1, 2, · · · , J } and .k ∈ {1, 2, · · · , K}, and let .nj,k denote the number of
observations in group j for the first factor, and k for the second. Define averages
nj k K nj k J nj k
1 1 1
y ·j k =
. yij k , y ·j · = yij k , and y ··k = yij k .
nj k nj · n·k
i=1 k=1 i=1 j =1 i=1
yij k = μ + αj + βk + γj k + εij k ,
.
74 3 Models: Overview on Predictive Models
.
μ = y, k = y ·,k − y,
αj = y j · − y, β
. γj k = y j k − y j · − y ·k + y.
.yij k = μ + αj + βk + εij k .
Such models were used historically in claims reserving (see Kremer (1982) for a
formal connection), and, of course, in ratemaking. As explained in Bennett (1978),
“in a rating structure used in motor insurance there may typically be about eight
factors, each having a number of different levels into which risks may be classified
and then be charged different rates of premium,” with either an “additive model” or
a “multiplicative model” for the premium .μ (with notations of Bennett (1978)),
μj k··· =
μ +αj + β k + · · · additive,
.
μj k··· =
μ · k · · · multiplicative,
αj · β
where .αj is a parameter value for the i-th level of the first risk factor, etc., and .μ is
a constant corresponding to some “overall average level.”
Historically, classification relativities were determined one dimension at a time
(see Feldblum and Brosius (2003), and the appendices to McClenahan (2006) and
Finger (2006) for some illustration of the procedure). Then, Bailey and Simon
(1959, 1960) introduced the “minimum bias procedure.”
In Fig. 3.2, we can visualize a dozen classes associated with credit risk (on the
GermanCredit database), with on the x-axis, predictions given by two models,
and the empirical default probability on the y-axis (that will correspond to a discrete
version of the calibration plot described in Sect. 4.3.3).
As discussed in Agresti (2012, 2015), there are strong connections between
those approaches based on groups and linear models, and actuarial research started
to move toward “continuous” models. Nevertheless, the use of categories has
been popular in the industry for several decades. For example, Siddiqi (2012)
recommends cutting all continuous variables into bins, using a so-called “weight-
of-evidence binning” technique, usually seen as an “optimal binning” for numerical
and categorical variables using methods including tree-like segmentation, or Chi-
squared merge. In R, it can be performed using the woebin function of the
scorecard package. For example, on the GermanCredit dataset, three con-
tinuous variables are divided into bins, as in Fig. 3.3. For the duration (in months),
bins are A = .[0, 8), B = .[8, 16), C = .[16, 34), D = .[34, 44), and E = .[44, 80); for
3.2 From Categorical to Continuous Models 75
Fig. 3.3 From continuous variables to categories (five categories .{A, B, C, D, E}), for three
continuous variables of the GermanCredit dataset: duration of the credit, amount of credit,
and age of the applicant. Bars in the background are the number of applicants in each bin (y-axis
on the left), and the line is the probability of having a default (y-axis on the right)
the credit amount, bins are A = .[0, 1400), B = .[1400, 1800), C = .[1800, 4000), D
= .[4000, 9200), B = .[9200, 20000); and for the age of the applicant, A = .[19, 26),
B = .[26, 28), C = .[28, 35), D = .[35, 37), and E = .[37, 80). The use of categorical
features, to create ratemaking classes is now obsolete, as more and more actuaries
consider “individual” pricing models.
76 3 Models: Overview on Predictive Models
Instead of considering risk classes, the measurement of risk can take a very different
form, which we could call “individualization”, or “personalization”, as in Barry
and Charpentier (2020). In many respects, the latter is a kind of asymptotic limit
of the asymptotic limit of the first one, when the number of classes increases.
By significantly reducing the population through the assignment of individuals to
exclusive categories, and ensuring that each category consists of a single individ-
ual, the processes of “categorization” and “individualization” begin to converge.
Besides computational aspects (discussed in the next section), this approach is
fundamentally altering the logical properties of reasoning, as discussed in François
(2022) and Krippner (2023). Individuals are given a very precise “score” (which
of course can be shared with others). These scores are not discrete, discontinuous,
qualitative categories, but numbers that we can consequently engage in calculations
(as explained in the previous chapter). When individualized measures are employed,
they are situated on a continuous scale: individuals are assigned highly precise
scores, which, of course, occasionally and at intervals, may be shared with others
but generally enable the ranking of individuals in relation to one another. These
scores are no longer discrete, discontinuous, qualitative categories, but rather
numerical values that can, therefore, be subjected to calculations. Furthermore,
these numbers possess cardinal value in the sense that they not only facilitate the
ranking of risks in comparison with one another but also denote a quantity (of
risk) amenable to reasoning and notably computation. Last, probabilities can be
associated with these numbers, which are not the property of a group but that of
an individual: risk measurement is no longer intended to designate the probability
of an event occurring within a group once in a thousand trials; it is aimed at
providing a quantified assessment of the default risk associated with a specific
individual, in their idiosyncrasy and irreducible singularity. Risk measurement has
now evolved into an individualized measure, François (2022) claim. Thanks to those
scores, individual policyholders are directly comparable. As Friedler et al. (2016)
explained, “the notion of the group ceases to be a stable analytical category and
becomes a speculative ensemble assembled to inform a decision and enable a course
of action (.. . .) Ordered for a different purpose, the groups scatter and reassemble
differently.” In the next section, we present techniques used by actuaries to model
risks, and compute premiums.
If we are going to talk here mainly about insurance pricing models, i.e., supervised
models where the variable of interest y is the occurrence of a claim in the coming
year, the number of claims, or the total charge, it is worth keeping in mind that
the input data (.x) can be the predictions of a model. For example .x1 could be
3.3 Supervised Models and “Individual” Pricing 77
with standard notation. Those functions are interesting as we have the following
decomposition
.Y = E[Y |X = x] + Y − E[Y |X = x] ,
=ε
where .E[ε|X = x] = 0. It should be stressed that the extension to the case where X
is absolutely continuous is formally slightly complicated since .{X = x} is an event
with probability 0, and then .P[Y ∈ A|X = x] is not properly defined (in Bayes
78 3 Models: Overview on Predictive Models
for some random variable .Y : (, F, P) → R, with law .PY . And we will take even
more freedom when conditioning. As discussed in Proschan and Presnell (1998),
“statisticians make liberal use of conditioning arguments to shorten what would
otherwise be long proofs,” and we do the same here. Heuristically, (the proof can
be found in Pfanzagl (1979) and Proschan and Presnell (1998)), a version of .P(Y ∈
A|X = x) can be obtained as a limit of conditional probabilities given that X lies
in small neighborhoods of x, the limit being taken as the size of the neighborhood
tends to 0,
P({Y ∈ A} ∩ {|X − x| ≤ })
P Y ∈ AX = x = lim
.
→0 P({|X − x| ≤ })
= lim P Y ∈ A|X − x| ≤ ,
→0
that can be extended into a higher dimension, using some distance between .X and
x, and then use that approach to define4 “.E[Y |X = x].” In Sect. 4.1, we have a brief
.
. 2 (y,
y) = (y −
y )2 ,
The fact that the expected value minimizes the expected loss for some loss function
(here . 2 ) is named “elicitable” in Gneiting et al. (2007). From this property, we
can understand why the expected value is also called “best estimate” (see also the
connection to Bregman distance, in Definition 3.12). As discussed in Huttegger
(2013), the use of a quadratic loss function gives rise to a rich geometric structure,
for variables that are squared integrable, which is essentially very close to the
geometry of Euclidean spaces (.L2 being a Hilbert space, with an inner product, and
a projection operator; we come back to this point in Chap. 10, in “pre-processing”
approaches). Up to a monotonic transformation (the square root function), the
distance here is the expectation of the quadratic loss function. With the terminology
80 3 Models: Overview on Predictive Models
of Angrist and Pischke (2009), the regression function .μ is the function of .x that
serves as “the best predictor of y, in the mean-squared error sense.”
The quantile loss . q,α , for some .α ∈ (0, 1) is
. q,α (y,
y) = max α(y −
y ), (1 − α)(
y − y) = (y −
y ) α − 1(y<
y) .
For example, Kudryavtsev (2009) used a quantile loss function in the context of
ratemaking. It is called “quantile” loss as
Q(α) = F −1 (α) ∈ argmin Rq,α (q) = argmin E
. q,α (Y, q) .
q∈R q∈R
Indeed,
∞
q
.argmin Rq,α (q) = argmin (α − 1) (y − q)dFY (y) + α (y − q)dFY (y) ,
m m −∞ q
and by computing the derivative of the expected loss via an application of the
Leibniz integral rule,
q ∞
0 = (1 − α)
. dFY (y) − α dFY (y),
−∞ q
1
n
Rn (
. m) = m(x i ), yi ) .
(
n
i=1
Again, in the regression context, with a quadratic loss function, the empirical risk
is the mean squared error (MSE), defined as
1
n
Rn (
. m) = MSEn = (yi − m
(x i ))2 .
n
i=1
In the context of a classifier, the loss is a function on .Y × Y, i.e., .{0, 1} × {0, 1},
taking values in .R+ . But in many cases, we want to compute a “loss” between y and
an estimation of .P[Y = 1], instead of a predicted class . y ∈ {0, 1}, therefore, it will
be a function defined on .{0, 1} × [0, 1]. That will correspond to a “scoring rule” (see
Definition 4.16). The empirical risk associated with the . 0/1 loss is the proportion of
misclassified individuals, also named “classifier error rate.” But it is possible to get
more information: given a sample of size n, it is possible to compute the “confusion
matrix,” which is simply the contingency table of the pairs .(yi , yi ), as in Figs. 3.4
and 3.5 .
Given a threshold t, one will get the confusion matrix, and various quantities can
be computed. To illustrate, consider a simple logistic regression model, on .x (and
not s), and get predictions on .n = 40 observations from toydata2 (as in Table
8.1). To illustrate, two values are considered for t, .30% and .50%.
FP TP
FPR = TPR =
FP + TN TP + FN
TN FN
TNR = FNR =
FP + TN TP + FN
82 3 Models: Overview on Predictive Models
prediction
1 25 1 17
false positive true positive false positive true positive
8 17 2 15
total 20 20 total 20 20
Fig. 3.6 Confusion matrices with threshold .30% and .50% for .n = 40 observations from the
toydata2 dataset, and a logistic regression for m
From Fig. 3.6, we can compute various quantities (as explained in Figs. 3.4
and 3.5). Sensitivity (true positive rate) is the probability of a positive test result,
conditioned on the individual truly being positive. Thus, here we have
17 15
TPR(30%) =
. = 0.85 and TPR(50%) = = 0.75,
3 + 17 5 + 15
3 5
.FNR(30%) = = 0.15 and FNR(50%) = = 0.25.
3 + 17 5 + 15
Specificity (true negative rate) is the probability of a negative test result, conditioned
on the individual truly being negative,
12 18
TNR(30%) =
. = 0.6 and TNR(50%) = = 0.9,
8 + 12 2 + 18
8 2
FPR(30%) =
. = 0.4 and FPR(50%) = = 0.1
8 + 12 2 + 18
12 18
. NPV(30%) = = 0.8 and NPV(50%) = = 0.7826,
12 + 3 18 + 5
17 15
PPV(30%) =
. = 0.68 and PPV(50%) = = 0.8824.
17 + 8 15 + 2
3.3 Supervised Models and “Individual” Pricing 83
12 + 17
.ACC(30%) = = 0.725 and
12 + 8 + 3 + 17
18 + 15
ACC(50%) = = 0.825,
18 + 2 + 5 + 15
whereas “balanced accuracy” (see Langford and Schapire 2005) is the average of
the true positive rate (TPR) and the true negative rate (TNR),
29
− 20 33
− 20
.κ(30%) = 40 40
= 0.45 and κ(50%) = 40 40
= 0.65,
1− 20
40 1− 20
40
One issue here is that the sample used to compute the empirical risk is the same
as the one used to fit the model, also-called “in-sample risk”
1
n
is
Rn (m) =
. (m(x i ), yi ) .
n
i=1
Thus, if we consider
n = argmin
is
.m Rn (m) ,
m∈M
1
n
.
is
Rn (
m) = m(x i ), yi ) = 0.
(
n
i=1
84 3 Models: Overview on Predictive Models
Fig. 3.7 Two fitted models from a (fake) dataset .(x1 , y1 ), · · · , (x10 , y10 ), with a linear model on
the left, and a polynomial model on the right, such that for both in-sample risk is null, .
is
Rn (m) = 0
To avoid this problem, randomly divide the initial database into a training dataset
and a validation dataset. The training database, with .nT < n observations, will be
used to estimate the parameters of the model
nT = argmin
is
m
. RnT (m) .
m∈M
Then, the validation dataset, with .nV = n − nT observations, will be used to select
the model, using the “out-of-sample risk”
1
nV
os
.Rn (
mnT ) = nT (x i ), yi .
m
V
nV
i=1
that cannot be calculated without knowing the true distribution of .(Y, X).
If . is the quadratic loss, . 2 (y,
y ) = (y −
y )2 ,
. m) = EDn EY |X [ (Y, m
R2 ( (X)|Dn )]
2
= EY |X (Y ) − EDn m(X)|Dn )
2
+EY |X Y − EY |X (Y )
2
+EDn m (X)|Dn ) − EDn m (X)|Dn ) .
3.3 Supervised Models and “Individual” Pricing 85
We recognize the square of the bias (.bias2 ), the stochastic error, and the variance of
the estimator.
Here, so far, all observations in the training dataset have the same importance.
But it is possible to include weights, for each observation, in the optimization
procedure. A classic example could be the weighted least squares,
n
2
. ωi yi − x i β ,
i=1
for some positive (or null) weights .(ω1 , · · · , ωn ) ∈ Rn+ . The weighted least squares
estimator is
β = (X ΩX)−1 X y, where = diag(ω).
.
n
.
Rω (
m) = ωi (
m(x i ), yi ) .
i=1
1
dH (p, q)2 = p(i) − q(i) = 1 − p(i)q(i) ∈ [0, 1],
2
i i
For example, for two Gaussian distributions with means .μ and variances .σ 2 ,
! " #
2σ1 σ2 1 (μ1 − μ2 )2
2
.dH (p1 , p2 ) =1− exp − ,
σ12 + σ22 4 σ12 + σ22
(that can be extended into a higher dimension, as in Pardo (2018)), whereas for two
exponential distributions with means .μ1 and .μ2 ,
√
2 μ1 μ2
dH2 (p1 , p2 ) = 1 −
. .
μ1 + μ2
A few years after, Saks (1937) introduced the concept of “total variation”
(between measures) in the context of signed measures on a measurable space, and
it can be used to define a total variation distance between probability measures (see
Dudley 2010). Quite generally, given two discrete distributions p and q, the total
variation is the largest possible difference between the probabilities that the two
probability distributions can assign to the same event:.
Definition 3.6 (Total Variation (Jordan 1881; Rudin 1966)) For two univariate
distributions p and q, the total variation distance between p and q is
.dTV (p, q) = sup |p(A) − q(A)| .
A⊂R
It should be stressed here that in the context of discrimination, Zafar et al. (2019)
or Zhang and Bareinboim (2018) suggest removing the symmetry property, to take
into account that there is a favored and a disfavored group, and therefore to consider
DTV (pq) = sup p(A) − q(A) .
.
A⊂R
Removing the standard property of symmetry (that we have on distances) yields the
concept of “divergence,” which is still a non-negative function, positive (in the sense
that it is null if and only if “.p = q”), and the triangle inequality is not satisfied
(even if it could satisfy some sort of Pythagorean theorem, if an “inner product”
can be derived). As Amari (1982) explains, it is mainly because divergences are
generalizations of “squared distances,” not “linear distances.”
Definition 3.7 (Kullback–Leibler (Kullback and Leibler 1951)) For two discrete
distributions p and q, Kullback–Leibler divergence of p, with respect to q is
p(i)
DKL (pq) =
. p(i) log ,
q(i)
i
3.3 Supervised Models and “Individual” Pricing 87
in higher dimension.
This corresponds to the relative entropy from q to p. Interestingly, (Kullback
2004) mentioned that he preferred the term “discrimination information.”
Notice that the ratio .log(p/q) is sometimes called “weight-of-evidence,” follow-
ing Good (1950) and Ayer (1972), see also Wod (1985) or Weed (2005) for some
surveys. Again, this is not a distance (even if it satisfies the nice property .p = q
if and only if .DKL (pq) = 0), so we use the term “divergence” (and notation D
instead of d).
Definition 3.8 (Mutual Information (Shannon and Weaver 1949)) For a pair of
two discrete variables x and y with joint distributions .pxy (and marginal ones .px
and .py ), the mutual information is
pxy (i, j )
⊥
IM(x, y) = DKL (pxy pxy
. )= pxy (i, j ) log ⊥ (i, j )
pxy
i,j
pxy (i, j )
= pxy (i, j ) log ,
px (i)py (j )
i,j
Definition 3.9 (Population Stability Index (PSI) (Siddiqi 2012)) PSI is a mea-
sure of population stability between two population samples,
1 1
DJS (p1 , p2 ) =
. DKL (p1 q) + DKL (p2 q),
2 2
where .q = 12 (p1 + p2 ).
Another popular distance is the Wasserstein distance,5 also called Mallows’
distance, from Mallows (1972).
Definition 3.11 (Wasserstein (Wasserstein 1969)) Consider two measures on p
and q on .Rd , with a norm . · (on .Rd ). Then define
( )1/k
Wk (p, q) =
. inf x − y dπ(x, y)
k
,
π ∈(p,q) Rd ×Rd
dTV (p, q) =
. inf P[X = Y ], (X, Y ) ∼ π
π ∈(p,q)
= inf E[ 0/1 (X, Y )], (X, Y ) ∼ π .
π ∈(p,q)
5 The original name, VaserxteOT1w ncyr.fd ĭn , is also written “Vaserstein” but as the
distance is usually denoted “W ,” we write “Wasserstein.”
3.3 Supervised Models and “Individual” Pricing 89
An optimal transport .T (in Brenier’s sense, from Brenier (1991), see Villani (2009)
or Galichon (2016)) from .P0 toward .P1 will be the solution of
T ∈
.
arginf (x, T(x))dP0 (x) .
T:T# P0 =P1 Rk
In dimension 1 (distributions on .R), let .F0 and .F1 denote the cumulative distribution
function, and .F0−1 and .F1−1 denote quantiles. Then
( 1 k )1/k
−1
Wk (p0 , p1 ) =
. F0 (u) − F1−1 (u) du ,
0
and one can prove that the optimal transport .T is a monotone transformation. More
precisely,
T : x0 → x1 = F1−1 ◦ F0 (x0 ).
.
Observe that, for two Gaussian distributions, and the Euclidean distance,
2
W2 (p0 , p1 )2 = (μ1 − μ0 )2 + σ1 − σ0 ,
.
And in that Gaussian case, there is an explicit expression for the optimal transport,
which is simply an affine map (see Villani (2003) for more details). In the univariate
case, .x1 = TN (x0 ) = μ1 + σσ10 (x0 − μ0 ), whereas in the multivariate case, an
analogous expression can be derived:
x 1 = TN (x 0 ) = μ1 + A(x 0 − μ0 ),
.
where .μ0 and .μ1 are the means of .p0 and .p1 , and .p̄0 and .p̄1 are the corresponding
centered probabilities. And if the measures are not Gaussian, but have variances . 0
and . 1 , Gelbrich (1990) proved that
( )
1/2 1/2 1/2
W2 (p0 , p1 ) ≥
.
2
μ1 − μ0 22 + tr 0 + 1 − 2 1 0 1 .
To conclude this part, Banerjee et al. (2005) suggested loss functions named
“Bregman distance functions.”
Definition 3.12 (Bregman Distance Functions (Banerjee et al. 2005)) Given a
strictly convex differentiable function .φ : R → R,
or if .φ : Rd → R,
Note that .Bφ is symmetric, positive, and .Bφ (y1 , y2 ) = 0 if and only if .y1 = y2 .
For example, if .φ(t) = t 2 , .Bφ (y1 , y2 ) = 2 (y1 , y2 ) = (y1 − y2 )2 . Huttegger (2017)
pointed out that those functions have a “nice epistemological motivation.” Consider
3.3 Supervised Models and “Individual” Pricing 91
some very general distance function .ψ, such that, for any random variable Z,
E ψ(Y, E[Y ]) ≤ E ψ(Y, Z) ,
.
meaning that .E[Y ] is the “best estimate” of Y (according to this distance .ψ). If
Y = 1A , it means that
.
E ψ(1A , P[A]) ≤ E ψ(1A , Z) ,
.
meaning that .P[A] is the “best” degree of belief of .1A . If we suppose that .ψ is
continuously differentiable in its first argument, and .ψ(0, 0) = 0, then .ψ is a
Bregman distance function. And one can write, if .φ : Rd → R,
Bφ (y 1 , y 2 ) = (y 1 − y 2 ) ∇ 2 φ(y t )(y 1 − y 2 ),
.
where .∇ 2 denote the Hessian matrix, and where .y t = ty 1 + (1 − t)y 2 , for some .t ∈
[0, 1]. We recognize some sort of local Mahalanobis distance, induced by .∇ 2 φ(y t ).
specifically, in this GLM framework, different quantities are used, namely the
canonical parameter is .θi , the prediction for .yi is
μi = E(Yi ) = b (θi ),
.
ηi = x i β,
.
For the “canonical link function,” .g −1 = b and then .θi = x i β = ηi and .μi =
E(Yi ) = g −1 (x i β). Inference is performed by finding the maximum of the log-
likelihood, that is
1
n n
. log L = yi x i β − b(x i β) + c(yi , ϕ) ,
ϕ
i=1 i=1
independent of β
and if .
β denotes the optimal parameter, the prediction is . yi = m(x i ) = g −1 (x i
β).
In Fig. 3.8, we have an explanatory diagram of a GLM, starting from some
predictor variables .x = (x1 , · · · , xk ) (on the left) and a target variable y (on the
right). The score, .η = x β is created from the predictors .x, and then the prediction
is obtained by a nonlinear transformation, .m(x) = g −1 (x β).
1
1
2 1
2
3
3
4
4
model
Fig. 3.8 Explanatory diagram of a generalized linear model (GLM), starting from some predictor
variables .x = (x1 , · · · , xk ) and a target variable y
3.3 Supervised Models and “Individual” Pricing 93
With the canonical link function the first-order condition is simply (with a
standard matrix formulation)
.∇ log L = X y = X y − g −1 (X
y − β) = 0.
This is the numerical equation solved numerically when calling glm in R (using
Fisher’s iterative algorithm, which is equivalent of a gradient descent, with the
Newton–Raphson iterative technique, where the explicit expression of the Hessian
is used). In a sense, the probabilistic construction is simply a way of interpreting
the derivation. For example, a Poisson regression can be performed on positive
observations .yi (not necessarily integers), which makes sense if we focus only on
solving the first-order condition (as done by the computer), not if we care about the
interpretation. And actually, we can see this approach as a “machine-learning” one.
For convenience, let us write quite generally .log Li (
yi ) the contribution of the i-th
observation to the log-likelihood.
As mentioned previously, with a machine-learning approach, the in-sample
empirical risk is
1
n
Rn (
. m) = m(x i ), yi ) ,
(
n
i=1
n
D =
. d ( yi , yi ) = 2 log Li (yi ) − log Li (
yi , yi ) , where d ( yi ) .
i=1
Here, the first term .logLi (yi ) corresponds to the log-likelihood of a “perfect fit”
(as .
yi is supposed to be equal to .yi ), also called “saturated model.” The unscaled
deviance, .d = ϕd is used as a loss function. For the Gaussian model, . (yi , yi ) =
yi , yi )2 , and the deviance corresponds to the . 2 loss. For the Poisson distribution
(
(with a log-link), the loss would be
2 (yi log yi − yi log
yi − yi +
yi ) yi > 0
. (yi ,
yi ) =
2
yi yi = 0,
y 2−a
y 1−a
. a (y,
y) = −y .
2−a 1−a
for some .λ > 0 and .π ∈ (0, 1), whereas for the zero-inflated Poisson,
π + (1 − π )e−λ if y = 0
P(Yi = yi ) =
. yi −λ
(1 − π ) λ yei ! if y = 1, 2, · · ·
A zero-inflated model can only increase the probability of having .y = 0, but this
is not a restriction in hurdle models. From those distributions, the idea here is to
consider a logistic regression for the binomial component, so that .logit(πi ) = x i β b
and a Poisson regression for counts, .exp(λi ) = x i β p . For the hurdle model,
because there are two parameters, the log-likelihood can be separated in two terms
(the parameters are therefore estimated independently), so that one can derive the
associated loss functions. For example, the R package countreg contains loss
functions that can be used in the package mboost for boosting.
A strong assumption of linear models is the linear assumption. Actually, the term
“linear” is ambiguous, because .ηi = x i β = x i , β (using geometric notations for
inner products on vector spaces) is linear both in .x and in .β. If x is the age of a
policyholder, we can consider a linear model in x but also a linear model in .log(x),
√
. x, .(x − 20)+ or any nonlinear transformation.
“Natura non facit saltus” (nature does not make jumps) as claimed by Leibniz,
meaning that in most applications, we consider continuous transformations. For
3.3 Supervised Models and “Individual” Pricing 95
ηi = g(μi ) = x i β = β0 + β1 x1 + · · · + βk xk ,
.
will become
where each function .sj is some unspecified function, introduced to make the model
more flexible. Note that it is still additive here, so those models are named GAM,
for “generalized additive models.” In the Gaussian case, if g is the identity function,
instead of seeking a model .m(x1 , · · · , xk ) = β0 + s1 (x1 ) + · · · + sk (xk ) such that
.
n
2
∈ argmin
m
. yi − (β0 + s1 (x1i ) + · · · + sk (xki )) ,
i=1
h1 ( ), · · · , h ( )
2 1
model
Fig. 3.9 Explanatory diagram of a generalized additive model (GAM), starting from the same
predictor variables .x = (x1 , · · · , xk ) (on the left) and with the same target variable y (on the
right). Each continuous variable .xj is converted into a function .h(xj ), expressed in some basis
function (such as splines), .h1 (xj ), · · · , hk (xj )
Fig. 3.10 Evolution of .(x1 , x2 ) → m (x1 , x2 , A), on the toydata2, with a plain logistic
regression on the left (generalized linear model [GLM]), and a generalized additive model (GAM)
on the right, fitted on the toydata2 training dataset. The area in the lower left corner corresponds
to low probabilities for .P[Y = 1|X1 = x1 , X2 = x2 ], whereas the area in the upper right corner
corresponds to high probabilities. True values of .(x1 , x2 ) → μ(x1 , x2 , A) = E[Y |x1 , x2 , A], are on
the left of Fig. 1.6. The scale can be visualized on the right (in %)
3.3 Supervised Models and “Individual” Pricing 97
$ %2
∂ 2 m(x)
E > 0.
∂xj ∂xj x=X
.
y = Xβ + Zu + ε,
.
where .X and .β are the fixed effects design matrix, and fixed effects (fixed but
unknown) respectively, whereas .Z and .γ are the random effects design matrix
and random effects, respectively. The latter is used to capture some remaining
heterogeneity that was not captured by the .x variables. Here, .ε contains the
residual components, supposedly independent of .γ . And naturally, one can define
a “generalized linear mixed model” (GLMM), as studied in McCulloch and Searle
(2004),
g(E[Y |γ ]) = x β + z γ ,
.
as in Jiang and Nguyen (2007) and Antonio and Beirlant (2007). It is possible to
make connections between credibility and a GLMM, as in Klinker (2010). The R
package glmm can be used here.
ridge
βλ
. = (X X + λI)−1 X y,
98 3 Models: Overview on Predictive Models
2 22
= argmin 2y − Xβ 2 + λβ22 .
ridge
βλ
.
2
= empirical risk = penalty
For the interpretation variables should have the same scales, so classically, variables
are standardized, to have unit variance. In an OLS context, we want to solve
Definition 3.14 (Ridge Estimator (OLS) (Hoerl and Kennard 1970))
⎧ ⎫
⎨1
n
k ⎬
.
ridge
βλ = argmin (yi − x i β)2 + λ βj2 .
β∈Rk ⎩ 2 i=1
⎭
j =1
See van Wieringen (2015) for many more results on ridge regression. The “least
absolute shrinkage and selection operator” regression, also called “LASSO,” was
introduced in Santosa and Symes (1986), and popularized by Tibshirani (1996) that
extended Breiman (1995). Heuristically, the best subset selection problem can be
expressed as
2 22
. min 2y − Xβ 2 subject to β 0 ≤ κ,
2
3 1/p
p
with .βp = j =1 |βj |
p . On the one hand, if .p ≤ 1, the optimization
problem can be seen as a variable section technique, as the optimal parameter has
some null components (this corresponds to the statistical concept of “sparsity”, see
Hastie et al. (2015)). On the other hand, if .p ≥ 1, it is a convex constraint (strictly
convex if .p > 1), which simplifies computations. Thus, .p = 1 is an interesting
case. When the objective is the sum of the squares of residuals, we want to solve
Definition 3.16 (LASSO Estimator (OLS) (Tibshirani 1996))
⎧ ⎫
⎨1
n
k ⎬
.
lasso
βλ = argmin (yi − x i β)2 + λ |βj | .
⎩2 ⎭
i=1 j =1
And it is actually possible to consider the “elastic net method” that (linearly)
combines the . 1 and . 2 penalties of the LASSO and ridge methods. Starting from
the LASSO penalty, Zou and Hastie (2005) suggested adding a quadratic penalty
term that serves to enforce strict convexity of the loss function, resulting in a unique
minimum. Within the OLS framework, consider
⎧ ⎫
⎨1 n k k ⎬
.
elastic
β λ1 ,λ2 = argmin (yi − x i β)2 + λ1 |βj | + λ2 βj2 .
⎩2 ⎭
i=1 j =1 j =1
In R, the package glmnet can be used to estimate those models, as in Fig. 3.11,
on the GermanCredit training dataset.6 We can visualize the shrinkage, and
variable selection: if we want to consider only two variables the indicator associated
with no checking account (.1(no checking account)) and the duration of the
credit (duration) are supposed to be the “best” two. Note that in those algorithms,
variables y and .x are usually centered, to remove the intercept .β0 , and are also
scaled. If we further assume that variables .x are orthogonal with unit . 2 norm,
.X X = I, then
6 This is a simple example, with covariates that are both continuous and categorical. See Friedman
which is a “shrinkage estimator” of the least squares estimator (as the propor-
tionality coefficient is smaller than 1). Considering .λ > 0 will induce a bias,
.E[β ] = E[
ols ridge
β λ ], but at the same time it could (hopefully) “reduce” the variance,
in the sense that .Var[β ] − Var[
ols ridge
β λ ] is a positive matrix. Theobald (1974) and
Farebrother (1976) proved that such property holds for some .λ > 0.
The variance reduction of those estimators comes at a price: the estimators
of .β are deliberately biased, as well as predictions as . yi = x i
β λ (whatever
the penalty considered); consequently, those models are not well calibrated (in
the sense discussed in Sect. 4.3.3). Nevertheless, as discussed for instance in
Steyerberg et al. (2001), in a classification context, those techniques might actually
improve the calibration of predictions, especially when the number of covariates
is large. And as proved by Fan and Li (2001), the LASSO estimate satisfies a nice
consistency property, in the sense that the probability of estimating 0’s for zero-
valued parameters tends to one, when .n → ∞. The algorithm selects the correct
variables and provides unbiased estimates of selected parameters, satisfying an
“oracle property.”
It is also possible to make a connection between credibility and penalize regres-
sion, as shown in Miller (2015a) and Fry (2015). Recall that Bühlmann credibility is
the solution of a Bayesian model whose prior depends on the likelihood hypothesis
for the target, as discussed in Jewell (1974), Klugman (1991) or Bühlmann and
Gisler (2005). And penalized regression is the solution of a Bayesian model with
either a normal (for ridge) or a Laplace (for LASSO) prior.
3.3 Supervised Models and “Individual” Pricing 101
We refer to Denuit et al. (2019b) for more details, and applications in actuarial
science. Neural networks are first an architecture that can be seen as an extension of
the one we have seen with GLMs and GAMs. In Fig. 3.12, we have a neural network
with two “hidden layers,” between the predictor variables .x = (x1 , · · · , xk ), and
the output. The first layer consists in three “neurons” (or latent variables), and the
second layer consists in two neurons.
To get a more visual understanding, one can consider the use of principal
component analysis (PCA) to reduce dimension in a GLM, as in Fig. 3.13. The
single layer consists here in the collection of the k principal components, obtained
using simple algebra, so that each component .zj is a linear combination of the
predictors .x. In this architecture, we consider a single layer, with k neurons, and
only two are used afterward. Here, we keep the idea of using a linear combination
of the variables.
Once the architecture is fixed, we try to construct (a priori) interpretative neurons
in the intermediate layer, we care only about accuracy, and the construction of
intermediate variables is optimized (this is called “back propagation”). As a starting
point, consider some binary explanatory variables, .x ∈ {0, 1}k ; McCulloch and Pitts
(1943) suggested a simple model, with threshold b
⎛ ⎞
k
.yi = h ⎝ xj,i ⎠ , where h(x) = 1(x ≥ b),
j =1
layer 1 layer 2
1
model
Fig. 3.12 Explanatory diagram of a neural network, starting from the same predictor variables
.x = (x1 , · · · , xk ) (actually, a normalized version of those variables, as explained in Friedman
et al. (2001)) and with the same target variable y. The intermediate layers (of neurons) can be
considered the constitution of intermediate features that are then aggregated
102 3 Models: Overview on Predictive Models
model
Fig. 3.13 Explanatory diagram showing the use of principal component analysis (PCA, see
Sect. 3.4) to reduce dimension in a GLM, seen as a neural network architecture, starting from the
same predictor variables .x = (x1 , · · · , xk ) and with the same target variable y. The intermediate
layer consists in the k principal components. Then, the GLM is considered not on the k predictors
but on the first two principal components
or (equivalently)
⎛ ⎞
k
yi = h ⎝ω +
. xj,i ⎠ where h(x) = 1(x ≥ 0),
j =1
with weight .ω = −b. The trick of adding 1 as an input was very important (as the
intercept in the linear regression) and can lead to simple interpretation. For instance,
if .ω = −1 we recognize the “or” logical operator (.yi = 1 if .∃j such that .xj,i = 1),
whereas if .ω = −k, we recognize the “and” logical operator (.yi = 1 if .∀j , .xj,i =
1). Unfortunately, it is not possible to get the “xor” logical operator (the “exclusive
or,” i.e., .yi = 1 if .x1,i = x2,i ) with this architecture. Rosenblatt (1961) considered
the extension where .x’s are real-valued (instead of binary), with “weight” .ω ∈ Rk
(the word is between quotation marks because here, weights can be negative)
⎛ ⎞
k
.yi = h ⎝ ωj xj,i ⎠ where h(x) = 1(x ≥ b),
j =1
(ex − e−x )
.h(x) = ,
(ex + e−x )
or the popular ReLU function (“rectified linear unit”). So here, for a classification
problem,
⎛ ⎞
k
yi = h ⎝ω0 +
. ωj xj,i ⎠ = m(x i ).
j =1
But so far, there is nothing more, compared with an econometric model (actually
we have less since there is no probabilistic foundations, so no confidence intervals
for instance). The interesting idea was to replicate those models, in a network.
Instead of mapping .x and y, the idea is to map .x and some latent variables .zj , and
then to map .z and y, using the same structure. And possibly, we can use a deeper
network, not with one single layer, but many more. Consider Fig. 3.14, which is a
simplified version of Fig. 3.12.
model
Fig. 3.14 Neural network, starting from the same predictor variables .x = (x1 , · · · , xk ) and with
the same target variable y, and a single layer, with variables .z = (z1 , · · · , zJ )
104 3 Models: Overview on Predictive Models
Our model .mω is now based on .(k + 1)(J + 1) parameters. Given a model .mω , we
can compute the quadratic loss function,
n
. (yi − mω (x i ))2 ,
i=1
the cross-entropy
n
. (yi log mω (x i ) + [1 − yi ] log[1 − mω (x i )]),
i=1
n
. (yi , mω (x i )) .
i=1
= (w, w 1 , · · · , w J ), by solving
The idea is then to get optimal weights .w
.
n
ω = argmin
.
(yi , mω (x i )) .
ω
i=1
hidden layers than the number of predictors. The point is to capture nonlinearity and
nonmonotonicity.
Without going too much further (see Denuit et al. (2019b) or Wüthrich and Merz
(2022) for more details), let us stress here that neural networks are intensively used
because they have strong theoretical foundations and some “universal approxima-
tion theorems,” even in very general frameworks. Those results were obtained at
the end of the 1980s, with Cybenko (1989), with an arbitrary number of artificial
neurons, or Hornik et al. (1989); Hornik (1991), with multilayer feed-forward
networks and only one single hidden layer. Later on, Leshno et al. (1993) proved that
the “universal approximation theorem” was equivalent to having a nonpolynomial
activation function (see Haykin (1998) for more details on theoretical properties).
Those theorems, which provide a guarantee of having a uniform approximation of
.μ by some .mw (in the sense that, for any . > 0, one can find .w—and an appropriate
Again, let us briefly explain the general idea (see Denuit et al. (2020) for more
details). Decision trees appeared in the statistical literature in the 1970s and the
1980s, with Messenger and Mandell (1972) with Theta Automatic Interaction
Detection (THAID), then Breiman and Stone (1977) and Breiman et al. (1984)
with Classification And Regression Trees (CART), as well as Quinlan (1986, 1987,
1993) with Iterative Dichotomiser 3 (ID3) that later became C4.5 (“a landmark
decision tree program that is probably the machine learning workhorse most widely
used in practice to date” as Witten et al. (2016) wrote). One should probably
mention here the fact that the idea of “decision trees” was mentioned earlier in
psychology, as discussed in Winterfeldt and Edwards (1986), but without any details
about algorithmic construction (see also Lima (2014) for old representations of
106 3 Models: Overview on Predictive Models
“decision trees”). Indeed, as already noted in Laurent and Rivest (1976) constructing
optimal binary decision trees is long and time consuming, and using some backward
procedure will fasten the process. Starting with the entire training set, we select
the most predictive variable (with respect to some criteria, as discussed later) and
we split the population in two, using an appropriate threshold and this predictive
variable. And then we iterate within each sub-population. Heuristically, each sub-
population should be as homogeneous as possible, with respect to y. In the methods
considered previously, we contemplated all explanatory variables together, using
linear algebra techniques to solve optimization problems more efficiently, but here,
variables are used sequentially (or to be more specific, binary step functions, as .xj
becomes .1(xj ≤ t) for some optimally selected threshold). After some iterations,
we end up with some subgroups, or sub-regions in .X called either “leaves” or
terminal nodes, whereas intermediary splits are called internal nodes. In the visual
representation, segments connecting nodes are called “branches”, and to continue
with the arboricultural and forestry metaphors, we evoke pruning when we trim the
trees to avoid possible overfit.
As mentioned, after several iterations, we split the population and the space .X
into a partition of J regions, .R1 , · · · , RJ , such that .Ri ∩ Rj = ∅ when .i = j and
.R1 ∪ R2 ∪ · · · ∪ RJ = X. Then, in each region, prediction is performed simply by
1
yRj =
. yi
|Rj |
i:x i ∈Rj
or using a majority rule for a classifier. Classically, regions are (hyper) rectangles
in .X ⊂ Rk (or orthants), in order to simplify the construction of the tree, and to
have a simple and graphical interpretation of the model. In a regression context,
the classical strategy is to minimize the mean squared error, which is the in-sample
empirical risk for the . 2 -norm,
J 2
.MSE = MSEj where MSEj = yi −
yRj ,
j =1 i:x i ∈Rj
where .
yRj is the prediction in region .Rj . For a classification problem, observe that
2
MSEj =
. yi −
yRj = n0,j (0 −
yRj )2 + n1,j (1 −
yRj )2 ,
i:x i ∈Rj
3.3 Supervised Models and “Individual” Pricing 107
so that, if .n0,j and .n1,j denote the number of observations such that .y = 0 and
y = 1 respectively in the region .Rj ,
.
( )2 ( )2
n0,j n0,j n0,j n1,j
. max{MSEj } = n0,j + n1,j = ,
n0,j + n1,j n1,j + n1,j n0,j + n1,j
and therefore
J
n0,j n1,j
J
MSE =
. = nj · p1,j (1 − p1,j ),
n0,j + n1,j
j =1 j =1
Instead of considering all possible partitions of .X that are rectangles, some “top-
down greedy approach” is usually considered, with some recursive binary splitting.
The algorithm is top-down as we start with all observations in the same class
(usually called the “root” of the tree) and then we split into regions that are smaller
and smaller. And it is greedy because optimization is performed without looking
back at the past. Formally, at the first stage, select variable .xκ and some cutoff point
t and consider half spaces
Then find the best variable and the best cutoff solution of
. min inf MSEκ (t) .
κ=1,··· ,k t∈Xκ
empirical risk and yield to the split of .Xj into two half spaces
And iterate. Ultimately, each leaf contains a single observation that corresponds
to a null empirical risk, on the training dataset, but will hardly generate. To avoid
this overfit, it will be necessary either to introduce a stopping criterion or to prune
the complete tree. In practice, we stop when the leaves have reached a minimum
number of observations, set beforehand. An alternative is to stop if the variation
(relative or absolute) of the objective function does not decrease enough. Formally,
to decide whether a leaf .{N } should be divided into .{NL , NR }, compute the impurity
variation
n nR
L
.I(NL , NR ) = I(N ) − I(NL , NR ) = ψ( yN ) − yNL ) +
ψ( ψ(yNL ) .
n n
• if .x2 < 5.9 (first node, first branch, 43%, .y = 12%, final leaf)
.
• if .x1 < −0.39 (first node, second branch, 21%, .y = 38%, final leaf)
.
• if .x1 ≥ −0.39 (second node, second branch, 9%, .y = 63%, final leaf)
.
Fig. 3.15 Classification tree on the toydata2 dataset, with default pruning parameters of
rpart. At the top, the entire population (.100%, .y = 40%) and at the bottom six leaves, on the
left, .43% of the population and .y = 12% and on the right, .16% of the population and .y = 89%.
The six regions (in space .X), associated with the six leaves, can be visualized on the right-hand
side of Fig. 3.18
3.3 Supervised Models and “Individual” Pricing 109
Table 3.1 Predictions for two individuals Andrew and Barbara, with models trained on
toydata2, with the true value .μ (used to generate data), then a plain logistic one (glm), an
additive logistic one (gam), a classification tree (cart), a random forest (rf), and a boosting model
(gbm, as in Fig. 3.21)
x1 x2 x3 s μ(x, s) glm (x)
m gam (x)
m cart (x)
m rf (x)
m gbm (x)
m
Andrew −1 8 −2 A 0.366 0.379 0.372 0.384 0.310 0.354
Barbara 1 4 2 B 0.587 0.615 0.583 0.434 0.622 0.595
Fig. 3.16 Classification tree on the GermanCredit dataset, with default pruning parameters of
rpart. At the top, the entire population (.100%, .y = 30%) and at the bottom nine leaves, on the
left, .45% of the population and .y = 13% and on the right, .3% of the population and .y = 35%
• if .x2 ≥ 1.6 (second node, fourth branch, 5%, .y = 83%, final leaf)
.
• if .x2 ≥ 4.3 (second node, third branch, 16%, .y = 89%, final leaf)
.
Observe that only variables x1 and x2 are used here. It is then possible to visualize
the prediction for a given pair .(x1 , x2 ) (the value of .x3 and s will have no influence).
As in Fig. 3.15. It is possible to visualize two specific predictions, as on Table 3.1
(we discuss further interpretations for those two specific individuals in Sect. 4.1).
In Fig. 3.16 we can visualize the classification tree obtained on the
GermanCredit dataset. Recall that in the entire training population (.n = 700),
.y = 30.1%. Then the tree grows as follows,
• if Duration .< 22.5 (first node, second branch, 43%, .y = 12%, final leaf)
.
• .. . .
.
110 3 Models: Overview on Predictive Models
For the pruning procedure, create a very large and deep tree, and then, cut some
branches. Formally, given a large tree .T0 identify a subtree .T ⊂ T0 that minimizes
|T|
. Rm 2 + α|T|,
Yi − Y
m=1 i:Xi ∈Rm
where .α is some complexity parameter, and .|T| is the number of leaves in the subtree
T. Observe that it is similar to penalized methods described previously, used to get
.
We have seen so far how to estimate various models, and then, we use some metrics
to select “the best one.” But rather than choosing the best among different models, it
could be more efficient to combine them. Among those “ensemble methods,” there
will bagging, random forests, or boosting (see Sollich and Krogh (1995), Opitz
and Maclin (1999) or Zhou (2012)). Those techniques can be related to “Bayesian
model averaging”, which linearly combines submodels of the same family, with
the posterior probabilities of each model, as coined in Raftery et al. (1997) or
Wasserman (2000), and “stacking”, which involves training a model to combine
the predictions of several other learning algorithms, as described in Wolpert (1992)
or Breiman (1996c).
It should be stressed here that a “weak learner” is defined as a classifier that is
correlates only slightly with the true classification (it can label examples slightly
better than random guessing, so to speak). In contrast, a “strong learner” is a
classifier that is arbitrarily well correlated with the true classification. Long story
short, we discuss in this section the idea that combining “weak learner” could yield
better results than seeking a “strong learner”.
A first approach is the one described in Fig. 3.17: consider a collection of
predictions, .{y (1) , · · · ,
y (k) }, obtained using k models (possibly from different
families, GLM, trees, neural networks, etc.), consider a linear combination of those
models, and solve a problem like
.
n
. min yi , α
yi .
α∈Rk
i=1
3.3 Supervised Models and “Individual” Pricing 111
1 2
2
3
3
4
4
5
model
Fig. 3.17 Explanatory diagram of parallel training, or “bagging” (bootstrap and aggregation),
starting from the same predictor variables .x = (x1 , · · · , xk ) and with the same target variable
y. Different models .mj (x) are fitted, and the outcome is usually the average of the models
For example, with .k = 11 judges, if each judge has a 30% chance of making the
decision, the majority is wrong with only 8% chance. This probability decreases
with k, unless there is a strong correlation. So in an ideal situation, ensemble
techniques work better if predictions are independent from each other.
Galton (1907) suggested that technique while trying to “guess the weight of
an ox” in a county fair in Cornwall, England, as recalled in Wallis (2014) and
Surowiecki (2004). .k = 787 participants provided guesses . y1 , · · · ,
yk . The ox
weighed 1198 pounds, and the average of the estimates was 1197 pounds. Francis
Galton compared two strategies, either picking a single prediction .
yj or considering
112 3 Models: Overview on Predictive Models
1
k
. yj − t)2 = (y − t)2 +
E ( yi − y)2 ,
(
k
i=1
so clearly, using the average prediction is better than seeking the “best one.” From
a statistical perspective, hopefully, ensemble generalizes better than a single chosen
model. From a computational perspective, averaging should be faster than solving
an optimization problem (seeking the “best” model). And interestingly, it will take
us outside classical model classes.
In ensemble learning, we want models to be reasonably accurate, and as inde-
pendent as possible. A popular example is “bagging” (for bootstrap aggregating),
introduced to increase the stability of predictions and accuracy. It can also reduce
variance and overfit. The algorithm would be
1. Generate k training datasets using bootstrap (resampling), as subdividing the
database into k smaller (independent) dataset will lead to small training datasets,
(j ) , with .j = 1, · · · , k,
2. Each dataset is used as a training sample to fit the model .m
such as (deep) trees, with small bias and large variance (variance will decrease
with aggregation, as shown below),
3. For a new observation .x, the aggregated prediction will be
1 (j )
k
bagging (x) =
.m (x).
m
k
j =1
Observe that here, we use the same weights for all models. The simple version of
“random forest” is based on the aggregation of trees. In Fig. 3.18 we can visualize
the predictions on the toydata2 dataset, as a function of .x1 and .x2 (and .x3 = 0),
with a classification tree on the left-hand side, and the aggregation of .k = 500 trees
on the right.
As explained in Friedman et al. (2001), the variance of aggregated models7 is
⎡ ⎤ ⎡ ⎤
1
k
1
k
.Var mbagging (x) = Var ⎣ (j ) (x)⎦ = 2 Var ⎣
m (j ) (x)⎦
m
k k
j =1 j =1
1
k k
(j )
= 2
1 (x), m
Cov m (j2 ) (x)
k
j1 =1 j2 =1
1 2 (j ) (j )
≤ 2
(x)) = Var m
k Var m (x) ,
k
(j )
as, when .j1 = j2 , .Corr m 1 (x), m (j2 ) (x) ≤ 1. And actually, even if
m(j ) (x)] = σ 2 (x) and .Corr[
.Var[ (j2 ) (x)] = r(x),
m(j1 ) (x), m
1 − r(x) 2
Var m
. bagging (x) = r(x)σ 2 (x) + σ (x).
k
The variance is lower when “more different” models are aggregated. And more
generally, as explained in (Denuit et al. 2020, Sect. 4.3.3),
E
. bagging (X) ≤ E
Y, m (j ) (X) .
Y, m
1 2 3 4
3
model
Fig. 3.19 Explanatory diagram of sequential learning (“boosting”) starting from the same predic-
tor variables .x = (x1 , · · · , xk ) and with the same target variable y. Using “weak” learners .ht
learning from the residuals of the previous model, and to improve sequentially the model (.ht+1 is
fitted to explain residuals .y − mt (x), and the update is .mt+1 = mt + ht+1 )
or more generally,
1. Initialization
3 : k (number of trees), .γ (learning rate), and .m0 (x) =
argmin ni=1 (yi , ϕ);
ϕ
2. At the stage .t ≥ 1,
y)
∂ (yi ,
2.1 Compute “residuals” .ri,t ← y
∂ ;
y =m t−1 (x i )
2.2 Fit a model
3 .ri,t ∼ ht (x i ) for some weak learner .h ∈ H, .ht =
argmin ni=1 (ri,t , h(x i ));
h∈H
2.3 Update .mt (·) = mt−1 (·) + γ ht (·);
2.4 Loop (.t ← t + 1 and return to 2.1)
In the context of classification, Freund and Schapire (1997) introduced the
“adaboost” algorithm (for “adaptive boosting”), based on updating weights
(see also Schapire (2013) for additional heuristics). Consider some binary clas-
sification problem (where y takes values 0 or 1) with a training dataset .D =
({yi ; x i }, i = 1, . . . , n). Following Algorithm 10.1 in Friedman et al. (2001),
1. Set weights .ωi,1 = 1/n, for .i = 1, . . . , n, .γ > 0, .m0 (x) = 0, and define .ω1 ;
2. At the stage .t ≥ 1,
3
n
ωi,t · 1 (ht (x i ) = yi )
i=1
. ¯t ←
3
n
ωi,t
i=1
Fig. 3.20 Evolution of Bernoulli deviance of .mt with a boosting learning (adaboost) on the
toydata2 dataset, as a function of t, on the validation sample (on top) and on the training sample
(below), with two different learning rates (on the left and on the right). Vertical lines correspond to
the “optimal” number of (sequential) trees
116 3 Models: Overview on Predictive Models
Fig. 3.21 Evolution of .mt (x) with a boosting learning (adaboost) on the toydata2 dataset,
as a function of t (for the models .mt from Fig. 3.20), with .xB on top and .xA below, with different
learning rates
are respectively obtained with 450 and 100 trees, as those were values that minimize
the overall error, on the validation samples (in Fig. 3.21).
Fig. 3.22 Roughness of predictive models, with, on the right-hand side, interpolation .t → m (x t ),
where .x t = tx B +(1−t)x A , for .t ∈ [0, 1] for five models on the toydata2 dataset, corresponding
to some individual in-between Andrew and Barbara. .x t correspond to fictitious individuals on
the segments .[x A , x B ] on the left, connecting Andrew and Barbara
prediction
1 111 1 129
false positive true positive false positive true positive
51 60 68 61
prediction
1 69 1 146
false positive true positive false positive true positive
26 43 14 32
Fig. 3.23 Confusion matrices with threshold .30% (on top) and .50% (below) on 300 observations
from the GermanCredit dataset, with a logistic regression on the left, and a random forest on
the right (without the sensitive attribute, gender)
In Table 3.2 different metrics are used on a plain logistic regression and a random
forest approach, with the accuracy (with a confidence interval) as well as specificity
and sensitivity, computed with thresholds 70%, 50%, and 30%. On the left, models
with all features (including sensitive ones) are considered. Then, from the left to the
right, we consider models (plain logistic and random forest) without gender, without
age, and without both.
118 3 Models: Overview on Predictive Models
Table 3.2 Various statistics for two classifiers, a logistic regression and a random forest on
the GermanCredit dataset (accuracy—with a confidence interval (lower and upper bounds)—
specificity and sensitivity are computed with a 70%, 50%, and 30% threshold), where all variables
are considered, on the left. Then, from the left to the right, without gender, without age, and without
both
All variables – gender – age – gender and age
GLM RF GLM RF GLM RF GLM RF
AUC 0.793 0.776 0.790 0.783 0.794 0.790 0.794 0.790
τ = 70% Accuracy 0.730 0.723 0.737 0.727 0.740 0.737 0.740 0.737
Accuracy− 0.676 0.669 0.683 0.672 0.686 0.683 0.686 0.683
Accuracy+ 0.779 0.773 0.786 0.776 0.789 0.786 0.789 0.786
Specificity 0.654 1.000 0.692 1.000 0.704 0.917 0.704 0.917
Sensitivity 0.737 0.718 0.741 0.720 0.744 0.729 0.744 0.729
τ = 50% Accuracy 0.757 0.773 0.760 0.767 0.753 0.787 0.753 0.787
Accuracy− 0.704 0.722 0.708 0.715 0.701 0.736 0.701 0.736
Accuracy+ 0.804 0.819 0.807 0.813 0.801 0.832 0.801 0.832
Specificity 0.618 0.756 0.623 0.721 0.609 0.778 0.609 0.778
Sensitivity 0.797 0.776 0.801 0.774 0.797 0.788 0.797 0.788
τ = 30% Accuracy 0.723 0.677 0.733 0.680 0.723 0.690 0.723 0.690
Accuracy− 0.669 0.621 0.679 0.624 0.669 0.634 0.669 0.634
Accuracy− 0.773 0.729 0.783 0.732 0.773 0.742 0.773 0.742
Specificity 0.526 0.469 0.541 0.472 0.526 0.485 0.526 0.485
Sensitivity 0.844 0.835 0.847 0.832 0.844 0.847 0.844 0.847
Receiver operating characteristic (ROC) curves associated with the plain logistic
regression and the random forest model (on the validation dataset) can be visualized
in Fig. 3.24, with models including all features, and then when possibly sensitive
attributes are not used.
More precisely, it is possible to compare models (here plain logistic and the ran-
dom forest) not with global metrics, but by plotting .m rf (x i , si ) against .m
glm (x i , si ),
on the left-hand side of Fig. 3.25. On the right-hand side, we can compare ranks
among predictions. The (outlier) in the right lower corner corresponds to some
glm (x i , si ) is almost the 80% lowest (and would be seen as
individual i such that .m
a “large risk,” in the worst 20% tail with a plain logistic model) whereas .m rf (x i , si )
is almost the 20% lowest (and would be seen as a “small risk,” in the top 20% tail
with a random forest model).
Supervised learning corresponds to the case we want to model and predict a target
variable y (as explained in the previous sections). In the case of unsupervised
learning, there is no target, and we only have a collection of variables .x. The two
3.4 Unsupervised Learning 119
Fig. 3.24 Receiver operating characteristic curves, for different models, plain logistic regression
and random forest on GermanCredit, with all variables .(x, s) at the top left, and on .x at the top
right, without gender and age. Below, models are on .(x, s) with a single sensitive attribute . GLM
= generalized linear model
general problems we want to solve are dimension reduction (where we want the
use a smaller number of features) and cluster construction (where we try to regroup
individuals together to create groups).
For cluster analysis, see Hartigan (1975), Jain and Dubes (1988) or Gan and
Valdez (2020) for theoretical foundations. Campbell (1986) applied cluster analysis
to identify groups of car models with similar technical attributes for the purpose of
estimating risk premium for individual car models, whereas Yao (2016) explored
territory clustering for ratemaking in motor insurance.
Reducing dimension simply means that, instead of our initial vectors of data, .x i ,
we want to consider a lower dimensional vector .x i . Instead of matrix .X, we consider
.X of lower rank. In the case of a linear transformation, corresponding to PCA (see
120 3 Models: Overview on Predictive Models
Fig. 3.25 Scatterplot of .( rf (x i )) on the left, on the germancredit dataset, and the
mglm (x i ), m
associated ranks on the right (with a linear transformation to have “ranks” on 0–100, corresponding
to the empirical copula. GLM = generalized linear model)
X Z c
X
Fig. 3.26 Nonlinear autoencoder at the top (with transformations .ϕ and .ψ), and a linear
autoencoder at the bottom (with transformations .P and .P , corresponding to principal component
analysis). .X is the original dataset, .Z the embedded version in a smaller latent space (principal
factors), and .X is the reconstruction
Jolliffe (2002) or Hastie et al. (2015)) or linear auto-encoder (see Sakurada and Yairi
(2014) or Wang et al. (2016)), the error is
n
.X − X = X − P P X = P P xi − xi P P xi − xi
2 2
i=1
n
X − X 2 = X − ψ ◦ ϕ(X)2 =
. ψ ◦ ϕ(x i ) − x i ψ ◦ ϕ(x i ) − x i .
i=1
n
. trace P P − I xi xi P P − I
i=1
$ %
n
= trace P P − I xi xi P P − I ,
i=1
where .·F denotes the Frobenius norm, corresponding to the elementwise . 2 norm
of a matrix,
.MF = = trace MM where M = Mi,j .
2 2
Mi,j
i,j
or
max trace(SP ) where P = P ∈ : eigenvalues(P ) ∈ {0, 1}
.
P ∈P
and trace(P ) = k .
122 3 Models: Overview on Predictive Models
As explained in Samadi et al. (2018), if .X −X 2F can be seen as the reconstruction
error of .X with respect to .X , the way to define a rank k-reconstruction loss is to
consider
Abstract In this chapter, we present important concepts for when dealing with
predictive models. We start with a discussion about the interpretability and explain-
ability of models and algorithms, presenting different tools that could help us to
understand “why” the predicted outcome of the model is the one we got. Then, we
will discuss accuracy, which is usually the ultimate target of most machine-learning
techniques. But as we see, the most important concept is the “good calibration” of
the model, which means that we want to have, locally, a balanced portfolio, and that
the probability predicted by the model is, indeed, related to the true risk.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 123
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_4
124 4 Models: Interpretability, Accuracy, and Calibration
×
2 × 2
Fig. 4.1 On the left, the ceteris paribus approach (only the direct relationship from .x1 to y is
considered, and .x2 is supposed to remain unchanged) and the mutatis mutandis approach (a change
in .x1 has a direct impact on y, and there could be an additional effect via .x2 )
Definition 4.1 (Ceteris Paribus (Marshall 1890)) Ceteris paribus (or more pre-
cisely ceteris paribus sic stantibus) is a Latin phrase, meaning “all other things
being equal” or “other things held constant.”
The ceteris paribus approach is commonly used to consider the effects of a cause,
in isolation, by assuming that any other relevant conditions are absent. In Fig. 4.1,
the output of a model, . y can be influenced by .x1 and .x2 , and in the ceteris paribus
analysis of the influence of .x1 on . y , we isolate the effect of .x1 on .
y . In the mutatis
mutandis approach, if .x1 and .x2 are correlated, we add to the “direct effect” (from
.x1 to .
y ) a possible “indirect effect” (through .x2 ).
Definition 4.2 (Mutatis Mutandis) Mutatis mutandis is a Latin phrase meaning
“with things changed that should be changed” or “once the necessary changes have
been made.”
In order to illustrate, let .(X1 , X2 , ε)┬ denote some Gaussian random vector,
where the first two components are correlated, and .ε is some unpredictable random
noise, independent of the pair .(X1 , X2 )┬
⎛ ⎞ ⎛⎛ ⎞ ⎛ 2 ⎞⎞
X1 μ1 σ1 rσ1 σ2 0
. ⎝X2 ⎠ ∼ N ⎝⎝μ2 ⎠ , ⎝rσ1 σ2 σ22 0 ⎠⎠ .
ε 0 0 0 σ2
and therefore
∗ rσ2 ∗
.EY |X1 [Y |x1 ] = β0 + β1 x1∗ + β2 μ2 + (x − μ1 ) : mutatis mutandis.
σ1 1
On the other hand, in the ceteris paribus approach, “isolating” the effect of
.x1 to other possible causes means that we pretend that .X1 and .X2 are now
independent. Therefore, formally, instead of .(X1 , X2 ), we consider .(X1⊥ , X2⊥ ) a
“copy” with independent components and the same marginal distributions,1 then
.E
∗
Y |X⊥ |X⊥ [Y |x1 ] = μ2 , and
2 1
Therefore, we have clearly the direct effect (ceteris paribus), and the indirect effect,
rσ2 ∗
. EY |X1 [Y |x1∗ ] = EY |X⊥ [Y |x1∗ ] +β2 (x − μ1 ).
1 σ1 1
mutatis mutandis ceteris paribus
As expected, if variables .x1 and .x2 are independent, .r = 0, and the mutatis mutandis
and the ceteris paribus approaches are identical. Later on, when presenting various
techniques in this chapter, we may use notation .EX1 and .EX⊥ , instead of .EY |X1 or
1
.E
Y |X1⊥ respectively, to avoid notations that are too heavy.
And more generally, from a statistical perspective, if we consider a nonlinear
model .EY |X [Y |x ∗ ] = EX [Y |x1∗ , x2∗ ] = m(x1∗ , x2∗ ), a natural ceteris paribus
estimate of the effect of .x1 on the prediction is
n
1
EY |X⊥ [m(X1⊥ , X2⊥ )|x1∗ ] ≈
. m(x1∗ , xi,2 )
1 n
i=1
(the average on the right being the empirical counterpart of the expected value on
the left), whereas to estimate mutatis mutandis, we need a local version, to take into
account a possible (local) correlation between .x1 and .x2 , i.e.,
1
.EY |X1 [m(X1 , X2 )|x1∗ ] ≈ m(x1∗ , xi,2 ),
‖Vϵ (x1∗ )‖
i∈Vϵ (x1∗ )
where .Vϵ (x1∗ ) = i : |xi,1 − x1∗ | ≤ ϵ is a neighborhood of .x1∗ . It should be stressed
that notations “.EY |X1 [m(X1 , X2 )|x1∗ ]” and “.EY |X⊥ [m(X1⊥ , X2⊥ )|x1∗ ]” do not have
1
measure-theoretic foundations, but they will be useful to highlight that in some
L
1 In the sense that .X2⊥ =X2 , almost surely, and .X1⊥ = X1 , and .X1⊥ ⊥⊥ X2 .
126 4 Models: Interpretability, Accuracy, and Calibration
cases, metrics and mathematical objects “pretend” that explanatory variables are
independent.
Fig. 4.2 Variable importance for different models trained on toydata2, without the sensitive attribute s, with four variables, .x1 , .x2 , .x3 , and s. CART =
classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF = random forest
127
128
Fig. 4.3 Variable importance for different models trained on toydata2, with the sensitive attribute s, with four variables, .x1 , .x2 , .x3 , and s. Cart = classification
and regression trees, gam = generalized additive model, glm = generalized linear model, rf = random forest
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability 129
Instead of a global measure, some local metrics can be considered. Goldstein et al.
(2015) defined the “individual conditional expectation” directly derived from ceteris
paribus functions, coined “ceteris paribus profile” in Biecek and Burzykowski
(2021).
Definition 4.4 (Ceteris Paribus Profile .z |→ mx ∗ ,j (z) (Goldstein et al. 2015))
Given .x ∗ ∈ X, define on .Xj
Here, it is a ceteris paribus profile in the sense that .xj∗ changes (and takes variable
value z) whereas all other components remain unchanged. Define then the difference
when component j takes generic value z and .xj∗ ,
cp
Definition 4.5 (.dmj (x ∗ )) The mean absolute deviation associated with the j -th
variable, at .x ∗ , is .dmj (x ∗ ),
dmj (x ∗ ) = E |δmx ∗ ,j (Xj )| = E |m(x ∗−j , Xj ) − m(x ∗−j , xj∗ )|
cp
.
cp
j (x ∗ )) The empirical mean absolute deviation associated with
Definition 4.6 (.dm
the j -th variable, at .x ∗ , is
n
cp ∗ 1
dm
. j (x ) = |m(x ∗−j , xi,j ) − m(x ∗−j , xj∗ )|.
n
i=1
In Figs. 4.4 and 4.5, we can visualize “ceteris paribus profiles” on our four
models, on toyxdata2, with .j = 1 (variable .x1 ) with the plain logistic regression,
the GAM, the classification tree, and the random forest, .z |→ mx ∗ ,1 (z). In Fig. 4.4,
it is .z |→ mx ∗ ,1 (z) associated with Andrew (when .(x ✶ , s ✶ ) = (−1, 8, −2, A))
and in Fig. 4.5, it is .z |→ mx ∗ ,1 (z) associated with Barbara (when .(x ✶ , s ✶ ) =
(1, 4, 2, B)). Bullet points indicate the values .mx ∗ ,1 (x1∗ ) for Andrew in Fig. 4.4, and
Barbara in Fig. 4.5. On top left, function is monotonic, with a “logistic” shape. On
the right, we see that a GLM will probably miss a nonlinear effect, with a (caped) J
shape.
130
Fig. 4.4 “Ceteris paribus profiles” for Andrew for different models trained on toydata2 (see Table 3.1 for numerical values, for variable .x1 , here .z✶ =
(x ✶ , s ✶ ) = (−1, 8, −2, A))
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability
Fig. 4.5 “Ceteris paribus profiles” for Barbara for different models trained on toydata2 (see Table 3.1 for numerical values, for variable .x1 , here .z✶ =
(x ✶ , s ✶ ) = (1, 4, 2, B))
131
132 4 Models: Interpretability, Accuracy, and Calibration
4.1.3 Breakdowns
k k
┬
m
. 0 +
(x ∗ ) = β 0 +
β x∗ = β j xj∗ = y +
β j xj∗ − x j ,
β
j =1 j =1
=vj (x ∗ )
where .vj (x ∗ ) is interpreted as the contribution of the j -th variable on the prediction
for an individual with characteristics .x ∗ . More generally, Robnik-Šikonja and
Kononenko (1997, 2003, 2008) defined the (additive) contribution of the j -th
variable on the prediction for an individual with characteristics .x ∗
so that
k
m(x ∗ ) = E m(X) +
. vj (x ∗ ),
j =1
and for the linear model .vj (x ∗ ) = βj xj∗ − EX⊥ |X−j [Xj⊥ |X−j = x ∗−j ] , and
j
∗ j x ∗ − x j .
.vj (x ) = β
j
More generally, .vj (x ∗ ) = m(x ∗ ) − EX⊥ |X−j [m(x ∗−j , Xj ))], where we can write
j
m(x ∗ ) as .E[m(x ∗ )], i.e.,
.
⎧
⎨E m(X)x ∗ , . . . , x ∗ − E ⊥ m(X)x1∗ , . . . , xj∗−1 , xj∗+1 , . . . , xk∗
∗ 1 k X |X −j
.vj (x ) = j ∗
⎩E m(X)x ∗ − E ⊥
X |X−j m(X) x −j .
j
Definition 4.7 (.γjbd (x ∗ ) (Biecek and Burzykowski 2021)) The breakdown con-
tribution of the j -th variable, at .x ∗ , is
γjbd (x ∗ ) = vj (x ∗ ) = E m(X)x ∗ − EX⊥ |X−j m(X)x ∗−j .
.
j
“In other words, the contribution of the j -th variable is the difference between
the expected value of the model’s prediction conditional on setting the values of the
first j variables equal to their values in .x ∗ and the expected value conditional on
setting the values of the first .j − 1 variables equal to their values in .x ∗ ,” as Biecek
and Burzykowski (2021) said.
4.1 Interpretability and Explainability 133
properties. The Shapley value describes contribution to the payout, weighted and
summed over all possible feature value combinations, as follows,
As explained in Ichiishi (2014), if we suppose that coalitions are being formed one
player at a time, at step j , it should be fair for player j to be given .V (S ∪ {j }) −
V(S) as a fair compensation for joining the coalition. And then for each actor, to
take the average of this contribution over all possible different permutations in which
the coalition can be formed. Which is exactly the expression above, that we can
rewrite
1
φj (V) =
.
number of players
coalitions including j
The goal, in Shapley (1953), was to find contributions .φj (V), for some value
function .V, that satisfies a series of desirable properties, namely
k
• “Efficiency”: . φj (V) = V({1, . . . , k}),
j =1
• “Symmetry”: if .V (S ∪ {j }) = V S ∪ {j ' } .f orallS, then .φj = φj ' ,
• “Dummy” (or “null player”): if .V (S ∪ {j }) = V (S) .f orallS, then .φj = 0,
• “Additivity”: if .V(1) and .V(2) have decomposition .φ(V(1) ) and .φ(V(2) ), then
.V +V(2) has decomposition .φ(V(1) +V(2) ) = φ(V(1) )+φ(V(2) ). “Linearity”
(1)
Interestingly, for a linear regression with k uncorrelated features, and mean centered,
(x ∗ ) ≈ m(x ∗+ ∗−
shap
Observe that .γj i ) − m(x i ), and therefore
1
(x ∗ ) = m(x ∗+ ∗−
shap
j
γ
.
i ) − m(x i ),
s
i∈{1,...,n}
shap
Fig. 4.8 Shapley contributions .γ
j (z✶A ) for Andrew for different models trained on toydata2 (see Table 3.1 for numerical values, here .z✶ = (x ✶ , s ✶ ) =
(−1, 8, −2, A))
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability
shap
Fig. 4.9 Shapley contributions .γ
j (z✶B ) for Barbara for different models trained on toydata2 (see Table 3.1 for numerical values; here, .z✶ = (x ✶ , s ✶ ) =
(1, 4, 2, B))
139
140 4 Models: Interpretability, Accuracy, and Calibration
shap
Definition 4.10 (Shapley Contribution .γ j ) The contribution of the j -th vari-
able is
n
shap 1 shap
γj
. = γj (x i ).
n
i=1
One interesting feature about the Shapley value is that the contribution can be
extended, from a single player j to any coalition, for example, two players .{i, j }.
This yields the concept of “Shapley interaction.”
shap
Definition 4.11 (Shapley Interaction .γi,j (x ∗ )) The interaction contribution
between the i-th and the j -th variable, at .x ∗ , is
where
Δi,j |S (x ∗ ) = EX⊥
. m(X)x ∗S∪{i,j } − EX⊥ m(X)x ∗S∪{j }
S∪{i,j } S∪{j }
∗
−EX⊥ m(X) x S∪{i} + EX⊥ m(X)x ∗S .
S∪{i} S
The “partial dependence plot,” formally defined and coined in Friedman (2001), is
simply the average of “ceteris paribus profiles.”
Definition 4.12 (PDP .pj (xj∗ ) and .pj (xj∗ )) The partial dependence plot associated
with the j -th variable is the function .Xj → R defined as
pj (xj∗ ) = EX⊥ m(X)|xj∗ ,
.
j
See Greenwell (2017) for the implementation in R, with the pdp package.
One can also use type = "partial" in the predict_parts function of
the DALEX package, as in Biecek and Burzykowski (2021). In Fig. 4.10 we can
1 (associated with variable .x1 ) in dataset toydata2, the average of
visualize .p
∗ ∗
.m(x , x i,−j ) when .i = 1, . . . , n, including all .m(x , x i,−j )s in Fig. 4.11.
j j
Interestingly, instead of the sum over the n predictions, subsums can be consid-
ered, with respect to some criteria. For example, in Figs. 4.12, 4.13 and 4.14, sums
over .si = A or .si = B are considered,
1 1
. jA (xj∗ ) =
p m(xj∗ , x i,−j ) and p
jB (xj∗ ) = m(xj∗ , x i,−j ).
nA nB
i:si ==A i:si ==B
On the toydata2 data, the three variables j (namely .x1∗ , .x2∗ and .x3∗ ) are used,
in Figs. 4.12, 4.13 and 4.14 respectively. If .x3∗ has a very flat impact (in Fig. 4.14),
and no influence on the outcome, one should observe that .p jA (x3∗ ) and .p
jB (x3∗ ) are
significantly different.
But instead of those ceteris paribus dependence plots, it could be interesting to
consider some local versions, or mutatis mutandis dependence plots. Apley and Zhu
(2020) introduced the “local dependence plot” and the “accumulated local plot,”
defined as follows,
Definition 4.13 (Local Dependence Plot .𝓁j (xj∗ ) and .
𝓁j (xj∗ )) The local depen-
dence plot is defined as
𝓁j (xj∗ ) = EXj m(X)|xj∗
.
1
𝓁j (xj∗ ) =
. m(xj∗ , x i,−j ) where V (xj∗ ) = i : d(xi,j , xj∗ ) ≤ ϵ ,
card(V (xj∗ ))
i∈V (xj∗ )
n
1
or
. 𝓁j (xj∗ ) = ∗ ωi (xj∗ )m(xj∗ , x i,−j ) where ωi (xj∗ ) = Kh (xj∗ − xi,j ),
i ωi (xj ) i=1
∗
xj∗ ∂m(xj , X −j )
.aj (xj ) = EXj xj dxj .
−∞ ∂xj
142
kj∗
∗ 1
.aj (xj ) =α+ m(ak , x i,−j ) − m(ak−1 , x i,−j ) ,
nu
u=1 u:xi,j ∈(au−1 ,au ]
s Age Housing Property Credit amount Account status Duration Employment since Job Purpose y
Barbara Female 63 For free No property 13,756 None 60 ≥7 years Highly qualified New car Good (0)
Andrew Male 28 Rent Car 4,113 [0, 200] 24 <1 year Skilled employee Old car Bad (1)
Fig. 4.16 Breakdown decomposition .γ jbd (z∗A ) for Andrew, GermanCredit, on four models.
CART = classification and regression trees, GBM = gradient boosting model, GLM = generalized
linear model, RF = random forest
Fig. 4.17 Breakdown decomposition .γjbd (z∗B ) for Barbara, GermanCredit, on four models.
CART = classification and regression trees, GBM = gradient boosting model, GLM = generalized
linear model, RF = random forest
4.1 Interpretability and Explainability 151
Fig. 4.18 Shapley contributions for Andrew, on the GermanCredit dataset. CART = classi-
fication and regression trees, GBM = gradient boosting model, GLM = generalized linear model,
RF = random forest
Fig. 4.19 Shapley contributions for Barbara, on the GermanCredit dataset. CART =
classification and regression trees, GBM = gradient boosting model, GLM = generalized linear
model, RF = random forest
152
In Figs. 4.21 and 4.22, we can visualize breakdown decomposition .γ jbd (z∗A ) for
Andrew and . γjbd (z∗B ) for Barbara respectively, on four models trained on the
FrenchMotor dataset.
In Figs. 4.23 and 4.24, we can visualize Shapley contributions (including confi-
shap ∗ shap ∗
dence intervals) .γj (zA ) for Andrew and .γj (zB ) for Barbara respectively,
for models estimated using the FrenchMotor dataset. One could also use
Shapley in the iml R package (see Molnar et al. 2018).
Figure 4.25 is the partial dependence plot for a specific (continuous) variable, the
license age (license_age), with the average of ceteris paribus profiles, for males
and females respectively. Observe that, even if s is not included in the models, partial
dependence plots are different in the two groups: for GLM and GAM, predicted
probabilities for males are higher than for females, whereas for the random forest,
predicted probabilities for females are higher than for males.
Note that there are more connections between interpretation and causal models,
as discussed in Feder et al. (2021); Geiger et al. (2022) or Wu et al. (2022). We return
to those approaches in Chaps. 7 (on experimental and observational data) and 9 (on
individual fairness, and counterfactual).
To conclude this section, let us briefly mention here a concept that will be
discussed further in the context of fairness and discrimination, related to the idea
of “counterfactuals” (as named in Lewis 1973). The word “counterfactual” can be
either an adjective describing something “thinking about what did not happen but
could have happened, or relating to this kind of thinking,” or a noun defined as
“something such as a piece of writing or an argument that considers what would
have been the result if events had happened in a different way to how they actually
154
Table 4.2 Information about Barbara and Andrew, two policyholders (names are fictional) from the FrenchMotor dataset. CSP1 corresponds to
‘farmers and breeders’ (employees of their farm) whereas CSP5 corresponds to ‘employees.’ The risk variable (RiskVar) is an internal score going from 1
(low) to 20 (high). A bonus of 50 is the best (lowest) level, and 100 is usually the entry level for new drivers. Neither has a garage. For the predictions, a ‘rank’
of 11% means that the policyholder is perceived as almost in the top 10% of all drivers in the validation database
Gender Claim
s Age (years) Marital status Social category License (months) Car age Car use Car class Car gas Bonus malus Risk score y
Barbara Female 26 Alone CSP5 67 10+ Private + office M1 Regular 76 7 0
Andrew Male 36 Couple CSP1 206 0 Professional M2 Diesel 78 19 0
Fig. 4.21 Breakdown decomposition .γ jbd (z∗A ) for Andrew for different models trained on
FrenchMotor. CART = classification and regression trees, GAM = generalized additive model,
GLM = generalized linear model, RF = random forest
Fig. 4.22 Breakdown decomposition .γjbd (z∗B ) for Barbara for different models trained on
FrenchMotor. CART = classification and regression trees, GAM = generalized additive model,
GLM = generalized linear model, RF = random forest
156 4 Models: Interpretability, Accuracy, and Calibration
shap
j
Fig. 4.23 Shapley contributions .γ (z✶A ) for Andrew, on the FrenchMotor dataset. CART =
classification and regression trees, GAM = generalized additive model, GLM = generalized linear
model, RF = random forest
shap
Fig. 4.24 Shapley contributions .γj (z✶B ) for Barbara, on the FrenchMotor dataset. CART
= classification and regression trees, GAM = generalized additive model, GLM = generalized linear
model, RF = random forest
4.1 Interpretability and Explainability
For classification problems, calibration measures how well your model’s scores
can be interpreted as probabilities, according to Cameron (2004) or Gross (2017).
Accuracy measures how often a model produces correct answers, as defined in
Schilling (2006).
Accurate, from Latin accuratus (past participle of accurare), means “done with
care.” With a statistical perspective, accuracy is how far off a prediction (.
y ) is from
its true value (y). Thus, a model is accurate if the errors (
.ε = y − y ) are small. In
the least-squares approach, accuracy can be measured simply by looking at the loss,
or mean squared error, that is the empirical risk
n n
1 1
Rn =
. 𝓁2 (yi ,
yi ) =
εi2 .
n n
i=1 i=1
But those metrics are based on a loss function defined on .Y × Y, that measures
a distance between the observation y and the prediction . y , seen as a pointwise
prediction. But in many applications, the prediction can be a distribution. So instead
of a loss defined on .Y×Y, one could consider a scoring function defined on .Py ×Y,
where .Py denotes a set of distributions on .Y, as discussed in Sect. 3.3.1. Following
Gneiting and Raftery (2007), define
Definition 4.16 (Scoring Rules (Gneiting and Raftery 2007)) A scoring rule is a
function .s : Y×Py → R that quantifies the error of reporting .Py when the outcome
is y. The expected score when belief is .Q is .S(Py , Q) = EQ s(Py , Y ) .
Definition 4.17 (Proper Scoring Rules (Gneiting and Raftery 2007)) A scoring
rule is proper if .S(Py , Q) ≥ S(Q, Q) for all .Py , Q ∈ Py , and strictly proper if
equality holds only when .Py = Q.
Let us start with the binary case, when .y ∈ Y = {0, 1} and .Py is the set of
Bernoulli distributions, .B(p), with .p ∈ [0, 1]. The scoring rule can be denoted
.s(p, y) ∈ R, and the expected score is .S(p, q) = E(s(p, Y )) with .Y ∼ B(q),
for some .q ∈ [0, 1]. For example, the Brier scoring rule is defined as follows: let
.fq (y) = q (1 − q)
y 1−y , and define
1
.s(fq , y) = −2fq (y) + fq (y)2 ,
y=0
scoring rule if and only if there exists a concave function G such that
q 1−q
d(p, q) = q log
. + (1 − q) log ,
p 1−p
TP + TN
accuracy(mt ) =
. .
TP + TN + FP + FN
as .
y = mt (x) = 1m(x)>t for threshold t,
The receiver operating characteristic (ROC) curve is the curve obtained by repre-
senting the true-positive rates according to the false-positive rates, by changing the
threshold. This can be related to the “discriminant curve” in the context of credit
scores, in Gourieroux and Jasiak (2007).
Definition 4.19 (ROC Curve) The ROC curve is the parametric curve
.P[m(X) > t|Y = 0], P[m(X) > t|Y = 1] , for t ∈ [0, 1],
when the score .m(X) and Y evolve in the same direction (a high score indicates a
high risk).
where
FRP(t) = P[m(X) > t|Y = 0] = P[m0 (X) > t]
.
TPR(t) = P[m(X) > t|Y = 1] = P[m1 (X) > t].
In other words, the ROC curve is obtained from the two survival functions of
m(X) FPR and TPR (respectively conditional on .Y = 0 and .Y = 1). The AUC, the
.
In Fig. 4.26, we can visualize on the left-hand side a classification tree, when
we try to predict the gender of a driver using telematic information (from the
telematic dataset), and on the right-hand side, ROC curves associated with three
models, a smooth logistic regression (GAM), adaboost (boosting, GBM) and a
random forest, trained on 824 observations, and ROC curves are based on the 353
observations of the validation dataset.
The AUC of a classifier is equal to the probability that the classifier will rank
a randomly chosen positive example higher than a randomly chosen negative
example. Indeed, assume for simplicity that the score (actually .m0 and .m1 ) has a
derivative, so that the true-positive rate and the false-positive rate are given by
1 1
TPR(t) =
. m'1 (x) dx and FPR(t) = m'0 (x) dx,
t t
162 4 Models: Interpretability, Accuracy, and Calibration
Fig. 4.26 On the right, ROC curves .C n (t), for various models on the telematic (validation)
dataset, where we try to predict the gender of the driver using telematic data (and the age). The area
under the curve (AUC) using a generalized addition model (GAM) is close to 69%. A classification
tree is also plotted on the left. GBM = gradient boosting model, RF = random forest
then
−∞
1
AUC =
. TPR FPR−1 (t) dt = TPR(u)FPR' (u) du,
0 ∞
where .M1 is the score for a positive instance and .M0 is the score for a negative
instance. Therefore, as discussed in Calders and Jaroszewicz (2007), the AUC is
very close to the Mann–Whitney U test, used to test the null hypothesis that, for
randomly selected values .Z0 and .Z1 from two populations, the probability of .Z0
being greater than .Z1 is equal to the probability of .Z1 being greater than .Z0 , which
is written, empirically
1
. 1(zj > zi ).
n0 n1
i:zi =0 j :zj =1
Therefore, the ROC curve and the AUC have more to do with the rankings of
individuals than values of predictions: if h is some strictly increasing function, m
and .h ◦ m have the same ROC curves. For example, if .m is a trained model, using
any technique discussed previously, then both .m1/2 and .m
2 are valid models, in the
sense that they have the same AUC, the same ROC curve, and can be considered as
4.3 Calibration of Predictive Models 163
Table 4.3 Area under the curve for various models on the toydata2
Training data Validation data
GLM CART GAM RF GLM CART GAM RF
(x, s)
.m 85.3 82.7 86.1 100.0 86.0 81.4 86.3 83.6
(x)
.m 85.0 82.7 85.9 100.0 85.5 81.4 85.9 83.6
CART = classification and regression tree, GAM = generalized additive model, GLM = generalized
linear model, RF = random forest
Table 4.4 Area under the curve for various models on the GermanCredit (validation sub-
sample) dataset, predicting the default. At the top, models including the sensitive variable (the
gender), and at the bottom, models without the sensitive variable, corresponding to fairness
through unawareness
GLM Tree Boosting Bagging
(x, s)
.m 79.339% 72.922% 79.488% 77.914%
(x)
.m 78.992% 72.922% 79.035% 78.287%
GLM = generalized linear model
scores as they both take values in .[0, 1]. Thus, accuracy for classifiers has to do with
the ordering of predictions, not their actual value (which is related to calibration, and
discussed in Sect. 4.3.3). Austin and Steyerberg (2012) used the term “concordance
statistic” for the AUC. Note that the AUC is also related to the Gini index (discussed
in the next section) (Table 4.3).
On the GermanCredit dataset, the variable of interest y is the default, taking
values in .{0, 1}, and the protected attribute p is the gender (binary, with male and
female). We consider four models, either on both .x and p, or only on .x (without
the sensitive attribute, corresponding to fairness through unawareness, as defined in
Chap. 8) : (1) a logistic regression, or GLM (2) a classification tree, (3) a boosted
model, and (4) a bagging model, corresponding to a random forest. The AUC for
those models is given in Table 4.4 and ROC curves are in Fig. 4.27.
If y is a categorical variable in more than 2 classes, different scoring rules can be
used, as suggested by Kull and Flach (2015).
Fig. 4.27 Receiver operating characteristic curves .C n (t), for various models on the
GermanCredit (validation) dataset, predicting a default. Thick lines correspond to models
including the sensitive variable (gender), and thin lines, models without the sensitive variable,
corresponding to “fairness through unawareness”. GLM = generalized linear model
classifier should classify the samples such that among the samples to which it gave
a [predicted probability] value close to 0.8, approximately 80% actually belong to
the positive class.”
As defined in Kuhn et al. (2013), when seeking for good calibration, “we desire
that the estimated class probabilities are reflective of the true underlying probability
of the sample.” This was initially written in the context of posterior distribution (in
a Bayesian setting), as in Dawid (1982): “suppose that a forecaster sequentially
assigns probabilities to events. He is well calibrated if, for example, of those events
to which he assigns a probability 30 percent, the long-run proportion that actually
occurs turns out to be 30 percent.” Van Calster et al. (2019) gave a nice state of the
art.
Before properly defining “calibration,” let us mention the Lorenz curve, as well as
“concentration curves,” in the context of actuarial models. Frees et al. (2011, 2014b)
suggested using the Lorenz curve and the Gini index to provide accuracy measures
for a classifier, and a regression. In economics, the Lorenz curve is a graphical
representation of the distribution of income or of wealth, and it was popularized
to represent inequalities in wealth distribution. It simply shows the proportion of
overall income, or wealth, owned by the bottom u% of the people. The first diagonal
(obtained if .L(u)) is called “line of perfect equality,” in the sense that the bottom u%
of the population always gets u% of the income. Inequalities arise when the bottom
u% of the population gets v% of the income, with .v < u (because the population
is sorted based on incomes, the Lorenz curve cannot rise above the line of perfect
equality). Formally, we have the following definition.
Definition 4.21 (Lorenz Curve (Lorenz 1905)) . If Y is a positive random vari-
able, with cumulative distribution function F ,
u −1
E Y · 1(Y ≤ F −1 (u)) F (t)dt
.L(u) = = 01 ,
EY −1 (t)dt
0 F
[nu]
y(i)
n (u) =
.L
i=1
,
n
y(i)
i=1
for a sample .{y1 , . . . , yn }, where .y(i) denote the order statistics, in the sense that
y(1) ≤ y(2) ≤ · · · ≤ y(n−1) ≤ y(n) .
.
166 4 Models: Interpretability, Accuracy, and Calibration
In Fig. 4.28, we can visualize some Lorenz curves on various models fitted on
the toydata2 (validation) dataset. Those functions can be obtained using the Lc
function in the ineq R package.
In order to measure the concentration of wealth, Lorenz (1905) suggested the
following function .[0, 1] → [0, 1], for positive variable .yi .
The expression of the Lorenz curve, .u |→ L(u), reminds us of the expected
shortfall, where the denominator is the quantile, and not the expected value. For
example, if Y has a log-normal distribution .LN(μ, σ 2 ), .L(u) = Ф Ф−1 (u) − σ ,
while if Y has a Pareto distribution with tail index .α (i.e., .P[Y > y] = (x/x0 )−α ),
the Lorenz curve is .L(u) = 1 − (1 − u)(α−1)/α (see Cowell 2011). It is rather
common to summarize the Lorenz curve into a single parameter, the Gini index,
introduced in Gini (1912), corresponding to a linear transformation of the area under
the Lorenz curve. More precisely, .G = 1 − 2AUC, so that a small .AUC corresponds
4.3 Calibration of Predictive Models 167
The numerator in the computation of the Gini index is the mean absolute difference,
also named “Gini mean difference” in Yitzhaki and Schechtman (2013),
. γ = E |Y − Y ' | where Y, Y ' ∼ F are independent copies.
γ = 2p(1 − p) = 1 − p2 − (1 − p)2 ,
.
which corresponds to the Brier score as in Gneiting and Raftery (2007). Consider
now more generally some classification problem, with a training sample .(x i , yi ),
Murphy (1973), Murphy and Winkler (1987), Dawid (2004), and more recently
Pohle (2020), introduced a so-called “Murphy decomposition.” For a squared loss,
2
E (X − Y )2 = Var[Y ] − Var E[Y |X] + E X − E[Y |X] ,
.
where the first term, UNC, is the unconditional entropy uncertainty, which rep-
resents the “uncertainty” in the variable of interest and does not depend on the
predictions (also called “pure randomness”); the second component, RES, is called
“resolution”, and corresponds to the part of the uncertainty in Y that can be
explained by the prediction, so it should reduce the expected score by that amount
(compared with the unconditional forecast); the last part, CAL corresponds to
“miscalibration,” or “reliability.”
As explained in Murphy (1996), an original decomposition was derived as a
partition of the Brier score, which can be interpreted as the MSE for probability
forecasts. For the Brier score in a binary setting, as discussed in Bröcker (2009),
with a calibration function .π(p) = P[Y = 1|p], where p is the forecast probability,
168 4 Models: Interpretability, Accuracy, and Calibration
and y the observed value, if .π = P[Y = 1], the Brier score is decomposed as
2 2
. π (1 − π ) − E π(p) − π + E p − π(p) .
UNC RES CAL
As the true model m is unknown, this problem cannot be solved, and a “natural”
idea is to replace .m(x i ) by the true observed values, that is, to solve
n
−1
. min yi log m(x i ) + 1 − yi ) log 1 − m
(x i ) ,
m n
i=1
which means that we select the model that minimizes Bernoulli deviance loss, in the
training sample.
Logistic regression satisfies the “balance property,” that could be seen as some
global “unbiased” estimator property, as discussed in Sect. 3.3.2,
n
1
. yi − m
(x i ) = 0,
n
i=1
where .mi = m (x i ) and .m(i) is the order statistics. Observe that some mirrored
Lorenz curve can also be used, with .M : u |→ 1 − L(1
− u),
n
n
m(i) mi 1(mi > miu✶ )
i=[n(1−u)]
.u |→ M(u) = =
i=1
,
n
n
m(i) mi
i=1 i=1
with .iu✶ = [n(1 − u)], so that the curve should be in the upper corner (as for the
ROC curve), as suggested in Tasche (2008). If .mi is identical for all individuals,
.u |→ M(u) is on the first diagonal. With a “perfect model” (or “saturated model”),
when .mi = yi , we have a piecewise linear function, from .(0, 0) to .(y, 1) and from
.(y, 1) to .(1, 1). In the context of motor insurance, it does not mean that we have
perfectly estimated the probability but it means that our model was able to predict
without any mistakes who would have claimed a loss.
Ling and Li (1998) introduced a “gain curve,” also called “(cumulative) lift
curve,” defined as
n
yi 1(mi > miu✶ )
.u |→
i=1
𝚪 (u) = ,
n
yi
i=1
where observed outcomes .yi are aggregated in the ordering of their prediction .mi =
(x i ).
m
Definition 4.22 (Concentration Curve (Gourieroux and Jasiak 2007) and (Frees
et al. 2011, 2014b)) If Y is a positive random variable, observed jointly with .X,
and if .m(X) and .μ(X) denote a predictive model, and the regression curve, the
concentration curve of .μ with respect to m is
E μ(X) · 1(m(X) ≤ Qm (u) E Y · 1(m(X) ≤ Qm (u))
.𝚪(u) = = ,
E μ(X) EY
[nu]
y(i):m
𝚪n (u) =
.
i=1
,
n
y(i):m
i=1
for a sample .{y1 , . . . , yn }, where .y(i):m are ordered with respect to m, in the sense
that .m(x (1):m ) ≤ m(x (2):m ) ≤ · · · ≤ m(x (n−1):m ) ≤ m(x (n):m ).
This function could be seen as the extension of the Lorenz curve, in the sense
that if .L(u) was the proportion of wealth owned by the lower u% of the population,
.𝚪(u) would represent the proportion of the total true expected loss, corresponding to
According to Kuhn et al. (2013), Section 11.1, “we desire that the estimated class
probabilities are reflective of the true underlying probability of the sample. That
is, the predicted class probability (or probability-like value) needs to be well-
calibrated. To be well-calibrated, the probabilities must effectively reflect the true
likelihood of the event of interest.” That is the definition we consider here
Definition 4.23 (Well-Calibrated (1) (Van Calster et al. 2019; Krüger and Ziegel
2021)) The forecast X of Y is a well-calibrated forecast of Y if .E(Y |X) = X
almost surely, or .E[Y |X = x] = x, for all x.
Definition 4.24 (Well Calibrated (2) (Zadrozny and Elkan 2002; Cohen and
Goldszmidt 2004)) The prediction .m(X) of Y is a well-calibrated prediction if
.E[Y |m(X) = y] =
y , for all .
y.
Definition 4.25 (Calibration Plot) The calibration plot associated with model m is
the function . y |→ E(Y |m(X) =
y ). The empirical version is some local regression
on .{yi , m(x i )}.
4.3 Calibration of Predictive Models 171
n
n
m(x i )1(m(x i ) ∈ Ik ) yi 1(m(x i ) ∈ Ik )
i=1 i=1
. ≈ .
n n
1(m(x i ) ∈ Ik ) 1(m(x i ) ∈ Ik )
i=1 i=1
One can also consider the k-nearest neighbors approach, or a local regression
(using the locfit R package), as in Denuit et al. (2021). In Figs. 4.29, 4.30
and 4.31, we can visualize some empirical calibration plots on the toydata2
Fig. 4.30 The blue line is the empirical calibration plot . y |→ E[Y |Y = y ], on the
GermanCredit dataset (based on k-nearest neighbors), with a GLM, a classification tree (on
(x, s), including the sensitive attribute (here
top), a boosting and a bagging model (below), when .m
the gender, s)
(Fig. 4.29), for the GLM and the random forest, and on the four models on
the GermanCredit dataset (Figs. 4.30 and 4.31). The calibPlot function in
package predtools can also be used.
We introduced in Sect. 2.5.2 the idea of a “balanced” model, with Definition 2.8,
which corresponds to a property of “globally unbiased.”
Definition 4.26 (Globally Unbiased Model m (Denuit et al. 2021)) Model m is
globally unbiased if .E[Y ] = E[m(X)].
But it is possible to consider a local version,
Definition 4.27 (Locally Unbiased Model m (Denuit et al. 2021)) Model m is
locally unbiased at .
y if .E[Y |m(X) =
y] =
y.
It means that the model is balanced “locally” on the group of individuals such
that .x’s satisfy .m(x) =
y (and therefore no cross-financing between groups).
4.3 Calibration of Predictive Models 173
In the GLM framework, different quantities are used, namely the natural parameter
yi = μi = E(Yi ) = b' (θi )), the score associated
for .yi (.θi ), the prediction for .yi (.
with .yi (.ηi = x ┬ i β) and link function: '
.g such that .ηi = g(μi ) = g(b (θi )). The
first-order conditions can be written, using the standard chain rule technique
In that case, with the canonical link .g✶ = b'−1 , i.e., .ηi = θi , the first-order condition
is (with notation
.y = μ),
∇ log L = X ┬ (y −
. y ) = 0,
n n
┬
so, if there is an intercept, .1 (y −
y ) = 0, i.e., . yi =
yi , which is the
i=1 i=1
]. If a noncanonical link function
empirical version (training dataset) of .E[Y ] = E[Y
is used (which is the case for the Tweedie model or the gamma model with a
logarithm link function), the first-order condition is
∇ log L = X ┬ Ω(y −
. y ) = 0,
] (unless we
where .Ω is a diagonal matrix,2 so we no longer have .E[Y ] = E[Y
consider an appropriate change of measure).
2 .Ω= W Δ, where .W = diag (V (μi )g ' (μi )2 )−1 and Δ = diag g ' (μi ) , so that we recognize
Fisher information—corresponding to the Hessian matrix (up to a negative sign)—.X ┬ W X.
4.3 Calibration of Predictive Models 175
Table 4.5 Balance property on the FrenchMotor, where y is the occurrence of a claim within
the year, with . n1 ni=1 m
(x i , si ) on top, and . n1 ni=1 m
(x i ) below (in %). Recall that on the training
dataset . n1 ni=1 yi = 8.73%
Training data Validation data
.y GLM CART GAM RF .y GLM CART GAM RF
(x, s)
.m 8.73 8.73 8.73 8.73 8.27 8.55 9.05 9.03 8.84 8.70
(x)
.m 8.73 8.73 8.73 8.73 8.29 8.55 9.05 9.03 8.84 8.73
Proposition 4.1 In the GLM framework with the canonical link function, .m (x) =
g✶−1 (x ┬
β) is balanced, or globally unbiased, but possibly locally biased.
i
This property can be visualized in Table 4.5.
If a model is not well-calibrated, several techniques have been considered in the
literature, such as a logistic correction, Platt scaling, and the “isotonic regression,”
as discussed in Niculescu-Mizil and Caruana (2005b), in the context of boosting.
Friedman et al. (2000) showed that adaboost builds an additive logistic regression
model, and that the optimal m satisfies
which suggests applying a “logistic correction” in order to get back the conditional
probability. Platt et al. (1999) suggested the use of a sigmoid function (coined “Platt
scaling”),
1
.m' (x) = ,
1 + exp[am(x) + b]
where a and b are obtained using maximum likelihood techniques. Finally, the
isotonic (or monotonic) regression, introduced in Robertson et al. (1988) and
Zadrozny and Elkan (2001), considers a nonparametric increasing transformation
from .m(x i )s to .yi s (that can be performed either with Iso or rfUtilities R
packages, and probability.calibration function).
In Fig. 4.32, we can visualize smooth calibration curves (estimated using
local regression, with the locfit package) on three models estimated from
the toydata1 dataset. At the top, crude calibration curves, fitted on .{ yi , yi }s, and
at the bottom, calibration curves after some “isotonic correction” (obtained with the
probability.calibration function from the rfUtilities R package).
176 4 Models: Interpretability, Accuracy, and Calibration
1 (.Αισωπος) Aesop’s original fables did not use this motivation as casus belli, even if the translation
by Jacobs (1894) is “if it was not you, it was your father, and that is all one”.
Chapter 5
What Data?
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 179
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_5
180 5 What Data?
should be created in the first place (. . . ) “Data has an annoying way of conforming
itself to support whatever point of view we want it to support.”
“All data is credit data,” said Merrill (2012) at a conference. And if credit
institutions collect a lot of “information,” so do insurance companies, to assess
and prevent risk, target ideal customers, accurately price policies, provide quotes,
conduct investigations, follow trends, create new products, etc. Such information is
now called “data” (from the Latin datum, data being the plural, past participle of
dare “to give,” used in the XVII-th century to designate a fact given as the basis for
calculation, in mathematical problems1 )
Definition 5.1 (Data Wikipedia 2023) In common usage, data are a collection
of discrete or continuous values that convey information, describing the quantity,
quality, fact, statistics, other basic units of meaning, or simply sequences of symbols
that may be further interpreted.
A few years ago, the term “statistics” was also popular, as introduced in
Achenwall (1749). It was based on the Latin statisticum (collegium),” meaning
“(lecture course on) state affairs,” and Italian statista “one skilled in statecraft,”
and the German term “Statistik” designated the analysis of data about the state,
signifying the “science of state” (corresponding to “political arithmetic” in English).
So in a sense, “statistics” was used to designate official data, collected for instance
by the Census, with a strict protocol, as explained in Bouk (2022), whereas the term
“data” would correspond to any kind of information that could be scraped, stored,
and used.
The “big data” hype has given us the opportunity to talk about not only its large
volume and value but also its variety (and all sorts of words beginning with the letter
“v”). Although for actuaries, data have often been “tabular data,” corresponding to
matrix numbers as seen in Part I, in the last few years the variety of data types has
become more apparent. There will naturally be text, starting with a name, an address
(which can be converted into spatial coordinates), but also possibly drug names in
prescriptions, telephone conversations with an underwriter or claims manager, or in
the case of companies, contracts with digitized clauses, etc. We can have images,
such as a picture of the car after a fender-bender or of the roof of a house after a
fire, medical images (X-rays, MRI), a satellite image of a field for a crop insurance
contract, or of a village after a flood, etc. Finally, there will also be information
associated with connected objects, data obtained from devices in a car fleet, from
a water leak detector or from chimney monitoring and control devices. However,
statistical summaries of “scores” are often based on these raw data (frequently
not available to the insurer) such as the number of kilometers driven in a given
week by a car insurance policyholder, or an acceleration score. These data, which
are much more extensive than tabular variables with predefined fields (admittedly,
sometimes with definition issues, as Desrosières 1998 points out), can provide
sensitive information that can be exploited by an opaque algorithm, possibly without
the knowledge of the actuary.
In non-commercial insurance, the policyholder is an individual, a person (even in
property insurance), and some part of the information collected will be considered
as “personal data,” as much of the information collected is sometimes considered
“sensitive" or “protected." In Europe, “personal data” is any information relating
to a natural person who is identified or can be identified, directly or indirectly. The
definition of personal data is specified in Article 4 of the General Data Protection
Regulation (GDPR). This information can be an identifier (a name, an identification
number, location data, for example) or one or more specific elements specific to the
physical, physiological, genetic, mental, economic, cultural or social identity of the
person. Among the (non-exhaustive) list given by the French CNIL,2 there may be
the surname, first name, telephone number, license plate, social security number,
postal address, e-mail, a voice recording, a photograph, etc. Such information is
relevant, and important, in the insurance business.
“Sensitive data” constitute a subset of personal data that include religious
beliefs, sexual orientation, union involvement, ethnicity, medical status, criminal
convictions and offences, biometric data, genetic information, or sexual activities.
According to the GDPR, in 2016,3 “processing of personal data revealing racial
or ethnic origin, political opinions, religious or philosophical beliefs or trade
union membership, as well as processing of genetic data, biometric data for the
purpose of uniquely identifying a natural person, data concerning health or data
concerning the sex life or sexual orientation of a natural person are prohibited.”
Such information is considered “sensitive.” In Europe, the 20184 “Convention 108”
(or “convention for the protection of individuals with regard to automatic processing
of personal data”) further clarifies the contours.
Informatics and Liberty), an independent French administrative regulatory body whose mission is
to ensure that data privacy law is applied to the collection, storage, and use of personal data, in
France.
3 See https://gdpr-info.eu/.
4 See https://www.coe.int/en/web/data-protection/convention108-and-protocol.
182 5 What Data?
From Avraham et al. (2013), in the USA, we can get Fig. 5.1, which provides a
general perspective about variables that are prohibited across all States, with the
“highest average level of strictness,” for different types of insurance. If there is a
strong consensus about religion and “race,” it is more complicated for other criteria.
5.2 Personal and Sensitive Data 183
credit score
zip code
genetics +
Fig. 5.1 (U.S.) State Insurance Antidiscrimination Laws . A factor is either considered “permit-
ted” or there is no specific regulation (green filled circle) usually because the factor is not relevant
to the risk. “prohibited” (red filled cross) or there could be variation across states (○␣). .∗ means
limited regulation; .+ is specifically permitted because of adverse selection (source: Avraham et al.
2013)
CA HI GA NC NY MA PA FL TX AL ON NB NL QC
Gender
Age ∗ ∗
Driving experience
Credit history ∗ ∗ ∗
Education
Profession
Employment
Family
Housing
Address/ZIP code
Fig. 5.2 A factor is considered “permitted” (green filled circle) when there are no laws or
regulatory policies in the state or province that prohibit insurers from using that factor. Otherwise,
it will be “prohibited” (red filled cross). In North Carolina, age is only allowed when giving a
discount to drivers 55 years of age and older. In Pennsylvania, credit score can be used for new
business and to reduce rates at renewal, but not to increase rates at renewal. In Alberta, credit score
and driver’s license seniority cannot be used for mandatory coverage (but can be used on optional
coverage). In Labrador, age cannot be used before 55, and beyond that, it must be a discount (as in
North Carolina) (source: in the United States, The Zebra 2022 and in Canada, Insurance Bureau of
Canada 2021)
Figure 5.2 presents some variables traditionally used in car insurance, in the
United States5 and in Canada.6 Unlike most European countries, which have a civil
law culture (where the main source of law is found in legal codes), the Canadian
provinces and the states of the United States of America have a common law system
(where rules are primarily made by the courts as individual decisions are made).
Québec uses a mixed law. Most states and provinces have documents listing the
Prohibited Rating Variables, such as Alberta’s (Automobile Insurance Rate Board,
5 CA: California, HI: Hawaii, GA: Georgia, NC: North Carolina, NY: New York, MA: Mas-
A variable that has been intensively discussed is ‘gender,” from the Council
Directive 2004/113/EC (see Sect. 6.2). In France, Article L. 111-7 of the Insurance
Code states that “the Minister in charge of the economy may authorize, by decree,
differences in premiums and benefits based on the taking into account of sex and
proportionate to the risks when relevant and precise actuarial and statistical data
establish that sex is a determining factor in the evaluation of the insurance risk.”
Here, “determining factor” echoes the “causal” effect required in California. In
Box 5.2, we have a description of legal aspects regarding discrimination in the
context of insurance, in France by Rodolphe Bigot, that are explicitly described
(age, family status, and sexual orientation (page 185), pregnancy, maternity and
“social insecurity” (page 186), and finally sex ( page 187).
(continued)
186 5 What Data?
(continued)
5.2 Personal and Sensitive Data 187
(continued)
188 5 What Data?
premiums based on the size of a car’s engine, even though the most powerful
cars are in fact bought more by men” (Lamy Assurances, 2021, n.◦ 3803).
To bring French law into compliance with European rules, Article L. 111-
7 of the Insurance Code was rewritten with the law of 26 July 2013. A
paragraph II bis was added: “The derogation provided in the last paragraph
of I is applicable to contracts and memberships in group insurance contracts
concluded or made no later than December 20, 2012 and to such contracts
and memberships tacitly renewed after that date. The derogation does not
apply to contracts and memberships mentioned in the first paragraph of
this IIa which have been substantially modified after December 20, 2012,
requiring the parties’ agreement, other than a modification that at least one
of the parties cannot refuse.” In terms of collective supplementary employee
benefits, no discrimination on the basis of gender can be made. However,
insurers still have the possibility of offering options in policies or insurance
products according to gender in order to cover conditions that exclusively or
mainly concern men or women. Differentiated coverage is therefore possible
for breast cancer, uterine cancer, or prostate cancer.
Sometimes, data are “derived” (e.g., country of residence derived from the subject’s
postcode) or “inferred” data (e.g., credit score, outcome of a health assessment,
results of a personalization or recommendation process) and not “provided” by the
data subject actively or passively, but rather created by a data controller or third
party from data provided by the data subject and, in some cases, other background
data, as explained in Wachter and Mittelstadt (2019). According to Abrams (2014),
inferences can be considered personal data as they derive from personal data.
Another precaution that should be kept in mind relates to the distinction between
what “reveals” and what “ is likely to reveal,” as Debet (2007) states (see also
Van Deemter 2010, “in praise of vagueness ”). Some information is self reported,
and some is inferred data. For instance, it could be possible to ask for “sex at birth”
to collect a sex variable, but in most cases a variable is based on civility (where
5.2 Personal and Sensitive Data 189
“Mrs” or “Mr” are proposed), so the information is more a gender variable. But
it can be more complex, as some models can be used to infer some information.
One can imagine that “being pregnant” could be a sensitive piece of information in
many situations. This information exists in some health organization databases (or
health care reimbursements). But as shown by Duhigg (2019) (in how companies
learn your secrets), there are organizations that try to infer this information from
purchases. This is the famous story of the man, in the Minneapolis area, who was
surprised that coupons for various products for young mothers were addressed to
his daughter. In this story, the inference by the model had been correct. We can
imagine, for marketing reasons, that some insurers might be interested in knowing
such information.
Recently, Lovejoy (2021) recalls that in June 2020, LinkedIn had a massive
breach (exposing the data of 700 million users), with a database of records including
phone numbers, physical addresses, geolocation data and... “inferred salaries.”
Again, knowing the salary of policyholders could be seen as interesting for some
insurers (any micro-economic model used to study insurance demand is based on
the “wealth” of the policyholder, as in Sect. 2.6.1). Much more information can
be inferred from telematics data, albeit with varying degrees of uncertainty. As
mentioned by Bigot et al. (2019), by observing that a person parks almost every
Friday morning near a mosque, one could say that there is a high probability that he
or she is Muslim (based on surveys of Muslim practices). But it is possible that this
inference is completely wrong, and that this person actually goes to the gym, across
the street from the mosque, and moreover, is a regular attendee. Facebook may be
able to infer protected attributes such as sexual orientation, race (Speicher et al.
2018), as well as political views (Tufekci 2018) and impending suicide attempts
(Constine 2017), whereas third parties have used Facebook data to decide eligibility
for loans (Taylor and Sadowski 2015) and infer political positions on abortion
(Coutts 2016). Susceptibility to depression can also be inferred from Facebook and
Twitter usage data. Microsoft can also predict Parkinson’s (Allerhand et al. 2018)
and Alzheimer’s (White et al. 2018) disease from search engine interactions. Other
recent invasive applications include predicting pregnancy by Target, and assessing
user satisfaction from mouse tracking (Chen et al. 2017). Even if such inferences are
impossible to understand (as they are provided by opaque models), or refute most
of the time, they could impact our private lives, reputation, and identity deeply, as
discussed in Wachter and Mittelstadt (2019). Therefore, inferences are very close to
the data available, and to “attacks,” as coined in the literature about privacy.
5.2.4 Privacy
As explained by Kelly (2021), “often the location data is used to determine what
stores people visit. Things like sexual orientation are used to determine what
demographics to target,” (in a marketing context). Each type of data can reveal
something about our interests and preferences, our opinions, our hobbies, and
190 5 What Data?
our social interactions. For example, an MIT study7 demonstrated how email
metadata can be used to map our lives, showing the changing dynamics of our
professional and personal networks. These data can be used to infer personal
information, including a person’s background, religion or beliefs, political views,
sexual orientation and gender identity, social relationships, or health. For example,
it is possible to infer our specific health conditions simply by connecting the dots
between a series of phone calls. For Mayer et al. (2016), the law currently treats
call content and metadata separately and makes it easier for government agencies
to obtain metadata, in part because it assumes that it should not be possible to infer
specific sensitive details about individuals from metadata alone. Chakraborty et al.
(2013) reminds us that current approaches to privacy protection, typically defined in
multi-user contexts, rely on anonymization to prevent such sensitive behavior from
being traced back to the user—a strategy that does not apply if the user’s identity is
already known. In time, a tracking system may be accurate enough to place a person
in the vicinity of a bank, bar, mosque, clinic or other privacy-sensitive location. In
2015, as told in Miracle (2016), Noah Deneau wondered if it would be possible to
identify devout Muslim drivers in New York City by looking at anonymized data
and inactive drivers during the five times of the day when they are supposed to
pray. He quickly searched for drivers who were not very active during the 30- to
45-min Muslim prayer period and was able to find four examples of drivers who
might fit this pattern. This brings to mind (Gambs et al., 2010), who conducted
an investigation on a dataset containing mobility data of taxi drivers in the San
Francisco Bay Area. By finding places where the taxi’s GPS sensor was turned off
for a long period of time (e.g., 2 h), they were able to infer the interests of the
drivers. For 20 of the 90 analyzed users, they were able to locate a plausible home
in a small neighborhood. They even confirmed these results for 10 users by using
a satellite view of the area: It showed the presence of a yellow taxi parked in front
of the driver’s supposed home. Dalenius (1977) introduced an interesting concept
of privacy. Nothing about an individual should be learned from a dataset if it cannot
be learned without having access to the dataset. We will return to this idea when we
define the fairness criteria, and when we require that the protected variable s cannot
be predicted from the data, and from the predictions.
The “right to be forgotten” is the right to have private information about a person be
removed from various directories, as discussed in Rosen (2011), Mantelero (2013),
or Jones (2016). It is also named “ right to oblivion” in de Andrade (2012) and “right
to erasure” in Ausloos (2020).
7 Project https://immersion.media.mit.edu/.
5.2 Personal and Sensitive Data 191
The process of collecting “internal data” typically begins with a “form,” derived
from the Latin word forma meaning form, contour, figure, or shape. In the XIV-th
century, the term started to refer to a legal agreement.8 Forms are used by insurers
in the underwriting process or to handle claims. “Think of a form to be filled in,
on paper or a screen, intended to gather information that can later be quantified,”
wrote Bouk (2022). Almost paraphrasing Christensen et al. (2016), he adds that
“someone, somewhere, designed that form, deciding on the set of questions to be
asked, or the spaces to be left blank. Maybe they also listed some possible answers
or limited the acceptable responses (...) The final resulting form and all that is
written upon it as well as all negotiations that shaped it, weather backstage or
offscreen, so to speak—all of this is data too. The data behind the numbers. To
find stories in the data, we must widen our lens to take in not only the numbers but
also the processes that generated those numbers.”
Traditionally, with some simple perspective, insurance companies use two
kinds of databases: an underwriting database (one line represents a policy, with
information on the policyholder, the insured property, etc.) and a claims database
(one line corresponds to a claim, with the policy number, and the last view of
the associated expenses), as in Fig. 5.3. These two bases are linked by the policy
number. But it is possible to use other secondary “keys” (as coined in database
management systems), corresponding to a single variable (or set of them) that can
uniquely identify rows. For internal data, classical keys are the policy number (to
connect the underwriting and the claim database), a “client number” (to connect
different policies that could hold the same person), or a claim number (usually to
connect the claims database to financial records). But it is also possible to use some
“keys” to connect to other databases, such as the license plate of the car, the model
of a car (“2018 Honda Civic Touring Sedan”), the address of a house, etc.
In recent years, however, companies have increasingly relied on data obtained from
a wide variety of external sources. These data are either on the insured property, with
8 Instead of the Latin formula that could designate a contract. Actually, “formula” nowadays
refers to “mathematical formulas” as seen in Chaps. 3 and 4, or “magic formulas,” the two being
very close for many people (see for example the introduction of O’Neil (2016) explaining how
mathematics “was not only deeply entangled in the world’s problems but also fueling many of
them. The housing crisis, the collapse of major financial institutions, the rise of unemployment—
all had been aided and abetted by mathematicians wielding magic formulas.”
5.3 Internal and External Data 193
claims database
underwriting database
id policy
policy address
socio-economic data
IRIS
geographic data
(lat,long)
Fig. 5.3 The databases of an insurer, with an underwriting database (in the center), with one line
per insurance policy. This database is linked to the claims database, which contains all the claims,
and an associated policy number. Other data can be used to enhance the database, for example,
based on the insured person’s home address, with either socio-economic information (aggregated
by neighborhood, wealth, number of crimes, etc.) but also other information, extracted from maps,
satellite images (distance to the nearest fire hydrant, presence of a swimming pool, etc.) In car
insurance, it is possible to find the value of a car from its characteristics, etc.
information about the car model, or about the house, obtained from the address,
as in Fig. 5.3. The address historically allowed (aggregated) information on the
neighborhood to be obtained, with numbers of violations, on past floods, on the
distance to the nearest fire station, etc. We can also use satellite images (via Google
Earth) or information from OpenStreetMap (we will mention those in Sect. 6.8).
And insurers rely on data that are becoming increasingly extensive, with sensors
deployed everywhere, in the car, or cell phones, as Prince and Schwarcz (2019)
recall. This “data boom” raises the question of whether an increasingly detailed
insight into the lives of policyholders can lead to more accurate pricing of risks.
Following Heidorn (2008), Hand (2020) discussed the concept of “dark data,”
noting that all data has a hidden side, which could potentially generate bias. We
formalize in the following sections these notions of bias, some of which can
be visualized in Fig. 5.4, inspired by Suresh and Guttag (2019). The “historical
bias” is the one that exists in the world as it is. This is the bias evoked in Garg
et al. (2018) in contextualization in textual analysis (“word embedding”) where
the vectorization reflects the biases existing between men and women (the word
“nurse” (nongendered) is often associated with words associated with women,
whereas “doctor” is often associated with words associated with men), or toward
minorities. The “sampling bias” is the one mentioned in Ribeiro et al. (2016), where
a classification algorithm is trained, to distinguish dogs from wolves, except that
194 5 What Data?
validation
sample sample
sampling bias (testing)
measure bias
(a)
definitions
learning cultural
bias bias
training deployment
sample
deployment bias
validation
sample
evaluation
(testing)
evaluation
bias
(b)
Fig. 5.4 Bias in data generation, and in model building (loosely based on Suresh and Guttag
2019). (a) Data generation. (b) Modeling process
all the images of wolves are taken in the snow, and the algorithm just looks at
the background of the image to assign a label. For “measurement bias,” Dressel
and Farid (2018) refers to reoffending in predictive justice, which is sometimes
measured not as a new conviction but as a second arrest. The “cultural bias” (called
“aggregation bias” in Suresh and Guttag 2019) refers to the following problem:
a particular dataset might represent people or groups with different backgrounds,
cultures, 3or norms, and a given variable can mean something quite different for
them. Examples include irony in textual analysis, or a cultural reference (which
the algorithm cannot understand). Hooker et al. (2020) observe that compression
amplifies existing algorithmic biases (where compression is similar to tree pruning,
when attempting to simplify models). Another example is Bagdasaryan et al. (2019)
who point out that data anonymization techniques can be problematic. Differential
privacy learning mechanisms such as gradient clipping and noise addition have a
disproportionate effect on under-represented and more complex subgroups. This
phenomenon is called “learning bias.” Evaluation bias takes place when the
reference data used for a particular task is not representative. This can be seen in
facial recognition algorithms, trained on a population that is very different from
the real one. Buolamwini and Gebru (2018) note that darker-skinned women make
5.3 Internal and External Data 195
up 7.4% of the Adience database (published in Eidinger et al. 2014, with 26,580
photos across 2,284 subjects with a binary gender label and one label from eight
different age groups), and this lack of representativeness of certain populations
can be an issue (e.g., for an algorithm designed to detect skin cancers). Finally,
“deployment bias” refers to the gap between the problem a model is expected to
solve, and the way it is actually used. This is what Collins (2018) or Stevenson
(2018) show, describing the harmful consequences of risk assessment tools for
actuarial sentencing, particularly in justifying an increase in incarceration on the
basis of individual characteristics. In Box 5.3, Olivier l’Harridon discusses “decision
bias.”
(continued)
196 5 What Data?
(continued)
5.3 Internal and External Data 197
Insurers have the feeling that they can use external data to get valuable information
about their policyholders. And not only insurers, all major companies are fascinated
by external data, especially those collected by big tech companies. As Hill (2022)
wrote, “Facebook defines who we are, Amazon defines what we want, and Google
defines what we think.”
But it is not new. Scism and Maremont (2010a,b) reported that a US insurer
had teamed up with a consulting firm, to look at 60,000 recent insurance applicants
(health related) and they found that a predictive model based partly on consumer-
marketing data, was “persuasive” in its ability to replicate traditional underwriting
techniques (based on costly blood and urine testing). So-called “external informa-
tion” was personal and family medical history, as well as information shared by
the industry from previous insurance applications, and data provided by Equifax
Inc., such as likely hobbies, television viewing habits, estimated income, etc. Some
examples of “good” risk assessment factors were being a “foreign traveler,” making
healthy food choices, being an outdoor enthusiast, and having strong ties to the
community. On the other hand, “bad” risk factors included having a long commute,
high television consumption, and making purchases associated with obesity, among
many others.
This is possible because of “data brokers” or “data aggregators,” as discussed
in Beckett (2014) and Spender et al. (2019). These companies collect data on
a grand scale from various sources, independently, they clean the data, link it,
198 5 What Data?
Banham (2015) and Karapiperis et al. (2015), cited in Kiviat (2021), mentioned
that in car insurance, policymakers have investigated the use of credit scores, web
browser history, grocery store purchases, zip codes, the number of past addresses,
speeding tickets, education level, data from devices that track driving in real time,
etc. In Box 5.5, we reproduce a list of variables given by David Backer, insurer in
Florida, USA.
(continued)
202 5 What Data?
9 See https://vimeo.com/123722811.
5.4 Typology of Ratemaking Variables 203
we have to accept that they don’t want to pay for ours, in return. Furthermore, our
behaviors can certainly be influenced by our own decisions, but they can also most
certainly be influenced by the decisions of others. Our behaviors are not isolated.
They are the product of our own actions and the product of our interactions with
others. And this applies when we are driving a car, as well as in a million and one
other activities.
As explained in Finger (2006) and Cummins et al. (2013), the risk factors
of a risk classification system have to meet many requirements, they have to be
(i) statistically (actuarially) relevant, (ii) accepted by society, (iii) operationally
suitable, and (iv) legally accepted.
which occurs when the premium charged for an insurance policy is explicitly linked
to the previous claim record of the policy or the policyholder. Actually, claims
history has been the most important rating factor in motor insurance, over the past
60 years. Lemaire et al. (2016) argued that annual mileage and claims history (such
as a bonus malus class) are the two main powerful rating variables.
Experience rating has to do with “merit-based” insurance pricing, “merit rating”
as coined by Van Schaack (1926), studied in Wilcox (1937) or Rubinow (1936), and
popularized by Bailey and Simon (1960). It seems like “merit” has always been seen
as a morally valid predictor. But recently, Sandel (2020) criticized this “ tyranny
of merit,” “the meritocratic ideal places great weight on the notion of personal
responsibility. Holding people responsible for what they do is a good thing, up to
a point. It respects their capacity to think and act for themselves, as moral agents
and as citizens. But it is one thing to hold people responsible for acting morally;
it is something else to assume that we are, each of us, wholly responsible for our
lot in life.” “What matters for a meritocracy is that everyone has an equal chance
to climb the ladder of success; it has nothing to say about how far apart the rungs
of the ladder should be. The meritocratic ideal is not a remedy for inequality; it is
a justification of inequality.” Hence, behavioral fairness corresponds to the fairness
of merit. But as pointed out by Szalavitz (2017), the narrative that legitimizes the
idea of merit is an often biased narrative, with a classical “actor-observer bias.”
Personalization is close to this idea, as it means looking at that individual based on
her own risk profile, claims experience, and characteristics, rather than viewing him
or her as part of a set of similar risks. Hyper-personalization takes personalization
further.
Omitted variable bias occurs when a regression model is fitted without considering
an important (predictor) variable. For example, Pradier (2011) noted that actuarial
textbooks (such as Depoid 1967) state that “the pure premium of women in the North
American market would be equal to that of men if it were conditioned on mileage.”
The sub-identification corresponds to the case where the true model would be .yi =
β0 + x
1 β 1 + x 2 β 2 + εi , but the estimated model is .yi = b0 + x 1 b1 + ηi (i.e.,
the variables .x 2 are not used in the regression). The least square estimate of .b1 , in
206 5 What Data?
the mis-specified model, is (with the standard matrix writing in econometrics, like
Davidson et al. 2004 or Charpentier et al. 2018)
.b 1 = (X 1 X 1 )
−1
X1 y
= (X −1
1 X 1 ) X 1 [X 1 β 1 + X 2 β 2 + ε]
= (X −1 −1 −1
1 X 1 ) X 1 X 1 β 1 + (X 1 X 1 ) X 1 X 2 β 2 + (X 1 X 1 ) X 1 ε
so that .E[
b1 ] = β 1 + β 12 , the bias (which we have noted .β 12 ) being null only in the
case where .X 1 X 2 = 0 (that is to say .X 1 ⊥ X 2 ): we find here a consequence of the
Frisch–Waugh theorem (from Frisch and Waugh 1933). If we simplify a little, let us
suppose that the real underlying model of the data
y = β0 + β1 x1 + β2 x2 + ε,
.
where .x1 and .x2 are explanatory variables, y is the target variable, and .ε is random
noise. The estimated model by removing .x2 gives
y =
. b0 +
b1 x1 .
One can think of a missing significant variable .x2 , or the case where .x2 is a
protected variable. Estimates of the regression coefficients obtained by least squares
are (usually) biased, in the sense that
c
ov[x1 , y] ov[x1 , β0 + β1 x1 + βx2 + ε]
c
b1 =
. = ,
1]
Var[x 1]
Var[x
or
ov[x1 , x1 ]
c ov[x1 , x2 ] c
c ov[x1 , ε] ov[x1 , x2 ]
c
b1 = β1 ·
. +β2 · + = β1 + β2 · .
Var[x ]
Var[x1 ]
Var[x ] 1]
Var[x
1 1
=1 =0
When .x2 is omitted, . b1 is biased, especially because .x1 and .x2 are correlated.
Therefore, in most realistic cases, removing the sensitive variable (if .x2 = p) not
only fails to make the regression models fair but, on the contrary, it is likely to
amplify discrimination. For example, in labor economics, if immigrants tend to have
lower levels of education, then the regression model would “punish” low education
even more by offering even lower wages to those with low levels of education (who
are mostly immigrants). Žliobaite and Custers (2016) suggests that a better strategy
for sorting regression models would be to learn a model on complete data that
includes the sensitive variable, then remove the component containing the sensitive
5.6 Omitted Variable Bias and Simpson’s Paradox 207
variable and replace it with a constant that does not depend on the sensitive variable.
A study on discrimination prevention for regression, Calders and Žliobaite (2013), is
related to the topic, but with a different focus. Their goal is to analyze the role of the
sensitive variable in suppressing discrimination, and to demonstrate the need to use
it for discrimination prevention. Calders and Žliobaite (2013) does, of course, use
the sensitive variable to formulate nondiscriminatory constraints, which are applied
during model fitting. But a discussion of the role of the sensitive variable is not the
focus of their study. Similar approaches have been discussed in economic modeling,
as in Pope and Sydnor (2011), where the focus was on the sanitization of regression
models; our focus in this paper is on the implications for data regulations.
Returning to our example, there are cases where . b1 < 0 (for example) whereas in
the true model, .β1 > 0. This is called a (Simpson) paradox or spurious correlation
(in ecological inference) in the sense that the direction of impact of a predictor
variable is not clear.
Fig. 5.5 Admission statistics on the six largest programs graduating at Berkeley, with the number
of admissions/number of applications that received the percentage of admissions. The bold
numbers indicate, by row, which men or women have the highest admission rate. The proportion
column shows the male–female proportions in the application submissions. The total corresponds
to the 12,763 applications in 85 graduate programs; the six largest programs are detailed below,
and the “top 6” line is the total of these six programs, i.e., 4,526 applications (source: Bickel et al.
1975)
208 5 What Data?
a1 a2 b1 b2 a1 + b1 a2 + b2
. < and < while > .
c1 c2 d1 d2 c1 + d1 c2 + d2
The conclusion of Bickel et al. (1975) emphasizes “the bias in the aggregated
data stems not from any pattern of discrimination on the part of admissions com-
mittees, which seems quite fair on the whole, but apparently from prior screening at
earlier levels of the educational system. Women are shunted by their socialization
and education toward fields of graduate study that are generally more crowded,
less productive of completed degrees, and less well funded, and that frequently offer
poorer professional employment prospects.” In other words, the source of the gender
bias in admissions was a field problem: through no fault of the departments, women
were “separated by their socialization,” which occurred at an earlier stage in their
lives.
Another illustration is found in the Titanic data, specifically when using information
about crew members to passengers. Figure 5.6 shows the same paradox, in the
context of survival following the sinking.
But let us look at the survival rates for men and women separately, as presented
in Fig. 5.6. Among the crew, there were 885 men, of whom 192 survived, a rate of
21.7%. Among the third-class passengers, 462 were men, and 75 survived, a rate of
16.2%. Among the crew, there were 23 women, and 20 of them survived, a rate of
87.0%. And among the third-class passengers, 165 were women, and 76 survived,
a rate of 46.1%. In other words, for males and females separately, the crew had
a higher survival rate than the third-class passengers; but overall, the crew had a
lower survival rate than the third-class passengers. As with the admissions, there is
no miscalculation, or catch. There is simply a misinterpretation, because gender .x2
and status .x1 (passenger/crew member) are not independent, just as gender .x2 and
survival y are not independent. Indeed, although women represent 22% of the total
population, they represent more than 50% of the survivors... and 2.5% of the crew.
Fig. 5.6 Survival statistics for Titanic passengers conditional on two factors, crew/passenger (.x1 )
and gender (.x2 )
5.6 Omitted Variable Bias and Simpson’s Paradox 209
Problems similar to Simpson’s paradox also occur in other forms. For example,
the ecological paradox (analyzed by Freedman 1999, Gelman 2009, and King et al.
2004) describes a contradiction between a global correlation and a correlation within
groups. A typical example was described by Robinson (1950). The correlation
between the percentage of the foreign-born population and the percentage of literate
people in the 48 states of the USA in 1930 was .+53%. This means that states
with a higher proportion of foreign-born people were also more likely to have
higher literacy rates (more people who could read, at least in American English).
Superficially, this value suggests that being foreign born means that people are
more likely to be literate. But if we look at the state level, the picture is quite
different. Within states, the average correlation is .−11%. The negative value means
that being foreign born means that people are less likely to be literate. If the within-
state information had not been available, an erroneous conclusion could have been
drawn about the relationship between country of birth and literacy.
210
Fig. 5.7 Annual mortality rate for women, Costa Rica and Sweden, and the age pyramid for both countries (inspired by Cohen 1986, data source: Keyfitz et al.
1968)
5 What Data?
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law 211
the subset of the students enrolled in the program (i.e., .{i : y3:i = 1}), the variables
.x1 and .x2 are correlate strongly but negatively this time.
10 Equivalent, in the UK, of 911 in North America, 112 in many European countries, or 0118 999
Fig. 5.8 Relationship between the two variables, .x1 (GPA) and .x2 (SAT), as a function of the
population studied: total population top left (.r ∼ 0.55), observable population top right, i.e.,
students who applied to the program (.r ∼ 0.45), population of students admitted to the program
at the bottom left (.r ∼ −0.05), and population of students enrolled in the program at the bottom
right (.r ∼ −0.7) (source: author, dummy data). GPA = gradient point average, SAT = scholastic
aptitude test
11 To continue the analogy, in credit risk, we find the three previous levels, with (1) those who do
not apply for credit, (2) those to whom the institution does not offer credit, and (3) those who are
not interested in the offer made.
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law 213
in the database, and that some information is only partially reported, such as
information related to blood alcohol content, as pointed out by Carnis and Lassarre
(2019): does missing information mean that the test was negative or that it was not
done? It is crucial to know how the data were collected before you begin to analyze
it. Missing information is also common in health insurance, as medical records are
often relatively complex, with all kinds of procedure codes (that vary from one
hospital to another, from one insurer to another). In France, the majority of drugs
are pre-packaged (in boxes of 12, 24, 36), which makes it difficult to quantify the
“real” consumption of drugs.
When measuring the wealth of a population, there are two common approaches:
surveys and tax data. Surveys can be expensive and logistically complex to conduct.
On the other hand, tax data can be biased, particularly when it comes to capturing
very high incomes owing to tax optimization strategies that may vary over time.
For instance, in the UK, individuals can circumvent inheritance tax by utilizing
strategies such as borrowing against a taxable asset (e.g., their home) and investing
the loan in a nontaxable asset such as woodland. One can also evade taxes in the
UK by purchasing property through an offshore company, as non-UK companies
and residents are exempt from UK taxation. When loopholes in a tax system are
identified and individuals start to exploit them extensively, it often results in the
creation of more complex structures that, in turn, possess their own loopholes. This
phenomenon is known as “Goodhart’s Law.”
According to Marilyn Strathern, Goodhart’s Law states that “once a measure
becomes a goal, it ceases to be a good measure.” In US health care, Poku (2016)
notes that starting in 2012, under the Affordable Care Act, Medicare began charging
financial penalties to hospitals with “higher than expected” 30-day readmission
rates. Consequently, the average 30-day hospital readmission rate for fee-for-service
beneficiaries decreased. Is this due to improved efforts by hospitals to transition and
coordinate care, or is it related to the increase in “observed” stays over the same
period? Very often, setting a target based on a precise measure (here the 30-day
readmission rate) makes this variable completely unusable to quantify the risk of
getting sick again, but also has a direct impact on other variables (in this case the
number of “observation” stays), making it difficult to monitor over time. On the
Internet, algorithms are increasingly required to sort content, judge the defamatory
or racist nature of Tweets, see if a video is a deepfake, put a reliability score on
a Facebook account, etc. There is a growing demand from many individuals to
make algorithms transparent, allowing them to understand the process behind the
creation of these scores. Unfortunately, as noted by Dwoskin (2018) “not knowing
how [Facebook is] judging us is what makes us uncomfortable. But the irony is that
they can’t tell us how they are judging us—because if they do, the algorithms that
they built will be gamed,” exactly as Goodhart’s law implies.
214 5 What Data?
lot behind the supermarket.” Policyholders are rewarded when they improve their
driving behavior, “relative to the broader policy holder pool” as stated by Friedman
and Canaan (2014). This approach, sometimes referred to as “gamification,” may
even encourage drivers to change their behavior and risks. Jarvis et al. (2019) goes
so far as to assert that “insurers can eliminate uncertainty by shaping behaviour.”
In attempting to typify dark data biases, Hand (2020) listed dozens of other existing
biases. Beyond the missing variables mentioned above, there is a particularly impor-
tant selection bias. In 2017, during one of the debates at the NeurIPS conference on
interpretability,12 an example of pneumonia detection was mentioned: a deep neural
network is trained to distinguish low-risk patients from high-risk patients, in order to
determine who to treat first. The model was extremely accurate on the training data.
Upon closer inspection, it turns out that the neural network found out that patients
with a history of asthma were extremely low risk, and did not require immediate
treatment. This may seem counter-intuitive, as pneumonia is a lung disease and
patients with asthma tend to be more inclined to it (typically making them high-risk
patients). Looking in more detail, asthma patients in the training data did have a
low risk of pneumonia, as they tended to seek medical attention much earlier than
non-asthma patients. In contrast, non-asthmatics tended to wait until the problem
became more severe before seeking care.
Survival bias is another type of bias that is relatively well known and doc-
umented. The best-known example is that presented by Mangel and Samaniego
(1984). During World War II, engineers and statisticians (British) were asked how
to strengthen bombers that were under enemy fire. The statistician Abraham Wald
began to collect data on cabin impacts. To everyone’s surprise, he recommended
armoring the aircraft areas that showed the least damage. Indeed, the aircraft used
in the sample had a significant bias: only aircraft that returned from the theatre
of operations were considered. If they were able to come back with holes in the
wingtips, it meant that the parts were strong enough. And because no aircraft came
back with holes in the propeller engines, those are the parts that needed to be
reinforced. Another example is patients with advanced cancer. To determine which
of two treatments is more effective in extending life spans, patients are randomly
assigned to the two treatments and the average survival times in the two groups are
compared. But inevitably, some patients survive for a long time—perhaps decades—
and we don’t want to wait decades to find out which treatment is better. So the study
will probably end before all the patients have died. That means we won’t know the
12 Called “The Great AI Debate: Interpretability is necessary for machine learning,” opposing Rich
Caruana and Patrice Simard (for) to Kilian Weinberger and Yann LeCun (against) https://youtu.be/
93Xv8vJ2acI.
216 5 What Data?
survival times of patients who have lived past the study end date. Another concern
is that over time, patients may die of causes other than cancer. And again, the data
telling us how long they would have survived before dying of cancer is missing.
Finally, some patients might drop out (for reasons unrelated to the study, or not).
Once again, their survival times are missing data. Related to this example, we can
return to another important example: why more people are dying from Alzheimer’s
disease than in the past? One answer may seem paradoxical: it is due to the progress
of medical science. Thanks to medical advances, people who would have died young
are now surviving long enough to be vulnerable to potentially long-lasting diseases,
such as Alzheimer’s. This raises all sorts of interesting questions, including the
consequences of living longer.
Chapter 6
Some Examples of Discrimination
Abstract We return here to the usual protected, or sensitive, variables that can
lead to discrimination in insurance. We mention direct discrimination, with race and
ethnic origin, gender and sex, or age. We also discuss genetic-related discrimination,
and as several official protected attributes are not related to biology but to social
identity, we return to this concept. We also discuss other inputs used by insurers, that
could be related to sensitive attributes, with text, pictures, and spatial information,
and could be seen as some discrimination by proxy. We also mention the use of
credit scores and network data.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 217
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_6
218 6 Some Examples of Discrimination
According to Dennis (2004), “racism is the idea that there is a direct cor-
respondence between a group’s values, behavior and attitudes, and its physical
features ... Racism is also a relatively new idea: its birth can be traced to
the European colonization of much of the world, the rise and development of
European capitalism, and the development of the European and US slave trade.”
Nevertheless, we can probably also mention Aristotle (350–320 before our era),
according to whom, Greeks (or Hellenes, Ελληνες) were free by nature, whereas
“barbarians” (βάρβαρος, non-Greeks) were slaves by nature. But in a discussion
on Aristotle and racism, Puzzo (1964) claimed that “racism rests on two basic
assumptions: that a correlation exists between physical characteristics and moral
qualities; that mankind is divisible into superior and inferior stocks. Racism, thus
defined, is a modern concept, for prior to the XVI-th century there was virtually
nothing in the life and thought of the West that can be described as racist. To
prevent misunderstanding, a clear distinction must be made between racism and
ethnocentrism (...) The Ancient Hebrews, in referring to all who were not Hebrews
as Gentiles, were indulging in ethnocentrism, not in racism (...) So it was with the
Hellenes who denominated all non-Hellenes.” We could also mention the “blood
purity laws” (“limpieza de sangre”) that were once commonplace in the Spanish
empire (that differentiated converted Jews and Moors (conversos and moriscos)
from the majority Christians, as discussed in Kamen (2014)), and require each
candidate to prove, with family tree in hand, the reliability of his or her identity
through public disclosure of his or her origins.
Fig. 6.1 On the right-hand side, color chart showing various shades of skin color, by L’Oréal
(2022), as well as the Fitzpatrick Skin Scale at the bottom (six levels), and on the left-hand side,
the RGB (red-green-blue) decomposition of those colors (source: Telles (2014))
black race, including all sub-Saharan Africans), and the “American” (or red race,
including all Native Americans), as discussed in Rupke and Lauer (2018). Johann
Friedrich Blumenbach and Carl Linnaeus investigated the idea of “human race,”
from an empirical perspective, and at the same time, Immanuel Kant became one
of the most notable Enlightenment thinkers to defend racism, from a philosophical
and scientific perspective, as discussed in Eze (1997) or Mills (2017) (even if Kant
(1795) ultimately rejected racial hierarchies and European colonialism).
A more simple version of “racism” is discrimination based on skin color, also
known as “colorism,” or “shadeism,” which is a form of prejudice and discrimination
in which people who are perceived as belonging to a darker skinned race are treated
differently based on their darker skin color. Somehow, this criterion could be seen
as more objective, because it is based on a color. Telles (2014) as defined by the
Fitzpatrick Skin Scale, used by dermatologists and researchers. Types 1 through 3
are widely considered lighter skin (on the left), and 4 through 6 as darker skin (on
the right-hand side). On could also consider a larger palette, as in Fig. 6.1.
This connection between “racism” and “colorism” is an old one. Kant (1775),
entitled “Of the Different Human Races” as translated into English (the original title
was “Rassender Menschen” in German), was the preliminary work for his lectures
on physical geography (collected in Rink (1805)). In Kant (1785), “Determination
of the Concept of a Human Race,” his initial position, on the existence of human
races, was confirmed. The first essay was published a few months before Johann
Friedrich Blumenbach’s de generis humani varietate nativa” (of the native variety
of the human race), and proposed a classification system less complex than the one
in Blumenbach (1775), based almost solely on color. Both used color as a way
of differentiating the races, even if it was already being criticized. For Immanuel
Kant, there were four “races”: whites, Black people, Hindustani, and Kalmuck, and
the explanation of such differences were the effects of air and sun. For example,
he argued that by the solicitude of nature, human beings were equipped with seeds
(“keime”) and natural predispositions (“anlagen”) that were developed, or held back,
depending on climate.
220 6 Some Examples of Discrimination
The use of colors could be seen as simple and practical, with a more objective
definition than the one underlying the concept of “race,” but it is also very
problematic. For example, Hunter (1775) classified as “light brown” Southern
Europeans, Sicilians, Abyssinians, the Spanish, Turks, and Laplanders, and as
“brown” Tartars, Persians, Africans on the Mediterranean, and the Chinese. In the
XX -th century in the USA, simple categories such as “white” and “black” was not
enough. As mentioned by Marshall (1993), in New York City, populations who
spoke Spanish was not usually referred to as belonging either to the “white race” or
the “ black race,” but were designated as ‘Spanish,” ‘Cuban,” or ‘Puerto Ricans.”
Obviously, the popular racial typologies used in the USA were not based on any
competent genetic studies. And it was the same almost everywhere. In Brazil for
instance, descendants played a negligible role in establishing racial identity. Harris
(1970) has shown that siblings could be assigned to different racial categories.
He counted more than 40 racial categories that were used in Brazil. Phenotypical
attributes (such as skin color, hair form, and nose or mouth shape) were entering into
the Brazilian racial classification, but the most important determinant of racial status
was socio-economic position. It was the same for the Eta in Japan, Smythe (1952)
and Donoghue (1957). As Atkin (2012) wrote, an “east African will be classified as
‘black’ under our ordinary concept but this person shares a skin colour with people
from India and a nose shape with people from northern Europe. This makes our
ordinary concept of race look to be in bad shape as an object for scientific study—it
fails to divide the world up as it suggests it should.”
From a scientific and biological standpoint, races do not exist as there are no
inherent “biobehavioral racial essences,” as described by Mallon (2006). Instead,
races are sociological constructs that are created and perpetuated by racism. Racism
here is a set of processes that create (or perpetuate) inequalities based on racializing
some groups, with a ’“privileged” group that will be favored, and a “racialized”
group that will be disadvantaged. Given the formal link between racism and
perceived discrimination, it is natural to start with this protected variable (even if
it is not clearly defined at the moment). Historically, in the USA, the notion of
race has been central in discussions on discrimination (Anderson (2004) provides a
historical review in the United States), see also Box 6.1, by Elisabeth Vallet.
(continued)
more refined analysis of other causes of (possible) excess mortality (this is also the
argument put forward by O’Neil (2016)). At that time, in the United States, several
states were passing anti-discrimination laws, prohibiting the charging of different
premiums on the basis of racial information. For example, as Wiggins (2013) points
out, in the summer of 1884, the Massachusetts state legislature passed the Act to
Prevent Discrimination by Life Insurance Companies Against People of Color. This
law prevented life insurers operating in the state from making any distinction or
discrimination between white persons and persons of color wholly or partially of
African descent, as to the premiums or rates charged for policies upon the lives
of such persons. The law also required insurers to pay full benefits to African
American policyholders. It was on the basis of these laws that the uninsurability
argument was made: insuring Black people at the same rate as white people would
be statistically unfair, argued (Hoffman 1896), and not insuring Black people was
the only way to comply with the law (see also Heen (2009)). As Bouk (2015)
points out “industrial insurers operated a high-volume business; so to simplify sales
they charged the same nickel to everyone. The home office then calculated benefits
according to actuarially defensible discrimination, by age initially and then by race.
In November 1881, Metropolitan decided to mimic Prudential, allowing policies to
be sold to African Americans once again, but with the understanding that Black
policyholders’ survivors only received two-thirds of the standard benefit.”
In the credit market, Bartlett et al. (2021), following up on Bartlett et al.
(2018), shows that in the USA, discrimination based on ethnicity has continued
to exist in the US mortgage market (for African Americans and Latinos), both
traditional and algorithm-based lending. But algorithms have changed the nature
of discrimination from one based on prejudice, or human dislike, to illegitimate
applications of statistical discrimination. Moreover, algorithms discriminate not by
denying loans, as traditional lenders do, but by setting higher prices or interest rates.
In health care, Obermeyer et al. (2019) shows that there is discrimination, based on
ethnicity or “racial” bias in a commercial software program widely used to assign
patients requiring intensive medical care to a managed care program. White patients
were more likely to be assigned to the care program than Black patients with a
comparable health status. The assignment was made using a risk score generated
by an algorithm. The calculation included data on total medical expenditures in a
given year and fine-grained data on health service use in the previous year. The score
therefore does not reflect expected health status, but predicts the cost of treatment.
Bias, stereotyping, and clinical uncertainty on the part of health care providers
can contribute to racial and ethnic disparities in health care, as noted by Nelson
(2002). Finally, in auto insurance, Heller (2015) found that predominantly African
American neighborhoods pay 70% more, on average, for auto insurance premiums
than other neighborhoods. In response, the Property Casualty Insurers Association
of America responded2 in November 2015 that “insurance rates are color-blind and
solely based on risk.” This position is still held by actuarial associations in the USA,
for whom questions about discrimination are meaningless. Larson et al. (2017)
obtained 30 million premium quotes, by zip code, for major insurance companies
across the USA, and confirmed that a gap existed, albeit a smaller one. Also, in
Illinois, insurance companies charged on average more than 10% more in auto
liability premiums for “majority minority” zip codes (in the sense that the rate of
minorities was the highest) than in majority white zip codes. Historically, as recalled
by Squires (2003), many financial institutions have used such discrimination by
refusing to serve predominantly African American geographic areas.
Although such analyses have recently proliferated (Klein (2021) provides a
relatively comprehensive literature review), this potential racial discrimination issue
was analyzed by Klein and Grace (2001), for instance, who offered the possibility of
controlling covariates correlated with race, and showed that there was no statistical
evidence of geographic redlining. This conclusion was consistent with the analysis
of Harrington and Niehaus (1998), and was subsequently echoed by Dane (2006),
Ong and Stoll (2007), or Lutton et al. (2020) among many others. It should be noted
here that redlining is not only associated with an antisocial criterion, but very often
with an economic criterion. A recent case of statistical discrimination is currently
being investigated in Belgium, as mentioned by Orwat (2020). In this country, the
energy supplier EDF Luminus refuses to supply electricity to people living in a
certain postal code district. For the energy supplier, this postal code area represents
a zone where many people with a bad credit history live.
If the term “ethnic statistics” is a sensitive subject in France, the censuses have
asked, traditionally (for more than a century), the nationality at birth, therefore
distinguishing French by birth and French by adoption. And since 1992, the variable
“parents’ country of birth” has been introduced in a growing number of public
surveys. In French statistics, the word “ethnic” in the anthropological sense (sub-
national or supra-national human groups whose existence is proved even though
they do not have a state) has long had a place, particularly in surveys on migration
between Africa and Europe. The 1978 Data Protection Act therefore uses the
expression “racial or ethnic origins.” In this sense, “ethnic origin” means any
reference to a foreign origin, whether it is a question of nationality at birth, the
parents’ nationality, or “reconstructions” based on the family name or physical
appearance. Some general intelligence and judicial police files contain personal
information on an individual’s physical characteristics, and in particular on their
skin color, as recalled by Debet (2007). Some medical research files (e.g., in
dermatology) may contain similar information. The French National Institute of
Statistics (INSEE) had initially refused to introduce a question on the parent’s
birthplace in its 1990 family survey, which could have served as a sampling frame. It
was not until the 1999 survey that it was introduced, as recalled by Tribalat (2016).
Another concern that may arise is that the difference that may exist in insurance
premiums between ethnic origins is not a reflection of different risks but of different
treatments. Therefore, Hoffman et al. (2016) shows that racial prejudice and false
beliefs about biological differences between Black and white people continue to
shape the way we perceive and treat the former—they are associated with racial
disparities in pain assessment and treatment recommendations.
224 6 Some Examples of Discrimination
(continued)
Insights into the distinction between gender and sex, exploring various perspectives
from different fields of knowledge and public action are given in Box 6.3. These
discussions shed light on the complex nature of gender and sex, moving beyond
binary categorizations and acknowledging the diversity of sexual development. By
examining social, legal, and statistical aspects, as well as their implications in
sociology and economics, the box delves into the multifaceted nature of gender
226 6 Some Examples of Discrimination
4 From https://science.gc.ca/site/science/en/interagency-research-funding/policies-and-
guidelines/self-identification-data-collection-support-equity-diversity-and-inclusion/.
6.2 Sex and Gender Discrimination 227
Many books, published in the XVIII-th century, mention that men and women have
very different behaviors when it comes to insurance. According to Fish (1868),
“upon no class of society do the blessings of life insurance fall so sweetly as
upon women. And yet” agents “have more difficulty in winning them to their cause
than their husbands.” Phelps (1895) asked explicitly “do women like insurance?”
whereas Alexander (1924) collects fables and short stories, published by insurance
companies, with the idea of scaring women by dramatizing the “disastrous economic
consequences of their stubbornness,” as Zelizer (2018) named it.
Women live longer than men across the world and scientists have by and large
linked the sex differences in longevity with biological foundation to survival. A
new study of wild mammals has found considerable differences in life span and
aging in various mammalian species. Among humans, women’s life span is almost
8% on average longer than men’s life span. But among wild mammals, females in
60% of the studied species have, on average, 18.6% longer lifespans. Everywhere
in the world women live longer than men—but this was not always the case (see
also life tables in Chap. 2). On Fig. 6.2, we can compare on the left-hand side,
the probability (at birth) of dying between 30 and 70 years of age also, in several
countries, for women (x-axis) and men (y-axis). On the right-hand side, we compare
life expectancy at birth.
Fig. 6.2 Probability of dying between 30 and 70 years old, on the left-hand side, for most
countries, and life expectancy at birth, on the right-hand side. Women are on the x-axis whereas
men are on the y-axis. The size of the dots is related to the population size (data source: Ortiz-
Ospina and Beltekian (2018))
228 6 Some Examples of Discrimination
The 2004 EU Goods and Services Directive, Council of the European Union (2004),
was aimed at reducing gender gaps in access to all goods and services, discussed for
example by Thiery and Van Schoubroeck (2006). A special derogation in Article
5(2) allowed insurers to set gender-based prices for men and women. Indeed,
“Member States may decide (...) to allow proportionate differences in premiums
and benefits for individuals where the use of sex is a determining factor in the
assessment of risk, on the basis of relevant and accurate actuarial and statistical
data.” In other words, this clause allowed an exception for insurance companies,
provided that they produced actuarial and statistical data that established that sex
was an objective factor in the assessment of risk. The European Court of Justice
canceled this legal exception in 2011, in a ruling discussed at length by Schmeiser
et al. (2014) or Rebert and Van Hoyweghen (2015), for example. This regulation,
which generated a lot of debate in Europe in 2007 and then in 2011, also raised
many questions in the USA, several decades earlier, in the late 1970s, with Martin
(1977), Hedges (1977), and Myers (1977). For example, in the case of Los Angeles,
the Department of Water and Power vs Manhart, the Supreme Court considered a
pension system in which female employees made higher contributions than males
for the same monthly benefit because of longer life expectancy. The majority
ultimately determined that the plan violated Title VII of the Civil Rights Act of
1964 because it assumes that individuals conform to the broader trends associated
with their gender. The court suggested that such discrimination might be troubling
from a civil rights perspective because it does not treat individuals as such, as
opposed to merely members of a group that they belong to. These laws were
driven, in part, by the fact that employment decisions are generally individual: a
specific person is hired, fired, or demoted, based on his or her past or expected
contribution to the employer’s mission. In contrast, stereotyping of individuals based
on group characteristics is generally more tolerated in fields such as insurance,
where individualized decision making does not make sense.
In Box 6.4, Avner Bar-Hen discusses measurement of gender-related inequalities.
(continued)
(continued)
230 6 Some Examples of Discrimination
As Robbins (2015) said, “if you are not already part of a group disadvantaged by
prejudice, just wait a couple of decades—you will be.” As explained in Ayalon and
Tesch-Römer (2018), “ageism” is a recent concept, defined in a very neutral way in
Butler (1969) as “a prejudice by one age group against another age group.” Butler
(1969) argued that ageism represents discrimination by the middle-aged group
against the younger and older groups in society, because the middle-aged group is
responsible for the welfare of the younger and older age groups, which are seen as
dependent. But according to Palmore (1978), ageing is seen as a loss of functioning
and abilities, and therefore, it carries a negative connotation. Accordingly, terms
such as “old” or “elderly” have negative connotations and thus should be avoided.
Age contrasts with race and sex in that our membership of age-based groups
changes over time, as mentioned in Daniels (1998). And unlike caterpillars becom-
ing butterflies, human aging is usually considered as a continuous process. There-
fore, the boundaries of age-defined categories (such as the “under 18” or the “over
65”) will always be based on arbitrariness.
In Fig. 6.3, the screening process used by the NHS during the COVID19
pandemic is explained.6 It is explicitly based on a score (with points added) function
of the age of the person arriving at the hospital. Thus, the first treatment received
will be based on the age of the person, which corresponds to a strong selection bias,
based on a possibly sensitive attribute.
6 From https://www.nhsdghandbook.co.uk/wp-content/uploads/2020/04/COVID-Decision-
Support-Tool.pdf.
6.3 Age Discrimination 231
Fig. 6.3 COVID-19 Decision Support Tool used in England, in March 2020, provided by the NHS
(National Health System)
which focused primarily on ethical and racial issues, as shown in the example of
Macnicol (2006). In the majority of cases, age discrimination is considered from
an employment perspective, as in Duncan and Loretto (2004) or Adams (2004).
In terms of insurance, age is considered “less discriminatory” than gender, as we
have seen, because as Macnicol (2006) observes, age is not a club in which one
enters at birth, and it will change with time. We can observe some insurers refusing
to discriminate according to the age, assuming it as a form of “raison d’être” (in
the sense given by the 2019 French Pact law). In France, for example, Mutuelle
Générale is committed to strengthening solidarity between generations. This should
be reflected in a rejection of discrimination, segmenting, according to this criterion.
But just as a distinction exists between biological sex and gender, some suggest
distinguishing between biological age and perceived (or subjective) age, such as
Stephan et al. (2015) or Kotter-Grühn et al. (2016). Uotinen et al. (2005) showed
that this subjective age would be a better predictor of mortality than biological
age. As Beider (1987) points out, it can be argued that if people do not have a
fair chance based on their age, because not everyone ages equally, people die at
different ages. Bidadanure (2017) reminds us that age discrimination is always
perceived as less “preoccupying” than other kinds. The aging process, from birth to
adulthood, correlates with various developmental and cognitive processes that make
it relevant to assign different responsibilities, consent capacities, and autonomy
to children, young adults, or the elderly. But unlike sex and race, age is not a
232 6 Some Examples of Discrimination
Fig. 6.4 Number of crashes (left) and number of fatalities (right), per million miles driven, for
both males and females (males in blue and females in red), by driver age. The reference (0) are
men aged 30–60 years. The number of accidents is three times higher (+200%) for those over 85,
and the number of deaths more than ten times higher (+900%) (data source: Li et al. (2003))
6.3 Age Discrimination
Fig. 6.5 From left to right, average written premium (Canadian dollars), claims frequency, and average claims cost (Canadian dollars), by age group (x-axis)
and gender (males in blue and females in red) in Quebec (data source: Groupement des Assureurs Automobiles (2021))
233
234 6 Some Examples of Discrimination
on the basis of disability if the person is allowed to drive. But for degenerative
diseases, a few laws explicitly prohibit driving, for example, for someone with an
established disease, such as Parkison’s disease, Crizzle et al. (2012). The fact that
older people are more responsible for accidents raises many moral questions, as
putting oneself at risk as a driver is one thing, but potentially injuring or killing
others is less acceptable.
Genetic discrimination, or “genoism,” occurs when people treat others (or are
treated) differently because they have or are perceived to have a gene mutation(s)
that causes or increases the risk of an inherited disorder. According to Ajunwa
(2014, 2016), “genetic discrimination should be defined as when an individual
is subjected to negative treatment, not as a result of the individual’s physical
manifestation of disease or disability, but solely because of the individual’s genetic
composition.” This concept is related to “genetic determinism” (as defined in
de Melo-Martín (2003) and Harden (2023)) or more recently “genetic essentialism”
(as in Peters (2014)). Dar-Nimrod and Heine (2011) defined “genetic essentialism”
as the belief that people of a same group share some set of genes that make
them physically, cognitively, and behaviorally uniform, but different from others.
Consequently, “genetic essentialists” believe that some traits are not influenced (or
only a little) by the social environment. But as explained in Jackson and Depew
(2017), essentialism is genetically inaccurate because it not only overestimates the
amount of genetic differentiation between human races but it also underestimates
the amount of genetic variation among same-race individuals (Rosenberg 2011;
Graves Jr 2015).
An important issue here is that “genetic discrimination should be defined as when
an individual is subjected to negative treatment, not as a result of the individual’s
physical manifestation of disease or disability, but solely because of the individual’s
genetic composition,” Ajunwa (2014). And that is usually difficult to assess (as
we can see, for instance, with obesity). But other definitions are more vague.
For example, according to Erwin et al. (2010), “the denial of rights, privileges,
or opportunities or other adverse treatment based solely on genetic information,
including family history or genetic test results,” could be seen as “genetic discrim-
ination.” According to the legislation in Florida, USA (Title XXXVII, Chapter 627,
Section 4301, 2017) “genetic information means information derived from genetic
testing to determine the presence or absence of variations or mutations, including
carrier status, in an individual’s genetic material or genes that are scientifically or
medically believed to cause a disease, disorder, or syndrome, or are associated with
a statistically increased risk of developing a disease, disorder, or syndrome, which
6.4 Genetics versus Social Identity 235
is asymptomatic at the time of testing. Such testing does not include routine physical
examinations or chemical, blood, or urine analysis unless conducted purposefully
to obtain genetic information, or questions regarding family history.” In Box 6.5,
an excerpt explaining what “genetic tests” are is provided, from Bélisle-Pipon et al.
(2019).
According to Rawls (1971), the starting point for each person in society is
the result of a social lottery (the political, social, and economic circumstances in
which each person is born) and a natural lottery (the biological potentials with
which each person is born—recently, Harden (2023) revisited this genetic lottery
and its social consequences). John Rawls argues that the outcome of each person’s
social and natural lottery is a matter of good or bad “fortune” or “chance” (like
ordinary lotteries). And as it is impossible to deserve the outcome of this lottery,
discrimination resulting from these lotteries should not exist. For egalitarians (“luck-
egalitarians”), it is appropriate to eliminate the differential effects on people’s
236 6 Some Examples of Discrimination
interests that, from their perspective, are a matter of luck. Affirmative action in
favor of women is a means of neutralizing the effects of sexist discrimination. Stone
(2007) revisits the idea that this ex ante equality is part of what makes lotteries
fair and appealing. Abraham (1986) discusses the consequences of natural lotteries
in insurance. Around the same time, Wortham (1986) stated, “those suffering from
disease, a genetic defect, or disability on the basis of a natural lottery should not be
penalized in insurance.”
As Natowicz et al. (1992) explained, “People at risk for genetic discrimina-
tion are (1) those individuals who are asymptomatic but carry a gene(s) that
increases the probability that they will develop some disease, (2) individuals who
are heterozygotes (carriers) for some recessive or X-linked genetic condition but
who are and will remain asymptomatic, (3) individuals who have one or more
genetic polymorphisms that are not known to cause any medical condition, and (4)
immediate relatives of individuals with known or presumed genetic conditions.”
Williams (2017) and Liu (2017) discussed discrimination based on people’s appear-
ance, coined “lookism.” To quote Liu (2015), “everybody deserves to be treated
based on what kind of person he or she is, not based on what kind of person other
people are.” It is probably a reason why discrimination against “overweight people”
is challenging. As explained in Loos and Yeo (2022), although it is undeniable that
changes in the environment have played a significant role in the rapid rise of obesity,
it is important to recognize that obesity is the outcome of a complex interplay
between environmental factors and inherent biological traits. And it is the “social
norm” that makes overweight people possibly discriminated against.
The stereotype “fat is bad” has existed in the medical field for decades, as
Nordholm (1980) reminds us. Further study is needed to ascertain how this affects
practice. It appears that obese individuals, as a group, avoid seeking medical
care because of their weight. As claimed by Czerniawski (2007), “with the rise
of actuarial science, weight became a criterion insurance companies used to
assess risk. Used originally as a tool to facilitate the standardization of the
medical selection process throughout the life insurance industry, these tables later
operationalized the notion of ideal weight and became recommended guidelines
for body weights.” At the end of the 1950s, 26 insurance companies cooperated to
determine the mortality of policyholders according to their body weight, as shown
in Wiehl (1960). The conclusion is clear with regard to mortality: “Studies bring out
the clear-cut disadvantage of overweight-mortality ratios rising in every instance
with increase in degree of overweight.” This is also what was said by Baird (1994),
40 years after “obesity is regarded by insurance companies as a substantial risk for
both life and disability policies.” This risk increases proportionally with the degree
of obesity (the same conclusion is found in Lew and Garfinkel (1979) or Must et al.
(1999)).
In statistics, as explained by Upton and Cook (2014), a “proxy” (we will also use
“substitute variable”) is a variable that, in a predictive model, replaces a useful but
unobservable, unmeasurable variable.
Definition 6.1 (Proxy (Upton and Cook 2014)) A proxy is a measurable variable
that is used in place of a variable that cannot be measured.
And for a variable to be a good proxy, it must have a good correlation with the
interest variable. A relatively popular example is the fact that in elementary school,
shoe size is often a good proxy for reading ability. In reality, foot size has little to do
with cognitive ability, but in children, foot size correlates greatly with age, which in
238 6 Some Examples of Discrimination
turn correlates greatly with cognitive ability. This concept is quite related to notions
of causality, as discussed in the next chapter.
We mentioned earlier economists’ vision of discrimination, such as Gary Becker,
and the link with a form of rationality and efficiency. And indeed, for many authors,
what matters is that the association between variables is strong enough to make
up a reliable predictor. For Norman (2003), group membership provides reliable
information on the group, and by extension on any individual who is a member
of it; the systematic use of this information (the generalization and stereotyping
discussed in Schauer (2006) and Puddifoot (2021)) can be economically efficient.
Taking an ethical counterpoint, Greenland (2002) reminds us that some information
sources should be excluded from our decision making because they are irrelevant, or
noncausal, even though they may provide fairly reliable information because of their
strong correlation with another indicator. The central argument is that if variables
are noncausal, then they lack moral justification. And proxy discrimination raises
complex ethical issues. As Birnbaum (2020) states, “if discriminating intentionally
on the basis of prohibited classes is prohibited—e.g., insurers are prohibited from
using race, religion or national origin as underwriting, tier placement or rating
factors—why would practices that have the same effect be permitted?” In other
words, it is not enough to compare paid premiums, but the narrative process
of modeling (i.e., the notions of interpretability and explicability, or the “fable”
mentioned in the introduction) is equally important in judging whether a model
is discriminatory or not. And the difficulty, as Seicshnaydre (2007) said, is that it
is not about looking for a proof of a sexist, or racist intention, or motivation, but
of establishing that an algorithm discriminates according to a prohibited (because
protected) criterion.
Obermeyer et al. (2019) reports that in 2019, a large health care company had
used medical expenses as a proxy for medical condition severity. The use of this
proxy resulted in a racially discriminatory algorithm, because although the medical
condition may not be racially discriminatory, health care spending is (in the USA
at least). More generally, all sorts of proxies are used, more or less correlated
with the variables of interest. For example, a person’s (or household’s) income is
estimated by income tax, or by living conditions (or the neighborhood where the
person lives). The first version of this paper speaks “indirect risk factors.” We return
to the importance of causal graphs for understanding whether one variable causes
another, or whether they simply correlate, in the Sect. 7.1.
There are also certain quantities that are essential for modeling decision making
in an uncertain context, but that are difficult to measure. This is the case of the
abstract concept of “risk aversion” (widely discussed by Menezes and Hanson
(1970) and Slovic (1987)). Hofstede (1995) proposes an uncertainty avoidance
index, calculated from survey data. The first two studies, Outreville (1990) and
1996, suggested using the education level to assess risk aversion. According to
Outreville (1996), education promotes an understanding of risk and therefore an
increase in the demand for insurance, for example (although an inverse relationship
could exist, if one assumes that increased levels of education are associated with an
increase in transferable human capital, which induces greater risk-taking).
6.5 Statistical Discrimination by Proxy 239
Recently, the Court of Justice of the European Union issued a ruling on 1 August
2022 with potentially far-reaching implications about inferred sensitive data (Case
C-184/20, ECLI:EU:C:2022:601). In essence, the question posed to the European
court was whether the disclosure of information such as the name of a partner
or spouse would constitute processing of sensitive data within the meaning of the
GDPR, even though such data is not in itself directly sensitive, but only allows the
indirect inference of sensitive information, such as the sexual orientation of the data
subject. More precisely, the question was “The referring court asks, in substance,
whether (.. . .) the publication (.. . .) of personal data likely to disclose indirectly
the political opinions, trade-union membership or sexual orientation of a natural
person constitutes processing of special categories of personal data (.. . .).” The
Court of Justice of the European Union has made a clear ruling on this issue, stating
that processing of sensitive data must be considered to be “processing not only of
intrinsically sensitive data, but also of data which indirectly reveals, by means of
an intellectual deduction or cross-checking operation, information of that kind.”
For example, location data indicating places of worship or health facilities visited
by an individual could now be qualified as sensitive data, as well as the recording
of the order of a vegetarian menu at a restaurant in a food delivery application.
And typically, if the pages “liked” by a user of a social network or the groups to
which he belongs are not technically sensitive data, membership in a support group
for pregnant women, or the placing of “likes” on the pages of politically oriented
newspapers, allow the deduction of perfectly precise sensitive information relating
to the state of health of a person or his political positions.
“Humans think in stories rather than facts, numbers or equations—and the
simpler the story, the better,” said Harari (2018), but for insurers, it is often a mixture
of both. For Glenn (2000), like the Roman god Janus, an insurer’s risk selection
process has two sides: the one presented to regulators and policyholders, and the
other presented to underwriters. On the one hand, there is the face of numbers,
statistics, and objectivity. On the other, there is the face of stories, character, and
subjective judgment. The rhetoric of insurance exclusion—numbers, objectivity,
and statistics—forms what Brian Glenn calls “the myth of the actuary,” “a powerful
rhetorical situation in which decisions appear to be based on objectively determined
criteria when they are also largely based on subjective ones” or “the subjective
nature of a seemingly objective process”. And for Daston (1992), this alleged
objectivity of the process is false, and dangerous, as also pointed out by Desrosières
(1998). Glenn (2003) claimed that there are many ways to rate accurately. Insurers
can rate risks in many different ways depending on the stories they tell on which
characteristics are important and which are not. “The fact that the selection of risk
factors is subjective and contingent upon narratives of risk and responsibility has in
the past played a far larger role than whether or not someone with a wood stove is
charged higher premiums.” Going further, “virtually every aspect of the insurance
industry is predicated on stories first and then numbers.” We remember Box et al.
(2011)’s “all models are wrong but some models are useful,” in other words, any
model is at best a useful fable.
240 6 Some Examples of Discrimination
As Bernstein (2013) reminds us, the word “stereotype” merges a Greek adjective
meaning solid, στερεός, with a noun meaning a mold, τύπος. Combining the two
terms, the word refers to a hard molding, something that can leave a mark, which
gave a printing term, namely the printing form used for letterpress printing. In 1802,
the dictionary of the French Academy, mentions, for the word “stereotype,” “a new
word which is said of stereotyped books, or printed with solid forms or plates,”
meaning that an image perpetuated without change. The American journalist and
public intellectual Walter Lippmann gave the word its contemporary meaning in
Lippmann (1922). For him, it was a description of how human beings fit “the
world outside” into “ the pictures in our heads,” which form simplified descriptive
categories by which we seek to locate others or groups of individuals. Walter
Lippmann tried to explain how images that spontaneously arise in people’s minds
become concrete. Stereotypes, he observed, are “the subtlest and most pervasive of
all influences.” A few years later, he began the first experiments to better understand
this concept. One could observe that Lippmann (1922) was one of the first books
on public opinion, manipulation, and storytelling. Therefore, it is natural to see
connections between the word “ stereotype” and storytelling, as well as explanations
and interpretations.
The importance of stereotypes in understanding many decision-making processes
are analyzed in detail in Kahneman (2011), inspired in large part by Bruner (1957),
and more recently, Hamilton and Gifford (1976) and especially Devine (1989). For
Daniel Kahneman, schematically, two types of mechanisms are used to make a
decision. System 1 is used for rapid decision making: it allows us to recognize
people and objects, helps us to direct our attention, and encourages us to fear spiders.
It is based on knowledge stored in the memory and accessible without intention, and
without effort. It can be contrasted with System 2, which allows decision making
in a more complex context, requiring discipline and sequential thinking. In the first
case, our brain uses the stereotypes that govern representativeness judgments, and
uses this heuristic to make decisions. If I cook a fish for some friends to eat, I
open a bottle of white wine. The stereotype “fish goes well with white wine” allows
me to make a decision quickly, without having to think. Stereotypes are statements
about a group that are accepted (at least provisionally) as facts on each member.
Whether correct or not, stereotypes are the basic tool for thinking about categories in
System 1. But often, further, more constructed thinking—corresponding to System
2—leads to a better, even optimal decision. Without choosing just any red wine,
perhaps a pinot noir would also be perfectly suited to grilled mullet. As Fricker
(2007) asserted, “stereotypes are [only] widely held associations between a given
social group and one or more attributes.” Isn’t this what actuaries do every day?
6.5 Statistical Discrimination by Proxy 241
In the “Ten Oever” judgment (Gerardus Cornelis Ten Oever v Stichting Bedrijf-
spensioenfonds voor het Glazenwassers – en Schoonmaakbedrijf, in April 1993),
the Advocate General Van Gerven argued that “the fact that women generally
live longer than men has no significance at all for the life expectancy of a
specific individual and it is not acceptable for an individual to be penalized on
account of assumptions which are not certain to be true in his specific case,”
as mentioned in De Baere and Goessens (2011). Schanze (2013) used the term
“injustice by generalization.” But at the same time, as explained by Schauer (2006),
this “generalization” is probably the reason for the actuary’s existence: “To be an
actuary is to be a specialist in generalization, and actuaries engage in a form of
decision-making that is sometimes called actuarial.” This idea can be found in
actuarial justice, for example in Harcourt (2008). Schauer (2006) reported that we
might be led to believe that it is better to have airline pilots with good vision than bad
ones (this point is also raised in the context of driving, and car insurance, Owsley
and McGwin Jr (2010)). This criterion could be used in hiring, and would, of course,
constitute a kind of discrimination, distinguishing “good” pilots from “bad” ones,
pilots with good vision from others. Some airlines might impose a maximum age
for airline pilots (55 or 60, for example), age being a reliable, if imperfect, indicator
of poor vision (or more generally of poor health, with impaired hearing or slower
reflexes). If we exclude the elderly from being commercial airline pilots we will
end up, ceteris paribus, with a cohort of airline pilots who have better vision, better
hearing, and faster reflexes. The explanation given here is clearly causal, and the
underlying goals of the discrimination then seem clearly legitimate, so that even the
use of age becomes “proxy discrimination” in the sense of Hellman (1998), called
“statistical discrimination” in Schauer (2017).
For Thiery and Van Schoubroeck (2006) (but also Worham (1985)), lawyers
and actuaries have fundamentally different conceptions of discrimination and
segmentation in insurance, one being individual, the other collective, as stigmatized
in the USA by the Manhart and Norris cases (Hager and Zimpleman 1982; Bayer
1986). In the Manhart case in 1978, the Court ruled that an annuity plan in which
men and women received equal benefits at retirement, even though women made
larger contributions, was illegal. In 1983, the Supreme Court ruled in the Norris
case that the use of gender-differentiated actuarial factors for benefits in pension
plans was illegal because it fell within the prohibition against discrimination. These
two decisions are a legal affirmation that insurance technique could not always be
used as a guarantee to justify differential treatment of members of certain groups
in the context of insurance premium segmentation. Indeed, legally, the right to
equal treatment is one that is granted to a person in his or her capacity as an
individual, not as a member of a racial, sexual, religious, or ethnic group. Therefore,
an individual cannot be treated differently because of his or her membership in
such a group, particularly one to which he or she has not chosen to belong. These
orders emphasize that individuals cannot be treated as mere components of a racial,
242 6 Some Examples of Discrimination
The first name, also known as a personal name, is an individual’s given name used
to specifically identify them, alongside their patronymic or family name. In the
majority of Indo-European languages, the first name is positioned before the family
name, known as the Western order. This differs from the Eastern order, where the
family name precedes the first name. Unlike the family name (usually inherited from
the fathers in patriarchal societies), the first name is chosen by the parents at birth
(or before), according to criteria influenced by the law and/or social conventions,
pressures and/or trends. More precisely, at birth (and/or baptism), each person is
usually given one or more first names, of which only one (which can be made up)
will be used afterward: the usual first name.
The use of family names appeared in Venice in the IX-th century (Brunet and Bideau
2000; Ahmed 2010) and came into use across Europe in the later Middle Ages
(beginning roughly in the XI-th century), according to the “Family names” entry of
the Encyclopædia Britannica. Family names seems to have originated in aristocratic
families and in big cities, as having a hereditary surname that develops into a
family name preserves the continuity of the family and facilitates official property
records and other matters. Sources of family names are original nicknames (Biggs,
Little, Grant), occupations (Archer, Clark), place names (Wallace, Murray, Hardes,
Whitney, Fields, Holmes, Brookes, Woods), as mentioned in McKinley (2014).
In English, the suffixation of—son has been also very popular (Richardson,
Dickson, Harrison, Gibson) but also using the prefix Fitz—(Fitzgerald), which goes
back to Norman French fis (fils in French), meaning “son,” as explained in McKinley
(2014). In Russian, as discussed in Plakans and Wetherell (2000), the suffix—ov
(ov, “son of”) was also used, such as “Ivan Petrov (Petrov),” for “Ivan (Ivan),
the son of Piotr ( Piotr, or Petr+ ov),” with the possibility of designating the
successive fathers, with the use of patronymics: Vasily Ivanovich Petrov (Vasiliıu
Ivanoviq Petrov), is Vasily (Vasiliıu) son of Ivan (Ivan+oviq), born from
the ancestor Piotr (Piotr+ ov).
Icelandic surnames are different from most other naming systems in the modern
Western world by being patronymic or occasionally matronymic, as mentioned in
Willson (2009) and Johannesson (2013): they indicate the father (or mother) of
the child and not the historic family lineage. Generally, with few exceptions, a
person’s last name indicates the first name of their father (patronymic) or in some
cases mother (matronymic) in the genitive, followed by—son “(son”) or—dóttir
(“daughter”). For instance, in 2017, Iceland’s national Women’s soccer team players
were Agla Maria Albertsdóttir, Sigridur Gardarsdóttir, Ingibjorg Sigurdardóttir,
Glodis Viggosdóttir, Dagny Brynjarsdóttir, Sara Bjork Gunnarsdóttir, Fanndis
6.6 Names, Text, and Language 247
Table 6.1 Last name, and Name Rank White (%) Black (%) Hispanic (%)
racial proportions in the USA,
from Gaddis (2017) (data Washington 138 5.2% 89.9% 1.5%
from US Census (2012)) Jefferson 594 18.7% 75.2% 1.6%
Booker 902 30.0% 65.6% 1.5%
Banks 278 41.3% 54.2% 1.5%
Jackson 18 41.9% 53.0% 1.5%
Mosley 699 42.7% 52.8% 1.5%
Becker 315 96.4% 0.5% 1.4%
Meyer 163 96.1% 0.5% 1.6%
Walsh 265 95.9% 1.0% 1.4%
Larsen 572 95.6% 0.4% 1.5%
Nielsen 765 95.6% 0.3% 1.7%
McGrath 943 95.9% 0.6% 1.6%
Stein 720 95.6% 0.9% 1.6%
Decker 555 95.4% 0.8% 1.7%
Andersen 954 95.5% 0.6% 1.7%
Hartman 470 95.4% 1.5% 1.2%
Orozco 690 3.9% 0.1% 95.1%
Velazquez 789 4.0% 0.5% 94.9%
Gonzalez 23 4.8% 0.4% 94.0%
Hernandez 15 4.6% 0.4% 93.8%
248 6 Some Examples of Discrimination
As Bosmajian (1974) states, “an individual has no definition, no validity for himself,
without a name. His name is his badge of individuality, the means whereby he
identifies himself and enters upon a truly subjective existence.” Names are often the
first information people have in a social interaction. Sometimes we know individuals
by name even before we meet them in person, as Erwin (1995) reminds us. First and
last names can carry a lot of information, as shown by Hargreaves et al. (1983),
Dinur et al. (1996), or Daniel and Daniel (1998). To quote Young et al. (1993), “the
name Bertha might be judged as belonging to an older Caucasian woman of lower-
middle class social status, with attitudes common to those of an older generation
.(· · · ) a person with a name such as Fred, Frank, Edith, or Norma is likely to be
judged, at least in the absence of other information, to be either less intelligent, less
popular, or less creative than would a person with a name such as Kevin, Patrick,
Michelle, or Jennifer.”
As discussed in Riach and Rich (1991) and Rorive (2009), a popular technique to
test for discrimination (in a real-life context) is to use “practice testing” or “situation
testing”. This probably started in the 1960s in the UK, with Daniel et al. (1968),
who aimed to compare the likelihood of success among three fictitious individuals,
British citizens originally from Hungary, the Caribbean, and England, across various
domains including employment, housing, and insurance, as noted by Héran (2010).
Smith (1977) also references a protocol used in 1974, which involved random job
offer selections and the submission of identical written applications. Concurrently in
the USA, Fix and Turner (1998) and Blank et al. (2004) mention “affirmative action
(pair-testing).” These so-called “scientific” tests rely on stringent, demanding, and
often costly protocols. In the context of insurance, in France, Petit et al. (2015)
and l’Horty et al. (2019) organized such experiments, where three profiles were
used: the first corresponds to the candidate whose name and surname are North
African sounding (for example, Abdallah Kaïdi, Soufiane Aazouz, or Medji Ben
Chargui), the second corresponds to the candidate whose first name is French
sounding and whose surname is North African sounding (for example, François El
Hadj, Olivier Ait Ourab, or Nicolas Mekhloufi), and finally the one whose first name
and last name are French sounding (for example Julien Dubois, Bruno Martin, or
Thomas Lecomte). Amadieu (2008) mentions that the first names (male) Sébastien,
Mohammed, and Charles-Henri are used for tests. Table 6.2 shows the main first
names in three generations of immigrants.
In Box 6.6, Baptiste Coulmont discusses the use of the first name as a proxy for
some sensitive attribute.
6.6 Names, Text, and Language 249
Table 6.2 Top 3 first names by sex and generations in France, according to the origin (Southern
Europe or Maghreb) of grandparents, Coulmont and Simon (2019)
Children of Grandchildren of
Immigrants immigrants immigrants
Southern Men José, Antonio, Manuel Jean, David, Thomas, Lucas,
Europe Alexandre Enzo
Women Maria, Marie, Ana Marie, Sandrine, Laura, Léa, Camille
Sandra
Maghreb Men Mohamed, Ahmed, Rachid Mohamed, Karim, Yanis, Nicolas,
Mehdi Mehdi
Women Fatima, Fatiha, Khaduja Sarah, Nadia, Sarah, Ines, Lina
Myriam
(continued)
Similarly, in “are Emily and Greg more employable than Lakisha and Jamal ?,”
Bertrand and Mullainathan (2004) randomly assigned African American or white-
sounding names in resumes to manipulate the perception of race. “White names”
received 50% more callbacks for interviews. Voicu (2018) presents the Bayesian
Improved First Name Surname Geocoding (BIFSG) model to use first names to
improve the classification of race and ethnicity in a mortgage-lending context,
drawing on Coldman et al. (1988) and Fiscella and Fremont (2006). Analyzing data
from the German Socio-Economic Panel, Tuppat and Gerhards (2021) show that
immigrants with first names considered uncommon in the host country dispropor-
tionately complain of discrimination. When names are used as markers indicating
ethnicity, it has been observed that highly educated immigrants tend to report
perceiving discrimination in the host country more frequently than less educated
immigrants. This phenomenon is referred to as the “discrimination paradox.”
Rubinstein and Brenner (2014) show that the Israeli labor market discriminates
on the basis of perceived ethnicity (between Sephardic and Ashkenazi-sounding
surnames). Carpusor and Loges (2006) analyzes the impact of first and last names
on the rental market, whereas Sweeney (2013) analyze their impact on online
advertising.
6.6 Names, Text, and Language 251
Chatbots will both raise critical ethical challenges and hold implications for
the democratization of technology, and implementing research addressing these
directions is important. Chatbots permit users to interact through natural language,
and consequently are a potential low threshold means of accessing information and
services and promoting inclusion. However, owing to technological limitations and
design choices, they can be the means of perpetuating and even reinforcing existing
biases in society, excluding or discriminating against some user groups, as discussed
in Harwell et al. (2018) or Feine et al. (2019), and over-representing or enshrining
specific values.
If we translate Turkish sentences that use the gender-neutral ‘o’ to English, we
obtain outputs such as the ones in Table 6.3.
To make such decisions, it clearly means that corpuses used to train those
models are sexist (possibly also agist or racist). Therefore, the words we use can
be the proxy for some sensitive attributes. This is related to “social norm bias” as
defined in Cheng et al. (2023), inspired by Antoniak and Mimno (2021). Social
norm bias is characterized by the associations between an algorithm’s predictions
and individuals’ adherence to inferred social norms, which can be a source of
algorithmic unfairness, as penalizing individuals for their adherence or deviation
to social norms is a form of algorithmic unfairness. More precisely, Social Norm
Bias occurs when an algorithm is more likely to correctly classify the women in
this occupation who adhere to these norms over the women in the same occupation
who do not adhere to the bias. Tang et al. (2017) used manually compiled lists
of gendered words and using only the frequency of these words. “Occupations are
socially and culturally ‘gendered’” wrote Stark et al. (2020); many jobs (for instance
in science, technology, and engineering) are traditionally masculine, Light (1999)
and Ensmenger (2015). Different English words have gender- and age-related
connotations, as shown in Moon (2014), inspired by Williams and Bennett (1975).
Based on a large reference corpus (the 450-million-word Bank of English, BoE),
Moon (2014) observed that the most frequent adjectives co-occurring with “young”
are: inexperienced, beautiful, fresh, attractive, healthy, vulnerable, pretty, naive, tal-
ented, impressionable, energetic, crazy, single, dynamic, fit, strong, trendy, innocent,
foolish, handsome, hip, stupid, ambitious, free, full (of life/ideas/hope/etc.), lovely,
enthusiastic, eager, small, vibrant, gifted, immature, slim, good-looking. In contrast,
the most frequent adjectives co-occurring with “old” are: sick, tired, infirm, frail,
gray, fat, worn(-out), decrepit, disabled, wrinkly/wrinkled, slow, poor, weak, wise,
beautiful, rare, ugly. The following associations were observed:
• Precocious, shy: teens, tailing off in the twenties
• Pretty, promising: peaking with teens, twenties, tailing off in the thirties
• Beautiful, fresh-faced, stunning: mainly teens and twenties
• Blonde: strongly associated with women in their twenties; to a lesser extent teens
and thirties
• Ambitious, brilliant, talented: peaking in the twenties
• Attractive: peaking in the twenties, tailing off in the thirties
• Handsome: peaking in the twenties, tailing off in the forties
• Balding, dapper, formidable, genial, portly, paunchy: mainly forties and older
• Sprightly/spritely: beginning to appear in the sixties, stronger in the seven-
ties/eighties
• Frail: mainly the seventies and older
Crawford et al. (2004) provide a corpus of 600 words and human-labeled gender
scores, as scored on a scale of 1–5 (1 is the most feminine, 5 is the most masculine)
by undergraduates at US universities. They find that words referring to explicitly
gendered roles such as wife, husband, princess, and prince are the most strongly
gendered, whereas words such as pretty and handsome also skew strongly in the
feminine and masculine directions respectively.
The voice is also an important element, often very subjective during meetings with
agents, in person, but today it is analyzed by algorithms, in particular when an
insurer uses conversational robots (we sometimes speak of a “chat bot”), as analyzed
by Hunt (2016), Koetter et al. (2019), Nuruzzaman and Hussain (2020), Oza et al.
(2020), or Rodríguez Cardona et al. (2021). We can note, in France, the experience
of the virtual assistant launched by Axa Legal Protection, called Maxime, described
by Chardenon (2019).
In the fall of 2020, in France, a law proposal repressing discrimination based on
accent was presented to the parliament, as reported by Le Monde (2021). Linguists
would speak of phonostyle discrimination, as in Léon (1993), or of “diastratic
variation,” with differences between usages by gender, age, and social background
(in the broad sense), in Gadet (2007), or of “glottophobia”, a term introduced by
Blanchet (2017). Glottophobia can be defined as “contempt, hatred, aggression,
6.6 Names, Text, and Language 253
6.7 Pictures
More than a century ago, first Lombroso (1876), and then Bertillon and Chervin
(1909), laid the foundations of phrenology and the “born criminal” theory, which
assumes that physical characteristics correlate with psychological traits and criminal
inclinations. The idea was to build classifications of human types on the basis of
morphological characteristics, in order to explain and predict differences in morals
and behavior. One could speak of the invention of a “prima facie”. We can also
mention “ugly laws,” in the sense of Schweik (2009), taking up a term used by
Burgdorf and Burgdorf Jr (1974) to describe laws in force in several cities in the
USA until the 1920s, but some of which lasted until 1974. These laws allowed
people with “unsightly” marks and scars on their bodies to be banned from public
places, especially parks. In New York, in the XVIII-th century, Browne (2015) recalls
that “lantern laws” demanded that Black, mixed-race, and Indigenous enslaved
people carry candle lanterns with them if they walked about the city after sunset,
and not in the company of a white person. The law prescribed various punishments
for those who did not carry this supervisory device.
These debates are resurfacing as the number of applications of facial recognition
technology increases, thanks to improvements in the quality of the images, the
algorithms used, and the processing power of computers. The potential of these
facial recognition tools to perform health assessment is demonstrated in Boczar et al.
(2021). Historically, Anguraj and Padma (2012) had proposed a diagnostic tool for
facial paralysis, and recently, Hong et al. (2021) uses the fact that many genetic
syndromes have facial dysmorphism and/or facial gestures that can be used as a
diagnostic tool to recognize a syndrome. As shown in Fig. 6.6, many applications
online can, for free, label pictures, and extract information, personal if not sensitive,
such as gender (with a “confidence” value), the age, and also some sort of emotion.
In Chap. 3, we explained that in the context of risk modeling, a “classifier” that
associated some pictures with a class is not extremely interesting, and having the
probability of belonging to classes is more interesting. In Fig. 6.7, we challenge the
“confidence” value given by Picpurify, using pictures generated by a generative
adversarial network (used in Hill and White (2020) to generate faces), with a
“continuous” transformation from a picture (top left) to another one (bottom right).
Based on the terminology we use later, when using barycenters in Chap. 12, we
have here some sort of “geodesic” between the picture of a woman and a picture of
a man. Surprisingly, we would expect the “confidence” (for identifying a “man”) to
increase continuously from a low value to a higher one. But this is not the case. The
algorithm predicting with very high confidence that the person on the top right is a
“female” and also with very high confidence that the person on the bottom left is a
“male” (with only very few changes in the pixels between the two pictures).
More generally, beyond medical considerations, Wolffhechel et al. (2014)
reminds us that several personality traits can be read from a face, and that facial
6.7 Pictures 255
Fig. 6.6 Faces generated by Karras et al. (2020). Gender and age were provided by https://
gender.toolpie.com/ and facelytics, https://www.picpurify.com/demo-face-gender-age.html with
a “confidence,” and https://cloud.google.com/vision/, https://howolddoyoulook.com/ and https://
www.facialage.c+om/ (accessed in January 2023)
features influence first impressions. That said, the prediction model considered fails
to reliably predict personality traits from facial features. However, recent technical
developments, accompanied by the development of large image banks, have made it
possible, as claimed by Kachur et al. (2020), to predict multidimensional personality
profiles from static facial images, using neural networks trained on large labeled
data sets. About 10 years ago, Cao et al. (2011) proposed predicting the gender of a
person from facial images (Rattani et al. 2017 or Rattani et al. 2018), and recently
Kosinski (2021) used a facial recognition algorithm to predict political orientation
(in a binary context, opposing liberals and conservatives, in the spirit of Rule and
Ambady (2010)). Wang and Kosinski (2018) and Leuner (2019) proposed using
256 6 Some Examples of Discrimination
Fig. 6.7 GAN used in Hill and White (2020) to generate faces, with a “continuous” transformation
from a picture (top left) to another one (bottom right), and then gender predicted using https://www.
picpurify.com/demo-face-gender-age.html with a “confidence” (accessed in March 2023)
Fig. 6.8 Labels from Google API Cloud Vision https://cloud.google.com/vision/, (accessed in
January 2023) (source: personal collection)
Fig. 6.9 Examples of images associated with a building search, with street photos on the left-hand
side, with Google Street View, and aerial imagery, with Google Earth, on the right-hand side
258 6 Some Examples of Discrimination
Fig. 6.10 Examples of images associated with building search, with street photos on the left, from
Google Street View, with different views from 2012 until 2018 on top, and from 2009 until 2018
below
Fig. 6.12 Some polygons of building contours, in Montréal (Canada), extracted from Open-
StreetMap (getbb function of osmdata package)
260 6 Some Examples of Discrimination
Fig. 6.13 Locations of fire hydrants in a neighborhood in Montréal (Canada), with a satellite
picture from GoogleView on the right-hand side (qmap function of ggmap package)
6.8.1 Redlining
groups in Chicago’s Austin neighborhood coined the word “redlining” in the late
1960s, referring literally to red lines lenders and insurance providers admitted
drawing around areas that they would not service. “The practice of redlining
was first identified and named in the Chicago neighborhood of Austin in the late
1960s. Saving and loan associations, at the time the primary source of residential
mortgages, drew red lines around neighborhoods they thought were susceptible to
racial change and refused to make mortgages in those neighborhoods,” explained
Squires (2011). And because those areas also had a higher proportion of African
American people, “redlining” started to be perceived as a discriminatory practice.
More generally, the use of geographic attributes may hide (intentionally or not) the
fact that some neighborhoods are populated mainly by people of a specific race, or
minority.
8 IRIS = Ilôts Regroupés pour l’Information Statistique, a division of the French territory into
grids of homogeneous size, with a reference size of 2000 inhabitants per elementary grid. France
has 16,100 IRIS, including 650 in the overseas departments, and Paris (intra muros) has 992 IRIS.
262 6 Some Examples of Discrimination
Fig. 6.15 Median income per household, proportion of elderly people, proportion of dwellings
according to their size, by neighborhood (IRIS), in Paris, France. This statistical information on
the neighborhood (and not the insured person) can be used to rate a home insurance policy, for
example (data source: INSEE open data)
cars detected on Google Street View images. For example, in the USA, if the
number of Sedans in a neighborhood is greater than the number of pickup trucks,
that neighborhood is likely to vote Democrat in the next presidential election (88%
chance); otherwise, the city is likely to vote Republican (82% chance).
As Law et al. (2019) puts it, when an individual buys a home, they are
simultaneously buying its structural characteristics, its accessibility to work, and
the neighborhood’s amenities. Some amenities, such as air quality, are measurable,
whereas others, such as the prestige or visual impression of a neighborhood, are
difficult to quantify. Rundle et al. (2011) notes that Google Street View provides a
sense of neighborhood safety (related to traffic safety), if looking for crosswalks, the
presence of parks and green spaces, etc. Using street and satellite image data, Law
et al. (2019) shows that it is impossible to capture these unquantifiable features and
improve the estimation of housing prices, in London, UK. A neural network, with
input from traditional housing characteristics such as age, size and accessibility, as
well as visual features from Google Street View and aerial images, is trained to
estimate house prices. It is also possible to infer some possibly sensitive personal
information, such as a possible disability with the presence of a ramp for the house
(the methodology is detailed in Hara et al. (2014)), or sexual orientation with the
presence of a rainbow flag in the window, or political with a Confederate flag
(as mentioned in Mas (2020)). Ilic et al. (2019) also evokes this “deep mapping”
of environmental attributes. A Siamese Convolutional Neural Network (SCNN)
is trained on temporal sequences of Google Street View images. The evolution
over time confirms some urban areas known to be undergoing gentrification, while
6.9 Credit Scores 263
Certain authors have, in the course of historical discourse, advanced the proposition
that discrimination grounded in wealth or income should be taken into account, as
advocated by Brudno (1976), Gino and Pierce (2010), or more recently Paugam
et al. (2017). However, it is pertinent to note that this particular criterion is not
within the purview of this discussion. Nevertheless, the contemplations posited by
these scholars evoke pertinent inquiries concerning the intricate interplay between
discrimination, economic disparities, and the principle of meritocracy. These con-
nections have also been underscored by Dubet (2014) and further elaborated upon
in Dubet (2016).
In the brief section “how insurers determine your premium,” in the National
Association of Insurance Commissioners (2011, 2022) reports, it is explained that
“most insurers use the information in your credit report to calculate a credit-based
insurance score. They do this because studies show a correlation between this score
and the likelihood of filing a claim. Credit-based insurance scores are different from
other credit scores.” And as mentioned in Kuhn (2020), credit scoring is allowed
in 47 states in the USA (all except California, Massachusetts, and Hawaii), and it
is used by the 15 largest auto insurers in the country and over 90% of all US auto
insurers.
More generally, credit scores are an important piece of individual information
in the USA or Canada, and are widely used in many lines of insurance, not just
loan insurance. As noted, a (negative) credit event, such as a default (or late)
payment on a mortgage, or a bankruptcy, can impact an individual for a considerable
period of time. These credit scores are numbers that represent an assessment of
a person’s creditworthiness, or the likelihood that he or she will pay back their
debts. But increasingly, these scores are being used in quite different contexts,
such as insurance. As mentioned by Kiviat (2019), “the field of property and
casualty insurance is dominated by the idea that it is fair to use data for pricing
if the data actuarially relate to loss insurers expect to incur.” And as she explains,
actuaries, underwriters, and policymakers seem to have gone along with that, being
conform with their sense of “moral deservingness” (using the term introduced by
Watkins-Hayes and Kovalsky (2016)). Guseva and Rona-Tas (2001) recalls that
in North America, Experian, Equifax, and TransUnion keep records of a person’s
borrowing and repayment activities. And the Fair Isaac Corporation (FICO) has
developed a formula (not known) that calculates, using these records, a score,
based on debt and available credit, income, or rather their variations (along with
payment history, number of recent credit applications and negative events—such
as bankruptcy or foreclosure—as well as changes in income due to changes in
employment or a family situation). The FICO score starts at 300 and goes up to
850, with a poor score below 600, and an “exceptional” score above 800. The
average score is around 711. This score, created for banking institutions, is now
used in pre-employment screening, as Bartik and Nelson (2016) reminds us. For
O’Neil (2016), this use of credit scores in hiring and promotion creates a dangerous
vicious cycle in terms of poverty. The Credit-Based Insurance Scores cannot use
any personal information to determine the score, other than that on the credit report,
specifically race and ethnicity, religion, gender, marital status, age, employment
history, occupation, place of residence, child/family support obligations, or rental
agreements. In Table 6.4, via Solution (2020) we can see the evolution of the
proposed credit rate according to the credit score, over 30 years, for a “good risk”
(3.2%) and a “bad risk” (4.8%), with the total amount of credit (a “bad risk” will
have an extra cost of 20%), and on an insurance premium, for the same profile, a
“bad risk” having an extra premium of about 70%.
In Table 6.4, we have at the top, the cost of a 30-year credit, for an amount of
$150,000, according to the credit score (with five categories, going from the most
risky on the left-hand side to the least risky on the right-hand side), with the average
6.9 Credit Scores 265
Table 6.4 Top, cost of a 30-year credit, for an amount of $150,000, according to the credit score,
with the average interest rate. Bottom, insurance premium charged to a 30-year-old driver, with
no convictions, driving an average car, 12,000 miles per year, in the city (source: InCharge Debt
Solutions)
300–619 620–659 660–699 700–759 760–850
Total credit cost $283,319 $251,617 $245,508 $239,479 $233,532
Rates 4.8% 3.8% 3.6% 3.4% 3.2%
Motor insurance premium $2580 $2400 $2150 $2000 $1500
had little to do with wealth; thus, Herodotus is astonished that the winners of the
Olympian games were content with an olive wreath and a “glorious renown,” (περ`ι
᾿
αρετ η̃ς). In the Greek ethical vision, especially among the Stoics, a “good life”
does not depend on material wealth—a precept pushed to its height by Diogenes
who, seeing a child drinking from his hands at the fountain, throws away the bowl
he had, telling himself that it is again, useless wealth.
Greek society, despite its orientation toward values beyond mere material wealth,
remained profoundly hierarchical. This prompts the inquiry of pinpointing the
juncture in Western history when affluence emerged as the paramount metric for
evaluating all aspects of life. Max Weber’s theory, as expounded in Weber (1904),
is illuminating in this context: he posits that the Protestant work ethic fostered a
mindset where worldly achievements and success were considered indicative of
divine predestination for the afterlife. Concomitantly, the affluent individuals in the
present realm were perceived as the chosen ones for the future. Adam Smith, a
discerning critic of the nascent capitalist society in his era, devoted a chapter in “The
Theory of Moral Sentiments,” as delineated in Smith (1759), to the subject titled ‘of
the corruption of our moral feelings occasioned by that disposition to admire the
rich and great, and to despise or neglect the poor and lowly.” This underscores
the historical backdrop in which adulation of the wealthy and powerful, coupled
with indifference or disdain for the impoverished and humble, began to take root.
Today, the veneration of wealth appears to have reached unprecedented levels, with
material success nearly exalted to the status of a moral virtue. Conversely, poverty
has metamorphosed into a stigmatized condition that is arduous to transcend.
Nonetheless, historical context reminds us that these circumstances are not inherent
or immutable.
Indeed, the poor have not always been “bad.” In Europe, the Church has largely
contributed to disseminating the image of the “good poor,” as it appears in the
Gospels: “happy are you poor, the kingdom of God is yours,” or “God could have
made all men rich, but he wanted there to be poor people in this world, so that the
rich would have an opportunity to redeem their sins.” Beyond this, the poor is seen
as an image of Christ, Jesus having said “whatever you do to the least of these, you
will do to me.” Helping the poor, doing a work of mercy, is a means of salvation.
For Saint Thomas Aquinas, charity is thus essential to correct social inequalities
by redistributing wealth through alms giving. In the Middle Ages, merchants
were seen as useful, even virtuous, as they allowed wealth to circulate within the
community. Priests played the role of social assistants, helping the sick, the elderly,
and people with disabilities. The hospices and “xenodochia” of the Middle Ages
(ξενοδοχει̃ον, the “place for strangers,” ξένος) are the symbol of this care of the
poor. And quite often, poverty is not limited to material capital, but also social and
cultural, to use more contemporary terminology.
Toward the end of the Middle Ages, the figure of the “bad poor,” the parasitic and
dangerous vagabond, appeared. Brant (1494) denounced these welfare recipients,
“some become beggars at an age when, young and strong, and in full health, one
could work: why bother.” This mistrust was reinforced by the great pandemic of the
Black Death. The hygienic theories of the end of the XIX-th century added the final
268 6 Some Examples of Discrimination
touch: if fevers and diseases were caused by insalubrity and poor living conditions,
then by keeping the poor out, the rich were protected from disease.
In the words of Mollat and du Jourdin (1986), “the poor are those who,
permanently or temporarily, find themselves in a situation of weakness, dependence,
humiliation, characterized by the deprivation of means, variable according to the
times and the societies, of power and social consideration.” Recently, Cortina
(2022) proposed the term “aporophobia,” or “pauvrophobia,” to describe a whole
set of prejudices that exist toward the poor. The unemployed are said to be
welfare recipients and lazy. These prejudices, which stigmatize a group, “the poor,”
lead to fear or hatred, generating an important cleavage, and finally a form of
discrimination. Cortina (2022)’s “pauvrophobia” is discrimination against social
precariousness, which would be almost more important than standard forms of
discrimination, such as racism or xenophobia. Cortina (2022) ironically notes that
rich foreigners are often not rejected.
But these prejudices also turn into accusations. Szalavitz (2017) thus abruptly
asks the question, “why do we think poor people are poor because of their own
bad choices?” The “actor-observer” bias provides one element of an answer: we
often think that it is circumstances that constrain our own choices, but that it is the
behavior of others that changes theirs. In other words, others are poor because they
made bad choices, but if I am poor, it is because of an unfair system. This bias is
also valid for the rich: winners often tend to believe that they got where they are by
their own hard work, and that they therefore deserve what they have.
Social sciences studies show, however, that the poor are rarely poor by choice,
and increasing inequality and geographic segregation do not help. The lack of
empathy then leads to more polarization, more rejection, and, in a vicious circle,
even less empathy.
To discriminate is to distinguish (exclude or prefer) a person because of his/her
“personal characteristics.” Can we then speak of discrimination against the poor? Is
poverty (like gender or skin color) a personal characteristic? In Québec, the concept
of “social condition” (which explicitly encompass poverty) is considered a protected
attribute. Consequently, discrimination based on that condition is legally prohibited.
A correlation between wealth and risk exists in various contexts. In France, for
instance, there is a notable disparity in road accident deaths, with approximately
3% of “executives” and 15% of “workers” being affected, despite both groups
representing nearly 20% of the working population each. Additionally, Blanpain
(2018) highlights that there is a significant 13-year gap in life expectancy at
birth between the most affluent and the most economically disadvantaged men, as
discussed in Chap. 2.
6.10 Networks
edges being denoted as E). These mathematical structures serve as the foundation
for defining networks, which are graphs where nodes or edges possess attributes. In
a social network, nodes are individuals (or policyholders) and edges denote some
sort of “connections” between individuals. On social media, such as Facebook or
LinkedIn, an edge indicates that two individuals are friends or colleagues, that
will be interconnected, through some reciprocal connection. On Twitter (now X),
connections are directional, in the sense that one account is following another one
(but of course, some edges could be bidirectional). In genealogical trees, nodes are
individuals, and connections will usually denote parent-children connections. Other
popular network structures are “bipartite graphs,” where nodes are divided into two
disjoint and independent sets .V1 and .V2 , and edges connect nodes between .V1 and
.V2 (and not within). Classical examples are “affiliate networks” where employees
and employers are connected, or policyholders and brokers, car accident and experts,
disability claims and medical doctors, etc. It is also possible to have two groups of
nodes, such as disease and drugs, and edges denote associations. Based on such a
graph, insurance companies may infer diseases based on drugs purchased.
Historically, connections among people were used to understand informal risk
sharing. “Villagers experience health problems, crop failures, wildly changing
employment, in addition to a variety of needs for cash, such as dowries. They
don’t have insurance or much, if any, savings: they rely on each other for help,”
said Jackson (2019). Increasingly, insurers try to extract information from various
networks. And as Bernstein (2007) claimed, “network and data analyses compound
and reflect discrimination embedded within society.” This can be related to “peer-
to-peer” or “decentralized insurance,” as studied in Feng (2023).
Scism (2019) presented a series of “life hacks,” including tips on how to behave on
social media in order to bypass insurers’ profile evaluations. For example “do not
post photos of yourself smoking,” “ post pictures of yourself exercising (but not while
engaging in a risky sport),” “use fitness tracking devices to show you are concerned
about your health,” “purchase food from healthy online meal-preparation services,”
and “visit the gym with mobile location-tracking enabled (while leaving your phone
at home when you go to a bar).”
Social networks are also important to analyse fraud. Fraud is often committed
through illegal set-ups with many accomplices. When traditional analytical tech-
niques fail to detect fraud owing to a lack of evidence, social network analysis
may give new insights by investigating how people influence each other. These are
the so-called guilt-by-associations, where we assume that fraudulent influences run
through the network. For example, insurance companies often have to deal with
groups of fraudsters trying to swindle by resubmitting the same claim using different
people. Suspicious claims often involve the same claimers, claimees, vehicles,
witnesses, and so on. By creating and analyzing an appropriate network, inspectors
270 6 Some Examples of Discrimination
may gain new insight in the suspiciousness of the claim and can prevent pursuit of
the claim. In many applications, it may be useful to integrate a second node type
in the network. Affiliation or bipartite networks represent the reason why people
connect to each other, and include the events that network objects—such as people
or companies—attend or share. An event can for example refer to a paper (scientific
fraud), a resource (social security fraud), an insurance claim (insurance fraud), a
store (credit card fraud), and so on. Adding a new type of node to the network
not only enriches the imaginative power of graphs but also creates new insights in
the network structure and provides additional information neglected before. On the
other hand, including a second type of node results in an increasing complexity of
the analysis.
As mentioned by the National Association of Insurance Commissioners (2011,
2022), “insurance companies can base premiums on all insured drivers in your
household, including those not related by blood, such as roommates.” And Boyd
et al. (2014) asserted that there is a new kind of discrimination associated not
with personal characteristics (like those discussed in the previous section) but with
personal networks. Beyond their personal characteristics (such as race or gender), an
important source of information is “who they know.” In the context of usage-based
auto insurance, the nuance is that personal networks are not those represented by
driving behavior (strictly speaking), but those defined by the places to which people
physically go.
In many countries, when it comes to employment, most companies are required
to respect equal opportunity: discrimination on the basis of race, gender, beliefs, reli-
gion, color, and national origin is prohibited. Additional regulations prohibit many
employers from discriminating on the basis of age, disability, genetic information,
military history, and sexual orientation. However, nothing prevents an employer
from discriminating based on a person’s personal network. And increasingly, as
Boyd et al. (2014) reminds us, technical decision-making tools are providing new
mechanisms by which this can happen. Some employers use LinkedIn (and other
social networking sites) to determine a candidate’s “cultural fit” for hire, including
whether or not a candidate knows people already known to the company. Although
hiring on the basis of personal relationships is by no means new, it takes on new
meaning when it becomes automated and occurs on a large scale. Algorithms
that identify our networks, or predict our behavior based on them, offer new
opportunities for discrimination and unfair treatment.
Feld (1991) has shown that in any network the average degree (i.e., the number
of neighbors, or connections) of the neighbor of a node is strictly greater than
the average degree of nodes in the network as a whole. Applied to networks of
friendship, this translates simply as “on average your friends have more friends
than you do.” This phenomenon is known as the “friendship paradox.” A related
6.10 Networks 271
Proof Define differences i ’s between the average of its neighbors’ degrees and its
own degree, in the sense that
1
n
i =
. Aij dj − di ,
di
j =1
where we suppose here that all nodes have nonzero degrees (at least one neighbor).
The friendship paradox states that the average of i across all nodes is greater than
zero. In order to prove it, write the average as
n n
1 1 1 1
n n
dj
. i = Aij dj − di = Aij − Aij ,
n n di n di
i=1 i=1 j =1 ij =1
which yields
1 1 1
n n n
dj di
. i = Aij − 1 but also Aij −1 ,
n n di n dj
i=1 ij =1 ij =1
2
2 1 1
n
dj di dj di
. i = Aij + −2 = Aij − ≥ 0.
n n di dj 2n di dj
i=1 ij ij
272 6 Some Examples of Discrimination
Observe that the exact equality holds only when di = dj for all pairs of neighbors,
corresponding to the case where the network is a regular graph (or possibly the
reunion of disjoint regular graphs).
For the generalized friendship paradox, which considers attributes other than
degree, as in Cantwell et al. (2021), one can define an analogous quantity, (x)
i ,
as some attribute x (such as the wealth) is defined as
(x) 1
i
. = Aij xj − xi ,
di
j
which measures the difference between the average of the attribute for node i’s
neighbors and the value for i itself. When the average of this quantity over all nodes
is positive one may say that the generalized friendship paradox holds. In contrast
to the case of degree, this is not always true—the value of (x) i can be zero or
negative—but we can write the average as
1 (x) 1 1 1 Aij
. i = Aij xj − xi = xi − xi ,
n n di n dj
i i j i j
where the second line again follows from interchanging summation indices. Defin-
ing the new quantity
Aij
δi =
. ,
dj
j
1 1 Aij 1 1
. δi = = Aij = 1,
n n dj n dj
i ij j i
Thus, we will have a generalized friendship paradox in the sense defined here if (and
only if) x and δ correlate positively. But this is not always the case:
Cov(d, δ) ≥ 0
. ⇒
Cov(d, x) ≥ 0.
Cov(x, δ) ≥ 0
6.10 Networks 273
French poet Paul Verlaine warned us,9 “il ne faut jamais juger les gens sur leurs
fréquentations. Tenez, Judas, par exemple, il avait des amis irréprochables.” Never-
theless, several organizations have proposed to use information about our “friends”
in order to learn more about us, following an homophily principle (in the sense of
McPherson et al. (2001)), because as the popular saying goes, “birds of a feather
flock together.” Therefore, Bhattacharya (2015) noted that “you apply for a loan
and your would-be lender somehow examines the credit ratings of your Facebook
friends. If the average credit rating of these members is at least a minimum credit
score, the lender continues to process the loan application. Otherwise, the loan
application is rejected.” As we can see, mathematical guarantees are not strong here,
and it is very likely that such a strategy creates more biases.
9 “you should never judge people by who they associate with. Take Judas, for example, he had
Abstract An important challenge for actuaries is that they need to answer causal
questions with observational data. After a brief discussion about correlation and
causality, we describe the “causation ladder,” and the three rungs: association or
correlation (“what if I see...”), intervention (“what if I do...”), and counterfactuals
(“what if I had done...”). Counterfactuals are important for quantifying discrimina-
tion.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 275
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_7
276 7 Observations or Experiments: Data in Insurance
In Sect. 3.4, we have described various predictive models, where some variables,
called “explanatory variables” or “predictors,” .x are used to “predict” a variable y,
through some function m. We needed variables .x to be as correlated as possible with
y, so that .m(x) also, hopefully, correlates with y. This correlation make them appear
as valid predictors, and that is usually sufficient motivation to use them as pricing
variables. “Databases about people are full of correlations, only some of which
meaningfully reflect the individual’s actual capacity or needs or merit, and even
fewer of which reflect relationships that are causal in nature,” claimed (Williams
et al. 2018). In Sect. 4.1, we discussed interpretability, and explainability, we try to
explain why a variable is legitimate in a pricing model, and why a specific individual
gets a large prediction (and therefore is asked a larger premium). But as mentioned
by Kiviat (2019), the causal effect from .x to y, was traditionally not a worthwhile
question for insurers, and the concept of “actuarial fairness” allows this problem to
be swept under the rug. But more and more “policymakers want(ed) to understand
why” some variable “ predict(ed) insurance loss in order to determine if any links
in the causal chain held the wrong people to account, and as a consequence gave
them prices they did not deserve” recalls (Kiviat 2019). That is actually simply a
regulatory requirement across many countries, states, and provinces, if one wants to
prove that rates are not “unfairly discriminatory” (Fig. 7.1).
Everyone is familiar with the adage “correlation does not imply causation.” And
indeed, a classical question for researchers, but also policy makers, is whether
a significant correlation between two variables represents a causal effect. As
mentioned in Traag and Waltman (2022), a similar question arises when we observe
a difference in outcomes y between two groups: does the difference represent a bias
Table 7.1 In-hospital mortality for all patients undergoing major surgery, by major payer group.
(Source LaPar et al. (2010), Tables 4 and 5)
Medicare Medicaid Uninsured Insurance
In-hospital mortality 4.4% 3.7% 3.2% 1.3%
Pulmonary resection 4.3% 4.3% 6.2% 2.0%
Esophagectomy 8.7% 7.5% 6.5% 3.0%
Colectomy 7.5% 5.4% 3.9% 1.8%
Pancreatectomy 6.1% 5.8% 8.4% 2.7%
Gastrectomy 10.8% 5.4% 5.0% 3.5%
Aortic aneurysm 12.4 14.5 14.8 7.0
Hip replacement 0.4% 0.2% 0.1% 0.1%
Coronary artery 4.0% 2.8% 2.3% 1.4%
bypass grafting
Number of cases 491,829 40,259 24,035 337,535
Age (years) .73.5 ± 8.6 .49.8 ± 16.4 .51.8 ± 12.8 .55.5 ± 11.4
Women 49.6% 48.8% 35.8% 39.7%
Length of stay (days) .9.5 ± 0.1 .12.7 ± 0.4 .10.1 ± 0.3 .7.4 ± 0.1
Total cost ($) .76, 374 ± 53.1 .93, 567 ± 251.4 .78, 279 ± 231.0 .63, 057 ± 53.0
Rural location 10.1% 8.5% 9.8% 6.6%
278 7 Observations or Experiments: Data in Insurance
In 2008, owing to severe budget constraints, Oregon found itself with a Medicaid
waiting list of 90,000 people and only enough money to cover 10,000 of them.
So the state created a lottery to randomly select people who would qualify for
Medicaid, therefore recreating the necessary preconditions. The reality, however,
was a bit more complex, as many of the lottery winners were not eligible for
Medicaid or chose not to submit their paperwork to enroll in the program. Compared
with the control group (people who did not have access to Medicaid), Finkelstein
et al. (2012) observed that the treatment group had substantially and statistically
significantly higher health care use (including primary and preventive care as well
as hospitalizations), lower out-of-pocket medical expenditures and medical debt
(including fewer bills sent to collection), and better self-reported physical and
mental health. These “experiments,” which are often difficult to implement (for
financial and sometimes ethical reasons), make it possible to bypass the bias of
administrative data. Having a non-null correlation is quite easy, but proving a causal
effect is difficult.
Definition 7.1 (Common Cause (Reichenbach 1956)) If X and Y are non-
independent, .X /⊥⊥ Y , then, either
⎧
⎪
⎪
⎨X causes Y
. Y causes X
⎪
⎪
⎩there exists Z such that Z causes both X and Y.
Before defining causality in the context of individual data, let us recall that, in a
context of temporal data, Granger (1969) introduced a concept of “causality” that
takes a relatively simple form. Sequences of observations are useful to properly
capture this “causal” effect. Consider here a standard bivariate autoregressive time
series, where a regression of variables at time .t + 1 on the same variables at time t
is performed
{
x1,t+1 = c1 + a1,1 x1,t + a1,2 x2,t + ε1,t
.
x2,t+1 = c2 + a2,1 x1,t + a2,2 x2,t + ε2,t ,
7.1 Correlation and Causation 279
also noted .x t+1 = c + Ax t + ε t+1 , where the off-diagonal terms of the autoregres-
sion matrix .A allow to quantify the lagged causality, i.e., a lagged causal effect
(between t and .t + 1) with respectively .x1 → x2 or .x1 ← x2 (see Hamilton (1994)
or Reinsel (2003) for more details). For example, Fig. 7.2 shows the scatterplot
.(x1,t , x2,t+1 ) and .(x2,t , x1,t+1 ), left and right, where respectively .x1 denotes the
number of cyclists in Helsinki, per day, in 2014 (at a given road intersection) and
.x2 denotes the (average) temperature on the same day. The graph on the left is
equivalent to asking whether the temperature “causes” the number of cyclists (if
the temperature rises, the number of cyclists on the roads rises) and the graph on the
right-hand side is equivalent to asking whether the number of cyclists “causes” the
temperature (if the number of cyclists on the roads rises, the temperature rises). In
both cases, if we estimate
{
x1 → x2 : x2,t+1 = γ1 + α2,1 x1,t + η1,t
.
x1 ← x2 : x1,t+1 = γ2 + α1,2 x2,t + η2,t ,
We can use the Granger test (see Hamilton 1994), on the data of Fig. 7.2, on the two
causal hypotheses (not on the levels but on the daily variations, of the number of
cyclists, and of the temperature)
{
x1 → x2 : H0 : a2,1 = 0, p − value = 56.66%
.
x1 ← x2 : H0 : a1,2 = 0, p − value = 0.004%.
Fig. 7.2 Number of cyclists per day (x2 ), in 2014, in Helsinki (Finland), and average daily
temperature (x1 ) respectively at time t on the x-axis and t + 1 on the y-axis. The regression lines
are estimated only on days when the temperature exceeded 0◦ C. (Data source: https://www.reddit.
com/r/dataisbeautiful/comments/8k40wl )
280 7 Observations or Experiments: Data in Insurance
In other words, temperature is causally related to the presence of cyclists on the road
(the temperature “causes” the number of cyclists, according to Granger’s approach),
but not vice versa.
In a nondynamic context, defining causality is a more perilous exercise. In
the upcoming sections, we revisit the concept of the “causation ladder” initially
introduced by Pearl (2009b) and more recently discussed in Pearl and Mackenzie
(2018). The first stage (Sect. 7.2), referred to as “association,” represents the most
basic level where we observe a connection or correlation between two or more
variables. Moving on to the second stage (Sect. 7.3), known as “intervention,”
we encounter situations where we not only observe an association but also have
the ability to alter the world through suitable interventions (or experiments). The
third stage (Sect. 7.4) focuses on “counterfactuals.” According to Holland (1986),
the “fundamental problem of causal inference” arises from the fact that we are
confined to observing just one realization, despite the potential existence of several
alternative scenarios that could have been observed. In this section, we employ
counterfactuals to estimate the causal effect, which quantifies the disparity between
what we observed and the “potential outcome” if the individual had received the
treatment.
Two variables are independent when the value of one gives no information about the
value of the other. In everyday language, dependence, association, and correlation
are used interchangeably. Technically, however, association is synonymous with
dependence (or non-independence) and is different from correlation. “Association
is a very general relationship: one variable provides information about another,
as explained in Altman and Krzywinski (2015), in the sense that there could be
association even if the “correlation coefficient” is statistically null.
Here, we have to study the joint distribution of some variables, and their conditional
distribution.
Proposition 7.1 (Chain Rule) In dimension 2, for all sets .A and .B,
[ ] [ | ] [ ]
P X ∈ A, Y ∈ B = P Y ∈ B|X ∈ A · P X ∈ A ,
.
Proof This chain rule is a simple consequence of the definition of the conditional
probability. u
n
Definition 7.2 (Independence (Dimension 2)) X and Y are independent, denoted
X ⊥⊥ Y , if for any sets .A, B ⊂ R,
.
[ ] [ ] [ ]
P X ∈ A, Y ∈ B = P X ∈ A · P Y ∈ B .
.
Proof Hirschfeld (1935), Gebelein (1941), Rényi (1959) and Sarmanov (1963)
introduced the concept of “maximal correlation,” defined as
{ }
r * (X, Y ) = max Corr[ϕ(X), ψ(Y )] ,
.
ϕ,ψ
(provided that expected values exist and are well defined). And those authors proved
that .X ⊥ Y if and only if .r * (X, Y ) = 0. u
n
Definition 7.4 (Linear Independence) In a general context, consider two random
vectors .X and .Y , in .Rdx and .Rdy respectively. .X ⊥ Y if and only if for any .a ∈ Rdx
and .b ∈ Rdy
Cov[a T X, bT Y ] = 0.
.
Proposition 7.3 Consider two random vectors .X and .Y . .X ⊥⊥ Y if and only if for
any functions .ϕ : Rdx → R and .ψ : Rdy → R (such that the expected values below
exist and are well defined)
or equivalently
Cov[ϕ(X), ψ(Y )] = 0.
.
[ ]
n
d | |
d
P {(Y1 , · · · , Yd ) ∈
. Ai } = P[{Yi ∈ Ai }].
i=1 i=1
In Sect. 4.1, we have discussed the difference between ceteris paribus (“all other
things being equal” or “other things held constant”) and mutatis mutandis (“once the
necessary changes have been made”). As discussed, ceteris paribus has to do with
versions of some random vector with some independent components (we consider
some explanatory variable as if they were independent of the other ones),
Definition 7.8 (Version of Some Random Vector with Independent Margin)
Let .Y = (Y1 , · · · , Yd ) denote some random vector. .(Y1⊥ , Y2 , · · · , Yd ) is a version
of .Y with independent first margin if
⎧
⎨Y ⊥ ⊥⊥ Y −1 = (Y2 , · · · , Yd )
1
.
L
⎩Y ⊥ =
1 Y1 .
One can easily extend the previous concept for some subset of indices .J ⊂
{1, · · · , d}. All those concepts can be extended to the case of condition indepen-
dence.
Definition 7.9 (Conditional Independence (dimension 2)) X and Y are indepen-
dent conditionally on Z, denoted .X ⊥⊥ Y | Z, if for any sets .A, B, C ⊂ R,
[ | ] [ | ] [ | ]
P X ∈ A, Y ∈ B|Z ∈ C = P X ∈ A|Z ∈ C · P Y ∈ B|Z ∈ C .
.
7.2 Rung 1, Association (Seeing, “what if I see...”) 283
See Dawid (1979) for various properties on conditional independence. All those
concepts are well known in actuarial science, but there still are several pitfalls. So
let us recall some simple properties.
Proposition 7.4 Consider three random variables X, Y , and Z. If .X ⊥ Z and
Y ⊥ Z, then .aX + bY ⊥ Z, for any .a, b ∈ R.
.
u
n
Proposition 7.5 Consider three random variables X, Y , and Z. If .X ⊥ Z and
Y ⊥ Z, it does not imply that .ψ(X, Y ) ⊥ Z, for any .ψ : R2 → R.
.
Proof
⎧
⎪
⎪ (0, 0, 0) with probability 1/4,
⎨
(0, 1, 1) with probability 1/4,
.(X, Y, Z) =
⎪(1, 0, 1)
⎪ with probability 1/4,
⎩
(1, 1, 0) with probability 1/4.
therefore, .Cov(X, Z) = 0, and similarly for the pair .(Y, Z), so .X ⊥ Z and .Y ⊥ Z.
On the other hand, .XY /⊥ Z as .Cov(XY, Z) /= 0, because
u
n
Proposition 7.6 Consider a random vector .X in .Rk , and a random variable Z.
.X ⊥ Z does not imply that .ψ(X) ⊥ Z, for any .ψ : R → R.
k
so .X ⊥⊥ Z and .Y ⊥⊥ Z. But
⎧
⎨(0, 0) with probability 1/4,
.(XY, Z) = (0, 1) with probability 1/2,
⎩
(1, 0) with probability 1/4,
[ ] 1 1 1 [ ] [ ]
P XY = 1, Z = 0 = =
. / · = P XY = 1 · P Z = 0 .
4 4 2
u
n
Proposition 7.8 Consider a random vector .X in .Rk , and a random variable Z.
.X ⊥
⊥ Z does not imply either that .ψ(X) ⊥ Z or .ψ(X) ⊥⊥ Z, for any .ψ : Rk →
R.
In the context of fairness, if we were able to ensure that .X⊥ ⊥⊥ S, we can still
- = m(X⊥ ) /⊥⊥ S (even .Y
have .Y - = m(X⊥ ) /⊥ S).
be difficult to “define” what causality is, and that it would probably be simpler
to “axiomatize” it. In other words, the approach involves identifying the essential
attributes that must be present for a relationship to be classified as “causal,”
expressing these properties in mathematical terms, and then assessing whether these
axioms lead to the meaningful characterization of a causal relationship.
For example, it seems legitimate that these relations are transitive: if .x1 causes
.x2 and if .x2 causes .x3 , then it must also be true that .x1 causes .x3 . We could then
talk about global causality. But a local version seems to exist: if .x1 causes .x3 only
through .x2 , then it is possible to block the influence of .x1 on .x3 if we prevent .x2 from
being influenced by .x1 . One could also ask that the causal relation be irreflexive, in
the sense that .x1 cannot cause itself. The danger of this property is that it tends to
seek a causal explanation for any variable. Finally, an asymmetrical property of the
relation is often desired, in the sense that .x1 causes .x2 implies that .x2 cannot cause
.x1 . Here again, precision is necessary, especially if the variables are dynamic: this
property does not prevent a lagged feedback effect. It is indeed possible for .x1,t to
cause .x2,t+1 and for .x2,t+1 to cause .x1,t+2 , but not for .x1,t to cause .x2,t and for
.x2,t to cause .x1,t , with this approach. As noted by Wright (1921), well before (Pearl
1988), the most natural tool to describe these causal relations visually and simply
is probably through a directed graph. Within this conceptual framework, a variable
is depicted as a node within the network (e.g., .x1 or .x2 ), and a causal relationship,
such as “.x1 causes .x2 ,” is visually represented by an arrow that directs from .x1 to .x2
(akin to our previous approach in analyzing time series).
A variable is here a node of the network (for example .x1 or .x2 ), and a causal
relation, in the sense “.x1 causes .x2 ” will be translated by an arrow directed from .x1
to .x2 (as we did on time series).
Definition 7.11 (Confounder) A variable is a confounder when it influences both
the dependent variable and the independent variable, causing a spurious association.
The existence of confounders is an important quantitative explanation why “corre-
lation does not imply causation.” For example, in Fig. 7.3a, .x1 is a confounder for
.x2 and .x3 . The term “fork” is also used.
2 3 2 3 2 3
Fig. 7.3 Some examples of directed (acyclical) graphs, with three nodes, and two connections. (a)
corresponds to the case where .x1 is a confounder for .x2 and .x3 , corresponding to a common shock
or mutual dependence (also called “fork”), (b) corresponds to the case where .x2 is a mediator for
.x1 and .x3 (also called “chain”), and (c) corresponds to the case where .x3 is a collider for .x1 and
.x2 , corresponding to a mutual causation case. (a) Confounder (b) Mediator (c) Collider
286 7 Observations or Experiments: Data in Insurance
path between the two variables, i.e., a succession of links. If there is no path between
two nodes, we say that the two variables are causally independent. And if there is
a directed path from node .x1 to node .x2 , then .x1 causally affects .x2 : if .x1 had been
different, .x2 would also have been different. Therefore, causality has a direction.
A collider is a variable that is the consequence of two or more variables, like .x3
in (c). A noncollider is a variable influenced by only one variable, and it allows
a consequence to be causally transmitted along a path. The causal variables that
influence the collider are themselves not necessarily associated, together. If this is
the case, the collider is said to be shielded and the variable is the vertex of a triangle.
For (a) .x2 /⊥⊥ x3 whereas .x2 ⊥⊥ x3 | x1 , for (b) .x1 /⊥⊥ x3 whereas .x1 ⊥⊥ x3 | x2 , and
for (c) .x1 ⊥⊥ x2 .
In Fig. 7.4, .x4 is a “descendant” of .x1 , a child of .x2 (and .x4 ), a parent of .x5 (and
.x6 ), and an “ancestor” of .x7 . The variables .x3 and .x5 are not causally independent.
.x4 is a collider, but .x6 is not. .x4 is an unshielded collider because .x2 and .x3 (the two
3 5
7.2 Rung 1, Association (Seeing, “what if I see...”) 287
(a) (b)
2 6 2 6
1 4 7 1 4 7
3 5 3 5
Fig. 7.5 On the directed graph of Fig. 7.4, examples of blocking a path. Path
.π= {x1 → x2 → x4 → x5 } (blue), from .x1 (blue) to .x5 (blue) is blocked by .x2 (red) (on
the left, (a)), and not blocked by .x3 (blue) (on the right, (b))
Definition 7.14 (Path) A path .π from a node .xi to another node .xj is a sequence
of nodes and edges starting at .xi and ending at .xj .
On the causal graph of Fig. 7.4, .π = {x1 → x2 → x4 → x5 } is a path from node
x1 to .x5 . To go further, a conditioning set .x c is simply a collection of nodes.
.
Definition 7.15 (Blocking Path) A path .π from a node .xi to another node .xj is
blocked by .x c whenever there is a node .xk such that either .xk ∈ x c and
{ } { } { }
. xk − → xk → xk + or xk − ← xk ← xk + or xk − ← xk → xk +
{ }
or .xk /∈ x c , as well as all descents of .xk , and . xk − → xk ← xk + . In that case, write
.xi ⊥
G−π xj | x c .
On the causal graph of Fig. 7.5, the path .π = {x1 → x2 → x4 → x5 } (from .x1
to .x5 ) is blocked by .x2 , on the left-hand side, (a), and not blocked by .x3 , on the
right-hand side, (b).
Definition 7.16 (d-separation (nodes)) A node .xi is said to be d-separated with
another node .xj by .x c whenever every path from .xi to .xj is blocked by .x c . We will
simply denote .xi ⊥G xj | x c .
On the causal graph of Fig. 7.6, at the top, nodes .x1 and .x5 are d-separated by .x4 ,
at the top left (a), as no path that does not contain .x4 connects .x1 and .x5 . Similarly,
they are d-separated by nodes .(x2 , x3 ), at the top right (b), as there are no paths that
do not contain .x2 and .x3 that connects .x1 and .x5 . At the bottom, nodes .x1 and .x5 are
neither blocked by .x3 , bottom left (c), nor the pair .(x3 , x6 ), bottom right (d), as there
is a path that does not contain .x3 and .(x3 , x6 ) respectively, that connects .x1 and .x5 .
In both cases, path .π = {x1 → x2 → x4 → x5 } can be considered (it connects .x1
and .x5 , and does not contain .x3 and .x6 ).
Definition 7.17 (d-separation (sets)) A set of nodes .x i is said to be d-separated
with another set of nodes .x j by .x c whenever every path from any .xi ∈ x i to any
.xj ∈ x j is blocked by .x c . We simply denote .x i ⊥
G xj | xc.
288 7 Observations or Experiments: Data in Insurance
(a) (b)
2 6 2 6
1 4 7 1 4 7
3 5 3 5
(c) (d)
2 6 2 6
1 4 7 1 4 7
3 5 3 5
Fig. 7.6 On the directed graph of Fig. 7.4, examples of d-separation. Nodes .x1 (blue) and .x5 (blue)
are d-separated by .x4 (red) (top left (a)), by .(x2 , x3 ) (red) (top right (b)), and not blocked by .x3
(blue) (bottom left (c)) and .(x3 , x6 ) (blue) (bottom right (d))
Proposition 7.9 Two nodes .xi and .xj are d-separated by .x c if and only if members
of .x c block all paths from .xi to .xj .
Proposition 7.10 Two nodes .xi and .xj are d-separated by .x c if and only if members
of .x c block all paths from .xi to .xj .
Definition 7.18 (Markov Property) Given a causal graph .G with nodes .x, the joint
distribution of .X satisfies the (global) Markov property with respect to .G if, for any
disjoints .x 1 , .x 2 and .x c
x 1 ⊥G x 2 | x c ⇒ X1 ⊥⊥ X2 | X c .
.
P[A1 ∩ · · · ∩ An ]= P[A1 ]×PA1 (A2 )×PA1 ∩A2 [A3 ]× · · · ×PA1 ∩···∩An−1 [An ],
.
But this writing is not unique, because we could also write (for example)
1 2 3 4 3 4
The idea here is to write conditional probabilities involving only the variables
and their causal parents. For example, the graph (e) in Fig. 7.7 would correspond to
whereas graph (f) of Fig. 7.7 would be associated with the writing
because .x1 and .x2 are assumed to be independent. It is not uncommon to add a
Markovian assumption, corresponding to the case where each variable is indepen-
dent of all its ancestors conditional on its parents. For example, on the graph (e) of
Fig. 7.7, the Markov hypothesis allows
whereas graph (f) of Fig. 7.7 would be associated with the writing
such that
P[x1 , x2 , x3 ]
.P[x2 , x3 |x1 ] = = P[x2 |x1 ] · P[x3 |x1 ],
P[x1 ]
290 7 Observations or Experiments: Data in Insurance
and therefore .x2 ⊥⊥ x3 conditionally to .x1 . In the diagram (b) of Fig. 7.3
such that
P[x1 , x2 , x3 ] P[x1 ]P[x2 |x1 ]
P[x1 , x3 |x2 ] =
. = P[x3 |x2 ] = P[x1 |x2 ] · P[x3 |x2 ],
P[x2 ] P[x2 ]
such that
and therefore .x1 and .x2 are not independent, conditional on .x3 .
See Côté et al. (2023) for more details about causal graphs, conditional indepen-
dence and fairness, in the context of insurance.
Fig. 7.8 Illustration of the .do operator, with two forks, (a) and (d), z being a collider on (a) and
a confounder on (d), and two chains, (b) and (c). At the top, the causal graphs, and at the bottom,
the implied graphs when an intervention on x, corresponding to “.do(x)” is considered. (a) Fork z
collider (b) Chain (c) Chain (d) Fork z confounder
the intervention .do(X = x) means that all incoming edges to x are cut. Hence,
.P [Y ∈ A|do(X = x)] can be seen as a .Q[Y ∈ A|X = x], where the causal graph
has been manipulated. On the two graphs on the left-hand side of Fig. 7.8a and b,
.P [Y ∈ A|do(X = x)] = P[Y ∈ A|X = x]. On the two graphs on the right-hand
For simplicity, we present here linear Gaussian structural causal models, associated
with an acyclic causal graph. The definition is very close to the “algorithmic”
292 7 Observations or Experiments: Data in Insurance
L
.(Xt |Xt−1 = xt−1 , Xt−2 = X2−2 , · · · , X1 = X1 , X0 = x0 ) = (Xt |Xt−1 = xt−1 ),
and equivalently, there are independent variables .(Ut ) and a measurable function h
such that .Xt = h(Xt−1 , Ut ). In the case of a causal graph, quite naturally, if C is a
cause, and E the effect, we should expect to have .E = h(C, U ) for some measurable
function h and some random noise U . This is the idea of structural models.
Definition 7.19 (Structural Causal Models (SCM) (Pearl 2009b)) In a simple
causal graph, with two nodes C (the cause) and E (the effect), the causal graph
is .C → E, and the mathematical interpretation can be summarized in two
assignments:
{
C = hc (UC )
.
E = he (C, UE ),
where .UC and .UE are two independent random variables, .UC ⊥⊥ UE .
More generally, a structural causal model is a triplet .(U , V , h), as in Pearl
(2010) or Halpern (2016). The variables in .U are called exogenous variables, in
other words, they are external to the model (we do not have to explain how they
are caused). The variables in .V are called endogenous. Each endogenous variable
is a descendant of at least one exogenous variable. Exogenous variables cannot
be descendants of any other variable and, in particular, cannot be descendants
of an endogenous variable. Also, they have no ancestors and are represented as
roots in causal graphs. Finally, if we know the value of each exogenous variable,
we can, using .h functions, determine with perfect certainty the value of each
endogenous variable. The causal graphs we have described consist of a set of n
nodes representing the variables in .U and .V , and a set of edges between the n
nodes representing the functions in .h. Observe that we consider acyclical graphs,
not only for a mathematical reason (to ensure that the model is solvable) but also
for interpretation: a cycle between x, y, and z would mean that x causes y, y causes
z, and z causes x. In a static setting (such as the one we consider here), that is not
possible.
In the causal diagram (a) in Fig. 7.9, we have two endogenous variables, x and
y, and two exogenous variables, .ux and .uy . The diagram (a) is a representation
Fig. 7.10 Two causal diagrams, .x → y, with a mediator m on the left-hand side ((a) and (b)) and
an intervention on x and a confounding factor w on the right-hand side ((c), and (d)) with and an
intervention on x. Variables u are exogeneous
of the real world, but we assume here that it is possible to make interventions,
and to change the value of x, assuming that all things remain equal. We use here
the notation .Y * to write the “potential” outcome if an intervention were to be
considered:
In the causal diagram (a) in Fig. 7.10, we have three endogenous variables, x, y,
and a mediator m, and three exogenous variables, .ux , .uy and .um . The diagram (a) is
a representation of the real world, but as before, it is assumed here that it is possible
to make interventions on X.
In other words,
In the causal diagram (b) of Fig. 7.10, we have three endogenous variables, x, y,
and a confounding factor w, and three exogenous variables, .ux , .uy and .uw . Diagram
(c) is a representation of the real world, but as before, it is assumed here that it is
possible to make interventions on X. In other words,
{
mediator : P[Yx* = 1] = P[Y = 1|do(X = x)] = P[Y = 1|X = x]
.
confusion : P[Yx* = 1] = P[Y = 1|do(X = x)] /= P[Y = 1|X = x].
E
. P[Y = 1|W = w, X = x] · P[W = w] = E(P[Y = 1|W, X = x]).
w
exp[β -x x + β
-0 + β -w w]
-
μx (w) =
. ,
- - -w w]
1 + exp[β0 + βx x + β
1E
n
A-
.CE = μ1 (wi ) − -
(- μ0 (wi )).
n
i=1
Pearl (1998) notes “.do(X = x)” (or simply .do(x)), then assigned to y (historically,
from Wright (1921) to Holland (1986), passing by Neyman et al. (1923) or Rubin
(1974), various notations have been proposed). To use the notation and interpretation
of Pearl (2010), “Y would be .yx* , had X been x in situation .uy ” will be written
.Yx (uy ) = y, since the structural error .uy has not being impacted by an intervention
*
on x.
In probabilistic terminology, .P[Y = y|X = x] denotes the population
distribution of Y among individuals whose X value is x. Here, .P[Y = y|do(X = x)]
represents the population distribution of Y if all individuals in the population had
their X value set to x. And more generally, .P(Y = y|do(X = x), Z = z)
denotes the conditional probability that .Y = y, given .Z = z, in the distribution
created by the intervention .do(X = x). Also, in the literature, the average causal
effect (ACE) corresponds to .E[Y |do(X = 1)] − E[Y |do(X = 0)], or .Y -0 if
-1 − Y
-
.Yx = E[Y |do(X = x)] (which we also note hereafter .Y
X←1 − YX←0 , as in Russell
* *
where P A denotes the parents of x, and z covers all combinations of values that
the variables in P A can take. A sufficient condition for identifying the causal effect
.P(y|do(x)) is that each path between X and one of its children traces at least one
arrow emanating from the measured variable, as in Tian and Pearl (2002).
To illustrate this difference between the intervention (via the do operator) and
conditioning, consider the causal graph (c) of Fig. 7.10, discussed in de Lara (2023),
based on the following structural causal model, with additive functions,
⎧
⎪
⎨W = hw (uw ) = uw
⎪
. X = hx (w, ux ) = w + ux
⎪
⎪
⎩Y = h (x, w, u ) = x + w + u
y y y
As mentioned in Bongers et al. (2021), structural causal models are not always
solvable, this is why the “acyclicity” assumption is important, because it ensures
unique solvability. If we consider now a “do intervention,” with .do(x = 0), we have
⎧ ⎧
⎪ ⎪
⎨X = 0
⎪ ⎨X = 0
⎪
. *
Wx←0 = uw *
, or Wx←0 = uw
⎪
⎪ ⎪
⎪
⎩Y * = x + w + u ⎩Y * = u + u
x←0 y x←0 w y
Thus, on the one hand, observe that .W |X = 0 as the same distribution as .Uw
conditional on .Ux + Uw = 0, i.e., .W |X = 0 as the same distribution as .−Ux . On
L
the other hand, the distribution of .W * is .Uw . Thus, generally, .(W |X = 0) /= Wx←0
*
7.4.1 Counterfactuals
Pearl and Mackenzie (2018) noted that causal inference was intended to answer the
question “what would have happened if...?” This question is central in epidemiology
(what would have happened if this person had received the treatment?) or as soon as
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 297
we try to evaluate the impact of a public policy (e.g., what would have happened
if we had not removed this tax?). But we note that this is the question we ask
as soon as we talk about discrimination (e.g., what would have happened if this
woman had been a man?). In causal inference, in order to quantify the effect of
a drug or a public policy measure, two groups are constituted, one that receives
the treatment and another that does not, and therefore serves as a counterfactual, in
order to answer the question “what would have happened if the same person had
had access to the treatment?” When analyzing discrimination, similar questions are
asked, for example, “would the price of risk be different if the same person had been
a man and not a woman?,” except that here gender is not a matter of choice, of an
arbitrary assignment to a treatment (random in so-called randomized experiments).
In fact, this parallel between discrimination analysis and causal inference was
initially criticized: changing treatment is possible, whereas sex change is a product
of the imagination. One can also think of the questions regarding the links between
smoking and certain cancers: seeing smoking as a “treatment” may make sense
mathematically, but ethically, one could not force someone to smoke just to quantify
the probability of getting cancer a few years later1 (whereas in a clinical experiment,
one could imagine that a patient is given the blue pills, instead of the red pills). We
enter here into the category of the so-called “quasi-experimental” approaches, in the
sense of Cook et al. (2002) and DiNardo (2016).
In the data, y (often called “outcome”) is the variable that we seek to model
and predict, and which will serve as a measure of treatment effectiveness. The
potential outcomes are the outcomes that would be observed under each possible
treatment, and we note yt* the outcome that would be observed if the treatment T
had taken the value t. And the counterfactual outcomes are what would have been
observed if the treatment had been different, in other words, for a person of type t,
the counterfactual outcome is y1−t * (because t takes the values {0, 1}). The typical
example is that of a person who received a vaccine (t = 1), who did not get sick
(y = 0), whose counterfactual outcome would be y0* , sometimes noted yt←0 * . Before
launching the vaccine efficacy study, the two outcomes are potential, y0* and y1* .
Once the study is launched, the observed outcome will be y and the counterfactual
outcome will be y1−t * . Note that different notations are used in the literature, y(1)
and y(0) in Imbens and Rubin (2015), y 1 and y 0 in Cunningham (2021), or yt=1 and
yt=0 in Pearl and Mackenzie (2018). Here, we use yt←1 * * , the star being a
and yt←0
reminder that those quantities are potential outcomes, as shown in Table 7.2.
The treatment formally corresponds, in our vaccine example, to an intervention,
which is formally a shot given to a person, or a pill that the person must swallow.
In this section, it is not possible to manipulate the variable whose causal effect we
want to measure. In the Introduction, we mentioned the idea that body mass index
(BMI) could have an impact on health status, but BMI is not a pill, it is an observed
quantity. It could be possible to manipulate variables that will have an impact on the
1 Ina humorous article, Smith and Pell (2003) asked the question of setting up randomized
experiments to prove the causal link between having a parachute and surviving a plane crash.
298 7 Observations or Experiments: Data in Insurance
Table 7.2 Excerpt of a standard table, with observed data ti , xi , yi , and potential outcomes
* *
yi,T ←0 and yi,T ←1 respectively when treatment (t) is either 0 or 1. The question mark ?
corresponds to the unobserved outcome, and will correspond to the counterfactual value of the
observed outcome
Treatment Outcome Features ···
ti yi *
yi,T ←0
*
yi,T ←1 xi ···
1 0 75 75 ? 172 ···
2 1 52 ? 52 161 ···
3 1 57 ? 57 163 ···
4 0 78 78 ? 183 ···
index (by forcing a person to practice sports regularly, change their eating habits,
etc.), so that one is not measuring strictly speaking the causal effect of the BMI,
but rather that of the interventions that influence the index. In the same way, it is
impossible to intervene on certain variables, said to be immutable, such as gender
or racial origin. The counterfactual is then purely hypothetical. Dawid (2000) was
very critical of the idea that we can create (or observe) a counterfactual, because “by
definition, we can never observe such [counterfactual] quantities, nor can we assess
empirically the validity of any modelling assumption we may make about them, even
though our conclusions may be sensitive to these assumptions.”
We will say that there is a causal effect (or “identified causal effect”) of a (binary)
treatment t on an outcome y if y0* and y1* are significantly different. And as we
cannot observe these variables at the individual level, we compare the effect on sub-
populations, as shown by Rubin (1974), Hernán and Robins (2010), or Imai (2018).
Quite naturally, one might want to measure the causal effect as the difference in y
between the two groups, the treated (t = 1) and the untreated (t = 0), but unless
additional assumptions are made, this difference does not correspond to the average
causal effect (ATE, “average treatment effect”). But let us formalize a little bit more
the different concepts used here.
Definition 7.22 (Average Treatment Effect (Holland 1986)) Given a treatment
T , the average treatment effect on outcome Y is
[ * ]
τ = ATE = E Yt←1
. − Yt←0
*
.
where y 1 is the average outcome of treated observations (ti = 1), and y 0 is the
average outcome of individuals in the control group (ti = 0). Observe that y 1 and
y 0 are unbiased estimates of E[Y |T = 1] and E[Y |T = 0] respectively. Therefore,
-
τnaive is an unbiased estimate of
For simplicity, assume that all confounders have been identified, and are
categorical variables. Within a class (or stratum) x, there is no confounding effect,
and therefore, the causal effect can be identified by naive estimation, and the overall
average treatment effect is identified by aggregation: by the law of total probability,
300 7 Observations or Experiments: Data in Insurance
P[YT* ←1 = y] is equal to
E [ ] E [ ]
. P YT* ←1 = y|X = x P[X = x] = P YT* ←1 = y|X = x, T = 1 P[X = x],
x x
also called “exact matching estimate.” Here, p-n (X = x) is the proportion of strata
x in the training dataset (or Pn (X = x) with notations of Part I).
Unfortunately, strata might be sparse, and that will generate a lot of variability.
So, instead of considering all possible strata, it is possible to create a score, such
that conditional on the score, we would have independence. This is the idea of a
“balancing score.”
Definition 7.27 (Balancing Score (Rosenbaum and Rubin 1983)) A balancing
score is a function b(X) such that
|
. X ⊥⊥ T | b(X).
Proof The complete proof can be found in Rosenbaum and Rubin (1983) and
Borgelt et al. (2009). It is a direct consequence of the so called “contraction”
property, in the sense that Y ⊥⊥ T | B and Y ⊥⊥ B imply Y ⊥⊥ (T , B). See Zenere
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 301
et al. (2022) for more details about balancing scores and conditional independence
properties. u
n
A popular balancing score is the “propensity score.” With our previous notations,
if n(x) is the number of observations (out of n) within stratum x, and p -n (x) =
n(x)/n. The propensity score will be - en (x) = n1 (x)/n(x), where n1 (x) is the
number of treated individuals in stratum x. And one can write
E n ( )
n(x) 1 E yi 1(ti = 1) yi 1(ti = 0)
-
τstrata (x) =
. -
τnaive (x) = − .
x
n n e(x i ) 1 − e(x i )
i=1
With a few calculations (see in the next section for the probabilistic version), we can
actually write the latter as
( ) E
1E E yi 1(ti = 1) yi 1(ti = 0) n(e)
.-
τstrata (x) = − = -
τnaive (e) ,
n e e 1−e e
n
e(x i )=e
where we recognize a matching estimator, not on the strata x but on the score.
The interpretation is that, conditional on the score, we can pretend that data were
collected through some randomization process.
Definition 7.28 (Propensity Score (Rosenbaum and Rubin 1983)) The propen-
sity score e(x) is the probability of being assigned to a particular treatment given a
set of observed covariates. For a binary treatment (t ∈ {0, 1}),
so that .Q is such that .Q(T = 0) = Q(T = 1), and under .Q, .T ⊥⊥ X. This means
that the pseudo population (obtained by re-weighting) looks as if the treatment was
randomly allocated by tossing an unbiased coin, and
[ ( )]
EQ [T Y ] 1 T 1−T
EQ [Y | T = 1] = = 2 · · EP Y T +
Q(T = 1) 2 e(X) 1 − e(X)
[ ] [ [ | ]] [ ]
T T | EP [T Y | X]
= EP · Y = EP EP · Y |X = EP
e(X) e(X) e(X)
.
[ [ ] ] [ [ ]]
EP T YT* ←1 | X e(X) · EP YT* ←1 | X
= EP = EP
e(X) e(X)
[ [ ]] [ ]
= EP EP YT* ←1 | X = EP YT* ←1 ,
and
[ ]
1−T [ ]
EQ [Y | T = 0] = EP
. · Y = EP YT* ←0 .
1 − e(X)
Thus, if we combine,
[ ] [ ]
[ * ] T 1−T
.EP YT ←1 − YT ←0 = EP · Y − EP ·Y .
*
e(X) 1 − e(X)
The price to pay to be able to identify the average treatment effect under .P is that
we need to estimate the propensity score e (see Kang and Schafer (2007) or Imai
and Ratkovic (2014)). We return to the Radon–Nikodym derivative and weights in
Sect. 12.2.
Importance sampling is a classical technique, popular when considering Monte
Carlo simulations to compute some quantities efficiently. Recall that Monte Carlo is
simply based on the law of large numbers: if we can draw i.i.d. copies of a random
variable .Xi ’s, under probability .P, then
1E
n
. h(xi ) → EP [h(X)], as n → ∞.
n
i=1
And much more can be obtained, as the empirical distribution .Pn (associated with
sample .{x1 , · · · , xn }) converges to .P as .n → ∞ (see, for example, Van der Vaart
2000).
Now, assume that we have an algorithm to draw efficiently i.i.d. copies of a
random variable .Xi , under probability .Q, and we still want to compute .EP [h(X)].
The idea of importance sampling is to use some weights,
1 E dP(xi )
n
. h(xi ) → EP [h(X)], as n → ∞,
n dQ(xi )
i=1 ' '' '
ωi
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 303
where weights are simply based on the likelihood ratio of .P over .Q. To introduce
notations that we use afterwards, define
1 E dP(xi )
n
μis =
-
. h(xi )
n dQ(xi )
i=1
At the top of Fig. 7.11, we supposed that we had a nice code to generate a Poisson
distribution .P(8); unfortunately, we want to generate some Poisson .P(5). At the
bottom, we consider the opposite: we can generate some .P(5) variable, but we want
a .P(8). On the left, the values of the weights, .dP(x)/dQ(x), with .x ∈ N. In the
center, the histogram of .n = 500 observations from the algorithm we have (.P(8) at
the top, .P(5) at the bottom), and on the right, a weighted histogram for observations
that we wish we had, mixing the first sample and appropriate weights (.P(5) at the
top, .P(8) below). Below, we generate data from .P(5), and the largest observation
was here 13 (before, all values from 0 to 11 were obtained). As we can see on the
right, it is not possible to get data outside the range of data initially obtained. Clearly,
this approach works well only when the supports are close. The weighted histogram
was obtained using wtd.hist, in package weights.
In our context, one can define the importance sampling estimator of .E[YT* ←1 ], as
( ) 1 E yi nT 1 E yi
μis YT* ←1 =
-
. = ,
n1 e(x i ) n n e(x i )
ti =1 ti =1
Fig. 7.11 Illustration of the importance sampling procedure, and the use of weights
(.dP(x)/dQ(x)). At the top, we have an algorithm to generate a Poisson distribution .P(8) (in
the middle), where thin lines represent the theoretical histogram, the plain boxes the empirical
histogram). We distort that sample to generate some Poisson .P(5), and we have the histogram
on the right-hand side (using function wtd.hist from package weights), with the empirical
histogram in plain boxes. Thin lines represent the theoretical histogram associated with the .P(5)
distribution. Below, it is the opposite, we can generate .P(5) samples, and distort them to get a .P(8)
sample
the incidence of a regular medical checkup (X) and a direct effect entailing any
other causal mechanisms. Whether or not an individual undergoes routine checkups
appears to be an interesting mediator, as it is likely to be affected by health insurance
coverage and may itself have an impact on the individual’s health (simply because
checkups can help to identify medical conditions before they become serious).
Another classical application of causal inference and predictive modeling could
be “uplift modeling.” The idea is to model the impact of a treatment (that would be as
a direct marketing action) on an individual’s behavior. Those ideas were formalized
more than 20 years ago, in Hansotia and Rukstales (2002) or Hanssens et al.
(2003). In Radcliffe and Surry (1999), the term “true response modelling” was used,
Lo (2002) used “ true lift,” and finally (Radcliffe 2007) suggested techniques for
“building and assessing uplift models.” More specifically, Hansotia and Rukstales
(2002) used two models, estimated separately (namely two logistic regressions), one
for the treated individuals, and one for the nontreated ones and Lo (2002) suggested
an interaction model, where interaction terms between predictive variables .x and
the treatment t are added. Over the past 20 years, several papers appeared to apply
those techniques, in personalized medicine, such as Nassif et al. (2013), but also in
insurance, with Guelman et al. (2012, 2014) and Guelman and Guillén (2014).
Part III
Fairness
For leaders today—both in business and regulation—the dominant theme of 21st century
financial services is fast turning out to be a complicated question of fairness, Wheatley
(2013), Chief Executive of the FCA, at Mansion House, London
When you can measure what you are speaking about, and express it in numbers, you
know something about it; but when you cannot measure it, when you cannot express
it in numbers, your knowledge is of a meagre and unsatisfactory kind1 : it may be the
beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of
science, whatever the matter may be, Lord Kevin, Thomson (1883)
Voulez-vous croire au réel, mesurez-le,
Do you want to believe in reality? Measure it, Bachelard (1927)
1 Sociologist William Ogburn, a onetime head of the Social Sciences Division at the University of
Chicago, was responsible for perhaps the most contentious carving on campus. Curving around an
oriel window facing 59th Street is the quote from Lord Kelvin: “when you cannot measure, your
knowledge is meager and unsatisfactory”. Leu (2015) mentioned that in a 1939 symposium,
economist Frank Knight Jr. snarkily suggested that the quote should be changed to “if you cannot
measure, measure anyhow”.
Chapter 8
Group Fairness
Sect. 8.1) when s was not used (also denoted “without sensitive” in the applications).
But it is also possible to include s, and the score is .m(x, s) ∈ [0, 1]. Four different
models are considered here (and trained on the complete toydata2 dataset): a
plain logistic regression (GLM, fitted with glm), an additive logistic regression
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 309
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_8
310 8 Group Fairness
Fig. 8.1 Scores .m(x i , si ), on top (“with sensitive” attribute) or .m(x i ) ∈ [0, 1] at the bottom
(“without sensitive” attribute), for the different models, fitted on the toydata2 dataset. Colors
correspond to the value of y, with .{0, 1} and the shape corresponds to the value of s (.f illed square
and .f illed circle correspond to A and B respectively)
(GAM, with splines, for the three continuous variables, using gam from the mgcv
package), a classification tree (CART, fitted using rpart), and a random forest
(fitted using randomForest). Details on those .n = 40 individuals are given in
Table 8.1. In Fig. 8.1, at the top, the x-values (from the left to the right) of the .n = 40
points correspond to values of .m(x i )’s in .[0, 1]. Colors correspond to the value of
y, with .{0, 1} and the shape corresponds to the value of s (.█ and .• correspond to A
and B respectively. As expected, individuals associated with .yi = 0 are more likely
to be on the left-hand side (small .m(x i )’s).
In Fig. 8.2, instead of visualizing .n = 40 individuals on a scatter-plot (as
in Fig. 8.1), we can visualize the distribution of scores, that is, the distribution
of .m(x i , si ) (“with sensitive” attribute) or .m(x i ) ∈ [0, 1] (“without sensitive”
attribute) using box plots respectively when .yi = 0 and .yi = 1 at the top, and
respectively when .si = A and .si = B at the bottom. For example at the bottom of the
two graphs, the two boxes correspond to prediction .m(x i ) when a logistic regression
is considered. At the top, we have the distinction .yi = 0 and .yi = 1: for individuals
.yi = 0, the median value for .m(x i ) is 23% with the logistic regression, whereas it is
60% for individuals .yi = 1. All models here were able to “discriminate” according
to the risk of individuals. Below, we have the distinction .si = A and .si = B: for
individuals .si = A, the median value for .m(x i ) is 30%, whereas it is 47% for
individuals .si = B. All models here were able to “discriminate” according to the risk
of individuals. Based on the discussion we had in the introduction, in Sect. 1.1.6 (and
Fig. 1.2 on the compas dataset respectively with Dieterich et al. (2016) and Feller
et al. (2016) interpretations), if the premium is proportional to .m(x i ), individuals in
group .B would be asked, on average, a higher premium than individuals in group
.A, and that difference could be perceived as discriminatory. When comparing box
plots at the top and at the bottom, at least, observe that all models “discriminate”
more, between groups, based on the true risk y than based on the sensitive attribute
s. In Fig. 8.3 we can visualize survival functions (on the small dataset, with .n = 40
8 Group Fairness 311
Table 8.1 The small version of toydata2, with .n = 40 observations. p corresponds to the
“true probability” (used to generate y), and .m(x) is the predicted probability, from a plain logistic
regression
.x1 -1.220 -1.280 -0.930 -0.330 0.050 1.000 -0.070 -1.340 -1.800 -1.890
.x2 3.700 9.600 4.500 5.800 2.800 1.200 1.400 3.800 5.900 3.200
.x3 -1.190 -0.410 -1.050 -0.660 -0.890 0.250 -1.530 -0.300 -1.780 -0.570
s A A A A A A A A A A
y 0 0 0 0 0 0 0 0 0 0
p 0.090 0.500 0.130 0.270 0.14 0.220 0.070 0.090 0.160 0.060
.m(x) 0.099 0.506 0.167 0.379 0.22 0.312 0.121 0.102 0.116 0.049
.x1 -0.330 0.800 -1.040 -0.160 -0.990 -0.440 -0.220 -3.200 -1.720 -1.090
.x2 7.900 9.200 9.400 9.800 6.700 7.900 9.700 3.700 2.700 9.600
.x3 -0.230 -0.650 -1.530 -1.650 -1.150 -0.030 -0.040 -3.440 -0.700 -1.620
s A A A A A A A A A A
y 1 1 1 1 1 1 1 1 1 1
p 0.460 0.840 0.500 0.690 0.260 0.440 0.660 0.100 0.050 0.510
.m(x) 0.585 0.867 0.511 0.739 0.297 0.565 0.759 0.012 0.047 0.515
.x1 1.480 0.720 0.400 -0.470 -0.230 -0.820 0.740 1.440 0.200 -0.610
.x1 3.000 0.000 4.800 5.300 6.500 0.500 0.700 1.200 3.000 3.900
.x1 3.520 1.690 0.080 0.910 0.220 1.220 0.560 1.620 0.190 -0.010
s B B B B B B B B B B
y 0 0 0 0 0 0 0 0 0 0
p 0.670 0.160 0.460 0.310 0.470 0.050 0.210 0.480 0.260 0.190
.m(x) 0.681 0.209 0.485 0.350 0.494 0.063 0.233 0.453 0.287 0.199
.x1 0.820 2.040 1.740 1.580 1.550 1.030 0.440 3.090 2.180 2.300
.x1 4.200 2.800 0.400 2.300 9.600 8.500 4.200 5.900 7.100 4.900
.x3 2.920 2.480 2.010 1.570 2.830 1.240 0.570 3.510 1.260 1.200
s B B B B B B B B B B
y 1 1 1 1 1 1 1 1 1 1
p 0.540 0.840 0.540 0.650 0.970 0.900 0.420 1.000 0.980 0.960
.m(x) 0.619 0.75 0.463 0.587 0.961 0.889 0.455 0.968 0.936 0.878
Fig. 8.2 Box plot of the scores .m(x i , si ) (with sensitive attribute) or .m(x i ) ∈ [0, 1] (without
sensitive attribute), for the different models, conditional on .y ∈ {0, 1} at the top and conditional on
.s ∈ {A, B} at the bottom
Table 8.2 Kolmogorov–Smirnov test, to compare the conditional distribution of .m(x i , si ) (“with
sensitive” attribute) or .m(x i ) ∈ [0, 1] (“without sensitive” attribute), conditional on the value
of y at the top, and s at the bottom. The distance is the maximum distance between the two
survival functions, and the p-value is the one obtained with a two-sided alternative hypothesis.
.H0 corresponds to the hypothesis that both distributions are identical (“no discrimination”)
The very first concept we discuss is based on “blindness,” also coined “fairness
through unawareness” in Dwork et al. (2012), and is based on the (naïve) idea that
8.1 Fairness Through Unawareness 313
Fig. 8.3 Survival distribution of the scores .m(x i ) ∈ [0, 1] (“without sensitive” attribute), for the
different models (plain logistic regression on the left-hand side, and random forest on the right-
hand side), conditional on .y ∈ {0, 1} at the top and conditional on .s ∈ {A, B} at the bottom
a model fitted on the subpart of the dataset that contains only legitimate attributes
(and not sensitive ones) is “fair.” As discussed previously, .x ∈ X denote legitimate
features, whereas .s ∈ S = {A, B} denotes the protected attribute. As discussed in
Chap. 3, we may use here .z as a generic notation, to denote either .x, or .(x, s).
Definition 8.1 (Fairness Through Unawareness (Dwork et al. 2012)) A model
m satisfies the fairness through unawareness criteria, with respect to a sensitive
attribute .s ∈ S if .m : X → Y.
Based on that idea, we extend the notion of “regression function” (from
Definition 3.1), and distinguish “aware” and “unaware regression functions.”
Definition 8.2 (Aware and Unaware Regression Functions .μ) The aware regres-
sion function is .μ(x, s) = E[Y |X = x, S = s] and the unaware regression function
is .μ(x) = E[Y |X = x].
314 8 Group Fairness
Fig. 8.4 Scatterplot with aware (with the sensitive attribute s) on the x-axis and unaware model
(without the sensitive attribute s) on the y-axis, or .{
μ(x i , si ),
μ(x i )}, for the GLM, the GAM and
the Random Forest models, on the toydata2 dataset, .n = 1000 individuals
As pointed out by Caton and Haas (2020), there are at least a dozen ways to formally
define the fairness of a classifier, or more generally of a model. For example, one can
wish for independence between the score and the group membership, .m(Z) ⊥⊥ S,
or between the prediction (as a class) and the protected variable .Y ⊥⊥ S.
Observe that this is implicitly in the introduction of this chapter, for example, in
Table 8.2 when we compared conditional distributions of the score .m(x i ) in the two
groups, .s = A and .s = B. Inspired by Darlington (1971) we define as follows, for a
classifier .mt :
Definition 8.4 (Demographic Parity (Calders and Verwer 2010; Corbett-Davies
et al. 2017)) A decision function .
y —or a classifier .mt , taking values in .{0, 1}—
satisfies demographic parity, with respect to some sensitive attribute S if (equiva-
lently)
⎧
⎪P[Y
⎪ = 1|S = A] = P[Y
= 1|S = B] = P[Y
= 1]
⎨
. |S = A] = E[Y
E[Y |S = B] = E[Y
]
⎪
⎪
⎩P[m (Z) = 1|S = A] = P[m (Z) = 1|S = B] = P[m (Z) = 1].
t t t
In the regression case, when y is continuous, there are two possible definitions
of demographic parity. If .
y = m(z), we ask only for the equality of the conditional
expectation (weak notion) or for the equality of the conditional distributions (strong
notion).
Definition 8.5 (Weak Demographic Parity) A decision function .
y satisfies weak
demographic parity if
|S = A] = E[Y
E[Y
. |S = B].
∈ A|S = A] = P[Y
P[Y
. ∈ A|S = B], ∀A ⊂ Y.
If y and .
y are binary, the two definitions are equivalent, but this is usually not
the case. When the score is used to select clients, so as to authorize the granting
of a loan by a bank or a financial institution, this “demographic parity” concept
(also called “statistical fairness,” “equal parity,” “equal acceptance rate,” or simply
“independence,” as mentioned in Calders and Verwer (2010)), simply requires the
316 8 Group Fairness
. = 1|S = A] /= P[Y
P[Y = 1|S = B].
In Table 8.3, the small dataset with .n = 40 is at the top, and with the entire dataset
(.n = 1000) at the bottom. One can observe an important difference between the
ratio, for identical thresholds t. This is simply because the small data set .n = 40 is a
distorted version of the entire one, with selection bias. Recall that in the entire data
set,
⎧
⎪
⎪ P[Y = 0, S = A] ∼ 47%
⎪
⎪
⎨P[Y = 1, S = B] ∼ 27%
.
⎪
⎪ P[Y = 1, S = A] ∼ 13%
⎪
⎪
⎩
P[Y = 0, S = B] ∼ 13%,
whereas in our small dataset, all probabilities are .25% (in order to have exactly 10
individuals in each group). So clearly, selection bias has an impact on discrimination
assessment.
In Fig. 8.5 we can visualize .t |→ P[mt (Z) = 0|S = A]/P[mt (Z) = 0|S = B] for
aware and unaware models (here a plain logistic regression) on the left-hand side,
and more specifically .t |→ P[mt (Z) = 0|S = A] and .t |→ P[mt (Z) = 0|S = B]
in the middle and on the right-hand side. The model without the sensitive attribute
is more fair, with respect to the demographic parity criteria, than the one with the
sensitive attribute. Points on the dots are the ones obtained on the small dataset,
with .n = 40. As mentioned previously, on that dataset, it seems that there is less
discrimination (with respect to s), probably because of some selection bias. When
.t = 30% for the plain unaware logistic regression, out of 300,000 individuals in
group A, there are 100,644 “positive,” which is a 33.55% proportion (.P[Y = 1|S =
A]), and out of 200,000 individuals in group B, there are 163,805 “positive,” which
is a 81.9% proportion (.P[Y = 1|S = B]). Thus, the ratio .P[Y = 1|S = B]/P[Y =
1|S = A] is 2.44. As t increases, both proportions (of “positive”) decrease, but the
= 1|S = B]/P[Y
ratio .P[Y = 1|S = A] increases. In Fig. 8.6, we visualize on the
right-hand side .P[Y = 1|S = B] and .P[Y = 1|S = A], and on the left-hand side,
.t |→ P[mt (X) ≤ t|S = A] and .t |→ P[mt (X) ≤ t|S = B].
Table 8.3 Quantifying demographic parity on toydata2, using dem_parity from R package fairness. Ratio .P[Y
= 1|S = B]/P[Y
= 1|S = A] (and
.P[Y
= 0|S = A]/P[Y = 0|S = B]) should be equal to 1 to satisfy the demographic parity criteria
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.n = 40, .t = 50%, ratio .P[Y = 1|S = B]/P[Y = 1|S = A]
.P[Y
= 1|S = A] 40% 35% 20% 30% 30% 25% 20% 30%
.P[Y
= 1|S = B] 45% 55% 40% 55% 60% 55% 40% 55%
Ratio 1.125 1.571 2.000 1.833 2.000 2.200 2.000 1.833
8.2 Independence and Demographic Parity
Fig. 8.5 Demographic parity as a function of the threshold t, for classifier .mt (x), when m is a plain logistic regression—with and without the sensitive attribute
s—with groups A and B, on toydata2, using dem_parity from R package fairness. Here, .n = 500,000 simulated data are considered for plain lines,
whereas dots on the left-hand side are empirical values obtained on the smaller subset, with .n = 40, as in Table 8.3. On the left-hand side, evolution of the ratio
.P[Y = 1|S = B]/P[Y = 1|S = A]. The horizontal line (at .y = 1) corresponds to perfect demographic parity. In the middle .t |→ P[mt (X) > t|S = B] and
.t |→ P[mt (X) > t|S = A] on the model with s, and on the right-hand side without s
319
320
Fig. 8.6 Alternative demographic parity graphs (compared with Fig. 8.5), with ratio .P[Y = 0|S = A]/P[Y = 0|S = B], on the left-hand side, with and without
the sensitive attribute s. In the middle and on the right-hand side .t |→ P[mt (X) ≤ t|S = B] and .t |→ P[mt (X) ≤ t|S = A], respectively with and without the
sensitive attribute s
8 Group Fairness
8.2 Independence and Demographic Parity 321
Fig. 8.7 Receiver operating characteristic curve on the plain logistic regression on the left, on the
small dataset with .n = 40 individuals, with the optimal threshold (.t = 50%) and the evolution of
the rate of error on the right, and the optimal threshold (also .t = 50%)
As we can see in Definition 8.4, this measure is based on .mt , and not m. A
classical choice for t is .50% (used in the majority rule in bagging approaches), but
other choices are possible. An alternative is to choose t so that .P[mt (Z) > t] should
be close to .P[Y = 1] (at least on the training dataset). In Sect. 8.9, we consider the
probability of claiming a loss in motor insurance (on the FrenchMotor dataset).
In the training subset, 8.72% of the policyholders claimed a loss, whereas in the
validation dataset, the percentage was slightly lower at 8.55%. With the logistic
regression model, on the validation dataset, the average prediction (.m(z)) is 9% and
the median one is 8%. With a threshold .t = 16%, and a logistic regression about
10% of the policyholders get . y = 1 (which is close to the claim frequency in the
dataset) and .9% with a classification tree. Therefore, another natural threshold is
the quantile (associated with .m(zi )) when probability is the proportion of 0 among
.yi . Finally, as discussed in Chapter 9 in Krzanowski and Hand (2009), several
And inspired by Shannon and Weaver (1949), it is also possible to define some
mutual information, based on Kullback–Leibler divergence between the joint
distribution and the independent version (see Definition 3.7).
= i, S = s)
P(Y
y , s) =
IM(
. = i, S = s) log
P(Y .
= i)P(S = s)
P(Y
i∈{0,1} s∈{A,B}
Fig. 8.8 Matching between .m(x, s = A) (distribution on the x-axis) and .m(x, s = B) (distribution on the y-axis), where m is (from the left to the right) GLM,
GBM, and RF. The solid line is the (monotonic) optimal transport .T⋆ for different models
323
324 8 Group Fairness
scores in group A, whereas the y-axis represents the distribution of scores in group B.
The solid line depicts the optimal transport .T⋆ , which follows a monotonic pattern.
If this line aligns with the diagonal, it indicates that the model m satisfies the “strong
demographic parity” criteria and is considered fair.
One shortcoming with those approaches is that “demographic parity” is simply
based on the independence between the protected variable s and the prediction
.
y . And this does not take into account the fact that the outcome y may correlate
with the sensitive variable s. In other words, if the groups induced by the sensitive
attribute s have different underlying distributions for y, ignoring these dependencies
may lead to results that would be considered fair. Therefore, quite naturally, an
extension of the independence property is the “separation” criterion that adds the
value of the outcome y. More precisely, we require independence between the
prediction .y and the sensitive variable s, conditional on the value of the outcome
variable y, or formally .Y ⊥⊥ S conditional on Y .
which corresponds to the parity of true positives, in the two groups, .{A, B}.
8.3 Separation and Equalized Odds 325
Fig. 8.9 Receiver operating characteristic curves (true-positive rates against false-positive rates)
for the plain logistic regression m, on the toydata2 dataset. Percentages indicated are thresholds
t used, in each group (A and B), with the false-positive rate (on the x-axis) and the true-positive
rate (on the y-axis)
The previous property can also be named “weak equal opportunity,” and as for
demographic parity, a stronger property can be defined.
Definition 8.10 (Strong Equal Opportunity) A classifier .m(·), taking values in
{0, 1}, satisfies equal opportunity, with respect to some sensitive attribute S if
.
In Fig. 8.9, point .• in the top left corner corresponds to the case where the
threshold t is 50% in group B, on the left-hand side, and point .◦ in the bottom left
corner corresponds to the case where the threshold t is 50% in group A. As those
two points are not on same vertical or horizontal line, .mt satisfies neither “true-
positive equality” nor “false-positive equality,” when .t = 50%. Nevertheless, if we
suppose that t can be different, it is possible to achieve both (but never together). If
we use a threshold .t = 24.1% in group A, we have “false-positive equality” with
326 8 Group Fairness
the classifier obtained when using a threshold .t = 50% in group B. And if we use a
threshold .t = 15.2% in group A, we have “true-positive equality” with the classifier
obtained when using a threshold .t = 50% in group B. When examining the graph
on the right, we can identify specific thresholds that need to be employed in group
B to achieve either “true-positive equality” or “false-positive equality” when the
threshold .t = 50% is used in group A.
If the two properties are satisfied at the same time, we have “equalized odds.”
Definition 8.12 (Equalized Odds (Hardt et al. 2016)) Parity of false positives A
decision function .
y —or a classifier .mt (·) taking values in .{0, 1}—satisfies equal
odds constraint, with respect to some sensitive attribute S, if
⎧
⎪
⎨P[Y = 1|S = A, Y = y] = P[Y = 1|S = B, Y = y]
⎪
. = 1|Y = y], ∀y ∈ {0, 1}
= P[Y ,
⎪
⎪
⎩P[m (Z) = 1|S = A, Y = y] = P[m (Z) = 1|S = B, Y = y], ∀y ∈ {0, 1},
t t
which corresponds to parity of true positive and false positive, in the two groups.
Note that instead of using the value of .y (conditional on y and s), “equalized
odds” means, for some specific threshold t,
P[m(X) ∈ A|Y = y, S = A]
.
Table 8.4 False-positive and false-negative metrics, on our .n = 40 individuals from dataset
toydata2, using fpr_parity and fnr_parity from R package fairness. The ratio
(group B against group A) should be close to 1 to have either false-positive fairness (at the top) or
false-negative fairness (at the bottom)
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
False-positive rate ratio, various t
.t = 50% 1.50 4.00 2.00 – 4.00 5.00 2.00 Inf
.t = 70% 1.75 1.75 3.00 2.67 1.75 2.25 3.00 2.00
False-negative rate ratio, various t
.t = 30% 0.60 0.75 0.60 1.00 0.33 0.20 0.60 0.50
.t = 50% 1.00 0.50 0.00 2.00 0.00 0.00 0.00 1.00
= Y |S = A] = P[Y
P[Y = Y |S = B] = P[Y
= Y]
.
P[mt (X) = Y |S = A] = P[mt (X) = Y |S = B] = P[mt (X) = Y ].
∈ A|S = A] = P[Y − Y
P[Y − Y
. ∈ A|S = B] = P[Y − Y
∈ A], ∀A.
328
Fig. 8.10 On the right-hand side, evolution of the false-positive rates, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of
threshold t (on a plain logistic regression), on toydata2, using fpr_parity from R package fairness
8 Group Fairness
8.3 Separation and Equalized Odds
Fig. 8.11 On the right-hand side, evolution of the false-negative rates, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of
threshold t (on plain logistic regression), on toydata2, using fnr_parity from R package fairness. Here, .n = 500,000 simulated data are considered
329
330 8 Group Fairness
Table 8.5 “equalized odds” on the .n = 40 subset of toydata2, using equal_odds from R
package fairness
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.t = 30% 1.400 1.167 1.400 1.000 2.000 1.800 1.400 1.333
.t = 50% 1.000 1.125 1.111 0.889 1.429 1.250 1.111 1.000
.t = 70% 1.000 1.111 1.000 0.900 1.000 1.000 1.000 0.900
|a |S = A = E |Y − Y
E |Y − Y
. |a |S = B = E |Y − Y
|a , ∀a > 0.
|S = A, Y = 0] = P[Y /= Y
P[Y /= Y
. |S = B, Y = 0],
|S = A, Y
P[Y /= Y
. = 1] = P[Y /= Y
|S = B, Y
= 1],
for example.
One can also use any metrics based on confusion matrices, such as .φ, introduced
by Matthews (1975), also denoted MCC, for “Matthew’s correlation coefficient” in
Baldi et al. (2000) or Tharwat (2021),
Definition 8.15 (.φ-Fairness (Chicco and Jurman 2020)) We have .φ-fairness if
φA = φB , where .φs denotes Matthew’s correlation coefficient for the s group,
.
The evolution of .φA /φB as a function of the threshold is reported in Table 8.6.
For example, with a 50% threshold t, in group .A, MCC is 0.612 with a plain logistic
regression, whereas it is 0.704 in group .A, with the same model. When employing
a baseline value of 1 for MCC in group B, the corresponding value in group A is
0.870 (see Table 3.6 for details on computations, based on confusion matrices). In
Fig. 8.13, we can look at this metric, as a function of the threshold t, when n is very
large.
Finally, observe that instead of asking for true-positive rates and false-positive
rates to be equal, one can ask to have identical ROC curves, in the two groups.
As a reminder, we have defined (see Definition 4.19) the ROC curve as .C(t) =
TPR ◦ FPR−1 (t), where .FPR(t) = P[m(X) > t|Y = 0] and .TPR(t) = P[m(X) >
t|Y = 1].
8.3 Separation and Equalized Odds
Fig. 8.12 On the right-hand side, evolution of the equalized odds metrics, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of
threshold t (on plain logistic regression), on toydata2, using equal_odds from R package fairness
331
332 8 Group Fairness
Definition 8.16 (Equality of ROC Curves (Vogel et al. 2021)) Let .FRPs (t) =
P[m(X) > t|Y = 0, S = s] and .TPRs (t) = P[m(X) > t|Y = 1, S = s], where
−1 −1
.s ∈ {A, B}. Set .ΔT P R (t) = TPRB ◦TPRA (t)−t et .ΔF RP (t) = FPRB ◦FPRA (t)−t.
⊥⊥ S | Y = y.
∀y : Y
.
into
Fig. 8.13 On the right-hand side, evolution of the .φ-fairness metric, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of threshold
t (on plain logistic regression), on toydata2, using mcc_parity from R package fairness. Here, .n = 500,000 simulated data are considered. In the
middle evolution of MCC, in groups A and B, for .mt (x, s)—with the sensitive attribute s, as a function of t. On the left, evolution of the ratio between groups
A and B, with and without the use of the sensitive attribute in .mt respectively, as a function of t. Dots on the left are empirical values obtained on a smaller
subset, as in Table 8.6
333
334 8 Group Fairness
Table 8.7 AUC fairness on the small version of toydata2, using roc_parity (that actually
compared AUC and not ROC curves) based on the ratio of AUC in the two groups, from R package
fairness
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
Ratio of AUC 0.837 0.839 0.913 0.768 0.857 0.860 0.913 0.763
for some appropriate threshold .tA and .tB . We can visualize this in Fig. 8.14. On
the left-hand side of Fig. 8.14, the thresholds are chosen so that the rate of false
positives is the same for both populations (A and B). In other words, if our model
is related to load acceptance, we must have the same proportion of individuals who
were offered a loan in each group (advantaged and disadvantaged). For example,
here, we keep the threshold of 50% on the score for group B (corresponding to a
true-positive rate of about 31.8%), and we must use a threshold of about 24.1% on
the score for group A. We consider the right-hand side true positive.
Definition 8.18 (Equal Treatment (Berk et al. 2021a)) Equal treatment is
achieved when the rates of false positives and false negatives are identical within
the protected groups,
= 1|S = A, Y = 0]
P[Y = 1|S = B, Y = 0]
P[Y
. = .
= 0|S = A, Y = 1]
P[Y = 0|S = B, Y = 1]
P[Y
Berk et al. (2021a) uses the processing term in connection with causal inference,
which we discuss next. When the classifier yields a higher number of false
negatives for the supposedly privileged group, it indicates that a larger proportion
of disadvantaged individuals are receiving favorable outcomes compared with the
opposite scenario of false positives. A slightly different version had been proposed
by Jung et al. (2020).
Definition 8.19 (Equalizing Disincentives (Jung et al. 2020)) The difference
between the true-positive rate and the false-positive rate must be the same in the
protected groups,
. = 1|S = A, Y = 1] − P[Y
P[Y = 1|S = A, Y = 0]
= 1|S = B, Y = 1] − P[Y
= P[Y = 1|S = B, Y = 0].
Before moving to the third criterion, it should be stressed that “equalized odds”
may not be a legitimate criterion, because equal error rates could simply reproduce
existing biases. If there is a disparity of s in y, then equal error rates could simply
reproduce this disparity in the prediction .y . And therefore, correcting the disparity
of s in y actually requires different error rates for different values of s.
8.4 Sufficiency and Calibration 335
Fig. 8.14 Densities of scores in solid strong lines, conditional of .y = 0 on the left-hand side, and
.y = 1 on the right-hand side, in groups A and B on the large toydata2 data set. Dotted lines
are unconditional on y. We consider here threshold .t = 50%, and positive predictions . y = 1 or
.mt (x) > t. At the top, threshold .t = 50% is kept in group B, and we select another threshold for
group A (to have the same proportion of .y = 1), whereas at the bottom, threshold .t = 50% is kept
in group A, and we select another threshold for group B
or m(X).
Y ⊥⊥ S | Y
.
336 8 Group Fairness
We can go further by asking for a little more, not only for parity but also for a
good calibration.
Definition 8.22 (Good Calibration (Kleinberg et al. 2017; Verma and Rubin
2018)) Fairness of good calibration is met if
= 1, S = A] = P[Y = 1|Y
P[Y = 1|Y
. = 1, S = B].
TPR · P[S = s]
PPVs =
. , ∀s ∈ {A, B},
TPR · P[S = s] + FPR · (1 − P[S = s])
such that .PPVA = PPVB implies that either .TPR or .FPR is zero, and as negative
predictive value can be written
= 1] = P[Y = 1|S = B, Y
P[Y = 1|S = A, Y = 1] positive prediction
.
= 0] = P[Y = 1|S = B, Y
P[Y = 1|S = A, Y = 0] negative prediction
or
. =
P[Y = 1|S = A, Y =
y ] = P[Y = 1|S = B, Y y ], ∀
y ∈ {0, 1}.
Finally, let us note that Kleinberg et al. (2017) introduced a notion of balance for
positive/negative class.
⎧
⎪E(m(X)|Y = 1, S = B) = E(M|Y = 1, S = A), balance for positive class
⎪
⎨
. E(m(X)|Y = 0, S = B) = E(M|Y = 0, S = A), equilibrium for the negative
⎪
⎪
⎩ class.
In Table 8.8, we use the “accuracy parity metric” as described by Kleinberg et al.
(2016) and Friedler et al. (2019). In groups A and B, accuracy metrics are 80% and
85%, in the .n = 40 dataset. Therefore, if basis 1 in group B is considered, the value
in group A would have been 0.941. In Fig. 8.15, we can look at this metric, as a
function of the threshold, when n is very large.
Another approach can be inspired by Kim (2017), for whom another way of
defining if a classification is fair or not is to say that we cannot tell from the result if
the subject was member of a protected group or not. In other words, if an individual’s
score does not allow us to predict that individual’s attributes better than guessing
them without any information, we can say that the score was assigned fairly.
Definition 8.25 (Non-Reconstruction of the Protected Attribute (Kim 2017)) If
we cannot tell from the result (.x, .m(x), y and .
y ) whether the subject was a member
of a protected group or not, we talk about fairness by nonreconstruction of the
protected attribute
Table 8.8 Accuracy parity on the small subset of toydata2, using acc_parity from R
package fairness. 1.071 means that accuracy is 7.1% higher in group A than in group B
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.t = 30% 0.933 0.875 1.071 0.833 1.071 1.067 1.071 0.938
.t = 50% 0.941 0.882 0.875 0.632 1.000 0.882 0.875 0.737
.t = 70% 0.812 0.867 0.647 0.647 0.812 0.688 0.647 0.688
338
Fig. 8.15 On the right-hand side, evolution of accuracy, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of threshold t (on
plain logistic regression), on toydata2, using acc_parity from R package fairness. Here, .n = 500,000 simulated data are considered. In the middle
evolution of accuracy, in groups A and B, for .mt (x, s)—with the sensitive attribute s, as a function of t
8 Group Fairness
8.5 Comparisons and Impossibility Theorems 339
The different notions of “group fairness” can be summarized in Table 8.9. And as
we now see, those notions are incompatible.
Proposition 8.3 Suppose that a model m satisfies the independence condition (in
Sect. 8.2) and the sufficiency property (in Sect. 8.4), with respect to a sensitive
attribute s, then necessarily, .Y ⊥⊥ S.
Proof From the sufficiency property (8.4), .S ⊥⊥ Y | m(Z), then, for .s ∈ S and
A ⊂ Y,
.
can be written
And from the independence property (8.4), .m(Z) ⊥⊥ S, we can write the first
component .P[S = s|m(Z)] = P[S = s], almost surely, and therefore
P[mt (Z) =
. y ] = P[mt (Z) =
y |S = s] = E P[mt (Z) =
y |Y, S = s] ,
or
P[mt (Z) =
. y] = y |Y = y · P Y = y S = s ,
P mt (Z) =
y
340
almost certainly. And since we assumed that y was a binary variable, .P[Y = 0] =
1 − P[Y = 1], as well as .P[Y = 0|S = s] = 1 − P[Y = 1|S = s], and therefore
P mt (Z) =
. y |Y = 1 · P Y = 1 S = s − P Y = 1
or
. y |Y = 0 · P Y = 0 S = s − P Y = 0
− P mt (Z) =
can be written
P mt (Z) =
. y |Y = 0 · P Y = 1 S = s − P Y = 1 .
i.e.,
which should not depend on s (from the sufficiency property). So a similar property
holds if .S = s ' . Observe further that .P[mt (Z) = 1|Y = 1] is the true positive
rate (TPR) whereas .P[mt (Z) = 1|Y = 0] is the false-positive rate (FPR). Let
.ps = P[Y = 1|S = s], so that
TPR
. P[Y = 1|S = s, mt (Z) = 1] = .
ps · TPR + (1 − ps ) · FPR
TPR TPR
. = .
ps · TPR + (1 − ps ) · FPR ps ' · TPR + (1 − ps ' ) · FPR
Suppose that .TPR /= 0 (otherwise .TPR = P[mt (Z) = 1|Y = 1] = 0 as stated in the
proposition), then
We had seen that demographic fairness is translated into the following equality
= 1|S = A]
P[Y = 1|S = B]
P[Y
. =1=
= 1|S = B]
P[Y = 1|S = A]
P[Y
If this approach is interesting, the statistical reality is that having a perfect equality
between two (predictive) probabilities is usually impossible. It is actually possible
to relax that equality, as follows,
Definition 8.26 (Disparate Impact (Feldman et al. 2015)) A decision function .Y
has a disparate impact, for a given threshold .τ , if,
= 1|S = B]
= 1|S = A] P[Y
P[Y
. min , < τ (usually 80%).
= 1|S = B] P[Y
P[Y = 1|S = A]
8.6 Relaxation and Confidence Intervals 343
This so-called “four-fifths rule,” coupled with the .τ = 80% threshold, was
originally defined by the State of California Fair Employment Practice Commission
Technical Advisory Committee on Testing, which issued the California State
Guidelines on Employee Selection Procedures in October 1972, as recalled in
Feldman et al. (2015), Mercat-Bruns (2016), Biddle (2017), or Lipton et al. (2018).
This standard was later adopted in the 1978 Uniform Guidelines on Employee
Selection Procedures, used by the Equal Employment Opportunity Commission,
the US Department of Labor, and the US Department of Justice. An important point
here is that this form of discrimination occurred even when the employer did not
intend to discriminate, but by looking at employment statistics (on gender or racial
grounds), it was possible to observe (and correct) discriminatory bias.
For example, on the toydata2 with .n = 1000 individuals,
= 1|S = A]
P[Y 134 400 134
. = = ∼ 33.1% ⪡ 80%.
= 1|S = B]
P[Y 270 600 405
= 1|S = A) = P(Y
Another approach, suggested to relax the equality .P(Y =
1|S = B), consists in introducing a notion of “approximate-fairness” in Holzer
and Neumark (2000), Collins (2007) and Feldman et al. (2015), or .ε-fairness in Hu
(2022)
. = 1|S = A) − P(Y
P(Y = 1|S = B) < ε.
W2 (PA , PB ) < ε,
.
which can be used to construct a confidence interval for T (Besse et al. (2018)
propose an asymptotic test, but using resampling techniques it is also possible to get
confidence intervals).
If the three previous approaches are now quite popular in machine-learning literature
(independence, separation, and sufficiency), other techniques to quantify discrimi-
nation have been introduced into econometrics. A classical case in labor economics
is the gender wage gap. Such a gap has been observed for decades and economists
have tried to explain the difference in average wages between men and women. In a
nutshell, as in insurance, such a gap could be a “fair demographic discrimination” if
there were group differences in wage determinants, that is, in characteristics that are
relevant to wages, such as education. This is called “compositional differences.”
But that gap can also be associated with “unfair discrimination,” if there was
a differential compensation for these determinants, such as different returns to
education for men and women. This is called “differential mechanisms.” In order
to construct a counterfactual state—to answer a question “If women had the same
characteristics as men, what would be their wage?”—economists have considered
the decomposition method, to recovery the causal effect of a sensitive attribute. Cain
(1986) and Fortin et al. (2011) have provided the state of the art on those techniques.
If the seminal work by Kitagawa (1955) and Solow (1957) introduced the
“decomposition method,” Oaxaca (1973) and Blinder (1973) have laid the foun-
dations for the decomposition approach to analyze mean wage differences between
groups, based either on gender or on race. This approach offers a straightforward
means of disentangling cause and effect within the context of linear models,
.yi = γ 1B (si ) + x ┬
i β + εi ,
where y is the salary, s denotes the binary sensitive attribute .1B (s), .x is a collection
of predictive variables (or control variables, that might have an influence on the
salary, such as education or work experience). In such a model, . γ can be used to
answer the question we asked earlier “what the wage would be if women (.s = B)
had the same characteristics .xas men (.s = A)?.” To introduce the Blinder–Oaxaca
approach, suppose that we consider two (linear) models, for men and women
respectively,
yA:i = x ┬
A:i β A + εA:i (group A)
.
yB:i = x ┬
B:i β B + εB:i (group B).
8.7 Using Decomposition and Regressions 345
Using ordinary least squares estimates (and standard properties of linear models),
yA = x┬
A β A (group A)
.
┬
y B = x B β B (group B),
so that .y A − y B = x ┬ ┬
A β A − x B β B , which we can write (by adding and removing
┬
.x A β B )
┬
yA − yB = xA − xB βB + x┬ −
. A βA βB , (8.1)
characteristics coefficients
where the first term is the “characteristics effect,” which describes how much the
difference in outcome y (on average) is due to the differences in the levels of
explanatory variables .x, whereas the second term is the “coefficients effect,” which
describes how much the difference in outcome y (on average) is due to differences in
the magnitude of regression coefficients .β. The first one is the legitimate component,
also called “endowment effect” in Woodhams et al. (2021) or “composition effect” in
Hsee and Li (2022), whereas the second one can be interpreted as some illegitimate
discrimination, and is called “returns effect” in Agrawal (2013) or “structure effect”
in Firpo (2017). For the first component,
┬ ┬
. xA − xB βB = x A:1 − x B:1 β B:1
┬ ┬
+ · · · + x A:j − x B:j βB:j + · · · + x A:k − x B:k βB:k ,
where,
for the j -th term, we explicitly see the average difference in the two groups,
.x A:j − x B:j , whereas for the second component
x┬ ┬ ┬ ┬
.A β A − β B = x A:1 βA:1 − βB:1 +· · ·+x A:j βA:j − βB:j +· · ·+x A:k βA:k − βB:k
which is analogous to the previous decomposition, in Eq. (8.1), but can be rather
different. This is more or less the same as the regression on categorical variables:
changing the reference does not change the prediction, only the interpretation of
the coefficient. To visualize it, consider the case where there is only one single
characteristic x, as in Fig. 8.16.
In the context of categorical variables, it is rather common to use a contrast
approach where all quantities are expressed with respect to some “average bench-
mark.” We do the same here, except that the “average benchmark” is now a “fair
346 8 Group Fairness
y ⋆ = x ┬ β ⋆ + ε,
.
where the “coefficient effect” component is now decomposed into two parts: an
illegitimate discrimination in favor of group A, and an illegitimate discrimination
against group B. This approach can be used to get a better interpretation of the first
two models. In fact, if we assume that there is only discrimination against group B,
and no discrimination in favor of group A, then .β ⋆ = β A and we obtain Eq. (8.1),
whereas if we assume that there is only discrimination in favor of group A, and no
discrimination against group B, then .β ⋆ = β B and we obtain Eq. (8.2).
As for the contrast approach, one can consider an average approach for the fair
benchmark. Reimers (1983) suggested considering the average coefficient between
the two groups,
1 1
β = βA +
⋆
. β ,
2 2 B
whereas Cotton (1988) suggested a weighted average, based on population sizes in
the two groups,
⋆ nA nB
β =
. β + β .
nA + nB A nA + nB B
8.7 Using Decomposition and Regressions 347
β ω = ω
.
⋆
β A + (1 − ω)
β B , where ω ∈ [0, 1].
┬
┬
y A −y B = x A −x B
. ω
β A +(1−ω)
β B + (1−ω)x A +ωx B
β A −
βB , (8.4)
or more generally
┬
┬
y A − y B = x A − x B Ω
. β A + (I − Ω)
β B + (I − Ω)x A + Ωx B βA −
βB ,
(8.5)
for some .p × p matrix .Ω, that corresponds to relative weights given to the
coefficients of group 0. Oaxaca and Ransom (1994) suggested using
But this approach suffers several drawbacks, related to the “omitted variable bias”
discussed in Sect. 5.6 (see also Jargowsky (2005) or more recently Cinelli and
Hazlett (2020)). Therefore, Jann (2008) suggested simply using a pooled regression
over the two groups, controlled by the group membership, that is
yi = γ 1B (si ) + x ┬
. i β + εi ,
⋆
so that the average wage gap between men and women in the labor market is
yB − yA =
. pB:j y B:j − pA:j y A:j ,
j
348 8 Group Fairness
and we decompose this wage gap into industry wage differentials, and the prob-
ability of entering a certain industry. With personal data, one can consider some
multinomial logit-model, so that the probability of an individual with characteristics
.x i joining industry j would be
exp(x ┬
i β)
pj,i =
. .
1 + exp(x ┬
i β)
exp(x ┬
i βs )
ps:j,i =
. .
1 + exp(x ┬
i βs )
exp(x ┬
i β A)
.
⋆
pB:j,i = ,
1 + exp(x ┬
i β A)
in two parts,
dj = [pB:j − pA:j ]y B:j + pA:j [y B:j − y A:j ] .
.
[y B:j − y A:j ] = [x ┬ ┬
. B:j β B:j − x A:j β A:j ],
i.e.,
+ pA:j x ┬
A:j [β B:j − β A:j ] illegitimate within industries
+ [(pB:j − pA:j
⋆
)] y B:j legitimate across industries
+ [(pA:j
⋆
− pA:j ]y B:j illegitimate across industries.
the distributions of male and female qualifications are the same, given incomes,
corresponding to a reverse regression.
E[Y |X = x] = β0 + x ┬ β 1 + β2 s
.
E[X|Y = y] = α0 + α1 y + α2 s,
or
yi = β0 + x ┬ β 1 + β2 si + εi
.
xi = α0 + α1 yi + α2 si + ηi .
The idea underlying both proposals is the intuitively appealing one that if the
protected class is underpaid, that is, .β2 < 0, then we should expect its members to
be overqualified in the jobs that they hold, that is, .α2 > 0. Kamalich and Polachek
(1982) suggested a multivariate extension, with
⎧
⎪
⎪ x1,i = α1,0 + α1,1 yi + α1,2 si + α1,3,2 x2,i + α1,3,3 x3,i + · · · + α1,3,k xk,i + η1,i
⎪
⎪
⎪
⎪ x = α2,0 + α2,1 yi + α2,2 si + α2,3,1 x1,i + α2,3,3 x3,i + · · · + α2,3,k xk,i + η2,i
⎪
⎨ 2,i
..
. .
⎪
⎪
⎪
⎪
⎪
⎪ xk,i = αp,0 + αk,1 yi + αk,2 si + αk,3,1 x1,i + αk,3,2 x2,i
⎪
⎩
+ · · · + αk,3,k−1 xk−1,i + ηk,i ,
whereas Conway and Roberts (1983) suggested some aggregated index of .x. More
precisely,
⎧ z
⎪
⎨ i
. yi = β0 + x ┬ β 1 +β2 si + εi .
⎪
⎩
zi = α0 + α1 yi + α2 si + ηi
yi = g(xi ) + β2 s
. ,
xi = h(yi ) + α2 s
First, the unknown conditional means are nonparametrically estimated. Then they
are substituted for the unknown functions, and least squares is used to estimate.
Robinson (1988) proved that these estimates are asymptotically equivalent to those
obtained using the true conditional mean functions for estimation.
In Sect. 4.2, four models were presented on the GermanCredit dataset (logistic
regression, classification tree, boosting, and bagging), with and without the sensitive
variable (gender). Cumulative distribution functions of the scores, for the plain
logistic regression and the boosting algorithm, can be visualized in Fig. 8.17.
In the training subset of GermanCredit 30.1% of people got a default (.y =
BAD), 29.7% in the validation dataset (.y). If we consider predictions from the
logistic regression model, with the sensitive attribute, on the validation dataset, the
average prediction (.m(x)) is 28.7% and the median one is 20.4%. With a (classifier)
threshold .t = 20%, we have a balanced dataset, with . y = 1 for .50% of the risky
people (see Fig. 8.17). With a threshold .t = 40%, 30% of the policyholders get
.
y = 1 (which is close to the default frequency in the dataset). In Tables 8.10
and 8.11, we use thresholds .t = 20% and .40% respectively.
In Sect. 4.2, four models were presented on the FrenchMotor dataset (logistic
regression, classification tree, boosting, and bagging), with and without the sen-
sitive variable (gender). In the training subset of FrenchMotor 8.72% of the
policyholders claimed a loss, 8.55% in the validation dataset (.y). If we consider
predictions from the logistic regression model, with the sensitive attribute, on the
validation dataset, the average prediction (.m(x)) is 9% and the median one is 8%.
With a threshold .t = 8%, we have a balanced dataset, with . y = 1 for .50% of the
352 8 Group Fairness
Fig. 8.17 Distributions of the score .m(z), on the GermanCredit dataset, conditional on y on the
left-hand side, and s on the right-hand side, when m is a plain logistic regression without sensitive
attribute s at the top, and boosting without sensitive attribute s at the bottom, with threshold .t =
40%
risky drivers (see Fig. 8.18). With a threshold .t = 16%, 10% of the policyholders
get .
y = 1 (which is close to the claim frequency in the dataset). In Tables 8.12
and 8.13, we use thresholds .t = 8% and .16% respectively.
8.9 Application on the FrenchMotor Dataset 353
Table 8.10 Fairness metrics on the GermanCredit dataset, with the fairness R package,
by Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .20%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 51.7% 28.0% 54.7% 61.7% 50.7% 28.0% 56.0% 60.7%
Predictive rate parity 0.992 1.190 0.992 1.050 0.957 1.190 1.041 1.037
Demographic parity 0.998 1.091 1.159 1.027 1.213 1.091 1.112 1.208
FNR parity 1.398 0.740 1.078 1.124 1.075 0.740 1.064 0.970
Proportional parity 0.922 1.008 1.071 0.949 1.121 1.008 1.027 1.116
Equalized odds 0.816 1.069 0.947 0.888 0.956 1.069 0.953 1.031
Accuracy parity 0.843 1.181 0.912 0.904 0.896 1.181 0.943 0.966
FPR parity 1.247 0.683 1.470 0.855 2.004 0.683 0.962 1.069
NPV parity 0.676 1.141 0.763 0.772 0.735 1.141 0.799 0.823
Specificity parity 0.941 1.439 0.930 1.028 0.851 1.439 1.007 0.990
ROC AUC parity 0.928 1.162 0.997 1.108 0.926 1.162 1.004 1.090
MCC parity 0.604 2.013 0.744 0.851 0.639 2.013 0.884 0.930
Table 8.11 Fairness metrics on the GermanCredit dataset, with the fairness R package,
by Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .40%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 30.3% 26.0% 27.7% 25.7% 30.7% 26.0% 28.0% 27.0%
Predictive rate parity 1.030 1.179 1.110 1.182 1.034 1.179 1.111 1.200
Demographic parity 1.090 1.062 1.074 1.069 1.108 1.062 1.044 1.019
FNR parity 1.533 0.851 1.110 0.781 1.342 0.851 1.322 0.962
Proportional parity 1.007 0.981 0.992 0.987 1.024 0.981 0.964 0.942
Equalized odds 0.925 1.032 0.982 1.041 0.944 1.032 0.955 1.008
Accuracy parity 0.949 1.154 1.054 1.164 0.963 1.154 1.038 1.159
FPR parity 1.118 0.703 0.820 0.653 1.118 0.703 0.784 0.641
NPV parity 0.738 1.080 0.890 1.108 0.766 1.080 0.848 1.082
Specificity parity 0.935 1.470 1.169 1.480 0.935 1.470 1.203 1.652
ROC AUC parity 0.928 1.162 0.997 1.108 0.926 1.162 1.004 1.090
MCC parity 0.745 1.817 1.105 1.754 0.779 1.817 1.056 2.055
354 8 Group Fairness
Fig. 8.18 Distributions of the score .m(z), on the FrenchMotor dataset, conditional on y on the
left-hand side, and s on the right-hand side, when m is a plain logistic regression without sensitive
attribute s at the top, and boosting without sensitive attribute s at the bottom, with threshold .t = 8%
8.9 Application on the FrenchMotor Dataset 355
Table 8.12 Fairness metrics on the FrenchMotor dataset, with the fairness R package, by
Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .t = 8%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 51.1% 29.2% 49.6% 18.7% 50.8% 29.2% 51.6% 18.6%
Predictive rate parity 1.019 1.021 1.017 1.011 1.018 1.021 1.027 1.012
Demographic parity 0.673 0.588 0.700 0.589 0.649 0.588 0.693 0.588
FNR parity 0.833 0.900 0.789 0.813 0.865 0.900 0.806 0.818
Proportional parity 1.182 1.034 1.231 1.035 1.141 1.034 1.217 1.033
Equalized odds 1.187 1.040 1.234 1.031 1.145 1.040 1.232 1.030
Accuracy parity 1.161 1.051 1.198 1.037 1.125 1.051 1.205 1.036
FPR parity 1.004 0.886 1.125 0.775 0.975 0.886 0.956 0.727
NPV parity 1.004 1.054 0.986 1.071 0.982 1.054 1.060 1.074
Specificity parity 0.998 1.141 0.927 1.079 1.012 1.141 1.026 1.091
ROC AUC parity 1.023 1.098 1.027 1.059 1.023 1.098 1.046 1.063
MCC parity 1.482 1.496 1.505 1.128 1.394 1.496 2.273 1.136
Table 8.13 Fairness metrics on the FrenchMotor dataset, with the fairness R package, by
Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .t = 16%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 10.0% 9.2% 6.6% 14.6% 10.2% 9.2% 5.6% 14.5%
Predictive rate parity 1.011 1.016 1.009 1.022 1.014 1.016 1.010 1.020
Demographic parity 0.596 0.591 0.587 0.577 0.588 0.591 0.592 0.577
FNR parity 0.618 0.620 0.642 0.819 0.710 0.620 0.478 0.827
Proportional parity 1.048 1.039 1.032 1.014 1.034 1.039 1.040 1.014
Equalized odds 1.045 1.040 1.027 1.021 1.033 1.040 1.034 1.020
Accuracy parity 1.050 1.050 1.032 1.038 1.043 1.050 1.040 1.036
FPR parity 1.071 1.003 1.090 0.569 1.015 1.003 1.090 0.613
NPV parity 1.011 1.259 0.652 1.160 1.092 1.259 0.847 1.143
Specificity parity 0.748 0.987 0.467 1.256 0.944 0.987 0.467 1.234
ROC AUC parity 1.023 1.098 1.027 1.059 1.023 1.098 1.046 1.063
MCC parity 0.993 1.452 0.354 1.289 1.236 1.452 0.610 1.265
Chapter 9
Individual Fairness
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 357
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_9
358 9 Individual Fairness
values in .{A, B}, A being the favored group (or supposed to be), and B the disfavored
one.
The natural idea, formalized in Luong et al. (2011), is that two “close” individuals
(in the sense of unprotected characteristics .x) must have the same prediction.
Definition 9.1 (Similarity Fairness (Luong et al. 2011; Dwork et al. 2012))
Consider two metrics, one on .Y × Y (or for a classifier .[0, 1] and not .{0, 1}) noted
.Dy , and one on .X noted .Dx , such that we have similarity fairness on a database of
Dy (m(x i , si ), m(x j , sj ))
. ≤ L, ∀i, j = 1, · · · , n,
Dx (x i , x j )
Definition 9.2 (Local Individual Fairness (Petersen et al. 2021)) Consider two
metrics, one on .Y (.[0, 1] for a classifier and not .{0, 1}) noted .Dy , and one on .X
noted .Dx , model m is locally individually fair if
Dy (m(X, S), m(x ' , S))
E(X,S)
. limsup ≤ L < ∞.
x ' :Dx (X,x ' )→0 Dx (X, x ' )
Heidari and Krause (2018) and Gupta and Kamble (2021) considered some
dynamic extension of that rule, to define fair rules in reinforcement learning, for
example, inspired by Bolton et al. (2003) who found that customers’ impression of
fairness of prices critically relies on past prices that act as reference points.
Based on the previous definition, a quite natural extension is the local one,
Definition 9.5 (Counterfactual Fairness (Kusner et al. 2017)) We achieve coun-
terfactual fairness for an individual with characteristics .x if
⋆
CATE(x) = E YS←A
. − YS←B
⋆ X = x = 0.
Observe that there are several variations in the literature around those definitions.
Kusner et al. (2017) formally defined counterfactual fairness “conditional on a
factual condition,” whereas Wu et al. (2019) considered “path-specific causal fair-
ness.” Zhang and Bareinboim (2018) distinguished “counterfactual direct effect,”
“counterfactual indirect effect” and “counterfactual spurious effect.” To explain
quickly the differences, following Avin et al. (2005), let us define the idea of “path-
specific causal effect” (studied in Zhang et al. (2016) and Chiappa (2019)).
Definition 9.6 (Path-Specific Effect (Avin et al. 2005)) Given a causal diagram,
and a path .π some s to y, the .π -effect of a change of s from B to A on y is
where “.doπ (S = A)” denotes the intervention on s transmitted only along path .π .
Transmission along a path (in a causal graph) was introduced with Defini-
tion 7.14. Then, following Wu et al. (2019), define the “path-specific counterfactual
effect”
Definition 9.7 (Path-Specific Counterfactual Effect (Wu et al. 2019)) Given a
causal diagram, a factual condition (denoted .F), and a path .π some s to y, the .π -
effect of a change of s from B to A on y is
Wk (p, q)k =
. inf ‖x − y‖k dπ(x, y) ,
π ∈Π(p,q) R ×R
d d
1 k
−1
Wk (p, q)k =
. Fp (u) − Fq−1 (u) du.
0
362 9 Individual Fairness
1 2 2
3 3 3
Fig. 9.1 (a) Causal graph used to generate variables in toydata2. (b) Simple causal graph that
can be used on toydata2, where the sensitive attribute may actually cause the outcome y, either
directly (upper arrow), or indirectly, through .x1 , a mediator variable. (c) Causal graph where the
sensitive attribute s may cause the outcome y, either directly or indirectly, via two possible paths
and two mediator variables, .x1 and .x2
in the sense .u = F1B (x1 )—then the counterfactual should be associated with the
quantile (in group A) with the same probability u. Thus, the counterfactual of .(x1 , B)
−1
would be .(T(x1 ), A), where .T = F1A ◦ F1B . Following Berk et al. (2021b) and
Charpentier et al. (2023a) it is possible to define a “quantile based counterfactual”
(or “adaptation with quantile preservation,” as defined in Plečko et al. (2021)), as
follows.
Definition 9.8 (Quantile-Based Counterfactual) The counterfactual of .(x1 , B) is
−1
(T(x1 ), A), where .T(x1 ) = F1A
. ◦ F1B (x1 ).
Definition 9.9 (Quantile-Based Counterfactual Discrimination) There is coun-
terfactual discrimination with model m for individual .(x1 , B) if
−1
m(x1 , B) /= m(T(x1 ), A), where T = F1A
. ◦ F1B .
L L
XA = (X|S = A) ∼ N(μA , σA 2 ) and XB = (X|S = B) ∼ N(μB , σB 2 ),
.
then
XB − μB L
. μA + σA · = XA .
σB
T(XB )
The empirical version of the problem described in the previous section could
be expressed as follows: consider two samples with identical size, denoted
.{x , · · · , xn } and .{x , · · · , xn }. For each individual .x , the counterfactual is an
A A B B B
1 1 i
A
individual in the other group .xj , with two constraints: (1) it should be a one-to-one
matching: each individual in group B should be associated with a single observation
in group A, and conversely; (2) individuals should be matched with a “close” one,
in the other group. Stuart (2010) used the name “1:1 nearest-neighbor matching,”
to describe that matching procedure (see also Dehejia and Wahba (1999) or Ho
et al. (2007)). The first condition imposes that a matching is simply a permutation
.σ of .{1, 2, · · · , n}, so that, for all i, the counterfactual of .x will be .x
B A
i p(i) . Recall
364 9 Individual Fairness
where .P is the set of .n × n permutation matrices. The solution is very intuitive, and
based on the following rearrangement inequality,
Lemma 9.2 (Hardy–Littlewood–Pólya Inequality (1)) Given .x1 ≤ · · · ≤ xn and
.y1 ≤ · · · ≤ yn n pairs of ordered real numbers, for every permutation .σ of
.{1, 2, · · · , n},
n
n
n
. xi yn+1−i ≤ xi yσ (i) ≤ xi yi .
i=1 i=1 i=1
where .z ∧ z' and .z ∨ z' denote respectively the maximum and the minimum
componentwise. If .−Ф is supermodular, .Ф is said to be submodular.
From Topkis’ characterization theorem (see Topkis 1998), if .Ф : R × R → R
is twice differentiable, .Ф is supermodular if and only if .∂ 2 Ф/∂x∂y ≥ 0, for all
.i /= j . And as mentioned in Galichon (2016), many popular functions in applied
n
n
n
. Ф(xi , yn+1−i ) ≤ Ф(xi , yσ (i) ) ≤ Ф(xi , yi ),
i=1 i=1 i=1
whereas if .Ф : R × R → R is submodular,
n
n
n
. Ф(xi , yi ) ≤ Ф(xi , yσ (i) ) ≤ Ф(xi , yn+1−i ).
i=1 i=1 i=1
n
n
. Ф(xi , yi ) ≥ Ф(xi , yσ ⋆ (i) ),
i=1 i=1
where .σ ⋆ is the permutation such that the rank of .yσ ⋆ (i) (among the y) is equal to the
rank of .xi (among the x), as discussed in Chapter 2 of Santambrogio (2015). This
corresponds to a “monotone rearrangement.”
A numerical illustration is provided in Table 9.1. There are .n = 6 individuals
in class B (per row) and in class A (per column). Table on the left-hand side is the
distance matrix .C, between .xi and .xj , whereas the table on the right-hand side is
the optimal permutation .P , solution of Eq. (9.1). Here, individual .i = 3, in group B,
is matched with individual .j = 9, in group A. Thus, in this very specific example,
model m would be seen to be fair for individual .i = 3 if .m(x 3 , B) = m(x 9 , A).
Observed that fairness can be assessed here only for individuals who belong to the
training dataset (and not any fictional individual .(x, B)).
A more general setting could be considered, where the two groups no longer have
the same size n. Consider two groups, A and B. Given .ν B ∈ Rn+B and .ν A ∈ Rn+A such
Table 9.1 Optimal matching, .n = 6 individuals in class B (per row) and in class A (per column).
The table on the left-hand side is the distance matrix .C, whereas the table on the right-hand side is
the optimal permutation .P ⋆ , solution of Eq. (9.1)
7 8 9 10 11 12 7 8 9 10 11 12
1 0.41 0.55 0.22 0.64 0.04 0.25 1 .· .· .· .· 1 .· 1 .→ 11
2 0.28 0.24 0.73 0.22 0.64 0.80 2 .· 1 .· .· .· .· 2 .→ 8
3 0.28 0.47 0.32 0.52 0.16 0.37 3 .· .· 1 .· .· .· 3 .→ 9
4 0.28 0.62 0.81 0.25 0.64 0.85 4 1 .· .· .· .· .· 4 .→ 7
5 0.41 0.37 0.89 0.25 0.81 0.97 5 .· .· .· 1 .· .· 5 .→ 10
6 0.66 0.76 0.21 0.89 0.22 0.14 6 .· .· .· .· .· 1 6 .→ 12
366 9 Individual Fairness
Table 9.2 Optimal matching, .nB = 6 individuals in class B (per row) and .nA = 10 individuals in
class A (per column). The table on the left-hand side is the distance matrix .C, whereas the table on
the right-hand side is the optimal weight matrix .P ⋆ in .U6,10 , solution of Eq. (9.2)
7 8 9 10 11 12 13 14 15 16 7 8 9 10 11 12 13 14 15 16
1 0.41 0.55 0.22 0.64 0.04 0.25 0.24 0.77 0.74 0.55 1 .· .· .1/5 .· .3/5 .· .1/5 .· .· .·
2 0.28 0.24 0.73 0.22 0.64 0.80 0.76 0.76 0.12 0.10 2 .· .2/5 .· .· .· .· .· .· .· .3/5
3 0.28 0.47 0.32 0.52 0.16 0.37 0.27 0.68 0.63 0.45 3 .3/5 .· .· .· .· .· .2/5 .· .· .·
4 0.28 0.62 0.81 0.25 0.64 0.85 0.58 0.32 0.51 0.48 4 .· .· .· .2/5 .· .· .· .3/5 .· .·
5 0.41 0.37 0.89 0.25 0.81 0.97 0.91 0.81 0.05 0.25 5 .· .1/5 .· .1/5 .· .· .· .· .3/5 .·
6 0.66 0.76 0.21 0.89 0.22 0.14 0.33 0.96 0.99 0.79 6 .· .· .2/5 .· .· .3/5 .· .· .· .·
that .ν ┬ ┬
B 1nB = ν A 1nA (identical sums), define
U (ν B , ν A ) = M ∈ Rn+B ×nA : M1nA = ν B and M ┬ 1nB = ν A ,
.
where .Rn+B ×nA is the set of .nB ×nA matrices with positive entries. This set of matrices
is a convex polytope (see Brualdi 2006).
nB
In our case, let us denote .U 1B , 1A as .UnB ,nA . Then the problem we want
nA
to solve is simply
⎧ ⎫
⎨nB
nA ⎬
.P ∈ argmin 〈P , C〉 or argmin
⋆
Pi,j Ci,j . (9.2)
P ∈UnB ,nA P ∈UnB ,nA ⎩ ⎭
i=1 j =1
Here, .P ⋆ is no longer a permutation matrix, but .P ⋆ ∈ UnB ,nA , so that sums per
row are equal to one (with positive entries), and can be considered as weights (as
permutation matrices), but here, sums per column are equal, but not to one—they are
equal to the ratio .nB /nA . To illustrate, consider Table 9.2, with .nB = 6 individuals
in class B (per row) and .nA = 10 individuals in class A (per column). Consider
individual .i = 3, in group B. She is matched with a weighted sum of individuals in
group A, namely .j = 7 and .13, with respective weights .3/5 and .2/5. Thus, on this
very specific example, model m would be seen to be fair for individual .i = 3 if
3 2
m(x 3 , B) =
. m(x 7 , A) + m(x 13 , A).
5 5
A transport .T is a deterministic function that couples .x0 and .x1 (or more generally
vectors in the same space) in the sense that .x1 = T(x0 ). And .(X0 , T(X0 )) is a
coupling of two measures .P0 and .P1 (for a more general framework than the p and
9.3 Counterfactuals and Optimal Transport 367
that we can relate to the equation below, which could be seen as the continuous (and
more general) version of Eq. (9.2),
is optimal if
.ν⋆ ∈ arginf E Ф(X0 , X 1 ) , where (X 0 , X 1 ) ∼ ν.
ν∈Π(P0 ,P1 )
After defining the optimal mapping in the univariate case using quantiles, we
mentioned that the optimal transport between two univariate Gaussian distributions
(.N(μ0 , σ02 ) and .N(μ1 , σ12 ) respectively) is
σ1
x1 = T⋆N (x0 ) = μ1 +
. x0 − μ0 ,
σ0
x 1 = T⋆N (x 0 ) = μ1 + A(x 0 − μ0 ),
.
Plečko et al. (2021) considered a more general approach to solving problems such
as the one in Fig. 9.1d: as previously, s causes x1 and x2 , but x2 is also influenced by
x1 . Therefore, when transporting on x2 , we should consider not the quantile function
of x2 conditional on s, but a quantile regression function, of x2 conditional on s
9.4 Mutatis Mutandis Counterfactual Fairness 369
A model satisfies the “counterfactual fairness” (Definition 9.5) property “had the
protected attributes (e.g., race) of the individual been different, other things being
equal, the decision would have remained the same.” This is a ceteris paribus
definition of counterfactual fairness. But a mutatis mutandis definition of the
conditional average treatment effect can be considered (as in Berk et al. (2021b)
and Charpentier et al. (2023a)).
Definition 9.12 (Mutatis Mutandis Counterfactual Fairness (Kusner et al.
2017)) If the prediction in the real world is the same as the prediction in the
counterfactual world, mutatis mutandis where the individual would have belonged
to a different demographic group, we have counterfactual fairness, i.e.,
E[YS←A
.
⋆
|X = T(x)] = E[YS←B
⋆
|X = x], ∀x,
Table 9.3 Definitions of individual fairness, see Carey and Wu (2022) for a complete list
Similarity Fairness
(Lipschitz) Dwork et al. (2012) (9.1) yi ,
.Dy ( yj )≤ Dx (x i , x j ), ∀i, j
Proxy-Based Fairness, Kilbertus et al. (9.3) .E Y do(S = A) = E Y do(S = B)
(2017)
⋆ ⋆
Fairness on Average Kusner et al. (2017) (9.4) .E YS←A = E YS←B
Treatment Effect
⋆ ⋆
Counterfactual Fairness, Kusner et al. (2017) (9.5) .E YS←A X = x = E YS←A X = x
Path-Specific Effect Avin et al. (2005) (9.6) .E[Y |doπ (S
= A)] = E[Y |doπ (S =
B)]
Path-Specific Wu et al. (2019) (9.7) .E[Y |doπ (S = A), F] = E[Y |doπ (S =
Counterfactual Effect B), F]
Mutatis Mutandis Kusner et al. (2017) (9.12) .E[YS←A
⋆ |X = T(x)] = E[YS←B
⋆ |X =
Counterfactual x]
370 9 Individual Fairness
the top, we have Betty, Brienne, and Beatrix, and for those three individuals, we
want to quantify some possible individual discrimination. At the bottom, we have
Alex, Ahmad, and Anthony, who are somehow “male versions” of Betty, Brienne,
and Beatrix, in the sense that they share the same legitimate characteristics .x. Even
if the distance between .x is null (as in the Lipschitz property) within each pair, a
“proper counterfactual” of Betty is not Alex (neither is Ahmad the counterfactual of
Brienne and neither is Anthony the counterfactual of Beatrix). We use the techniques
mentioned previously to construct counterfactuals to those individuals.
For the first block, we simply use marginal transformations, as in Definition 9.8.
For example, for Brienne, .x1 = 1, which is the median value among women (group
B), if Brienne had been a man (group A), and if we assume that she would have
kept the same relative position (the median), her corresponding value for .x1 would
have been .−1. Similarly for .x3 . But for .x2 , 5 is also the median in group B, and as
the median is almost the same in group A, .x2 remains the same. Thus, if Brienne,
characterized by .(x1 , x2 , x3 , B), had been a man, there would be discrimination with
model m if .m(T1 (x1 ), T2 (x2 ), T3 (x3 ), A) < m(x1 , x2 , x3 , B).
Instead of marginal transformations, it is possible to consider the two other
techniques mentioned previously, optimal transport (using transport in R) and
a multivariate Gaussian transport. Formally, we consider here causal graphs of
Fig. 9.1, in the sense that s has a causal impact on both .x1 and .x2 , and not on
.x3 . In Table 9.4, counterfactuals are created for Betty, Brienne, and Beatrix, with
Definition 9.8) and with multivariate Gaussian transport (based on Proposition 9.2).
For example, Beatrix, corresponding to .(2, 8) is mapped with .(−0.27, 7.9) in the
first case and .(−0.38, 7.82) in the second case.
With the fairadapt function, in the fairAdapt package based on Plečko
et al. (2021), it is also possible to create some counterfactuals. To do so, we consider
the causal networks of Fig. 9.2e and f, the first one being that used to generate
data. Compared with the causal networks of Fig. 9.1, here, we take into account
the existing correlation between .x1 and .x3 , in the sense that an intervention on s
changes .x1 , and even if s has no direct impact on .x3 , it will go through the path
.π = {s, x1 , x3 }. Then, all variables may have an impact on y. To model properly that
1 1 1
3 3 3
Fig. 9.2 (d) Causal graph with no direct impact of s on y, but two mediators, and possibly, .x1 may
cause .x2 . (e) is similar to (c) with an additional indirect connection from .x1 to y, via mediator .x3 .
(f) is similar to (d) with an additional indirect connection from .x1 to y, via mediator .x3
Fig. 9.3 Optimal transport on a .(x1 , x2 ) grid, from B to A, using fairadapt on the left-hand
side, and, on the right-hand side using a parametric estimate of .T⋆ (based on Gaussian quantiles).
Only individuals in group B are transported here
On the left-hand side of Fig. 9.6, we have the scatterplot on .(x1 , x2 ), with points
in the A group mainly on the left-hand side and in the group B on the right-hand side.
On the right-hand side of Fig. 9.6, gray segments map individuals in group B and in
group A. The three points on the right-hand side, .• are Betty, Brienne, and Beatrix,
and on the left-hand side .• they denote the three matched individuals in group A.
In Fig. 9.7, we can visualize the optimal transport on a .(x1 , x2 ) grid, from B to A,
using a nonparametric estimate of .T⋆ (based on empirical quantiles) on the left-hand
side, and a Gaussian distribution on the right-hand side.
9.6 Application to the GermanCredit Dataset 373
Fig. 9.4 Scatterplot .(m(x i ), m(T⋆ (x i ))), adapted prediction against the original prediction, for
individuals in groups A and B, on the toydata2 dataset. Transformation is from group B to
group A (therefore predictions in group A remain unchanged). Model m is, at the top, plain logistic
regression (GLM) and at the bottom, a random forest
The GermanCredit dataset was considered in Sect. 8.8, where group fairness
metrics were computed. Recall that the proportion of empirical defaults in the
training dataset was 35% for males (group .M) whereas it was 28% for females
(group .F). And here, we try to quantify potential individual discrimination against
women. In Fig. 9.8, we have a simple causal graph on the germancredit
dataset, with .s → {duration, credit}, and .{duration, credit} → y,
as well as all other variables (that could cause y) (Fig. 9.9). The causal graph of
Fig. 9.10 is the one used in Watson et al. (2021). On that second causal graph,
.s → {savings, job} and then multiple causal relationships. Finally, on the
374 9 Individual Fairness
Fig. 9.5 The plain line is the density of .x1 for group A, whereas the plain area corresponds to
group B, on the toydata2 dataset. On the left-hand side, we have the distribution on the training
dataset, and on the right-hand side, the density of the adapted variables .x1 (with a transformation
from group B to group A)
Fig. 9.6 Optimal matching, of individuals in group B to individuals in group A, on the right, where
points .• are Betty, Brienne, and Beatrix, and .• their counterfactual version in group A
causal graph of Fig. 9.11, four causal links are added, to the previous one, .s →
{duration, credit_Amount} and .{duration, credit_Amount} → y.
Based on the causal graph of Fig. 9.8, when computing the counterfactual, most
variables that are children of the sensitive attribute s are adjusted (here .x1 is
Duration and .x2 is Credit_Amount). As in the previous section, we consider
three pairs of individuals; within each pair, individuals share the same .x, only
s changes (more details on the other features are also given). And again, the
counterfactual version of Betty (.F) is not Alex (.M). For instance, the amount of credit
that was initially 1262 for Betty should be different if Betty had been a man. If we
were using some quantile-based transport, as in Table 9.5, 1262 corresponds to the
25% quantile for men, and 17% quantile for women. So, if Betty were considered a
man, with a credit amount corresponding to the 17% quantile within men, it should
9.6 Application to the GermanCredit Dataset 375
Fig. 9.7 Optimal transport on a .(x1 , x2 ) grid, from B to A, using a nonparametric estimate of .T⋆
(based on empirical quantiles) on the left-hand side, and a Gaussian distribution on the right-hand
side
sex housing
duration
default
job savings
Fig. 9.8 Simple causal graph on the GermanCredit dataset, where all variables could have
a direct impact on y (or default), except for the sensitive attribute (s), which has an indirect
impact through two possible paths, via either the duration of the credit (duration) or the amount
(credit_Amount)
be a smaller amount, about 1074. Again, after that transformation, when everyone
is considered as a man, distributions of credit and duration are identical.
Here, eight models are considered: a logistic regression (GLM), a boosting (with
Adaboost algorithm, as presented in Sect. 3.3.6), a classification tree (as in Fig. 3.25)
and a random forest (RF, with 500 trees), each time the unaware version (based on
.x only, not s) and the aware version (including s). For all unaware models, Alex (.M)
and Beatrix (.F) get the same prediction. But for most models, aware models yield to
different predictions, when individuals have a different gender. Predictions of those
six individuals are given in Table 9.7. Those individual predictions can be visualized
in Fig. 9.9, when the causal graph is the one of Fig. 9.8. Lighter points correspond
376 9 Individual Fairness
Fig. 9.9 Scatterplot .(m(x i ), m(T⋆ (x i ))) for individuals in groups M and F, on the
GermanCredit dataset. Transformation is only from group F to group M, so that all individuals
are seen as men, on the y-axis. Model m is, from top to bottom, a plain logistic regression (GLM),
a boosting model (GBM) and a random forest (RF). We used fairadapt codes and the causal
graph of Fig. 9.8
9.6 Application to the GermanCredit Dataset 377
sex housing
duration
default
job savings
Fig. 9.10 Causal graph on the germancredit dataset, from Watson et al. (2021)
sex housing
duration
default
job savings
Table 9.5 Counterfactuals based on the causal graph of Fig. 9.8, based on marginal quantile
transformations. We consider here only counterfactual versions of women, considered as the
“disfavored”
Alex Ahmad Anthony Betty Brienne Beatrix
s (gender) M M M F F F
.x1 Duration 12 18 30 12 18 30
.u = F1|s (x1 ) 36% 57% 86% 34% 50% 79%
−1
.T(x1 ) = F1|s=M (u) 12 18 30 12 18 24
.x2 Credit 1262 2319 4720 1262 2319 4720
.u = F2|s (x2 ) 25% 55% 82% 17% 45% 76%
−1
.T(x2 ) = F2|s=M (u) 1262 2319 4720 1074 1855 3854
to the entire dataset, and darker points are the six individuals. Predictions with the
random forest are not very robust here.
For the logistic regression and the boosting algorithm (see Table 9.7), counter-
factual predictions are rather close to the original ones. For example, if Brienne
had been a man, mutatis mutandis, the default probability would have been .23.95%
instead of .24.30% (as Ahamad), when considering the unaware logistic regression.
378 9 Individual Fairness
Table 9.6 Predictions, with .95% confidence intervals, based on the causal graph of Fig. 9.8, using
marginal quantile transformations. We consider here only counterfactual versions of women only,
considered as the “disfavored”
Betty Brienne Beatrix
Unaware logistic 39.7% [23.9% ; 57.9%] 24.3% [13.8% ; 39.1%] 30.9% [15.7% ; 51.7%]
.m(x)
Unaware logistic 39.7% [23.9% ; 57.9%] 24.3% [13.8% ; 39.1%] 30.9% [15.7% ; 51.7%]
.m(x)
Unaware logistic 39.5% [23.8% ; 57.8%] 24.0% [13.6% ; 38.7%] 24.9% [12.2% ; 44.1%]
.m(T(x))
Aware logistic 36.7% [21.1% ; 55.6%] 22.6% [12.5% ; 37.4%] 30.1% [15.2% ; 50.8%]
.m(x, s = F)
Aware logistic 42.1% [25.4% ; 60.8%] 26.8% [15.0% ; 43.2%] 35.1% [17.6% ; 57.8%]
.m(x, s = M)
Aware logistic 41.9% [25.2% ; 60.7%] 26.5% [14.8% ; 42.8%] 28.6% [13.7% ; 50.2%]
.m(T(x), s = M)
The impact is larger for Beatrix, who had an initial prediction of default of .30.88%,
as a woman (even if the logistic regression is gender blind), and the same model, on
the counterfactual version of Beatrix, would have predicted .24.91%. It could be seen
as sufficiently different to consider that Beatrix could legitimately fell discriminated
because of her gender. But as shown in Table 9.6, we can easily compute confidence
intervals for predictions on GLM (it would be more complicated for boosting
algorithms and random forests) and the difference is not significantly different.
More complex models can be considered. Using function fairTwins, it is
possible to get a counterfactual for each individual in the dataset, as suggested in
Szepannek and Lübke (2021), even when we consider categorical covariates. The
first “realistic” causal graph we consider is the one used in Watson et al. (2021), on
that same dataset, that can be visualized in Fig. 9.10.
In the GermanCredit dataset, several variables in .x are categorical. For
ordered categorical variables (such as Savings_Bonds, taking values < 100
DM, 100 <= ...< 500 DM, etc.) it is possible to adapt optimal transport
techniques, as suggested in Plečko et al. (2021), assuming that ordered categories
are .{1, 2, · · · , m} and using a cost function .γ (i, j ) = |i − j |k . For non-ordered
categorical variables (such as Job, or Housing, the latter taking values such as
‘rent’, ‘own,’ or ‘for free’), a cost function .γ (i, i) = 1(i /= j ) is used.
And for continuous variables (Age, Credit_Amount or Duration), previous
techniques can be used. As Age is on no-path from s to y in all causal graphs, it
will not change. In Fig. 9.12 we can visualize the distributions of x conditional on
.s = M and .s = F respectively when x is Credit_Amount and Duration.
9.6 Application to the GermanCredit Dataset 379
Fig. 9.12 Distributions of x (Credit_Amount at the top and Duration at the bottom)
conditional on .s = A and .s = B
Predictions of the six individuals from Table 9.7 can be visualized in Fig. 9.13,
when the causal graph is that of Fig. 9.10. As in the previous section, there are
three pairs of individuals; within each pair, individuals share the same .x, only s
changes. Then, eight models are considered, a logistic regression (GLM), a boosting
(with Adaboost algorithm, as presented in Sect. 3.3.6), a classification tree (as in
Fig. 3.25), and a random forest (RF, with 500 trees), each time, with the unaware
version (based on .x only, not s) and the aware version (including s). For all unaware
models, Alex (.M) and Beatrix (.F) get the same prediction. But for most models,
aware models yield to different prediction, when individuals have a different gender.
380
Table 9.7 Creating counterfactuals for Betty, Brienne, and Beatrix in the GermanCredit dataset
Firstname s Firstname s Job Savings Housing Purpose
Alex M Betty F highly qualified employee 100 DM rent radio /
television
Ahmad M Brienne F skilled employee 100<=...<500 DM own furniture
Anthony M Beatrix F unskilled - resident no savings for car (new)
free
9 Individual Fairness
Original data
s Age Duration Credit .m glm (x) .m glm (x, s) .m gbm (x) .m
gbm (x, s) .m
cart (x) .m
cart (x, s) .m
rf (x) .m
rf (x, s)
Betty F 26 12 1262 39.69% 36.66% 42.30% 43.26% 31.75% 31.75% 25.40% 23.20%
Brienne F 33 18 2320 24.30% 22.61% 23.88% 21.08% 21.31% 21.31% 43.60% 33.60%
Beatrix F 45 30 4720 30.88% 30.08% 28.49% 30.42% 15.38% 15.38% 23.40% 25.80%
Alex M 26 12 1262 39.69% 42.10% 42.30% 44.86% 31.75% 31.75% 25.40% 21.60%
Ahmad M 33 18 2320 24.30% 26.84% 23.88% 22.18% 21.31% 21.31% 43.60% 31.00%
Anthony M 45 30 4720 30.88% 35.08% 28.49% 31.82% 15.38% 15.38% 23.40% 31.60%
Counterfactual
adjusted data, with marginal quantile transformations, causal graph from Fig. 9.8
Betty M 26 12 1074 39.51% 41.90% 40.69% 44.86% 31.75% 31.75% 23.80% 25.60%
Brienne M 33 18 1855 23.95% 26.46% 23.88% 22.18% 21.31% 21.31% 43.00% 38.60%
Beatrix M 45 24 3854 24.91% 28.58% 20.55% 20.31% 21.31% 21.31% 17.60% 24.80%
adjusted data, with fairAdapt, causal graph from Fig. 9.8
9.6 Application to the GermanCredit Dataset
Betty M 26 12 1110 42.73% 45.18% 44.24% 46.64% 31.75% 31.75% 22.2% 25.6%
Brienne M 33 18 1787 23.90% 26.40% 23.88% 22.18% 21.31% 21.31% 43.2% 38.2%
Beatrix M 45 24 3990 25.01% 28.70% 22.17% 23.60% 21.31% 21.31% 19.6% 26.4%
adjusted data, with fairAdapt, causal graph from Fig. 9.10
Betty M 26 18 1778 52.23% 54.03% 40.05% 46.81% 21.31% 21.31% 34.80% 31.80%
Brienne M 33 15 1864 32.25% 35.85% 31.60% 25.97% 21.31% 21.31% 23.00% 20.40%
Beatrix M 45 21 3599 39.70% 43.16% 28.36% 28.90% 21.31% 21.31% 10.60% 13.40%
adjusted data, with fairAdapt, causal graph from Fig. 9.11
Betty M 26 15 1882 49.05% 50.86% 35.32% 40.12% 21.31% 21.31% 27.8% 30.0%
Brienne M 33 18 1881 50.76% 53.49% 43.00% 38.77% 21.31% 21.31% 10.8% 13.8%
Beatrix M 45 24 3234 24.20% 26.23% 14.63% 16.84% 21.31% 21.31% 22.4% 19.0%
381
382 9 Individual Fairness
Fig. 9.13 Scatterplot .(m(x i ), m(T(x i ))) for individuals in groups M and F, on the
GermanCredit dataset. Transformation is only from group F to group M, so that all individuals
are seen as men, on the y-axis. Model m is, from top to bottom, a plain logistic regression (GLM),
a boosting model (GBM). We used fairadapt codes and the causal graph of Fig. 9.10
Part IV
Mitigation
1 Also called “Kranzberg’s First Law”. “By that I mean that technology’s interaction with the social
ecology is such that technical developments frequently have environmental, social, and human
consequences that go far beyond the immediate purposes of the technical devices and practices
themselves, and the same technology can have quite different results when introduced into different
contexts or under different circumstances,” Kranzberg (1995).
Chapter 10
Pre-processing
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 385
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_10
386 10 Pre-processing
10.2 Orthogonalization
This approach is a classical one in the econometric literature, when linear models
are considered. Interestingly, it is possible to use it even if there are multiple
sensitive attributes. It is also the one discussed in Frees and Huang( (2023). )Write
the .n × k matrix .S as a collection of k vectors in .Rn , .S = s 1 · · · s k , that
will correspond to k sensitive attributes. The orthogonal projection on variables
T −1 T
.{s 1 , · · · , s k } is associated with a matrix .||S = S(S S) S , while the projection
on the orthogonal of .S is .||S ⊥ = I − ||S (see Gram–Schmidt orthogonalization, and
10.2 Orthogonalization 387
Fig. 10.1). Let .-S denote the collection of centered vectors (using matrix notations,
-
.S = H S where .H = I − (11 )/n).
T
( )
Write the .n × p matrix .X as a collection of p vectors in .Rn , .X = x 1 · · · x p .
For any .x j , define
x⊥ - -T- −1-T
.j = ||-
S ⊥ x j = x j − S(S S) S x j .
1 T 1 T
Cov(s, x ⊥
. j )= s H x⊥
j = - S ⊥ x j = 0.
s ||-
n n
And similarly, the centered version of .x ⊥j is then also orthogonal to any .s. From
⊥
an econometric perspective, .x j can be seen as the residual of the regression of .x j
against .s, obtained from least-squares estimation
. s T-
xj = - βj + x⊥
j .
Asking for orthogonality, .Xj ⊥ S is related to a null correlation between the two
variables. But if S is binary, this can be simplified. Recall that if .x = (x1 , · · · , xn ),
then
1 T 1 1
x=
. (1 x) and Var(x) = x T H x where H = I − (11T ),
n n n
388 10 Pre-processing
1 T
Cov(s, x) =
. s Hx
n
Observe that Pearson’s linear correlation is
Hs Hx sTH x
Cor(s, x) =
. · = ,
||H s|| ||H x|| ||H s||||H x||
where
nj 1 E
sj =
. and Ax j = (xi − x).
n nj
i:si =j
Observe that to ensure that our predictor is fair, we need to compute the
correlation between y and s. If y is also binary, .y ∈ {0, 1}n , the covariance between
s and y can be written
10.3 Weights
The propensity score was introduced in Rosenbaum and Rubin (1983), as the
probability of treatment assignment conditional on observed baseline covariates.
The propensity score exists in both randomized experiments and in observational
studies, the difference is that in randomized experiments, the true propensity score
is known (and is related to the design of the experiment) whereas in observational
studies, the propensity score is unknown, and should be estimated. If the logistic
regression is the classical technique used to estimate that conditional probability,
McCaffrey et al. (2004) suggested some boosted regression approach, whereas Lee
et al. (2010) suggested bagging techniques.
In Fig. 10.2, we can visualize .ω |→ Cor[- mω (x), 1B (s)], on two datasets,
toydata2 (on the left) and germancredit (on the right), where .m -ω is a logistic
regression with weights proportional to 1 in class A and .ω in class B. The large dot
is the plain logistic regression (with identical weights).
10.4 Application to toydata2 389
Fig. 10.2 .ω |→ Cor[- mω (x, s), 1B (s)], on two datasets, toydata2 (on the left-hand side)
and GermanCredit (on the right-hand side), where .m -ω is a logistic regression with weights
proportional to 1 in class A and .ω in class B
In the first block, at the top of Table 10.1, we have predictions for six individuals,
using logistic regression trained on toydata2, including the sensitive attribute. On
the left are the original values of the features .(x i , si ). The second part corresponds
to the transformation of the features, .x ⊥ i , so that each feature is orthogonal to the
sensitive attribute (.x ⊥j is then orthogonal to any .s). Observe that features have on
average the same values when conditioning with respect to the sensitive attribute.
Based on orthogonalized features .m -⊥ (x ⊥ ) is the fitted unaware logistic regression.
This model is fair with respect to demographic parity as the demographic parity
ratio, .E[-
m(Z)|S = B]/E[- m(Z)|S = A], is close to one. Observe further that the
empirical correlation between .m -(x i , si ) and .1B (si ) was initially .0.72, and after
orthogonalization, the empirical correlation between .m -⊥ (x ⊥ i ) and .1B (si ) is .0.01.
At the bottom (third block), we can also visualize statistics about equalized odds,
with ratios .E[- m(Z)|S = B, Y = y]/E[- m(Z)|S = A, Y = y] for .y = 0 (at the
top) and .y = 1 (at the bottom). Values should be close to one to have equalized
odds, which is not the case here. On the right-hand side, at the very last column, we
consider a model based on weights in the training sample .m -ω (x). For the choice of
the weight, Fig. 10.2 suggested using very large weights for individuals in class B.
This model is slightly fairer than the original plain logistic regression.
In Figs. 10.3 and 10.4, we can visualize the orthogonalization method, with, in
Fig. 10.3, the optimal transport plot, between distributions of .m -(x i , si ) (on the x-
-⊥ (x ⊥
axis) to .m i ) (on the y-axis), for individuals in group A on the left-hand side, and
in group B on the right-hand side. In Fig. 10.4, on the left, we can visualize densities
-⊥ (x ⊥
of .m i ) from individuals in group A and in group B (thin lines are densities
of scores from the original plain logistic regression .m -(x i , si )). On the right-hand
side, we have the scatterplot of points .(- m(x i , si = A), m-⊥ (x ⊥ i )) and .(- m(x i , si =
B), m ⊥ ⊥
- (x i )). The six individuals at the top of Table 10.1 are emphasized, with
390
Table 10.1 Predictions using logistic regressions on toydata2, with two “pre-processing” approaches, with orthogonalized features .m -⊥ (x ⊥ ) and using
weights .m-ω (x). The first six rows describes six individuals. The next three rows are a check for demographic parity, with averages conditional on s, whereas
the last six rows are a check for equalized odds, with averages conditional on s and y
x1 x2 x3 s y m
-(x, s) x1⊥ x2⊥ x3⊥ m
-⊥ (x ⊥ ) m
-ω (x)
Alex 0 2 0 A GOOD 0.138 0.957 −2.924 0.968 0.355 0.221
Betty 0 2 0 B BAD 0.274 −0.958 −3.128 −0.938 0.123 0.221
Ahmad 1 5 1 A BAD 0.546 1.957 0.076 1.968 0.709 0.761
Brienne 1 5 1 B GOOD 0.739 0.042 −0.128 0.062 0.383 0.761
Anthony 2 8 2 A GOOD 0.900 2.957 3.076 2.968 0.915 0.973
Beatrix 2 8 2 B BAD 0.955 1.042 2.872 1.062 0.733 0.973
Average |S = A −0.967 4.919 −0.976 0.223 0.223 0.223 −0.010 −0.006 0.402 0.288
Average |S = B 0.958 5.132 0.935 0.675 0.675 0.223 −0.010 −0.006 0.408 0.674
(Difference) +2 ∼0 +2 ×3.027 ×3.027 ∼0 ∼0 ∼0 ×1.015 × 2.340
Average |S = A, Y =0 −1.018 4.357 −0.993 0.000 0.184 0.000 −0.061 −0.567 0.362 0.239
Average |S = B, Y =0 0.194 3.569 0.437 0.000 0.463 0.000 −0.061 −0.567 0.228 0.435
(Difference) ×2.516 ∼0 ∼0 ∼0 ×0.630 ×1.820
Average |S = A, Y =1 −0.788 6.873 −0.917 1.000 0.362 1.000 0.169 1.949 0.540 0.458
Average |S = B, Y =1 1.325 5.884 1.175 1.000 0.777 1.000 0.169 1.949 0.494 0.789
(Difference) ×2.146 ∼0 ∼0 ∼0 ×0.915 ×1.723
10 Pre-processing
10.4 Application to toydata2 391
Fig. 10.4 On the left-hand side, densities of .m -⊥ (x ⊥i ) from individuals in group A and in B
(thin lines are densities of scores from the plain logistic regression .m -(x i , si )). On the right-hand
side, scatterplot of points .(- -⊥ (x ⊥
m(x i , si = A), m i )) and .(-
m (x i , si = B), m -⊥ (x ⊥
i )), from the
toydata2 dataset
vertical segments showing the (individual) difference between the initial model and
the fair one.
In Figs. 10.5 and 10.6, we can visualize similar plots for the weight method,
with, in Fig. 10.5, the optimal transport plot between distributions of .m -(x i , si ) (on
the x-axis) to .m-ω (x i )’s (on the y-axis), for individuals in group A on the left-hand
side, and in group B on the right-hand side. In Fig. 10.6, on the left-hand side, we can
visualize densities of .m -ω (x i ) from individuals in group A and in group B (again, thin
lines are the original densities of scores from the plain logistic regression .m-(x i , si )).
On the right-hand side, we have the scatterplot of points .(- -⊥ (x ⊥
m(x i , si = A), m i ))
m(x i , si = B), m
and .(- ⊥ ⊥
- (x i )). We emphasize again the six individuals at the top of
Table 10.1.
392 10 Pre-processing
Fig. 10.6 On the left-hand side, densities of .m -ω (x i ) from individuals in group A and in B (thin
lines are densities of scores from the plain logistic regression .m -(x i , si )). On the right-hand side,
m(x i , si = A), m
scatterplot of points .(- -ω (x i )) and .(-
m(x i , si = B), m
-ω (x i )), from the toydata2
dataset
Fig. 10.7 Optimal transport between distributions of .m -⊥ (x ⊥i ) for individuals in group A (x-axis)
-ω (x i ) on the right-hand side
to individuals in group B (y-axis) on the left-hand side, and .m
The same analysis can be performed on the GermanCredit dataset, with plain
logistic regression here too, but other techniques described in Chap. 3 could be
considered. For the orthogonalization, it is performed on the .X design matrix,
that contains indicators for all factors (but the reference) for categorical variable.
Observe that the empirical correlation between .m -(x i , si ) and .1B (si ) was initially
.−0.195. After orthogonalization, the empirical correlation between .m -⊥ (x ⊥i ) and
.1B (si ) is now .0.009. In Table 10.2, we have averages of score predictions using the
Table 10.2 Averages of score predictions using the plain logistic regressions on
GermanCredit at the top, with two “pre-processing” approaches, with orthogonalized
-⊥ (x ⊥ ) and using weights .m
features .m -ω (x), at the bottom. The second block is a check for
demographic parity, with averages conditional on s, whereas the second block is a check for
equalized odds, with averages conditional on s and y
Demographic Equalized Equalized
parity odds (.y = 0) odds (.y = 1)
A B (Ratio) A B (Ratio) A B (Ratio)
-(x, s)
.m 0.352 0.277 .×0.787 0.299 0.237 .×0.792 0.448 0.381 .×0.850
-
.m
⊥ (x ⊥ ) 0.298 0.301 .×1.010 0.249 0.260 .×1.045 0.387 0.408 .×1.054
-ω (x)
.m 0.289 0.277 .×0.958 0.249 0.235 .×0.946 0.363 0.386 .×1.065
-⊥ (x ⊥
Fig. 10.9 On the left-hand side, densities of .m i ) from individuals in group A and in B (thin
-(x i , si )). On the right-hand side,
lines are densities of scores from the plain logistic regression .m
m(x i , si = A), m
the scatterplot of points .(- -⊥ (x ⊥ m(x i , si = B), m
i )) and .(- -⊥ (x ⊥i )), from the
germancredit dataset
10.5 Application to the GermanCredit Dataset 395
Fig. 10.11 On the left-hand side, densities of .m-ω (x i ) from individuals in group A and in group B
(thin lines are densities of scores from the plain logistic regression .m -(x i , si )). On the right-hand
m(x i , si = A), m
side, the scatterplot of points .(- -ω (x i )) and .(-
m(x i , si = B), m -ω (x i )), from the
GermanCredit dataset
and in group B on the right-hand side. In Fig. 10.9, on the left-hand side, we can
visualize densities of .m-⊥ (x ⊥ i ) from individuals in group A and in group B (thin
lines are densities of scores from the original plain logistic regression .m -(x i , si )).
On the right-hand side, we have the scatterplot of points .(- -⊥ (x ⊥
m(x i , si = A), m i ))
m(x i , si = B), m
and .(- ⊥ ⊥
- (x i )). The six individuals mentioned (in Table 9.7) are
again emphasized.
In Figs. 10.10 and 10.11, we can visualize similar plots for the weight method,
with, in Fig. 10.10, the optimal transport plot, between distributions of .m -(x i , si )
-ω (x i ) (on the y-axis), for individuals in group A on the left-hand
(on the x-axis) to .m
side, and in group B on the right-hand side. In Fig. 10.11, on the left-hand side,
we can visualize densities of .m -ω (x i ) from individuals in group A and in group B
(again, thin lines are the original densities of scores from the plain logistic regression
396 10 Pre-processing
Fig. 10.12 Optimal transport between distributions of .m-⊥ (x ⊥i ) for individuals in group A (x-axis)
-ω (x i ) on the right-hand side, on the
to individuals in group B (y-axis) on the left-hand side, and .m
GermanCredit dataset
.-(x i , si )). On the right-hand side, we have the scatterplot of points .(-
m m(x i , si =
-⊥ (x ⊥
A), m i )) and .(-
m (x i , s i = B), -
m ⊥ (x ⊥ )).
i
In Fig. 10.12, visualization of the optimal transport between distributions of
-⊥ (x ⊥
.m
i ) for individuals in group A (x-axis) to individuals in group B (y-axis) on
the left, and .m -ω (x i ) on the right-hand side. If the “transport line” is on the first
diagonal, the Wasserstein distance between the two conditional distributions is close
to 0, meaning that demographic parity is satisfied.
Chapter 11
In-processing
Classically, we have seen (see Definition 3.3) that models were obtained by
minimizing the empirical risk, the sample-based version of .R(
m)
∈ argmin R(m) where R(m) = E (Y, m(X, S)) ,
m
.
m∈M
for some set of models, .M. Quite naturally, we could consider a constrained
optimization problem
∈ argmin R(m) s.t. m fair,
m
.
m∈M
for some fairness criterion. Using standard results in optimization, one could
consider a penalized version of that problem
. ∈ argmin R(m) + λR(m),
m
m∈M
where “.R(m)” denotes a positive regularizer, indicating the extent to which the
fairness criterion is violated, as suggested in Zafar et al. (2017), Zhang and
Bareinboim (2018), Agarwal et al. (2018), Kearns et al. (2018), Li and Fan (2020),
and Li and Liu (2022). As a (technical) drawback, adding a regularizer that is
nonconvex could increase the complexity of optimization, as mentioned in Roth
et al. (2017) and Cotter et al. (2019).
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 397
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_11
398 11 In-processing
or, with synthetic notations, .PA [A] = PB [A] = P[A]. Thus, a classical metric for
strong demographic parity would be
for some appropriate weights .wA and .wB . Strong equal opportunity is achieved if
. Pm (X, S) ∈ AS = s, Y = 1 = P m(X, S) ∈ AY = 1 , ∀s, ∀A,
n
. log L(β) = yi log[m(x i )] + (1 − yi ) log[1 − m(x i )]
i=1
exp[x
i β]
and .m(x i ) = . Inspired by Zafar et al. (2019), fairness constraints
1 + exp[x
i β]
related to disparate impact and disparate mistreatment criteria discussed previously,
which should be strictly satisfied, could be introduced. For example, weak demo-
graphic parity is achieved if .E[m(X)|S = A] = E[ m(X)|S = B] and therefore
β ∗ = argmin − log L(β) , s.t. E[
. m(X)|S = A] = E[
m(X)|S = B]
β
11.1 Adding a Group Discrimination Penalty 399
Quite naturally, one can consider a penalized version of that constrained optimiza-
tion problem
β λ = argmin − log L(β) + λE[
. m(X)|S = B] .
m(X)|S = A] − E[
β
Following Scutari et al. (2022), observe that one could also write
β = argmin − log L(β) , s.t. |cov[mβ (x), 1B (s)]| ≤ ,
.
β
exp[x
i β]
where .mβ (x i ) = .
1 + exp[x
i β]
The sample-based version of the penalty would be
1 1
β λ = argmin − log L(β) + λ
. mβ (x i ) − mβ (x i ) ,
β nA nB
i:si =A i:si =B
or, because
cov[mβ (x), 1B (s)] = E mβ (x)[1B (s) − E(1B (s))] − E 1B (s) − E[1B (s)] mβ (x),
.
=0
1
n
nB
cov[mβ (x), 1B (s)] =
. (1B (si ) − 1B ) · mβ (x i ), where 1B = .
n n
i=1
Komiyama et al. (2018) actually initiated that idea: it regresses the fitted values
against the sensitive attributes and the response, and it bounds the proportion
of the variance explained by the sensitive attributes over the total explained
variance in that model. In R, functions frrm and nclm in the package fairml
perform such regression, where in the former, the constraint is actually written as
a Ridge penalty (see Definition 3.15). Demographic parity is achieved with the
400 11 In-processing
option "sp-komiyama" whereas equalized odds are obtained with the option
"eo-komiyama".
An alternative, suggested from Proposition 7.2, is to use the maximal correlation.
Recall that given two random variables U and V , the maximal correlation (also
coined “HGR,” as it was introduced in Hirschfeld (1935), Gebelein (1941), and
Rényi (1959))
r (U, V ) = HGR(U, V ) =
. max E[f (U )g(V )],
f ∈FU ,g∈GV
where
Mary et al. (2019) estimated the maximal correlation using (Gaussian) kernel
density estimates, in the context of fairness constraints (instead of a plain linear
correlation as done previously).
s being the other category, and where .r(x, s, x , s ) is a penalty that enforces the
.
difference between the outcomes for an individual and its counterfactual version.
Russell et al. (2017) considered
.r(x, s, x , s ) = m (x , s ),
(x, s) − m
In Figs. 11.1 and 11.2, we can visualize .AUC( mλ ) against the demographic parity
ratio .E[
mβλ (Z)|S = B]/E[mβλ (Z)|S = A], in Fig. 11.1 when .Z = (X, S) and
in Fig. 11.2, when .Z = X (unaware model), where, with general notations, the
prediction is obtained from a logistic regression
exp[zβ λ]
.βλ (z) =
m ,
1 + exp[zβ λ]
Fig. 11.1 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of m βλ on the y-axis and the fairness ratio (demographic parity) on the x-axis. Top left
βλ (x i , si ) (with a logistic
corresponds to accurate and fair. On the right-hand side, evolution of m
regression) for three individuals in group A and three in B, on the toydata2 dataset
Fig. 11.2 On the left-hand side, accuracy–fairness (demographic parity) trade-off plot (based on
Table 11.1), with the AUC of m βλ on the y-axis and the fairness ratio on the x-axis. The thin line
is the one with the model taking into account s (from Fig. 11.1). On the right-hand side, evolution
of mβλ (x i ) (with a logistic regression) for three individuals in group A and three in B, on the
toydata2 dataset
402 11 In-processing
Table 11.1 Penalized logistic regression, for different values of .λ, including s on the left-hand
side and excluding s on the right-hand side. At the top, values of . β λ in the first block, and
predictions .mβλ (x i , si ) and .m
βλ (x i ) for a series of individuals in the second block. At the bottom,
the demographic parity ratio, .E[ mβλ (Z)|S = B]/E[ mβλ (Z)|S = A] (fairness is achieved when the
ratio is 1), and AUC (the higher the value, the higher the accuracy of the model)
(x, s), aware
m m(x), unaware
← less fair more fair → ← less fair more fair →
β 0 (Intercept) −2.55 −2.29 −1.97 −1.51 −1.03 −2.14 −1.98 −1.78 −1.63
β 1 (x1 ) 0.88 0.88 0.85 0.77 0.62 1.01 0.84 0.57 0.26
β 2 (x2 ) 0.37 0.37 0.35 0.32 0.25 0.37 0.35 0.31 0.24
β 3 (x3 ) 0.02 0.02 0.02 0.02 0.03 0.15 0.02 −0.15 −0.29
β B (1B ) 0.82 0.44 −0.03 −0.70 −1.31 – – – –
Betty 0.27 0.25 0.22 0.17 0.14 0.20 0.22 0.24 0.24
Brienne 0.74 0.71 0.66 0.54 0.40 0.70 0.66 0.55 0.38
Beatrix 0.95 0.95 0.93 0.87 0.73 0.96 0.93 0.82 0.55
Alex 0.14 0.17 0.22 0.29 0.37 0.20 0.22 0.24 0.24
Ahmad 0.55 0.61 0.66 0.70 0.71 0.70 0.66 0.55 0.38
Anthony 0.90 0.92 0.93 0.93 0.91 0.96 0.93 0.82 0.55
E[m(x i , si )|S = A] 0.23 0.26 0.31 0.36 0.42 0.25 0.30 0.37 0.41
E[m(x i , si )|S = B] 0.67 0.65 0.61 0.53 0.42 0.64 0.61 0.54 0.41
(ratio) ×2.97 ×2.49 ×2.01 ×1.46 ×1.00 ×2.53 ×2.02 ×1.48 ×1.00
AUC 0.86 0.86 0.85 0.82 0.74 0.86 0.85 0.82 0.70
where .
β λ is a solution of a penalized maximum likelihood problem,
. λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
β
In Table 11.1, we have the outcome of penalized logistic regressions, for different
values of .λ, including s on the left-hand side and excluding s on the right-hand side.
At the top, values of . βλ (x i , si ) and .m
β λ in the first block, and predictions .m βλ (x i )
for a series of individuals in the second block. At the bottom, the demographic parity
ratio, .E[mβλ (Z)|S = B]/E[ mβλ (Z)|S = A] (fairness is achieved when the ratio is
1), and AUC (the higher the value, the higher the accuracy of the model).
In Fig. 11.3, we have the optimal transport plot, between distributions of
βλ (x i , si ) from individuals in group A and in B, for different values of .λ (low
.m
value on the left-hand side and high value on the right-hand side), associated with a
demographic parity penalty criterion.
In Fig. 11.4, we have, on the left-hand side, densities of .m βλ (x i , si ) from
individuals in group A and in B (thin lines are densities of .m β(x i , si )). On the
right-hand side, the scatterplot of points .( mβ(x i , si = A), m
βλ (x i ), s = A) and
mβ(x i , si = B), m
.( βλ (x i ), s = B), where .m
β is the plain logistic regression, and
βλ is the penalized logistic regression, from the toydata2 dataset, associated
.m
Fig. 11.3 Optimal transport between distributions of .m βλ (x i , si ) from individuals in group A and
in B, for different values of .λ (low value on the left-hand side and high value on the right-hand
side), associated with a demographic parity penalty criterion
In Fig. 11.5, the optimal transport plot, from the distribution of .m β(x i , si ) to the
βλ (x i , si ), for individuals in group A on the left-hand side, and in
distribution of .m
group B on the right-hand side, for different values of .λ, with a low value at the top
and a high value at the bottom (fair model, associated with a demographic parity
penalty criterion).
In Figs. 11.6, we can visualize .AUC(mλ ) against the demographic parity ratio
E[
. mβλ (Z)|S = B]/E[ mβλ (Z)|S = A], with general notations, the prediction is
obtained from a logistic regression
exp[zβ λ]
βλ (z) =
m
. ,
1 + exp[zβ λ]
where .
β λ is a solution of a penalized maximum likelihood problem,
λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
.
β
In Table 11.2, we can visualize the outcome of some penalized logistic regres-
sions, for different values of .λ, including s on the left-hand side and excluding s
on the right-hand side. At the top, values of . β λ in the first block, and predictions
βλ (x i , si ) and .m
.m βλ (x i ) for a series of individuals in the second block. At the
bottom, the class balance ratios, .E[ mβλ (Z)|S = B, Y = y]/E[ mβλ (Z)|S = A, Y =
y] (fairness is achieved when the ratio is 1), and AUC (the higher the value, the
higher the accuracy of the model).
404 11 In-processing
Fig. 11.4 On the left-hand side, densities of .m βλ (x i , si ) from individuals in group A and in B (thin
lines are densities of .m β(x i , si )). On the right-hand side the scatterplot of points .( mβ(x i , si =
βλ (x i ), s = A) and .(
A), m mβ(x i , si = B), m βλ (x i ), s = B), where .m β is the plain logistic
regression, and .m βλ is the penalized logistic regression, from the toydata2 dataset, associated
with a demographic parity penalty criterion
11.3 Application on toydata2 405
Fig. 11.5 Optimal transport from the distribution of .m β(x i , si ) to the distribution of .m
βλ (x i , si ),
for individuals in group A on the left-hand side, and in group B on the right-hand side, for different
values of .λ, with a low value at the top and a high value at the bottom (fair model, associated with
a demographic parity penalty criterion)
406 11 In-processing
Fig. 11.6 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of .mβλ on the y-axis and the fairness ratio (class balance) on the x-axis. Top left corresponds
βλ (x i , si ) (with a logistic regression) for
to accurate and fair. On the right-hand side, evolution of .m
three individuals in group A and three in B, in the toydata2 dataset
Table 11.2 Penalized logistic regression, for different values of .λ, including s on the left-hand
side, and excluding s on the right-hand side. At the top, values of . β λ in the first block, and
βλ (x i , si ) and .m
predictions .m βλ (x i ) for a series of individuals in the second block. At the bottom,
the class balance ratios, .E[ mβλ (Z)|S = B, Y = y]/E[ mβλ (Z)|S = A, Y = y] (fairness is achieved
when the ratio is 1), and AUC (the higher the value, the higher the accuracy of the model)
(x, s), aware
m
← less fair more fair →
β 0 (Intercept) −2.55 −2.45 −2.34 −2.21 −2.27 −2.08 −1.44 -2.61
β 1 (x1 ) 0.89 0.90 0.92 0.94 0.88 0.82 0.52 0.39
β 2 (x2 ) 0.37 0.37 0.37 0.37 0.41 0.40 0.30 0.39
β 3 (x3 ) 0.02 0.03 0.03 0.03 0.01 0.00 0.04 −0.42
β B (1B ) 0.81 0.63 0.39 0.11 −0.03 −0.48 −0.60 0.40
Betty 0.27 0.25 0.23 0.20 0.19 0.15 0.19 0.20
Brienne 0.74 0.72 0.70 0.67 0.66 0.57 0.51 0.44
Beatrix 0.95 0.95 0.95 0.94 0.94 0.91 0.81 71
Alex 0.14 0.15 0.17 0.19 0.19 0.22 0.30 0.14
Ahmad 0.55 0.58 0.61 0.65 0.66 0.68 0.65 0.33
Anthony 0.90 0.91 0.93 0.94 0.94 0.94 0.89 0.60
E[
m(x i , si )|S = A, Y = 0] 0.19 0.20 0.21 0.23 0.26 0.29 0.36 0.33
E[
m(x i , si )|S = B, Y = 0] 0.46 0.44 0.42 0.39 0.38 0.32 0.33 0.33
(Ratio) 2.47 2.25 2.00 1.74 1.47 1.10 0.91 1.00
E[
m(x i , si )|S = A, Y = 1] 0.37 0.38 0.40 0.43 0.48 0.52 0.55 0.54
E[
m(x i , si )|S = B, Y = 1] 0.78 0.76 0.75 0.73 0.72 0.66 0.59 0.54
(Ratio) 2.12 2.00 1.86 1.72 1.50 1.26 1.09 1.00
(Global ratio) 2.47 2.25 2.00 1.74 1.50 1.26 1.09 1.00
AUC 0.86 0.86 0.86 0.86 0.85 0.83 0.80 0.75
11.4 Application to the GermanCredit Dataset 407
In Figs. 11.9 and 11.10, we can visualize .AUC( mλ ) against the demographic parity
ratio .E[
mβλ (Z)|S = B]/E[mβλ (Z)|S = A], in Fig. 11.1 when .Z = (X, S) and
in Fig. 11.2 when .Z = X (unaware model), where, with general notations, the
prediction is obtained from a logistic regression
exp[zβ λ]
βλ (z) =
m
.
1 + exp[zβ λ]
where .
β λ is a solution of a penalized maximum likelihood problem,
λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
.
β
Fig. 11.7 On the left-hand side, densities of m βλ (x i , si ) from individuals in group A and in B (thin
lines are densities of m β(x i , si )). On the right-hand side, the scatterplot of points ( mβ(x i , si =
βλ (x i ), s = A) and (
A), m mβ(x i , si = B), m βλ (x i ), s = B), where m β is the plain logistic
regression, and m βλ is the penalized logistic regression, from the toydata2 dataset, associated
with a class balance penalty criterion
11.4 Application to the GermanCredit Dataset 409
Fig. 11.8 Optimal transport from the distribution of m β(x i , si ) to the distribution of m
βλ (x i , si ),
for individuals in group A on the left-hand side, and in group B on the right-hand side, for different
values of λ, with a low value at the top and a high value at the bottom (fair model, associated with
a class balance (equalized odds) penalty criteria)
410 11 In-processing
Fig. 11.9 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of m βλ on the y-axis and the fairness ratio (demographic parity) on the x-axis. Top left
corresponds to accurate and fair. On the right-hand side, evolution of mβλ (x i , si ) (with a logistic
regression) for three individuals in group A and three in B, in the toydata2 dataset
Fig. 11.10 Optimal transport between distributions of m βλ (x i , si ) from individuals in group A
and in B, for different values of λ (low value on the left-hand side and high value on the right-hand
side), associated with a demographic parity penalty criterion
In Figs. 11.13 and 11.14, we can visualize .AUC( mλ ) against the demographic parity
ratio .E[
mβλ (Z)|S = B]/E[mβλ (Z)|S = A], in Fig. 11.1 when .Z = (X, S) and
in Fig. 11.2 when .Z = X (unaware model), where, with general notations, the
prediction is obtained from a logistic regression
exp[zβ λ]
βλ (z) =
m
. ,
1 + exp[zβ λ]
11.4 Application to the GermanCredit Dataset 411
Fig. 11.11 On the left-hand side, densities of .m βλ (x i , si ) from individuals in group A and
in B (thin lines are densities of .m β(x i , si )). On the right-hand side, the scatterplot of points
mβ(x i , si = A), m
.( βλ (x i ), s = A) and .(
mβ(x i , si = B), m βλ (x i ), s = B), where .m
β is the plain
logistic regression, and .m βλ is the penalized logistic regression, from the toydata2 dataset,
associated with a demographic parity penalty criterion
412 11 In-processing
Fig. 11.13 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of m βλ on the y-axis and the fairness ratio (class balance) on the x-axis. Top left corresponds
to accurate and fair. On the right-hand side, evolution of mβλ (x i , si ) (with a logistic regression) for
three individuals in group A and three in B, in the toydata2 dataset
Fig. 11.14 Optimal transport between distributions of m βλ (x i , si ) from individuals in group A
and in B, for different values of λ (low value on the left-hand side and high value on the right-hand
side), associated with a class balance penalty criterion
where .
β λ is a solution of a penalized maximum likelihood problem,
.λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
β
Fig. 11.15 On the left-hand side, densities of .m βλ (x i , si ) from individuals in group A and
in B (thin lines are densities of .m β(x i , si )). On the right-hand side, the scatterplot of points
mβ(x i , si = A), m
.( βλ (x i ), s = A) and .(
mβ(x i , si = B), m βλ (x i ), s = B), where .m
β is the plain
logistic regression, and .m βλ is the penalized logistic regression, from the toydata2 dataset,
associated with a class balance penalty criterion
11.4 Application to the GermanCredit Dataset 415
If weak demographic parity is not satisfied, in the sense that .EX|S=A [m(X)] =
EX|S=B [m(X)], a simple technique to get a fair model is to consider
E[m(X, S)]
m (x, s) =
. · m(x, s) for a policyholder in group s.
E[m(X, s)]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 417
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_12
418 12 Post-Processing
dP[S = s]
W =
. ,
dP[S = s|X = x]
dP[S = s]
W =
. ,
dP[S = s|X = x]
and therefore
.E[W ] = dP[S = s]dP[X = x] = 1.
12.3 Average and Barycenters 419
Similarly,
E[SW ] =
. swdP[S = s, X = x] = swdP[S = s|X = x]dP[X = x],
and
E[SW ] =
. sdP[S = s]dP[X = x] = E[S]dP[X = x] = E[S].
E[XSW ] =
. xE[S]dP[X = x] = E[X]E[S].
for some distance (or divergence) d, as in Nielsen and Boltz (2011). Those are
also called “centroïds” associated with measures .P = {P1 , · · · , Pn }, and weights
.ω. Instead of theoretical measures .Pi , the idea of “averaging histograms” (or
empirical measures) has been considered in Nielsen and Nock (2009) using “gener-
alized Bregman centroid,” and in Nielsen (2013), who introduced the “generalized
Kullback–Leibler centroid,” based on Jeffreys divergence, introduced in Jeffreys
(1946), which corresponds to a symmetric divergence, extending Kullback–Leibler
divergence (see Definition 3.7).
An alternative (see Agueh and Carlier (2011) and Definition 3.11) is to use the
Wasserstein distance .W2 . As shown in Santambrogio (2015), if one of the measures
.Pi is absolutely continuous, the minimization problem has a unique solution. As
n
n
N(μ∗ , ∗ ), where μ∗ =
. ωi μi and ∗ = ωi i .
i=1 i=1
n
N(μ , ), where μ =
. ωi μi ,
i=1
n
1/2
=
.
ωi 1/2 i 1/2 .
i=1
12.3 Average and Barycenters 421
Fig. 12.1 Wasserstein barycenter and Jeffreys-Kullback–Leibler centroid of two Gaussian dis-
tributions on the left, and empirical estimate of the density of the Wasserstein barycenter and
Jeffreys–Kullback–Leibler centroid of two samples .x1 and .x2 (drawn from normal distributions)
In the univariate case, with two Gaussian measures, the difference is that in the
first case, the variance is the average of variances, whereas in the second case, the
standard deviation is the average of standard deviations,
⎧
⎨σ ∗ = ω σ 2 + ω σ 2 : Jeffreys–Kullback–Leibler centroid
1 1 2 2
.
⎩σ = ω σ + ω σ : Wasserstein barycenter.
1 1 2 2
In Fig. 12.1 we can visualize the barycenters of two Gaussian distributions, on the
left, and the empirical version with kernel density estimators on the right (based
on two Gaussian samples of sizes .n = 1000). More specifically, if .f1 is the kernel
1 is the integral of .f1 , and if
density estimator of sample .x1 , if .F
T (x) = ω1 x + ω2 F
. 1 (x) ,
−1 F
2
Fig. 12.2 Empirical Jeffreys–Kullback–Leibler centroid of two samples generated from beta
distributions on the left-hand side (based on histograms) and in the middle (based on the estimated
density of the beta kernel ), the Wasserstein barycenter on the right-hand side (based on the
estimated density of the beta kernel)
Fig. 12.3 Optimal transport for two samples drawn from two beta distributions (one skewed to
the left (on the left-hand side) and one to the right (on the right-hand side), on the x-axis) to the
barycenter (on the y-axis)
The transport, from .f1 or .f2 to .f (all three on the right-hand side of Fig. 12.2)
can be visualized in Fig. 12.3, on the left-hand side and on the right-hand side
respectively.
Given two scoring functions .m(x, s = A) and .m(x, s = B), it is possible, post-
processing, to construct a fair score .m using the approach we just described.
12.4 Application to toydata1 423
Definition 12.1 (Fair Barycenter Score) Given two scores .m(x, s = A) and
m(x, s = B), the “fair barycenter score” is
.
m (x, s = A) = P[S = A] · m(x, s = A) + P[S = B] · FB−1 ◦ FA m(x, s = A)
.
m (x, s = B) = P[S = A] · FA−1 ◦ FB m(x, s = B) + P[S = B] · m(x, s = B).
probability of claiming a loss in motor insurance when s is the gender of the driver,
from the left to the right.
In Fig. 12.6, computations of the distribution of the “fair score” defined as the
barycenter of the distributions of scores .(m(x i , si = A) and .(m(x i , si = B), in
the two groups. On the left, the Wasserstein barycenter, and then two computations
of the Jeffreys-Kullback–Leibler centroid.
In Fig. 12.7, the optimal transport plot, with the optimal matching between .m(x)
and .m (x) for individuals in group .s = A, on the left-hand side, and between .m(x)
and .m (x) for individuals in group .s = B on the right-hand side, with unaware
scores on the toydata1 dataset.
In Fig. 12.8, the optimal transport plot, with matching between .m(x, s = A) and
.m (x) for individuals in group .s = A, on the left-hand side, and between .m(x, s =
B) and .m (x) for individuals in group .s = B on the right-hand side, with aware
scores on the toydata1 dataset (plain lines, thin lines are the unaware score from
Fig. 12.7).
In Table 12.1, we can visualize the predictions for six individuals, with an
unaware model .m (x, s), an aware model .m
(x), and two barycenters, the Wasserstein
barycenter .m∗w (x) and the Jeffreys–Kullback–Leibler centroid, .m∗jkl (x).
In Fig. 12.9, we have distributions of the scores in the two groups, A and B, after
optimal transport to the barycenter, with the Jeffreys–Kullback–Leibler centroid at
the top and the Wasserstein barycenter at the bottom, with Definition 12.1. Given
424
12 Post-Processing
Fig. 12.4 Matching between .m(x, s = A) and .m (x), at the top, and between .m(x, s = B) and .m (x), at the bottom, on the probability of claiming a loss in
motor insurance when s is the gender of the driver, from FrenchMotor
12.4 Application to toydata1 425
Fig. 12.5 Scatterplot of points .(m(x i , si = A), m (x i , s = A) and .(m(x i , si = B), m (x i , s = B),
with three models (GLM, GBM, and RF), on the probability of claiming a loss in motor insurance
when s is the gender of the driver
Fig. 12.6 Barycenter of the two distributions of scores, of the distributions of scores .(m(x i , si =
A) and .(m(x i , si = B). On the left, the Wasserstein barycenter, and then two computations of the
Jeffreys-Kullback–Leibler centroid
two scores .m(x, s = A) and .m(x, s = B), the “fair barycenter score” being that of
Definition 12.1,
m (x, s = A) = P[S = A] · m(x, s = A) + P[S = B] · FB−1 ◦ FA m(x, s = A)
.
m (x, s = B) = P[S = A] · FA−1 ◦ FB m(x, s = B) + P[S = B] · m(x, s = B).
426 12 Post-Processing
Fig. 12.7 Matching between .m(x) and .m (x) for individuals in group .s = A, on the left-hand
side, and between .m(x) and .m (x) for individuals in group .s = B on the right-hand side, with
unaware scores on the toydata1 dataset
Fig. 12.8 Matching between .m(x, s = A) and .m (x) for individuals in group .s = A, on the left-
hand side, and between .m(x, s = B) and .m (x) for individuals in group .s = B on the right-hand
side, with aware scores on the toydata1 dataset (plain lines, thin lines are the unaware score
from Fig. 12.7)
In the entire dataset, we have 64% men (7973) and 36% women (4464) registered
as “main driver.” Overall, if we consider “weak demographic parity,” .8.2% of
women claim a loss, as opposed to .8.9% of men. In Table 12.2, we can visualize
“gender-neutral” predictions, derived from the logistic regression (GLM), a boost-
12.5 Application on FrenchMotor 427
Table 12.1 Individual predictions for six fictitious individuals. With two models (aware and
unaware) and two barycenters, on toydata1
x s .y (x, s)
.m (x)
.m .m
∗
w (x) .m
∗
jkl (x)
Alex .−1 A 0.475 0.250 0.219 0.154 0.094
Betty .−1 B 0.475 0.205 0.219 0.459 0.357
Ahmad 0 A 0.475 0.490 0.465 0.341 0.279
Brienne 0 B 0.475 0.426 0.465 0.719 0.692
Anthony .+1 A 0.475 0.734 0.730 0.571 0.521
Beatrix .+1 B 0.475 0.681 0.730 0.842 0.932
Fig. 12.9 Distributions of the scores in the two groups, A and B, after optimal transport to the
barycenter, with Jeffreys–Kullback–Leibler centroid at the top and the Wasserstein barycenter at
the bottom
ing algorithm (GBM), and a random forest (RF). The first column corresponds to
the proportional approach discussed in Sect. 12.1.
In Table 12.2, we have, for the two groups, the global correction discussed in
Sect. 12.2, with .−6% for the men (.×0.94, group A) and .+11% for the women
(.×1.11, group B). As on average, women have fewer accidents than men, they need
428 12 Post-Processing
Table 12.2 “Gender-free” prediction if the initial prediction was 5% (at the top), 10% (in the
middle), and 20% (at the bottom). The first approach is the simple “benchmark” based on .P[Y =
1]/P[Y = 1|S = s], and then three models are considered, GLM, GBM, and RF
A (men) B (women)
.×0.94 GLM GBM RF .×1.11 GLM GBM RF
.m(x) = 5% 4.73% 4.94% 4.80% 4.42% 5.56% 5.16% 5.25% 6.15%
.m(x) = 10% 9.46% 9.83% 9.66% 8.92% 11.12% 10.38% 10.49% 12.80%
.m(x) = 20% 18.91% 19.50% 18.68% 18.26% 22.25% 20.77% 21.63% 21.12%
bottom, respectively with .s = 1(age > 65) (discrimination against old people) and
.s = 1(age < 30) (discrimination against young people).
Table 12.3 “Age-free” prediction (against old driver) if the initial prediction was 5% (at the top),
10% (in the middle) and 20% (at the bottom)
A (younger .< 65) B (old .> 65)
.×1.01 GLM GBM RF .×0.94 GLM GBM RF
.m(x) = 5% 5.05% 5.17% 5.10% 5.27% 4.71% 3.84% 3.84% 3.96%
.m(x) = 10% 10.09% 10.37% 10.16% 11.00% 9.42% 7.81% 9.10% 6.88%
.m(x) = 20% 20.19% 19.98% 19.65% 21.26% 18.85% 19.78% 23.79% 12.54%
12.5 Application on FrenchMotor 429
Table 12.4 “Age-free” (against young drivers) prediction if the initial prediction was 5% (at the
top), 10% (in the middle), and 20% (at the bottom)
A (young .< 25) B (older .> 25)
.×0.74 GLM GBM RF .×1.06 GLM GBM RF
.m(x) = 5% 3.71% 3.61% 4.45% 2.41% 5.29% 5.29% 5.14% 6.05%
.m(x) = 10% 7.42% 7.89% 8.69% 5.17% 10.59% 10.29% 10.19% 11.95%
.m(x) = 20% 14.84% 21.82% 18.09% 9.93% 21.17% 19.87% 20.33% 21.29%
Fig. 12.10 Matching between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s =
B) and .m (x, s = B), at the bottom, based on the probability of claiming a loss in motor insurance
when s is the indicator that the driver is “old” .1(age > 65)
probability of claiming a loss in motor insurance when s is the indicator that the
driver is “old” .1(age > 65) (more than 65 years old).
In Fig. 12.12, we can visualize the optimal transport plot, with t matching
between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s = B)
and .m (x, s = B), at the bottom, based on the probability of claiming a loss in
motor insurance when s is the indicator that the driver is “young” .1(age < 30)
(less than 30 years old).
In Fig. 12.13, we have the scatterplot of points .(m(x i , si = A), m( x i )) and
.(m(x i , si = B), m( x i )), with three models (GLM, GBM, and RF), based on the
probability of claiming a loss in motor insurance when s is the indicator that the
driver is “young” .1(age < 30).
430 12 Post-Processing
Fig. 12.11 Scatterplot of points .(m(x i , si = A), m( x i )) and .(m(x i , si = B), m( x i )), with three
models (GLM, GBM, and RF), based on the probability of claiming a loss in motor insurance when
s is the indicator that the driver is “old” .1(age > 65)
Fig. 12.12 Matching between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s =
B) and .m (x, s = B), at the bottom, based on the probability of claiming a loss in motor insurance
when s is the indicator that the driver is “young” .1(age < 30)
12.6 Penalized Bagging 431
Fig. 12.13 Scatterplot of points .(m(x i , si = A), m( x i )) and .(m(x i , si = B), m( x i )), with three
models (GLM, GBM, and RF), based on the probability of claiming a loss in motor insurance when
s is the indicator that the driver is “young” .1(age < 30)
k
Let .Mω (x) = ωj mj (x) = ω m(x). For example, with random forests, k is
j =1
large, and .ωj = 1/k. But we can consider an ensemble approach, on a few models.
Recall that “demographic parity” (Sect. 8.2) is achieved if
EY
. S = B = E Y
S = A = E Y ,
meaning here
E ω m(X)S = A = E ω m(X)S = B .
.
where the empirical risk (associated with accuracy) could be, if the risk is associated
with loss .,
1
n
R(ω) =
. ω m(x i ), yi ,
n
i=1
and where .R(ω) is some fairness criteria as discussed in Sect. 11.1. Following
Friedler et al. (2019), an alternative could be to consider .α-disparate impact
|α |S = A]
E[|Y
Rα =
. for α > 0,
|α |S = B]
E[|Y
As in Sect. 11.1, we want to find some weights that give a trade-off between
fairness and accuracy. The only difference is that in Chap. 11, “in-processing,” we
were still training the model. Here, we already have a collection of models, we
just want to consider a weighted average of the model. Given a sample .{(yi , x i )},
consider the following penalized problem, where the fairness criteria are related to
.α-“demographic parity,” and accuracy is characterized by some loss .,
⎧ α ⎫
⎨ 1 1
λ ⎬
. min ω m(x i ) − ω m(x i ) + ω m(x i ), yi ,
ω∈Sk ⎩ nA i:S =A n
i n
i
⎭
i
−1 1
ωdp =
. where = AA + λB,
1 −1 1
where
1 1
A=
. m(x i ) − m(x i )
nA n
i:Si =A i
and
1
n
B=
. m(x i ) − yi 1 m(x i ) − yi 1 .
n
i=1
12.6 Penalized Bagging 433
The tuning parameter .λ is positive, and selecting .λ > 0 yields a unique solution.
For .β-“equalized odds,” the optimization problem is
1 (A) α λ
n
. min e(yi ) +
e (yi ) − ω m(x i ), yi
ω∈Sk n
i=1
n
i
n
.e (y) ∝ ω m(x i )Kh (yi − y) = ω v h (y),
i=1
and
(0)
.e
(A)
(y) ∝ ω m(x i )Kh (yi − y) = ω v h (y)
i:si =A
1
−1 1 n
.ω eo = where = γ i γ i + λB,
1 −1 1 n
i=1
where
.γ i = v (0)
h (yi ) − v h (yi )
and
1
n
B=
. m(x i ) − yi 1 m(x i ) − yi 1 .
n
i=1
We should keep in mind here that we can always solve this problem numerically.
References
Aas K, Jullum M, Løland A (2021) Explaining individual predictions when features are dependent:
More accurate approximations to shapley values. Artif Intell 298:103502
Abraham K (1986) Distributing risk: Insurance, legal theory and public policy. Yale University
Press, Yale
Abrams M (2014) The origins of personal data and its implications for governance. SSRN 2510927
Achenwall G (1749) Abriß der neuesten Staatswissenschaft der vornehmsten Europäischen Reiche
und Republicken zum Gebrauch in seinen Academischen Vorlesungen. Schmidt
Adams SJ (2004) Age discrimination legislation and the employment of older workers. Labour
Econ 11(2):219–241
Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair
classification. In: Dy J, Krause A (eds) International Conference on Machine Learning, Pro-
ceedings of Machine Learning Research, Stockholmsmässan, Stockholm Sweden, Proceedings
of Machine Learning Research, vol 80, pp 60–69
Agrawal T (2013) Are there glass-ceiling and sticky-floor effects in india? an empirical examina-
tion. Oxford Dev Stud 41(3):322–342
Agresti A (2012) Categorical data analysis. Wiley, New York
Agresti A (2015) Foundations of linear and generalized linear models. Wiley, New York
Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. SIAM J Math Anal 43(2):904–
924
Ahima RS, Lazar MA (2013) The health risk of obesity–better metrics imperative. Science
341(6148):856–858
Ahmed AM (2010) What is in a surname? the role of ethnicity in economic decision making. Appl
Econ 42(21):2715–2723
Aigner DJ, Cain GG (1977) Statistical theories of discrimination in labor markets. Ind Labor Relat
Rev 30(2):175–187
Ajunwa I (2014) Genetic testing meets big data: Tort and contract law issues. Ohio State Law J
75:1225
Ajunwa I (2016) Genetic data and civil rights. Harvard Civil Rights-Civil Liberties Law Rev 51:75
Akerlof GA (1970) The market for “lemons”: Quality uncertainty and the market mechanism. Q J
Econ 84(3):488–500
Al Ramiah A, Hewstone M, Dovidio JF, Penner LA (2010) The social psychology of discrimi-
nation: Theory, measurement and consequences. In: Making equality count, pp 84–112. The
Liffey Press, Dublin
Alexander L (1992) What makes wrongful discrimination wrong? biases, preferences, stereotypes,
and proxies. Univ Pennsylvania Law Rev 141(1):149–219
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 435
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4
436 References
Alexander W (1924) Insurance fables for life underwriters. The Spectator Company, London
Alipourfard N, Fennell PG, Lerman K (2018) Can you trust the trend? discovering simpson’s
paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on
Web Search and Data Mining, pp 19–27
Allen CG (1975) Plato on women. Feminist Stud 2(2):131
Allerhand L, Youngmann B, Yom-Tov E, Arkadir D (2018) Detecting Parkinson’s disease from
interactions with a search engine: Is expert knowledge sufficient? In: Proceedings of the 27th
ACM International Conference on Information and Knowledge Management, pp 1539–1542
Altman A (2011) Discrimination. Stanford Encyclopedia of Philosophy
Altman N, Krzywinski M (2015) Association, correlation and causation. Nature Methods
12(10):899–900
Alvarez-Esteban PC, del Barrio E, Cuesta-Albertos JA, Matrán C (2018) Wide consensus aggrega-
tion in the Wasserstein space. application to location-scatter families. Bernoulli 24:3147–3179
Amadieu JF (2008) Vraies et fausses solutions aux discriminations. Formation emploi Revue
française de sciences sociales 101:89–104
Amari SI (1982) Differential geometry of curved exponential families-curvatures and information
loss. Ann Stat 10(2):357–385
American Academy of Actuaries (2011) Market consistent embedded value. Life Financial
Reporting Committee
Amnesty International (2023) Discrimination. https://www.amnesty.org/en/what-we-do/
discrimination/
Amossé T, De Peretti G (2011) Hommes et femmes en ménage statistique: une valse à trois temps.
Travail, genre et sociétés 2:23–46
Anderson TH (2004) The pursuit of fairness: A history of affirmative action. Oxford University
Press, Oxford
de Andrade N (2012) Oblivion: The right to be different from oneself-reproposing the right to be
forgotten. In: Cerrillo Martínez A, Peguera Poch M, Peña López I, Vilasau Solana M (eds) VII
international conference on internet, law & politics. Net neutrality and other challenges for the
future of the Internet, IDP. Revista de Internet, Derecho y Política, 13, pp 122–137
Angrist JD, Pischke JS (2009) Mostly harmless econometrics: An empiricist’s companion.
Princeton University Press, Princeton
Anguraj K, Padma S (2012) Analysis of facial paralysis disease using image processing technique.
Int J Comput Appl 54(11). https://doi.org/10.5120/8607-2455
Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias: There’s software used across the
country to predict future criminals and it’s biased against blacks. ProPublica May 23
Antoniak M, Mimno D (2021) Bad seeds: Evaluating lexical methods for bias measurement. In:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long
Papers), pp 1889–1904
Antonio K, Beirlant J (2007) Actuarial statistics with generalized linear mixed models. Insurance
Math Econ 40(1):58–76
Apfelbaum EP, Pauker K, Sommers SR, Ambady N (2010) In blind pursuit of racial equality?
Psychol Sci 21(11):1587–1592
Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised
learning models. J Roy Stat Soc B (Stat Methodol) 82(4):1059–1086
Aran XF, Such JM, Criado N (2019) Attesting biases and discrimination using language semantics.
arXiv 1909.04386
Agüera y Arcas B, Todorov A, Mitchell M (2018) Do algorithms reveal sexual orientation or just
expose our stereotypes. Medium January 11
Armstrong JS (1985) Long-range Forecasting: from crystal ball to computer. Wiley, New York
Arneson RJ (1999) Egalitarianism and responsibility. J Ethics 3:225–247
Arneson RJ (2007) Desert and equality. In: Egalitarianism: New essays on the nature and value of
equality, pp 262–293. Oxford University Press, Oxford
References 437
Arneson RJ (2013) Discrimination, disparate impact, and theories of justice. In: Hellman D,
Moreau S (eds) Philosophical foundations of discrimination law, vol 87, p 105. Oxford
University Press, Oxford
Arrow KJ (1963) Uncertainty and the welfare economics of medical care. Am Econ Rev 53:941–
973
Arrow KJ (1973) The theory of discrimination. In: Ashenfelter O, Rees A (eds) Discrimination in
labor markets. Princeton University Press, Princeton
Artıs M, Ayuso M, Guillen M (1999) Modelling different types of automobile insurance fraud
behaviour in the spanish market. Insurance Math Econ 24(1–2):67–81
Artís M, Ayuso M, Guillén M (2002) Detection of automobile insurance fraud with discrete choice
models and misclassified claims. J Risk Insurance 69(3):325–340
Ashenfelter O, Oaxaca R (1987) The economics of discrimination: Economists enter the court-
room. Am Econ Rev 77(2):321–325
Ashley F (2018) Man who changed legal gender to get cheaper insurance exposes the unreliability
of gender markers. CBC (Canadian Broadcasting Corporation) - Radio Canada July 28
Atkin A (2012) The philosophy of race. Acumen
Ausloos J (2020) The right to erasure in EU data protection law. Oxford University Press, Oxford
Austin PC, Steyerberg EW (2012) Interpreting the concordance statistic of a logistic regression
model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med
Res Methodol 12:1–8
Austin R (1983) The insurance classification controversy. Univ Pennsylvania Law Rev 131(3):517–
583
Automobile Insurance Rate Board (2022) Technical guidance: Change in rates and rating pro-
grams. Albera AIRB
Autor D (2003) Lecture note: the economics of discrimination-theory. Graduate Labor Economics,
Massachusetts Institute of Technology, pp 1–18
Avery RB, Calem PS, Canner GB (2004) Consumer credit scoring: do situational circumstances
matter? J Bank Finance 28(4):835–856
Avin C, Shpitser I, Pearl J (2005) Identifiability of path-specific effects. In: IJCAI International
Joint Conference on Artificial Intelligence, pp 357–363
Avraham R (2017) Discrimination and insurance. In: Lippert-Rasmussen K (ed) Handbook of the
Ethics of Discrimination, Routledge, pp 335–347
Avraham R, Logue KD, Schwarcz D (2013) Understanding insurance antidiscrimination law. South
California Law Rev 87:195
Avraham R, Logue KD, Schwarcz D (2014) Towards a universal framework for insurance anti-
discrimination laws. Connecticut Insurance Law J 21:1
Ayalon L, Tesch-Römer C (2018) Introduction to the section: Ageism–concept and origins.
Contemporary perspectives on ageism, pp 1–10
Ayer AJ (1972) Probability and evidence. Columbia University Press, New York
Azen R, Budescu DV (2003) The dominance analysis approach for comparing predictors in
multiple regression. Psychol Methods 8(2):129
Bachelard G (1927) Essai sur la connaissance approchée. Vrin
Backer DC (2017) Risk profiling in the auto insurance industry. Gracey-Backer, Inc Blog March
14
Baer BR, Gilbert DE, Wells MT (2019) Fairness criteria through the lens of directed acyclic
graphical models. arXiv 1906.11333
Bagdasaryan E, Poursaeed O, Shmatikov V (2019) Differential privacy has disparate impact on
model accuracy. Adv Neural Inf Process Syst 32:15479–15488
Bailey RA, Simon LJ (1959) An actuarial note on the credibility of experience of a single private
passenger car. Proc Casualty Actuarial Soc XLVI:159
Bailey RA, Simon LJ (1960) Two studies in automobile insurance ratemaking. ASTIN Bull J IAA
1(4):192–217
Baird IM (1994) Obesity and insurance risk. Pharmacoeconomics 5(1):62–65
438 References
Baker T (2011) Health insurance, risk, and responsibility after the patient protection and affordable
care act. University of Pennsylvania Law Review 1577–1622
Baker T, McElrath K (1997) Insurance claims discrimination. In: Insurance redlining: Disinvest-
ment, reinvestment, and the evolving role of financial institutions, pp 141–156. The Urban
Institute Press, Washington, DC
Baker T, Simon J (2002) Embracing risk: the changing culture of insurance and responsibility.
University Chicago Press, Chicago
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of
prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424
Ban GY, Keskin NB (2021) Personalized dynamic pricing with machine learning: High-
dimensional features and heterogeneous elasticity. Manag Sci 67(9):5549–5568
Banerjee A, Guo X, Wang H (2005) On the optimality of conditional expectation as a bregman
predictor. IEEE Trans Inf Theory 51(7):2664–2669
Banham R (2015) Price optimization or price discrimination? regulators weigh in. Carrier
Management May 17
Barbosa JJR (2019) The business opportunities of implementing wearable based products in the
health and life insurance industries. PhD thesis, Universidade Católica Portuguesa
Barbour V (1911) Privateers and pirates of the west indies. Am Hist Rev 16(3):529–566
Barocas S, Selbst AD (2016) Big data’s disparate impact. California Law Rev 104:671–732
Barocas S, Hardt M, Narayanan A (2017) Fairness in machine learning. Nips Tutor 1:2017
Barocas S, Hardt M, Narayanan A (2019) Fairness and machine learning. fairmlbook.org
Barry L (2020a) Insurance, big data and changing conceptions of fairness. Eur J Sociol 61:159–184
Barry L (2020b) L’invention du risque catastrophes naturelles. Chaire PARI, Document de Travail
18
Barry L, Charpentier A (2020) Personalization as a promise: Can big data change the practice of
insurance? Big Data Soc 7(1):2053951720935143
Bartik A, Nelson S (2016) Deleting a signal: Evidence from pre-employment credit checks. SSRN
2759560
Bartlett R, Morse A, Stanton R, Wallace N (2018) Consumer-lending discrimination in the era of
fintech. University of California, Berkeley, Working Paper
Bartlett R, Morse A, Stanton R, Wallace N (2021) Consumer-lending discrimination in the fintech
era. J Financ Econ 140:30–56
Bath C, Edgar K (2010) Time is money: Financial responsibility after prison. Prison Reform Trust,
London
Baumann J, Loi M (2023) Fairness and risk: An ethical argument for a group fairness definition
insurers can use. Philos Technol 36(3):45
Bayer PB (1986) Mutable characteristics and the definition of discrimination under title vii. UC
Davis Law Rev 20:769
Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Philos Trans Roy
Soc Lond (53):370–418
Becker GS (1957) The economics of discrimination. University of Chicago Press, Chicago
Beckett L (2014) Everything we know about what data brokers know about you. ProPublica June
13
Beider P (1987) Sex discrimination in insurance. J Appl Philos 4:65–75
Belhadji EB, Dionne G, Tarkhani F (2000) A model for the detection of insurance fraud. Geneva
papers on risk and insurance issues and practice, pp 517–538
Bélisle-Pipon JC, Vayena E, Green RC, Cohen IG (2019) Genetic testing, insurance discrimination
and medical research: what the united states can learn from peer countries. Nature Med
25(8):1198–1204
Bell ET (1945) The development of mathematics. Courier Corporation, Chelmsford
Bender M, Dill C, Hurlbert M, Lindberg C, Mott S (2022) Understanding potential influences of
racial bias on p&c insurance: Four rating factors explored. CAS research paper series on race
and insurance pricing
References 439
Beniger J (2009) The control revolution: Technological and economic origins of the information
society. Harvard University Press, Harvard
Benjamin B, Michaelson R (1988) Mortality differences between smokers and non-smokers. J Inst
Actuaries 115(3):519–525
Bennett M (1978) Models in motor insurance. J Staple Inn Actuarial Soc 22:134–160
Bergstrom CT, West JD (2021) Calling bullshit: the art of skepticism in a data-driven world.
Random House Trade Paperbacks
Berk R, Heidari H, Jabbari S, Joseph M, Kearns M, Morgenstern J, Neel S, Roth A (2017) A
convex framework for fair regression. arXiv 1706.02409
Berk R, Heidari H, Jabbari S, Kearns M, Roth A (2021a) Fairness in criminal justice risk
assessments: The state of the art. Sociol Methods Res 50(1):3–44
Berk RA, Kuchibhotla AK, Tchetgen ET (2021b) Improving fairness in criminal justice algorith-
mic risk assessments using optimal transport and conformal prediction sets. arXiv 2111.09211
Berkson J (1944) Application of the logistic function to bio-assay. J Am Stat Assoc 39(227):357–
365
Bernard DS, Farr SL, Fang Z (2011) National estimates of out-of-pocket health care expenditure
burdens among nonelderly adults with cancer: 2001 to 2008. J Clin Oncol 29(20):2821
Bernoulli J (1713) Ars conjectandi: opus posthumum: accedit Tractatus de seriebus infinitis; et
Epistola gallice scripta de ludo pilae reticularis. Impensis Thurnisiorum
Bernstein A (2013) What’s wrong with stereotyping. Arizona Law Rev 55:655
Bernstein E (2007) Temporarily yours: intimacy, authenticity, and the commerce of sex. University
of Chicago Press, Chicago
Bertillon A, Chervin A (1909) Anthropologie métrique: conseils pratiques aux missionnaires
scientifiques sur la manière de mesurer, de photographier et de décrire des sujets vivants et
des pièces anatomiques. Imprimerie nationale. Paris, France
Bertrand M, Duflo E (2017) Field experiments on discrimination. Handbook Econ Field Exp
1:309–393
Bertrand M, Mullainathan S (2004) Are emily and greg more employable than lakisha and jamal?
a field experiment on labor market discrimination. Am Econ Rev 94(4):991–1013
Besnard P, Grange C (1993) La fin de la diffusion verticale des gouts?(prénoms de l’élite et du
vulgum). L’Année sociologique, pp 269–294
Besse P, del Barrio E, Gordaliza P, Loubes JM (2018) Confidence intervals for testing disparate
impact in fair learning. arXiv 1807.06362
Beutel A, Chen J, Doshi T, Qian H, Wei L, Wu Y, Heldt L, Zhao Z, Hong L, Chi EH, et al.
(2019) Fairness in recommendation ranking through pairwise comparisons. In: Proceedings of
the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pp 2212–2220
Bhattacharya A (2015) Facebook patent: Your friends could help you get a loan - or not. CNN
Business 2015/08/04
Bickel PJ, Hammel EA, O’Connell JW (1975) Sex bias in graduate admissions: Data from
berkeley. Science 187(4175):398–404
Bidadanure J (2017) Discrimination and age. In: Lippert-Rasmussen K (ed) Handbook of the ethics
of discrimination, Routledge, pp 243–253
Biddle D (2017) Adverse impact and test validation: A practitioner’s guide to valid and defensible
employment testing. Routledge
Biecek P, Burzykowski T (2021) Explanatory model analysis: explore, explain, and examine
predictive models. CRC Press, Boca Raton
Bielby WT, Baron JN (1986) Men and women at work: Sex segregation and statistical discrimina-
tion. Am J Sociol 91(4):759–799
Biemer PP, Christ SL (2012) Weighting survey data. In: International handbook of survey
methodology, Routledge, pp 317–341
Bigot R, Cayol A (2020) Le droit des assurances en tableaux. Ellipses
440 References
Boyd D, Levy K, Marwick A (2014) The networked nature of algorithmic discrimination. Data
and Discrimination: Collected Essays Open Technology Institute
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Brams SJ, Brams SJ, Taylor AD (1996) Fair division: from cake-cutting to dispute resolution.
Cambridge University Press, Cambridge
Brant S (1494) Das Narrenschiff. von Jakob Locher
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics
37(4):373–384
Breiman L (1996a) Bagging predictors. Mach Learn 24:123–140
Breiman L (1996b) Bias, variance, and arcing classifiers. Tech. rep., University of California,
Berkeley
Breiman L (1996c) Stacked regressions. Mach Learn 24:49–64
Breiman L (2001) Random forests. Machine learning 45:5–32
Breiman L, Stone C (1977) Parsimonious binary classification trees. technology service co.
rporation, santa monica. Tech. rep., Ca., Technical Report
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Taylor &
Francis, London
Brenier Y (1991) Polar factorization and monotone rearrangement of vector-valued functions.
Commun Pure Appl Math 44(4):375–417
Brilmayer L, Hekeler RW, Laycock D, Sullivan TA (1979) Sex discrimination in employer-
sponsored insurance plans: A legal and demographic analysis. Univ Chicago Law Rev 47:505
Brilmayer L, Laycock D, Sullivan TA (1983) The efficient use of group averages as nondiscrimi-
nation: A rejoinder to professor benston. Univ Chicago Law Rev 50(1):222–249
Bröcker J (2009) Reliability, sufficiency, and the decomposition of proper scores. Q J Roy Meteorol
Soc J Atmos Sci Appl Meteorol Phys Oceanogr 135(643):1512–1519
Brockett PL, Golden LL (2007) Biological and psychobehavioral correlates of credit scores and
automobile insurance losses: Toward an explication of why credit scoring works. J Risk
Insurance 74(1):23–63
Brockett PL, Xia X, Derrig RA (1998) Using kohonen’s self-organizing feature map to uncover
automobile bodily injury claims fraud. J Risk Insurance, 245–274
Brosnan SF (2006) Nonhuman species’ reactions to inequity and their implications for fairness.
Social Justice Res 19(2):153–185
Brown RL, Charters D, Gunz S, Haddow N (2007) Colliding interests–age as an automobile
insurance rating variable: Equitable rate-making or unfair discrimination? J Bus Ethics
72(2):103–114
Brown RS, Moon M, Zoloth BS (1980) Incorporating occupational attainment in studies of male-
female earnings differentials. J Human Res, 3–28
Browne S (2015) Dark matters: On the surveillance of blackness. Duke University Press, Durham
Brownstein M, Saul J (2016a) Implicit bias and philosophy, volume 1: Metaphysics and epistemol-
ogy. Oxford University Press, Oxford
Brownstein M, Saul J (2016b) Implicit bias and philosophy, volume 2: Moral responsibility,
structural injustice, and ethics. Oxford University Press, Oxford
Brualdi RA (2006) Combinatorial matrix classes, vol 13. Cambridge University Press, Cambridge
Brubaker R (2015) Grounds for difference. Harvard University Press, Harvard
Brudno B (1976) Poverty, inequality, and the law. West Publishing Company, Eagan
Bruner JS (1957) Going beyond the information given. In: Bruner J, Brunswik E, Festinger L,
Heider F, Muenzinger K, Osgood C, Rapaport D (eds) Contemporary approaches to cognition,
pp 119–160. Harvard University Press, Harvard
Brunet G, Bideau A (2000) Surnames: history of the family and history of populations. Hist Family
5(2):153–160
Buchanan R, Priest C (2006) Deductible. Encyclopedia of Actuarial Science
Budd LP, Moorthi RA, Botha H, Wicks AC, Mead J (2021) Automated hiring at Amazon.
Universiteit van Amsterdam E-0470
442 References
Bugbee M, Matthews B, Callanan S, Ewert J, Guven S, Boison L, Liao C (2014) Price optimization
overview. Casualty Actuarial Society
Bühlmann H, Gisler A (2005) A course in credibility theory and its applications, vol 317. Springer,
New York
Buntine WL, Weigend AS (1991) Bayesian back-propagation. Complex Syst 5:603–643
Buolamwini J, Gebru T (2018) Gender shades: Intersectional accuracy disparities in commercial
gender classification. In: Conference on Fairness, Accountability and Transparency, Proceed-
ings of Machine Learning Research, pp 77–91
Burgdorf MP, Burgdorf Jr R (1974) A history of unequal treatment: The qualifications of
handicapped persons as a suspect class under the equal protection clause. Santa Clara Lawyer
15:855
Butler P, Butler T (1989) Driver record: A political red herring that reveals the basic flaw in
automobile insurance pricing. J Insurance Regulat 8(2):200–234
Butler RN (1969) Age-ism: Another form of bigotry. Gerontologist 9(4_Part_1):243–246
Cain GG (1986) The economic analysis of labor market discrimination: A survey. Handbook Labor
Econ 1:693–785
Calders T, Jaroszewicz S (2007) Efficient auc optimization for classification. In: Knowledge
Discovery in Databases: PKDD 2007: 11th European Conference on Principles and Practice
of Knowledge Discovery in Databases, Warsaw, Poland, September 17–21, 2007. Proceedings
11, pp 42–53. Springer, New York
Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification.
Data Mining Knowl Discovery 21(2):277–292
Calders T, Žliobaite I (2013) Why unbiased computational processes can lead to discriminative
decision procedures. In: Discrimination and privacy in the information society, pp 43–57.
Springer, New York
Calisher CH (2007) Taxonomy: what’s in a name? doesn’ta rose by any other name smell as sweet?
Croatian Med J 48(2):268
Callahan A (2021) Is bmi a scam. The New York Times May 18th
Calmon FP, Wei D, Ramamurthy KN, Varshney KR (2017) Optimized data pre-processing for
discrimination prevention. arXiv 1704.03354
Cameron J (2004) Calibration - i. Encyclopedia of Statistical Sciences, 2nd edn.
Campbell M (1986) An integrated system for estimating the risk premium of individual car models
in motor insurance. ASTIN Bull J IAA 16(2):165–183
Candille G, Talagrand O (2005) Evaluation of probabilistic prediction systems for a scalar variable.
Q J Roy Meteor Soc J Atmos Sci Appl Meteo Phys Oceanogr 131(609):2131–2150
Cantwell GT, Kirkley A, Newman MEJ (2021) The friendship paradox in real and model networks.
J Complex Netw 9(2):cnab011
Cao D, Chen C, Piccirilli M, Adjeroh D, Bourlai T, Ross A (2011) Can facial metrology predict
gender? In: 2011 International Joint Conference on Biometrics (IJCB), pp 1–8. IEEE
Cardano G (1564) Liber de ludo aleae. Franco Angeli
Cardon D (2019) Culture numérique. Presses de Sciences Po
Carey AN, Wu X (2022) The causal fairness field guide: Perspectives from social and formal
sciences. Front Big Data 5. https://doi.org/10.3389/fdata.2022.892837
Carlier G, Chernozhukov V, Galichon A (2016) Vector quantile regression: an optimal transport
approach. Ann Stat 44:1165–1192
Carnis L, Lassarre S (2019) Politique et management de la sécurité routière. In: Laurent C,
Catherine G, Marie-Line G (eds) La sécurité routière en France, Quand la recherche fait son
bilan et trace des perspectives, L’Harmattan
Carpusor AG, Loges WE (2006) Rental discrimination and ethnicity in names. J Appl Soc Psychol
36(4):934–952
Carrasco V (2007) Le pacte civil de solidarité: une forme d’union qui se banalise. Infostat Justice
97(4)
Cartwright N (1983) How the laws of physics lie. Oxford University Press, Oxford
Casella G, Berger RL (1990) Statistical Inference. Duxbury Advanced Series
References 443
Casey B, Pezier J, Spetzler C (1976) The role of risk classification in property and casualty
insurance: a study of the risk assessment process : final report. Stanford Research Institute,
Stanford
Cassedy JH (2013) Demography in early America. Harvard University Press, Harvard
Castelvecchi D (2016) Can we open the black box of ai? Nature News 538(7623):20
Caton S, Haas C (2020) Fairness in machine learning: A survey. arXiv 2010.04053
Cavanagh M (2002) Against equality of opportunity. Clarendon Press, Oxford, England
Central Bank of Ireland (2021) Review of differential pricing in the private car and home insurance
markets. Central Bank of Ireland Publications, Dublin, Ireland
Chakraborty S, Raghavan KR, Johnson MP, Srivastava MB (2013) A framework for context-aware
privacy of sensor data on mobile systems. In: Proceedings of the 14th Workshop on Mobile
Computing Systems and Applications, Association for Computing Machinery, HotMobile ’13
Chardenon A (2019) Voici maxime, le chatbot juridique d’axa, fruit d’une démarche collaborative.
L’usine digitale 12 février
Charles KK, Guryan J (2011) Studying discrimination: Fundamental challenges and recent
progress. Annu Rev Econ 3(1):479–511
Charpentier A (2014) Computational actuarial science with R. CRC Press, Boca Raton
Charpentier A, Flachaire E, Ly A (2018) Econometrics and machine learning. Economie et
Statistique 505(1):147–169
Charpentier A, Élie R, Remlinger C (2021) Reinforcement learning in economics and finance.
Comput Econ 10014
Charpentier A, Flachaire E, Gallic E (2023a) Optimal transport for counterfactual estimation: A
method for causal inference. In: Thach NN, Kreinovich V, Ha DT, Trung ND (eds) Optimal
transport statistics for economics and related topics. Springer, New York
Charpentier A, Hu F, Ratz P (2023b) Mitigating discrimination in insurance with Wasserstein
barycenters. BIAS, 3rd Workshop on Bias and Fairness in AI, International Workshop of ECML
PKDD
Chassagnon A (1996) Sélection adverse: modèle générique et applications. PhD thesis, Paris,
EHESS
Chassonnery-Zaïgouche C (2020) How economists entered the ‘numbers game’: Measuring
discrimination in the us courtrooms, 1971–1989. J Hist Econ Thought 42(2):229–259
Chatterjee S, Barcun S (1970) A nonparametric approach to credit screening. J Am Stat Assoc
65(329):150–154
Chaufton A (1886) Les assurances, leur passé, leur présent, leur avenir, au point de vue rationnel,
technique et pratique, moral, économique et social, financier et administratif, légal, législatif et
contractuel, en France et à l’étranger. Chevalier-Marescq
Chen SX (1999) Beta kernel estimators for density functions. Comput Stat Data Anal 31(2):131–
145
Chen Y, Liu Y, Zhang M, Ma S (2017) User satisfaction prediction with mouse movement
information in heterogeneous search environment. IEEE Trans Knowl Data Eng 29(11):2470–
2483
Cheney-Lippold J (2017) We are data. New York University Press, New York
Cheng M, De-Arteaga M, Mackey L, Kalai AT (2023) Social norm bias: residual harms of fairness-
aware algorithms. Data Mining and Knowledge Discovery, pp 1–27
Chetty R, Stepner M, Abraham S, Lin S, Scuderi B, Turner N, Bergeron A, Cutler D (2016)
The association between income and life expectancy in the United States, 2001–2014. JAMA
315(16):1750–1766
Cheung I, McCartt AT (2011) Declines in fatal crashes of older drivers: Changes in crash risk and
survivability. Accident Anal Prevent 43(3):666–674
Chiappa S (2019) Path-specific counterfactual fairness. Proc AAAI Confer Artif Intell
33(01):7801–7808
Chicco D, Jurman G (2020) The advantages of the matthews correlation coefficient (mcc) over f1
score and accuracy in binary classification evaluation. BMC Genom 21(1):1–13
Chollet F (2021) Deep learning with Python. Simon and Schuster, New York
444 References
Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments. Big Data 5(2):153–163
Christensen CM, Dillon K, Hall T, Duncan DS (2016) Competing against luck: The story of
innovation and customer choice. Harper Business, New York
Churchill G, Nevin JR, Watson RR (1977) The role of credit scoring in the loan decision. Credit
World 3(3):6–10
Cinelli C, Hazlett C (2020) Making sense of sensitivity: Extending omitted variable bias. J Roy
Stat Soc B (Stat Methodol) 82(1):39–67
Clark G, Clark GW (1999) Betting on lives: the culture of life insurance in England, 1695–1775.
Manchester University Press, Manchester
Clarke DD, Ward P, Bartle C, Truman W (2010) Older drivers’ road traffic crashes in the UK.
Accident Anal Prevent 42(4):1018–1024
Cohen I, Goldszmidt M (2004) Properties and benefits of calibrated classifiers. In: 8th European
Conference on Principles and Practice of Knowledge Discovery in Databases, vol 3202,
pp 125–136. Springer, New York
Cohen J (1960) A coefficient of agreement for nominal scales. Educat Psychol Measur 20(1):37–46
Cohen JE (1986) An uncertainty principle in demography and the unisex issue. Am Stat 40(1):32–
39
Coldman AJ, Braun T, Gallagher RP (1988) The classification of ethnic status using name
information. J Epidemiol Community Health 42(4):390–395
Collins BW (2007) Tackling unconscious bias in hiring practices: The plight of the rooney rule.
New York University Law Rev 82:870
Collins E (2018) Punishing risk. Georgetown Law J 107:57
de Condorcet N (1785) Essai sur l’application de l’analyse à la probabilité des décisions rendues à
la pluralité des voix. Imprimerie royale, Paris
Constine J (2017) Facebook rolls out AI to detect suicidal posts before they’re reported. Techcrunch
November 27
Conway DA, Roberts HV (1983) Reverse regression, fairness, and employment discrimination. J
Bus Econ Stat 1(1):75–85
Cook TD, Campbell DT, Shadish W (2002) Experimental and quasi-experimental designs for
generalized causal inference. Houghton Mifflin Boston, MA
Cooper DN, Krawczak M, Polychronakos C, Tyler-Smith C, Kehrer-Sawatzki H (2013) Where
genotype is not predictive of phenotype: towards an understanding of the molecular basis of
reduced penetrance in human inherited disease. Human Genetics 132:1077–1130
Cooper PJ (1990) Differences in accident characteristics among elderly drivers and between elderly
and middle-aged drivers. Accident Anal Prevention 22(5):499–508
Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A (2017) Algorithmic decision making and the
cost of fairness. arXiv 1701.08230
Corlier F (1998) Segmentation : le point de vue de l’assureur. In: Cousy H, Classens H,
Van Schoubroeck C (eds) Compétitivité, éthique et assurance. Academia Bruylant
Cornell B, Welch I (1996) Culture, information, and screening discrimination. J Polit Econ
104(3):542–571
Correll J, Judd CM, Park B, Wittenbrink B (2010) Measuring prejudice, stereotypes and discrimi-
nation. The SAGE handbook of prejudice, stereotyping and discrimination, pp 45–62
Correll SJ, Benard S (2006) Biased estimators? comparing status and statistical theories of gender
discrimination. In: Advances in group processes, vol 23, pp 89–116. Emerald Group Publishing
Limited, Leeds, England
Cortina A (2022) Aporophobia: why we reject the poor instead of helping them. Princeton
University Press, Princeton
Côté O (2023) Methodology applied to build a non-discriminatory general insurance rate according
to a pre-specified sensitive variable. MSc Thesis, Université Laval
Côté O, Côté MP, Charpentier’ A (2023) A fair price to pay: exploiting directed acyclic graphs for
fairness in insurance. Mimeo
References 445
Cotter A, Jiang H, Gupta MR, Wang S, Narayan T, You S, Sridharan K (2019) Optimization with
non-differentiable constraints with applications to fairness, recall, churn, and other goals. J
Mach Learn Res 20(172):1–59
Cotton J (1988) On the decomposition of wage differentials. Rev Econ Stat, 236–243
Coulmont B, Simon P (2019) Quels prénoms les immigrés donnent-ils à leurs enfants en France?
Populat Soc (4):1–4
Council of the European Union (2004) Council directive 2004/113/ec of 13 december 2004
implementing the principle of equal treatment between men and women in the access to and
supply of goods and services. Official J Eur Union 373:37–43
Cournot AA (1843) Exposition de la théorie des chances et des probabilités. Hachette
Coutts S (2016) Anti-choice groups use smartphone surveillance to target ‘abortion-minded
women’during clinic visits. Rewire News Group May 25
Cowell F (2011) Measuring inequality. Oxford University Press, Oxford
Cragg JG (1971) Some statistical models for limited dependent variables with application to the
demand for durable goods. Econometrica J Econometric Soc, 829–844
Crawford JT, Leynes PA, Mayhorn CB, Bink ML (2004) Champagne, beer, or coffee? a corpus of
gender-related and neutral words. Behav Res Methods Instrum Comput 36:444–458
Cresta J, Laffont J (1982) The value of statistical information in insurance contracts. GREMAQ
Working Paper 8212
Crizzle AM, Classen S, Uc EY (2012) Parkinson disease and driving: an evidence-based review.
Neurology 79(20):2067–2074
Crocker KJ, Snow A (2013) The theory of risk classification. In: Loubergé H, Dionne G (eds)
Handbook of insurance, pp 281–313. Springer, New York
Crossney KB (2016) Redlining. https://philadelphiaencyclopediaorg/essays/redlining/
Cudd AE, Jones LE (2005) Sexism. A companion to applied ethics, pp 102–117
Cummins JD, Smith BD, Vance RN, Vanderhel J (2013) Risk classification in life insurance, vol 1.
Springer Science & Business Media, New York
Cunha HS, Sclauser BS, Wildemberg PF, Fernandes EAM, Dos Santos JA, Lage MdO, Lorenz
C, Barbosa GL, Quintanilha JA, Chiaravalloti-Neto F (2021) Water tank and swimming pool
detection based on remote sensing and deep learning: Relationship with socioeconomic level
and applications in dengue control. Plos One 16(12):e0258681
Cunningham S (2021) Causal inference. Yale University Press, Yale
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals
Syst 2(4):303–314
Czerniawski AM (2007) From average to ideal: The evolution of the height and weight table in the
United States, 1836–1943. Soc Sci Hist 31(2):273–296
Da Silva N (2023) La bataille de la Sécu: une histoire du système de santé. La fabrique éditions
Dalenius T (1977) Towards a methodology for statistical disclosure control. statistik Tidskrift
15(429–444):2–1
Dalziel JR, Job RS (1997) Motor vehicle accidents, fatigue and optimism bias in taxi drivers.
Accident Analy Prevent 29(4):489–494
Dambrum M, Despres G, Guimond S (2003) On the multifaceted nature of prejudice: Psy-
chophysiological responses to ingroup and outgroup ethnic stimuli. Current Res Soc Psychol
8(14):187–206
Dane SM (2006) The potential for racial discrimination by homeowners insurers through the use
of geographic rating territories. J Insurance Regulat 24(4):21
Daniel JE, Daniel JL (1998) Preschool children’s selection of race-related personal names. J Black
Stud 28(4):471–490
Daniel WW, et al. (1968) Racial discrimination in England: based on the PEP report. Penguin
Books, Harmondsworth
Daniels N (1990) Insurability and the hiv epidemic: ethical issues in underwriting. Milbank Q,
497–525
Daniels N (1998) Am I my parents’ keeper? An essay on justice between the young and the old.
Oxford University Press, Oxford
446 References
Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of dna. Psychol
Bull 137(5):800
Darlington RB (1971) Another look at ‘cultural fairness’ 1. J Educat Measur 8(2):71–82
Daston L (1992) Objectivity and the escape from perspective. Soc Stud Sci 22(4):597–618
Davenport T (2006) Competing on analytics. Harvard Bus Rev 84:1–10
David H (2015) Why are there still so many jobs? The history and future of workplace automation.
J Econ Perspect 29(3):3–30
Davidson R, MacKinnon JG, et al. (2004) Econometric theory and methods, vol 5. Oxford
University Press, New York
Davis GA (2004) Possible aggregation biases in road safety research and a mechanism approach
to accident modeling. Accident Anal Prevent 36(6):1119–1127
Dawid AP (1979) Conditional independence in statistical theory. J Roy Stat Soc B (Methodologi-
cal) 41(1):1–15
Dawid AP (1982) The well-calibrated Bayesian. J Am Stat Assoc 77(379):605–610
Dawid AP (2000) Causal inference without counterfactuals. J Am Stat Assoc 95(450):407–424
Dawid AP (2004) Probability forecasting. Encyclopedia of Statistical Sciences 10
De Alba E (2004) Bayesian claims reserving. Encyclopedia of Actuarial Science
De Baere G, Goessens E (2011) Gender differentiation in insurance contracts after the judgment in
case c-236/09, Association Belge des Consommateurs Test-Achats asbl v. conseil des ministres.
Colum J Eur L 18:339
De Pril N, Dhaene J (1996) Segmentering in verzekeringen. DTEW Research Report 9648, pp 1–56
De Wit GW, Van Eeghen J (1984) Rate making and society’s sense of fairness. ASTIN Bull J IAA
14(2):151–163
De Witt J (1671) Value of life annuities in proportion to redeemable annuities. Originally in Dutch
Translated in Hendriks (1853), pp 232–49
Dean LT, Nicholas LH (2018) Using credit scores to understand predictors and consequences. Am
J Public Health 108(11):1503–1505
Dean LT, Schmitz KH, Frick KD, Nicholas LH, Zhang Y, Subramanian S, Visvanathan K (2018)
Consumer credit as a novel marker for economic burden and health after cancer in a diverse
population of breast cancer survivors in the USA. J Cancer Survivorship 12(3):306–315
Debet A (2007) Mesure de la diversité et protection des données personnelles. Commission
Nationale de l’Informatique et des Libertés 16/05/2007 08:40 DECO/IRC
Défenseur des droits (2020) Algorithmes: prévenir l’automatisation des discriminations. https://
www.defenseurdesdroits.fr/sites/default/files/2023-07/ddd_rapport_algorithmes_2020_EN_
20200531.pdf
Dehejia RH, Wahba S (1999) Causal effects in nonexperimental studies: Reevaluating the
evaluation of training programs. J Am Stat Assoc 94(448):1053–1062
Delaporte P (1962) Sur l’efficacité des critères de tarification de l’assurance contre les accidents
d’automobiles. ASTIN Bull J IAA 2(1):84–95
Delaporte PJ (1965) Tarification du risque individuel d’accidents d’automobiles par la prime
modelée sur le risque. ASTIN Bull J IAA 3(3):251–271
Demakakos P, Biddulph JP, Bobak M, Marmot MG (2016) Wealth and mortality at older ages: a
prospective cohort study. J Epidemiol Community Health 70(4):346–353
Dennis RM (2004) Racism. In: Kuper A, Kuper J (eds) The social science encyclopedia, Routledge
Denuit M, Charpentier A (2004) Mathématiques de l’assurance non-vie: Tome I Principes
fondamentaux de théorie du risque. Economica. Paris, France
Denuit M, Charpentier A (2005) Mathématiques de l’assurance non-vie: Tome II Tarification et
provisionnement. Economica. Paris, France
Denuit M, Maréchal X, Pitrebois S, Walhin JF (2007) Actuarial modelling of claim counts: Risk
classification, credibility and bonus-malus systems. Wiley, New York
Denuit M, Hainault D, Trufin J (2019a) Effective statistical learning methods for actuaries I (GLMs
and extensions). Springer, New York
Denuit M, Hainault D, Trufin J (2019b) Effective statistical learning methods for actuaries III
(neural networks and extensions). Springer, New York
References 447
Denuit M, Hainault D, Trufin J (2020) Effective statistical learning methods for actuaries II (tree-
based methods and extensions). Springer, New York
Denuit M, Charpentier A, Trufin J (2021) Autocalibration and tweedie-dominance for insurance
pricing with machine learning. Insurance Math Econ
Depoid P (1967) Applications de la statistique aux assurances accidents et dommages: cours
professé à l’Institut de statistique de l’Université de Paris. 2e édition revue et augmentée.. . .
Berger-Levrault
Derrig RA, Ostaszewski KM (1995) Fuzzy techniques of pattern recognition in risk and claim
classification. J Risk Insurance, 447–482
Derrig RA, Weisberg HI (1998) Aib pip claim screening experiment final report. understanding
and improving the claim investigation process. AIB Filing on Fraudulent Claims Payment
Desrosières A (1998) The politics of large numbers: A history of statistical reasoning. Harvard
University Press, Harvard
Devine PG (1989) Stereotypes and prejudice: Their automatic and controlled components. J
Personality Soc Psychol 56(1):5
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology
26(3):297–302
Dierckx G (2006) Logistic regression model. Encyclopedia of Actuarial Science
Dieterich W, Mendoza C, Brennan T (2016) Compas risk scales: Demonstrating accuracy equity
and predictive parity. Northpointe Inc 7(7.4):1
Dilley S, Greenwood G (2017) Abandoned 999 calls to police more than double. BBC 19
September 2017
DiNardo J (2016) Natural experiments and quasi-natural experiments, pp 1–12. Palgrave Macmil-
lan UK, London
DiNardo J, Fortin N, Lemieux T (1995) Labor market institutions and the distribution of wages,
1973–1992: A semiparametric approach. National Bureau of Economic Research (NBER)
Dingman H (1927) Insurability, prognosis and selection. The spectator company
Dinur R, Beit-Hallahmi B, Hofman JE (1996) First names as identity stereotypes. J Soc Psychol
136(2):191–200
Dionne G (2000) Handbook of insurance. Springer, New York
Dionne G (2013) Contributions to insurance economics. Springer, New York
Dionne G, Harrington SE (1992) An introduction to insurance economics. Springer, New York
Dobbin F (2001) Do the social sciences shape corporate anti-discrimination practice: The United
States and France. Comparative Labor Law Pol J 23:829
Donoghue JD (1957) An eta community in japan: the social persistence of outcaste groups. Am
Anthropol 59(6):1000–1017
Dorlin E (2005) Sexe, genre et intersexualité: la crise comme régime théorique. Raisons Politiques
2:117–137
Dostie G (1974) Entrevue de michèle lalonde. Le Journal 1er juin 1974
Dressel J, Farid H (2018) The accuracy, fairness, and limits of predicting recidivism. Sci Adv
4(1):eaao5580
Du Bois W (1896) Review of race traits and tendencies of the American negro. Ann Am Acad,
127–33
Duan T, Anand A, Ding DY, Thai KK, Basu S, Ng A, Schuler A (2020) Ngboost: Natural gradient
boosting for probabilistic prediction. In: International Conference on Machine Learning,
PMLR, pp 2690–2700
Dubet F (2014) La Préférence pour l’inégalité. Comprendre la crise des solidarités: Comprendre la
crise des solidarités. Seuil - La République des idées
Dubet F (2016) Ce qui nous unit : Discriminations, égalité et reconnaissance. Seuil - La République
des idées
Dublin L (1925) Report of the joint committee on mortality of the association of life insurance
medical directors. The Actuarial Society of America
Dudley RM (2010) Distances of probability measures and random variables. In: Selected works of
RM dudley, pp 28–37. Springer, New York
448 References
Duggan JE, Gillingham R, Greenlees JS (2008) Mortality and lifetime income: evidence from us
social security records. IMF Staff Papers 55(4):566–594
Duhigg C (2019) How companies learn your secrets. The New York Times 02-16-2019
Duivesteijn W, Feelders A (2008) Nearest neighbour classification with monotonicity constraints.
In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases,
pp 301–316. Springer, New York
Dulisse B (1997) Older drivers and risk to other road users. Accident Anal Prevent 29(5):573–582
Dumas A, Allodji R, Fresneau B, Valteau-Couanet D, El-Fayech C, Pacquement H, Laprie A,
Nguyen TD, Bondiau PY, Diallo I, et al. (2017) The right to be forgotten: a change in access to
insurance and loans after childhood cancer? J Cancer Survivorship 11:431–437
Duncan A, McPhail M (2013) Price optimization for the us market. techniques and implementation
strategies.”. In: Ratemaking and Product Management Seminar
Duncan C, Loretto W (2004) Never the right age? gender and age-based discrimination in
employment. Gender Work Organization 11(1):95–115
Durkheim É (1897) Le suicide: étude sociologique. Félix Alcan Editeur
Durry G (2001) La sélection de la clientèle par l’assureur : aspects juridiques. Risques 45:65–71
Dwivedi M, Malik HS, Omkar S, Monis EB, Khanna B, Samal SR, Tiwari A, Rathi A (2021) Deep
learning-based car damage classification and detection. In: Advances in Artificial Intelligence
and Data Engineering: Select Proceedings of AIDE 2019, pp 207–221. Springer, New York
Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. In: Pro-
ceedings of the 3rd Innovations in Theoretical Computer Science Conference, vol 1104.3913,
pp 214–226
Dwoskin E (2018) Facebook is rating the trustworthiness of its users on a scale from zero to one.
Washington Post 21-08
Eco U (1992) Comment voyager avec un saumon. Grasset
Edgeworth FY (1922) Equal pay to men and women for equal work. Econ J 32(128):431–457
Edwards J (1932) Ten years of rates and rating bureaus in ontario, applied to automobile insurance.
Proc Casualty Actuarial Soc 19:22–64
Eidelson B (2015) Discrimination and disrespect. Oxford University Press, Oxford
Eidinger E, Enbar R, Hassner T (2014) Age and gender estimation of unfiltered faces. IEEE Trans
Inf Forens Secur 9(12):2170–2179
Eisen R, Eckles DL (2011) Insurance economics. Springer, New York
Ekeland I (1995) Le chaos. Flammarion
England P (1994) Neoclassical economists’ theories of discrimination. In: Equal employment
opportunity: Labor market discrimination and public policy, Aldine de Gruyter, pp 59–70
Ensmenger N (2015) “beards, sandals, and other signs of rugged individualism”: masculine culture
within the computing professions. Osiris 30(1):38–65
Epstein L, King G (2002) The rules of inference. The University of Chicago Law Review, pp 1–133
Erwin C, Williams JK, Juhl AR, Mengeling M, Mills JA, Bombard Y, Hayden MR, Quaid
K, Shoulson I, Taylor S, et al. (2010) Perception, experience, and response to genetic
discrimination in huntington disease: The international respond-hd study. Am J Med Genet
B Neuropsychiatric Genet 153(5):1081–1093
Erwin PG (1995) A review of the effects of personal name stereotypes. Representative Research in
Social Psychology
European Commission (1995) Directive 95/46/ec of the european parliament and of the council of
24 october 1995 on the protection of individuals with regard to the processing of personal data
and on the free movement of such data. Official J Eur Communit 38(281):31–50
of the European Union C (2018) Proposal for a council directive on implementing the principle
of equal treatment between persons irrespective of religion or belief, disablility, age or sexual
orientation. Proceedings of Council of the European Union 11015/08
Ewald F (1986) Histoire de l’Etat providence: les origines de la solidarité. Grasset
Eze EC (1997) Race and the enlightenment: A reader. Wiley, New York
Fagyal Z (2010) Accents de banlieue. Aspects prosodiques du français populaire en contact avec
les langues de l’immigration, L’Harmattan
References 449
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties.
J Am Stat Assoc 96(456):1348–1360
Farbmacher H, Huber M, Lafférs L, Langen H, Spindler M (2022) Causal mediation analysis with
double machine learning. Economet J 25(2):277–300
Farebrother RW (1976) Further results on the mean square error of ridge regression. J Roy Stat
Soc B (Methodological), 248–250
Feder A, Oved N, Shalit U, Reichart R (2021) causalm: Causal model explanation through
counterfactual language models. Comput Linguist 47(2):333–386
Feeley MM, Simon J (1992) The new penology: Notes on the emerging strategy of corrections and
its implications. Criminology 30(4):449–474
Feinberg J (1970) Justice and personal desert. In: Feinberg J (ed) Doing and deserving. Princeton
University Press, Princeton
Feine J, Gnewuch U, Morana S, Maedche A (2019) Gender bias in chatbot design. In: International
workshop on chatbot research and design, pp 79–93. Springer, New York
Feiring E (2009) reassessing insurers’ access to genetic information: genetic privacy, ignorance,
and injustice. Bioethics 23(5):300–310
Feld SL (1991) Why your friends have more friends than you do. Am J Sociol 96(6):1464–1477
Feldblum S (2006) Risk classification, pricing aspects. Encyclopedia of actuarial science
Feldblum S, Brosius JE (2003) The minimum bias procedure: A practitioner’s guide. In: Pro-
ceedings of the Casualty Actuarial Society, Casualty Actuarial Society Arlington, vol 90, pp
196–273
Feldman F (1995) Desert: Reconsideration of some received wisdom. Mind 104(413):63–77
Feldman M, Friedler SA, Moeller J, Scheidegger C, Venkatasubramanian S (2015) Certifying
and removing disparate impact. In: Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, vol 1412.3756, pp 259–268
Feller A, Pierson E, Corbett-Davies S, Goel S (2016) A computer program used for bail and
sentencing decisions was labeled biased against blacks. it’s actually not that clear. The
Washington Post October 17
Feller W (1957) An introduction to probability theory and its applications. Wiley, New York
Feng R (2023) Decentralized insurance. Springer, New York
Fenton N, Neil M (2018) Risk assessment and decision analysis with Bayesian networks. CRC
Press, Boca Raton
Ferber MA, Green CA (1982a) Employment discrimination: Reverse regression or reverse logic.
Working Paper, University of Illinois, Champaign
Ferber MA, Green CA (1982b) Traditional or reverse sex discrimination? a case study of a large
public university. Ind Labor Relat Rev 35(4):550–564
Fermanian JD, Guegan D (2021) Fair learning with bagging. SSRN 3969362
Finger RJ (2006) Risk classification. In: Bass I, Basson S, Bashline D, Chanzit L, Gillam W,
Lotkowski E (eds) Foundations of Casualty Actuarial Science, Casualty Actuarial Society, pp
287–341
Finkelstein A, Taubman S, Wright B, Bernstein M, Gruber J, Newhouse JP, Allen H, Baicker K,
Group OHS (2012) The oregon health insurance experiment: evidence from the first year. Q J
Econ 127(3):1057–1106
Finkelstein EA, Brown DS, Wrage LA, Allaire BT, Hoerger TJ (2010) Individual and aggregate
years-of-life-lost associated with overweight and obesity. Obesity 18(2):333–339
Finkelstein MO (1980) The judicial reception of multiple regression studies in race and sex
discrimination cases. Columbia Law Rev 80(4):737–754
Firpo SP (2017) Identifying and measuring economic discrimination. IZA World of Labor
Fiscella K, Fremont AM (2006) Use of geocoding and surname analysis to estimate race and
ethnicity. Health Ser Res 41(4p1):1482–1500
Fish HC (1868) The agent’s manual of life assurance. Wynkoop & Hallenbeck
Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: Learning a
variable’s importance by studying an entire class of prediction models simultaneously. J Mach
Learn Res 20(177):1–81
450 References
Fisher FM (1980) Multiple regression in legal proceedings. Columbia Law Rev 80:702
Fisher RA (1921) Studies in crop variation. i. an examination of the yield of dressed grain from
broadbalk. J Agricultural Sci 11(2):107–135
Fisher RA, Mackenzie WA (1923) Studies in crop variation. ii. The manurial response of different
potato varieties. J Agricultural Sci 13(3):311–320
Fix M, Turner MA (1998) A National Report Card on Discrimination in America: The Role of
Testing. ERIC: Proceedings of the Urban Institute Conference (Washington, DC, March 1998)
Flanagan T (1985) Insurance, human rights, and equality rights in canada: When is discrimination
“reasonable?”. Canad J Polit Sci/Revue canadienne de science politique 18(4):715–737
Fleurbaey M, Maniquet F (1996) A theory of fairness and social welfare. Cambridge University
Press, Cambridge
Flew A (1993) Three concepts of racism. Int Soc Sci Rev 68(3):99
Fong C, Hazlett C, Imai K (2018) Covariate balancing propensity score for a continuous treatment:
Application to the efficacy of political advertisements. Ann Appl Stat 12(1):156–177
Fontaine H (2003) Driver age and road traffic accidents: what is the risk for seniors? Recherche-
transports-sécurité
Fontaine KR, Redden DT, Wang C, Westfall AO, Allison DB (2003) Years of life lost due to
obesity. J Am Med Assoc 289(2):187–193
Foot P (1967) The problem of abortion and the doctrine of the double effect. Oxford Rev 5
Forfar DO (2006) Life table. Encyclopedia of Actuarial Science
Fortin N, Lemieux T, Firpo S (2011) Decomposition methods in economics. In: Handbook of labor
economics, vol 4, pp 1–102. Elsevier
Fourcade M (2016) Ordinalization: Lewis a. coser memorial award for theoretical agenda setting
2014. Sociological Theory 34(3):175–195
Fourcade M, Healy K (2013) Classification situations: Life-chances in the neoliberal era. Account
Organizations Soc 38(8):559–572
Fox ET (2013) ’Piratical Schemes and Contracts’: Pirate Articles and Their Society 1660–1730.
PhD Thesis, University of Exeter
François P (2022) Catégorisation, individualisation. retour sur les scores de crédit. hal 03508245
Freedman DA (1999) Ecological inference and the ecological fallacy. Int Encyclopedia Soc Behav
Sci 6(4027-4030):1–7
Freedman DA, Berk RA (2008) Weighting regressions by propensity scores. Evaluat Rev
32(4):392–409
Freeman S (2007) Rawls. Routledge
Frees EW (2006) Regression models for data analysis. Encyclopedia of Actuarial Science
Frees EW, Huang F (2023) The discriminating (pricing) actuary. North American Actuarial J
27(1):2–24
Frees EW, Meyers G, Cummings AD (2011) Summarizing insurance scores using a gini index. J
Am Stat Assoc 106(495):1085–1098
Frees EW, Derrig RA, Meyers G (2014a) Predictive modeling applications in actuarial science,
vol 1. Cambridge University Press, Cambridge
Frees EW, Meyers G, Cummings AD (2014b) Insurance ratemaking and a gini index. J Risk
Insurance 81(2):335–366
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an
application to boosting. J Comput Syst Sci 55(1):119–139
Frezal S, Barry L (2019) Fairness in uncertainty: Some limits and misinterpretations of actuarial
fairness. J Bus Ethics 167(1):127–136
Fricker M (2007) Epistemic injustice: Power and the ethics of knowing. Oxford University Press,
Oxford
Friedler SA, Scheidegger C, Venkatasubramanian S (2016) On the (im) possibility of fairness.
arXiv 1609.07236
Friedler SA, Scheidegger C, Venkatasubramanian S, Choudhary S, Hamilton EP, Roth D (2019) A
comparative study of fairness-enhancing interventions in machine learning. In: Proceedings of
the Conference on Fairness, Accountability, and Transparency, pp 329–338
References 451
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat, 1189–
1232
Friedman J, Popescu BE (2008) Predictive learning via rule ensembles. Ann Appl Stat, 916–954
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting
(with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Friedman J, Hastie T, Tibshirani R, et al. (2001) The elements of statistical learning. Springer, New
York
Friedman S, Canaan M (2014) Overcoming speed bumps on the road to telematics. In: Challenges
and opportunities facing auto insurers with and without usage-based programs, Deloitte
Frisch R, Waugh FV (1933) Partial time regressions as compared with individual trends.
Econometrica, 387–401
Froot KA, Kim M, Rogoff KS (1995) The law of one price over 700 years. National Bureau of
Economic Research (NBER) 5132
Fry T (2015) A discussion on credibility and penalised regression, with implications for actuarial
work. Actuaries Institute
Fryer Jr RG, Levitt SD (2004) The causes and consequences of distinctively black names. Q J Econ
119(3):767–805
Gaddis SM (2017) How black are lakisha and jamal? Racial perceptions from names used in
correspondence audit studies. Sociol Sci 4:469–489
Gadet F (2007) La variation sociale en français. Editions Ophrys
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: Representing model
uncertainty in deep learning. In: International Conference on Machine Learning, PMLR, pp
1050–1059
Galichon A (2016) Optimal transport methods in economics. Princeton University Press, Princeton
Galindo C, Moreno P, González J, Arevalo V (2009) Swimming pools localization in colour
high-resolution satellite images. In: 2009 IEEE International Geoscience and Remote Sensing
Symposium, vol 4, pp IV–510. IEEE
Galles D, Pearl J (1998) An axiomatic characterization of causal counterfactuals. Foundations Sci
3:151–182
Galton F (1907) Vox populi. Nature 75(7):450–451
Gambs S, Killijian MO, del Prado Cortez MNn (2010) Show me how you move and i will tell you
who you are. In: Proceedings of the 3rd ACM International Workshop on Security and Privacy
in GIS and LBS
Gan G, Valdez EA (2020) Data clustering with actuarial applications. North American Actuarial J
24(2):168–186
Gandy OH (2016) Coming to terms with chance: Engaging rational discrimination and cumulative
disadvantage, Routledge
Garg N, Schiebinger L, Jurafsky D, Zou J (2018) Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proc Natl Acad Sci 115(16):E3635–E3644
Garrioch D (2011) Mutual aid societies in eighteenth-century paris. French History & Civiliza-
tion 4
Gautron V, Dubourg É (2015) La rationalisation des outils et méthodes d’évaluation: de l’approche
clinique au jugement actuariel. Criminocorpus Revue d’Histoire de la justice, des crimes et des
peines
Gebelein H (1941) Das statistische problem der korrelation als variations- und eigenwertproblem
und sein zusammenhang mit der ausgleichsrechnung. ZAMM - Journal of Applied Mathemat-
ics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik 21(6):364–379
Gebru T, Krause J, Wang Y, Chen D, Deng J, Aiden EL, Fei-Fei L (2017) Using deep learning and
Google Street View to estimate the demographic makeup of neighborhoods across the United
States. Proc Natl Acad Sci 114(50):13108–13113
Geenens G (2014) Probit transformation for kernel density estimation on the unit interval. J Am
Stat Assoc 109(505):346–358
452 References
Gouriéroux C (1999) The econometrics of risk classification in insurance. Geneva Papers Risk
Insurance Theory 24(2):119–137
Gourieroux C (1999) Statistique de l’assurance. Economica
Gourieroux C, Jasiak J (2007) The econometrics of individual risk. Princeton University Press,
Princeton
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics, 857–
871
Gowri A (2014) The irony of insurance: Community and commodity. PhD thesis, University of
Southern California
Granger CW (1969) Investigating causal relations by econometric models and cross-spectral
methods. Econometrica J Econ Soc, 424–438
Graves Jr JL (2015) Great is their sin: Biological determinism in the age of genomics. Ann Am
Acad Polit Soc Sci 661(1):24–50
Greene WH (1984) Reverse regression: The algebra of discrimination. J Bus Econ Stat 2(2):117–
120
Greenland S (2002) Causality theory for policy uses. In: Murray C (ed) Summary measures of
population, pp 291–302. Harvard University Press, Harvard
Greenwell BM (2017) pdp: an r package for constructing partial dependence plots. R J 9(1):421
Grobon S, Mourlot L (2014) Le genre dans la statistique publique en France. Regards croisés sur
l’économie 2:73–79
Gross, S.T. (2017). Well-Calibrated Forecasts. In Wiley StatsRef: Statistics Reference Online (eds
N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri and J.L. Teugels). https://doi.
org/10.1002/9781118445112.stat00252.pub2
Groupement des Assureurs Automobiles (2021) Plan statistique automobile, résultats généraux,
voitures de tourisme. GAA
Guelman L, Guillén M (2014) A causal inference approach to measure price elasticity in
automobile insurance. Exp Syst Appl 41(2):387–396
Guelman L, Guillén M, Pérez-Marín AM (2012) Random forests for uplift modeling: an
insurance customer retention case. In: International conference on modeling and simulation
in engineering, economics and management, pp 123–133. Springer, New York
Guelman L, Guillén M, Perez-Marin AM (2014) A survey of personalized treatment models for
pricing strategies in insurance. Insurance Math Econ 58:68–76
Guillen M (2006) Fraud in insurance. Encyclopedia of Actuarial Science
Guillen M, Ayuso M (2008) Fraud in insurance. Encyclopedia of Quantitative Risk Analysis and
Assessment 2
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In:
International conference on machine learning, pp 1321–1330. PMLR
Gupta S, Kamble V (2021) Individual fairness in hindsight. J Mach Learn Res 22(1):6386–6420
Guseva A, Rona-Tas A (2001) Uncertainty, risk, and trust: Russian and American credit card
markets compared. Am Sociol Rev, 623–646
Guven S, McPhail M (2013) Beyond the cost model: Understanding price elasticity and its
applications. In: Casualty actuarial society E-forum, Spring 2013, Citeseer
Haas D (2013) Merit, fit, and basic desert. Philos Explorat 16(2):226–239
Haberman S, Renshaw AE (1996) Generalized linear models and actuarial science. J Roy Stat Soc
D (Statistician) 45(4):407–436
Hacking I (1990) The taming of chance. 17. Cambridge University Press, Cambridge
Hager WD, Zimpleman L (1982) The norris decision, its implications and application. Drake Law
Rev 32:913
Hale K (2021) A.i. bias caused 80% of black mortgage applicants to be denied. Forbes 09/2021
Halley E (1693) An estimate of the degrees of the mortality of mankind, drawn from curious
tables of the births and funerals at the city of breslaw; with an attempt to ascertain the price of
annuities upon lives. Philos Trans Roy Soc 17:596–610
Halpern JY (2016) Actual causality. MIT Press, Cambridge, MA
454 References
Hamilton DL, Gifford RK (1976) Illusory correlation in interpersonal perception: A cognitive basis
of stereotypic judgments. J Exp Soc Psychol 12(4):392–407
Hamilton JD (1994) Time series analysis. Princeton University Press, Princeton
Hand DJ (2020) Dark data: why what you don’t know matters. Princeton University Press,
Princeton
Hansotia BJ, Rukstales B (2002) Direct marketing for multichannel retailers: Issues, challenges
and solutions. J Database Market Customer Strat Manag 9:259–266
Hanssens DM, Parsons LJ, Schultz RL (2003) Market response models: Econometric and time
series analysis, vol 2. Springer Science & Business Media, New York
Hara K, Sun J, Moore R, Jacobs DW, Froehlich J (2014) Tohme: detecting curb ramps in Google
Street View using crowdsourcing, computer vision, and machine learning. In: Proceedings of
the 27th Annual ACM Symposium on User Interface Software and Technology
Harari YN (2018) 21 Lessons for the 21st century. Random House, New York
Harcourt BE (2008) Against prediction. University of Chicago Press, Chicago
Harcourt BE (2011) Surveiller et punir à l’âge actuariel. Déviance et Société 35:163
Harcourt BE (2015a) Exposed: Desire and disobedience in the digital age. Harvard University
Press, Harvard
Harcourt BE (2015b) Risk as a proxy for race: The dangers of risk assessment. Federal Sentencing
Rep 27(4):237–243
Harden KP (2023) Genetic determinism, essentialism and reductionism: semantic clarity for
contested science. Nature Rev Genet 24(3):197–204
Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Adv Neural Inf
Process Syst 29:3315–3323
Hardy GH, Littlewood JE, Pólya G, Pólya G, et al. (1952) Inequalities. Cambridge University
Press, Cambridge
Hargreaves DJ, Colman AM, Sluckin W (1983) The attractiveness of names. Human Relat
36(4):393–401
Harrington SE, Niehaus G (1998) Race, redlining, and automobile insurance prices. J Bus
71(3):439–469
Harris M (1970) Referential ambiguity in the calculus of brazilian racial identity. Southwestern J
Anthropol 26(1):1–14
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Harwell D, Mayes B, Walls M, Hashemi S (2018) The accent gap. The Washington Post July 19
Hastie T, Tibshirani R (1987) Generalized additive models: some applications. J Am Stat Assoc
82(398):371–386
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity. Monogr Stat Appl
Probab 143:143
Haugeland J (1989) Artificial intelligence: The very idea. MIT Press
Havens HV (1979) Issues and needed improvements in state regulation of the insurance business.
US General Acounting Office
Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall PTR, Indianapolis,
IN
He Y, Xiong Y, Tsai Y (2020) Machine learning based approaches to predict customer churn for
an insurance company. In: 2020 Systems and Information Engineering Design Symposium
(SIEDS), pp 1–6. IEEE
Heckert NA, Filliben JJ, Croarkin CM, Hembree B, Guthrie WF, Tobias P, Prinz J, et al. (2002)
Handbook 151: SEMATECH e-handbook of statistical methods. NIST
Hedden B (2021) On statistical criteria of algorithmic fairness. Philos Public Affairs 49(2):209–
231
Hedges BA (1977) Gender discrimination in pension plans: Comment. J Risk Insurance 44(1):141–
144
Heen ML (2009) Ending jim crow life insurance rates. Northwestern J Law Soc Policy 4:360
Heidari H, Krause A (2018) Preventing disparate treatment in sequential decision making. In:
IJCAI, pp 2248–2254
References 455
Heidorn PB (2008) Shedding light on the dark data in the long tail of science. Library Trends
57(2):280–299
Heimer CA (1985) Reactive risk and rational action. University of California Press, California
Heller D (2015) High price of mandatory auto insurance in predominantly African American
communities. Tech. rep., Consumer Federation of America
Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen
veränderlichen. Journal für die reine und angewandte Mathematik 1909(136):210–271
Hellman D (1998) Two types of discrimination: The familiar and the forgotten. California Law
Rev 86:315
Hellman D (2011) When is discrimination wrong? Harvard University Press, Harvard
Helton JC, Davis F (2002) Illustration of sampling-based methods for uncertainty and sensitivity
analysis. Risk Anal 22(3):591–622
Henley A (2014) Abolishing the stigma of punishments served: Andrew henley argues that those
who have been punished should be free from future discrimination. Criminal Justice Matters
97(1):22–23
Henriet D, Rochet JC (1987) Some reflections on insurance pricing. Eur Econ Rev 31(4):863–885
Héran F (2010) Inégalités et discriminations: Pour un usage critique et responsable de l’outil
statistique
Heras AJ, Pradier PC, Teira D (2020) What was fair in actuarial fairness? Hist Human Sci
33(2):91–114
Hernán MA, Robins JM (2010) Causal inference
Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. CRC Press,
Boca Raton
Hesselager O, Verrall R (2006) Reserving in non-life insurance. Encyclopedia of Actuarial Science
Higham NJ (2008) Functions of matrices: theory and computation. SIAM, Philadelphia, PA
Hilbe JM (2014) Modeling count data. Cambridge University Press, Cambridge
Hill K (2022) A dad took photos of his naked toddler for the doctor. Google flagged him as a
criminal. The New York Times August 25
Hill K, White J (2020) Designed to deceive: do these people look real to you? The New York Times
11(21)
Hillier R (2022) The legal challenges of insuring against a pandemic. In: Pandemics: insurance and
social protection, pp 267–286. Springer, Cham
Hiltzik M (2013) Yes, men should pay for pregnancy coverage and here’s why. Los Angeles Times
November 1st
Hirschfeld HO (1935) A connection between correlation and contingency. Math Proc Camb Philos
Soc 31(4):520–524
Hitchcock C (1997) Probabilistic causation. Stanford Encyclopedia of Philosophy
Ho DE, Imai K, King G, Stuart EA (2007) Matching as nonparametric preprocessing for reducing
model dependence in parametric causal inference. Polit Anal 15(3):199–236
Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics 12(1):55–67
Hoffman FL (1896) Race traits and tendencies of the American Negro, vol 11. American Economic
Association
Hoffman FL (1931) Cancer and smoking habits. Ann Surg 93(1):50
Hoffman KM, Trawalter S, Axt JR, Oliver MN (2016) Racial bias in pain assessment and treatment
recommendations, and false beliefs about biological differences between blacks and whites.
Proc Natl Acad Sci 113(16):4296–4301
Hofmann HJ (1990) Die anwendung des cart-verfahrens zur statistischen bonitätsanalyse von
konsumentenkrediten. Zeitschrift fur Betriebswirtschaft 60:941–962
Hofstede G (1995) Insurance as a product of national values. Geneva Papers Risk Insurance-Issues
Pract 20(4):423–429
Holland PW (1986) Statistics and causal inference. J Am Stat Assoc 81(396):945–960
Holland PW (2003) Causation and race. ETS Research Report Series RR-03-03
Holzer H, Neumark D (2000) Assessing affirmative action. J Econ Liter 38(3):483–568
456 References
Homans S, Phillips GW (1868) Tontine dividend life assurance policies. Equitable Life Assurance
Society of the United States
Hong D, Zheng YY, Xin Y, Sun L, Yang H, Lin MY, Liu C, Li BN, Zhang ZW, Zhuang J, et al.
(2021) Genetic syndromes screening by facial recognition technology: Vgg-16 screening model
construction and evaluation. Orphanet J Rare Dis 16(1):1–8
Hooker S, Moorosi N, Clark G, Bengio S, Denton E (2020) Characterising bias in compressed
models. arXiv 2010.03058
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks
4(2):251–257
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal
approximators. Neural Networks 2(5):359–366
Horowitz MAC (1976) Aristotle and woman. J Hist Biol 9:183–213
Hosmer DW, Lemesbow S (1980) Goodness of fit tests for the multiple logistic regression model.
Commun Stat Theory Methods 9(10):1043–1069
Houston R (1992) Mortality in early modern scotland: the life expectancy of advocates. Continuity
Change 7(1):47–69
Hsee CK, Li X (2022) A framing effect in the judgment of discrimination. Proc Natl Acad Sci
119(47):e2205988119
Hu F (2022) Semi-supervised learning in insurance: fairness and active learning. PhD thesis,
Institut polytechnique de Paris
Hu F, Ratz P, Charpentier A (2023a) Fairness in multi-task learning via Wasserstein barycenters.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases –
ECML PKDD
Hu F, Ratz P, Charpentier A (2023b) A sequentially fair mechanism for multiple sensitive attributes.
ArXiv 2309.06627
Hubbard GN (1852) De l’organisation des sociétés de bienfaisance ou de secours mutuels et des
bases scientifiques sur lesquelles elles doivent être établies. Guillaumin, Paris
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101
Hume D (1739) A treatise of human nature. Cambridge University Press, Cambridge
Hume D (1748) An enquiry concerning human understanding. Cambridge University Press,
Cambridge
Hunt E (2016) Tay, Microsoft’s AI chatbot, gets a crash course in racism from twitter. Guardian
24(3):2016
Hunter J (1775) Inaugural disputation on the varieties of man. In: Blumenbach JF (ed) De generis
humani varietate nativa. Vandenhoek & Ruprecht, Gottingae
Huttegger SM (2013) In defense of reflection. Philos Sci 80(3):413–433
Huttegger SM (2017) The probabilistic foundations of rational learning. Cambridge University
Press, Cambridge
Ichiishi T (2014) Game theory for economic analysis. Academic Press, Cambridge, MA
Ilic L, Sawada M, Zarzelli A (2019) Deep mapping gentrification in a large Canadian city using
deep learning and Google Street View. PloS One 14(3):e0212814
Imai K (2018) Quantitative social science: an introduction. Princeton University Press, Princeton
Imai K, Ratkovic M (2014) Covariate balancing propensity score. J Roy Stat Soc B Stat Methodol,
243–263
Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences.
Cambridge University Press, Cambridge
Ingold D, Soper S (2016) Amazon doesn’t consider the race of its customers. should it? Bloomberg
April 21st
Institute and Faculty of Actuaries (2021) The hidden risks of being poor: the poverty premium in
insurance. Faculty of Actuaries Report
Insurance Bureau of Canada (2021) Facts of the property and casualty insurance industry in
Canada. Insurance Bureau of Canada
Ismay P (2018) Trust among strangers: friendly societies in modern Britain. Cambridge University
Press, Cambridge
References 457
Iten R, Wagner J, Zeier Röschmann A (2021) On the identification, evaluation and treatment of
risks in smart homes: A systematic literature review. Risks 9(6):113
Ito J (2021) Supposedly ‘fair’ algorithms can perpetuate discrimination. Wired 02.05.2019
Jaccard P (1901) Distribution de la flore alpine dans le bassin des dranses et dans quelques régions
voisines. Bull de la Société Vaudoise de Sci Nature 37:241–272
Jackson JP, Depew DJ (2017) Darwinism, democracy, and race: American anthropology and
evolutionary biology in the twentieth century. Taylor & Francis, London
Jackson MO (2019) The human network: How your social position determines your power, beliefs,
and behaviors. Vintage
Jacobs J (1894) Aesop’s fables: selected and told anew. Capricorn Press, Santa Barbara, CA
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Hoboken, NJ
Jann B (2008) The blinder–oaxaca decomposition for linear regression models. Stata J 8(4):453–
479
Jargowsky PA (2005) Omitted variable bias. Encyclopedia Social Measur 2:919–924
Jarvis B, Pearlman RF, Walsh SM, Schantz DA, Gertz S, Hale-Pletka AM (2019) Insurance rate
optimization through driver behavior monitoring. Google Patents 10,169,822
Jean N, Burke M, Xie M, Davis WM, Lobell DB, Ermon S (2016) Combining satellite imagery
and machine learning to predict poverty. Science 353(6301):790–794
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc Roy Soc
Lond A Math Phys Sci 186(1007):453–461
Jerry RH (2023) Understanding parametric insurance: A potential tool to help manage pandemic
risk. In: Covid-19 and insurance, pp 17–62. Springer, New York
Jewell WS (1974) Credible means are exact Bayesian for exponential families. ASTIN Bull J IAA
8(1):77–90
Jiang H, Nachum O (2020) Identifying and correcting label bias in machine learning. In:
International Conference on Artificial Intelligence and Statistics, Proceedings of Machine
Learning Research, pp 702–712
Jiang J, Nguyen T (2007) Linear and generalized linear mixed models and their applications, vol 1.
Springer, New York
Jo HH, Eom YH (2014) Generalized friendship paradox in networks with tunable degree-attribute
correlation. Phys Rev E 90(2):022809
Johannesson GT (2013) The history of Iceland. ABC-CLIO
Johnston L (1945) Effects of tobacco smoking on health. British Med J 2(4411):98
Jolliffe IT (2002) Principal component analysis. Springer, New York
Jones EE, Nisbett RE (1971) The actor and the observer: Divergent perceptions of the causes of
behavior. General Learning Press, New York
Jones ML (2016) Ctrl + Z: The Right to Be Forgotten. New York University Press, New York
Jordan A, Krüger F, Lerch S (2019) Evaluating probabilistic forecasts with scoringRules. J
Stat Softw 90:1–37
Jordan C (1881) Sur la série de Fourier. Camptes Rendus Hebdomadaires de l’Academie des Sci
92:228–230
Joseph S, Castan M (2013) The international covenant on civil and political rights: cases, materials,
and commentary. Oxford University Press, Oxford
Jost JT, Rudman LA, Blair IV, Carney DR, Dasgupta N, Glaser J, Hardin CD (2009) The existence
of implicit bias is beyond reasonable doubt: A refutation of ideological and methodological
objections and executive summary of ten studies that no manager should ignore. Res Organizat
Behav 29:39–69
Jung C, Kearns M, Neel S, Roth A, Stapleton L, Wu ZS (2019a) An algorithmic framework for
fairness elicitation. arXiv 1905.10660
Jung C, Kearns M, Neel S, Roth A, Stapleton L, Wu ZS (2019b) Eliciting and enforcing subjective
individual fairness. arXiv:1905.10660
Jung C, Kannan S, Lee C, Pai MM, Roth A, Vohra R (2020) Fair prediction with endogenous
behavior. arXiv 2002.07147
458 References
Kachur A, Osin E, Davydov D, Shutilov K, Novokshonov A (2020) Assessing the big five
personality traits using real-life static facial images. Sci Rep 10(1):1–11
Kaganoff BC (1996) A dictionary of Jewish names and their history. Jason Aronson, Lanham, MD
Kahlenberg Richard D (1996) The remedy. class, race and affirmative action. Basic, New York
Kahneman D (2011) Thinking, fast and slow. Farrar, Straus and Giroux
Kamalich RF, Polachek SW (1982) Discrimination: Fact or fiction? an examination using an
alternative approach. Southern Econ J, 450–461
Kamen H (2014) The Spanish Inquisition: a historical revision. Yale University Press, Yale
Kamiran F, Calders T (2012) Data preprocessing techniques for classification without discrimina-
tion. Knowl Inf Syst 33(1):1–33
Kang JD, Schafer JL (2007) Demystifying double robustness: A comparison of alternative
strategies for estimating a population mean from incomplete data. Stat Sci 22(4):523–539
Kanngiesser P, Warneken F (2012) Young children consider merit when sharing resources with
others. PLOS ONE 8(8):e43979
Kant I (1775) Über die verschiedenen Rassen der Menschen. Nicolovius Edition
Kant I (1785) Bestimmung des Begriffs einer Menschenrace. Haude und Spener, Berlin
Kant I (1795) Zum ewigen Frieden. Ein philosophischer Entwurf). Nicolovius Edition
Kantorovich LV, Rubinshtein S (1958) On a space of totally additive functions. Vestnik of the St
Petersburg Univ Math 13(7):52–59
Kaplan D (2023) Bayesian statistics for the social sciences. Guilford Publications, New York
Karapiperis D, Birnbaum B, Brandenburg A, Castagna S, Greenberg A, Harbage R, Obersteadt
A (2015) Usage-based insurance and vehicle telematics: insurance market and regulatory
implications. CIPR Study Ser 1:1–79
Karimi H, Khan MFA, Liu H, Derr T, Liu H (2022) Enhancing individual fairness through
propensity score matching. In: 2022 IEEE 9th International Conference on Data Science and
Advanced Analytics (DSAA), pp 1–10. IEEE
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the
image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp 8110–8119
Kearns M, Roth A (2019) The ethical algorithm: The science of socially aware algorithm design.
Oxford University Press, Oxford
Kearns M, Valiant L (1989) Cryptographic limitations on learning boolean formulae and finite
automata. J ACM 21(1):433–444
Kearns M, Neel S, Roth A, Wu ZS (2018) Preventing fairness gerrymandering: Auditing and
learning for subgroup fairness. In: International Conference on Machine Learning, Proceedings
of Machine Learning Research, vol 1711.05144, pp 2564–2572
Keffer R (1929) An experience rating formula. Trans Actuarial Soc Am 30:130–139
Keita SOY, Kittles RA, Royal CD, Bonney GE, Furbert-Harris P, Dunston GM, Rotimi CN (2004)
Conceptualizing human variation. Nature Genet 36(Suppl 11):S17–S20
Kekes J (1995) The injustice of affirmative action involving preferential treatment. In: Cahn S (ed)
The Affirmative Action Debate, Routledge, pp 293–304
Kelly H (2021) A priest’s phone location data outed his private life. it could happen to anyone. The
Washinghton Post 22-07-2021
Kelly M, Nielson N (2006) Age as a variable in insurance pricing and risk classification. Geneva
Papers Risk Insurance Issues Pract 31(2):212–232
Keyfitz K, Flieger W, et al. (1968) World population: an analysis of vital data. The University of
Chicago Press, Chicago
Keys A, Fidanza F, Karvonen MJ, Kimura N, Taylor HL (1972) Indices of relative weight and
obesity. J Chronic Dis 25(6–7):329–343
Kiiveri H, Speed T (1982) Structural analysis of multivariate data: A review. Sociological Methodol
13:209–289
Kilbertus N, Rojas-Carulla M, Parascandolo G, Hardt M, Janzing D, Schölkopf B (2017) Avoiding
discrimination through causal reasoning. arXiv 1706.02744
References 459
Kranzberg M (1986) Technology and history: “Kranzberg’s laws”. Technol Culture 27(3):544–560
Kranzberg M (1995) Technology and history: “Kranzberg’s laws”. Bull Sci Technol Soc 15(1):5–
13
Krasanakis E, Spyromitros-Xioufis E, Papadopoulos S, Kompatsiaris Y (2018) Adaptive sensitive
reweighting to mitigate bias in fairness-aware classification. In: Proceedings of the 2018 World
Wide Web Conference, pp 853–862
Kremer E (1982) Ibnr-claims and the two-way model of anova. Scandinavian Actuarial J
1982(1):47–55
Krikler S, Dolberger D, Eckel J (2004) Method and tools for insurance price and revenue
optimisation. J Financ Serv Market 9(1):68–79
Krippner GR (2023) Unmasked: A history of the individualization of risk. Sociol Theory,
07352751231169012
Kroll JA, Huey J, Barocas S, Felten EW, Reidenberg JR, Robinson DG, Yu H (2017) Accountable
algorithms. Univ Pennsylvania Law Rev 165:633–705
Krüger F, Ziegel JF (2021) Generic conditions for forecast dominance. J Bus Econ Stat 39(4):972–
983
Krzanowski WJ, Hand DJ (2009) ROC curves for continuous data. CRC Press, Boca Raton
Kuczmarski J (2018) Reducing gender bias in google translate. Keyword 6:2018
Kudryavtsev AA (2009) Using quantile regression for rate-making. Insurance Math Econ
45(2):296–304
Kuhn M, Johnson K, et al. (2013) Applied predictive modeling, vol 26. Springer, New York
Kuhn T (2020) Root insurance commits to eliminate bias from its car insurance rates. Business
Wire August 6
Kull M, Flach P (2015) Novel decompositions of proper scoring rules for classification: Score
adjustment as precursor to calibration. In: Machine Learning and Knowledge Discovery in
Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7–11, 2015,
Proceedings, Part I 15, pp 68–85. Springer, New York
Kullback S (2004) Minimum discrimination information (mdi) estimation. Encyclopedia Stat
Sci 7:4821–4823
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Kusner MJ, Loftus J, Russell C, Silva R (2017) Counterfactual fairness. Adv Neural Inf Process
Syst 30:4067–4077
de La Fontaine J (1668) Fables. Barbin
Laffont JJ, Martimort D (2002) The theory of incentives: the principal-agent model, pp 273–306.
Princeton University Press, Princeton
Lahoti P, Gummadi KP, Weikum G (2019) Operationalizing individual fairness with pairwise fair
representations. arXiv 1907.01439
Lambert D (1992) Zero-inflated poisson regression, with an application to defects in manufactur-
ing. Technometrics 34(1):1–14
Lamont M, Molnár V (2002) The study of boundaries in the social sciences. Annu Rev Sociol,
167–195
Lancaster R, Ward R (2002) The contribution of individual factors to driving behaviour: Implica-
tions for managing work-related road safety. HM Stationery Office
Landes X (2015) How fair is actuarial fairness? J Bus Ethics 128(3):519–533
Langford J, Schapire R (2005) Tutorial on practical prediction theory for classification. J Mach
Learn Res 6(3):273–306
LaPar DJ, Bhamidipati CM, Mery CM, Stukenborg GJ, Jones DR, Schirmer BD, Kron IL, Ailawadi
G (2010) Primary payer status affects mortality for major surgical operations. Ann Surg
252(3):544
Laplace PS (1774) Mémoire sur la probabilité de causes par les évenements. Mémoire de
l’académie royale des sciences
de Lara L (2023) Counterfactual models for fair and explainable machine learning: A mass
transportation approach. PhD thesis, Institut de Mathématiques de Toulouse
References 461
Loftus JR, Russell C, Kusner MJ, Silva R (2018) Causal reasoning for algorithmic fairness. arXiv
1805.05859
Loi M, Christen M (2021) Choosing how to discriminate: navigating ethical trade-offs in fair
algorithmic design for the insurance sector. Philos Technol, 1–26
Lombroso C (1876) L’uomo delinquente. Hoepli
Loos RJ, Yeo GS (2022) The genetics of obesity: from discovery to biology. Nature Rev Genet
23(2):120–133
L’Oréal (2022) A new geography of skin color. https://wwwlorealcom/en/articles/science-and-
technology/expert-inskin/
Lorenz MO (1905) Methods of measuring the concentration of wealth. Publ Am Stat Assoc
9(70):209–219
Lovejoy B (2021) Linkedin breach reportedly exposes data of 92% of users, including inferred
salaries. 9to5mac 06/29
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I,
Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Proceedings
of the 31st International Conference on Neural Information Processing Systems, vol 30, pp
4768–4777. Curran Associates, Inc.
Luong BT, Ruggieri S, Turini F (2011) k-nn as an implementation of situation testing for discrim-
ination discovery and prevention. In: Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp 502–510
Lutton L, Fan A, Loury A (2020) Where banks don’t lend. WBEZ
MacIntyre AC (1969) Hume on ‘is’ and ‘ought’. In: The is-ought question, pp 35–50. Springer,
New York
MacKay DJ (1992) A practical Bayesian framework for backpropagation networks. Neural
Comput 4(3):448–472
Macnicol J (2006) Age discrimination: An historical and contemporary analysis. Cambridge
University Press, Cambridge
Maedche A (2020) Gender bias in chatbot design. Chatbot Research and Design, p 79
Mallasto A, Feragen A (2017) Learning from uncertain curves: The 2-wasserstein metric for
gaussian processes. Advances in Neural Information Processing Systems 30
Mallon R (2006) ‘race’: normative, not metaphysical or semantic. Ethics 116(3):525–551
Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat, 508–515
Mangel M, Samaniego FJ (1984) Abraham wald’s work on aircraft survivability. J Am Stat Assoc
79(386):259–267
Mantelero A (2013) The eu proposal for a general data protection regulation and the roots of the
‘right to be forgotten’. Comput Law Secur Rev 29(3):229–235
Marshall A (1890) General relations of demand, supply, and value. Principles of economics:
unabridged eighth edition
Marshall A (2021) Ai comes to car repair, and body shop owners aren’t happy. Wired April 13
Marshall GA (1993) Racial classifications: popular and scientific. In: Harding S (ed) The “racial”
economy of science: Toward a democratic future, pp 116–125. Indiana University Press,
Bloomington, IN
Martin GD (1977) Gender discrimination in pension plans: Author’s reply. J Risk Insurance
44(1):145–149
Mary J, Calauzènes C, El Karoui N (2019) Fairness-aware learning for continuous attributes and
treatments. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International
Conference on Machine Learning, Proceedings of Machine Learning Research, Proceedings
of Machine Learning Research, vol 97, pp 4382–4391
Mas L (2020) A confederate flag spotted in the window of police barracks in paris. France 24 10/07
Massey DS (2007) Categorically unequal: The American stratification system. Russell Sage
Foundation
Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage
lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2):442–451
464 References
Mayer J, Mutchler P, Mitchell JC (2016) Evaluating the privacy properties of telephone metadata.
Proc Natl Acad Sci 113(20):5536–5541
Maynard A (1979) Pricing, insurance and the national health service. J Soc Pol 8(2):157–176
Mayr E (1982) The growth of biological thought: Diversity, evolution, and inheritance. Harvard
University Press, Harvard
Mazieres A, Roth C (2018) Large-scale diversity estimation through surname origin inference.
Bull Sociol Methodol/Bull de Méthodologie Sociologique 139(1):59–73
Mbungo R (2014) L’approche juridique internationale du phénomène de discrimination fondée sur
le motif des antécédents judiciaires. Revue québécoise de droit international 27(2):59–97
McCaffrey DF, Ridgeway G, Morral AR (2004) Propensity score estimation with boosted
regression for evaluating causal effects in observational studies. Psychol Methods 9(4):403
McClenahan CL (2006) Ratemaking. In: Foundations of casualty actuarial science. Casualty
Actuarial Society
McCullagh P, Nelder J (1989) Generalized linear models. Chapman & Hall
McCulloch CE, Searle SR (2004) Generalized, linear, and mixed models. Wiley, New York
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
McDonnell M, Baxter D (2019) Chatbots and gender stereotyping. Interact Comput 31(2):116–121
McFall L (2019) Personalizing solidarity? the role of self-tracking in health insurance pricing.
Econ Soc 48(1):52–76
McFall L, Meyers G, Hoyweghen IV (2020) Editorial: The personalisation of insurance: Data,
behaviour and innovation. Big Data Soc 7(2):1–11
McKinley R (2014) A history of British surnames. Routledge, Abingdon
McKinsey (2017) Technology, jobs and the future of work. McKinsey Global Institute
McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: Homophily in social networks.
Annu Rev Sociol 27(1):415–444
Meilijson I (2006) Risk aversion. Encyclopedia of Actuarial Science
Meinshausen N, Ridgeway G (2006) Quantile regression forests. J Mach Learn Res 7(6):983–999
de Melo-Martín I (2003) When is biology destiny? biological determinism and social responsibil-
ity. Philos Sci 70(5):1184–1194
Memmi A (2000) Racism. University of Minnesota Press, Minnesota
Menezes CF, Hanson DL (1970) On the theory of risk aversion. Int Econ Rev, 481–487
Mercat-Bruns M (2016) Discrimination at work. University of California Press, California
Mercat-Bruns M (2020) Les rapports entre vieillissement et discrimination en droit: une fertilisa-
tion croisée utile sur le plan individuel et collectif. La Revue des Droits de l’Homme 17
Merchant C (1980) The death of nature: Women, ecology, and the scientific revolution. Harper-
collins
Merriam-Webster (2022) Dictionary. Merriam-Webster
Merrill D (2012) New credit scores in a new world: Serving the underbanked. TEDxNewWallStreet
Messenger R, Mandell L (1972) A modal search technique for predictive nominal scale multivari-
ate analysis. J Am Stat Assoc 67(340):768–772
Meuleners LB, Harding A, Lee AH, Legge M (2006) Fragility and crash over-representation among
older drivers in western australia. Accident Anal Prevent 38(5):1006–1010
Meyers G, Van Hoyweghen I (2018) Enacting actuarial fairness in insurance: From fair discrimi-
nation to behaviour-based fairness. Sci Culture 27(4):413–438
Michelbacher G (1926) ‘moral hazard’ in the field of casualty insurance. Proc Casualty Actuar Soc
13(27):448–471
Michelson S, Blattenberger G (1984) Reverse regression and employment discrimination. J Bus
Econ Stat 2(2):121–122
Milanković M (1920) Théorie mathématique des phénomènes thermiques produits par la radiation
solaire. Gauthier-Villars, Paris
Miller G, Gerstein DR (1983) The life expectancy of nonsmoking men and women. Public Health
Rep 98(4):343
References 465
Miller H (2015a) A discussion on credibility and penalised regression, with implications for
actuarial work. Actuaries Institute
Miller MJ, Smith RA, Southwood KN (2003) The relationship of credit-based insurance scores to
private passenger automobile insurance loss propensity. Actuarial Study, Epic Actuaries
Miller T (2015b) Price optimization. Commonwealth of Pennsylvania, Insurance Department
August 25
Mills CW (2017) Black rights/white wrongs: The critique of racial liberalism. Oxford University
Press, Oxford
Milmo D (2021) Working of algorithms used in government decision-making to be revealed. The
Guardian November 29
Milne J (1815) A Treatise on the Valuation of Annuities and Assurances on Lives and Survivor-
ships: On the Construction of Tables of Mortality and on the Probabilities and Expectations of
Life, vol 2. Longman, Hurst, Rees, Orme, and Brown
Minsky M, Papert S (1969) An introduction to computational geometry. Cambridge tiass, HIT
479:480
Miracle JM (2016) De-anonymization attack anatomy and analysis of ohio nursing workforce data
anonymization. PhD thesis, Wright State University
Mittelstadt BD, Allo P, Taddeo M, Wachter S, Floridi L (2016) The ethics of algorithms: Mapping
the debate. Big Data Soc 3(2):2053951716679679
Mittra J (2007) Predictive genetic information and access to life assurance: The poverty of ‘genetic
exceptionalism’. BioSocieties 2(3):349–373
Mollat M, du Jourdin MM (1986) The poor in the Middle Ages: an essay in social history. Yale
University Press, Yale
Molnar C (2023) A guide for making black box models explainable. https://christophm.github.io/
interpretable-ml-book
Molnar C, Casalicchio G, Bischl B (2018) iml: An r package for interpretable machine learning. J
Open Source Soft 3(26):786
Monnet J (2017) Discrimination et assurance. Journal de Droit de la Santé et de l’Assurance
Maladie 16:13–19
Moodie EE, Stephens DA (2022) Causal inference: Critical developments, past and future. Canad
J Stat 50(4):1299–1320
Moon R (2014) From gorgeous to grumpy: adjectives, age and gender. Gender Lang 8(1):5–41
Moor L, Lury C (2018) Price and the person: Markets, discrimination, and personhood. J Cultural
Econ 11(6):501–513
Morel P, Stalk G, Stanger P, Wetenhall P (2003) Pricing myopia. The Boston Consulting Group
Perspectives
Morris DS, Schwarcz D, Teitelbaum JC (2017) Do credit-based insurance scores proxy for income
in predicting auto claim risk? J Empir Legal Stud 14(2):397–423
Morrison EJ (1996) Insurance discrimination against battered women: Proposed legislative
protections. Ind LJ 72:259
Moulin H (1992) An application of the shapley value to fair division with money. Econometrica,
1331–1349
Moulin H (2004) Fair division and collective welfare. MIT Press, Cambridge, MA
Mowbray A (1921) Classification of risks as the basis of insurance rate making with special
reference to workmen’s compensation. Proceedings of the Casualty Actuarial Society
Müller R, Kornblith S, Hinton GE (2019) When does label smoothing help? Adv Neural Inf Process
Syst 32:4694–4703
Mundubeltz-Gendron S (2019) Comment l’intelligence artificielle va bouleverser le monde du
travail dans l’assurance. L’Argus de l’Assurance 10/04
Murphy AH (1973) A new vector partition of the probability score. J Appl Meteor Climatol
12(4):595–600
Murphy AH (1996) General decompositions of mse-based skill scores: Measures of some basic
aspects of forecast quality. Month Weather Rev 124(10):2353–2369
466 References
Murphy AH, Winkler RL (1987) A general framework for forecast verification. Month Weather
Rev 115(7):1330–1338
Must A, Spadano J, Coakley EH, Field AE, Colditz G, Dietz WH (1999) The disease burden
associated with overweight and obesity. J Am Med Assoc 282(16):1523–1529
Myers RJ (1977) Gender discrimination in pension plans: Further comment. J Risk Insurance
44(1):144–145
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9(1):141–142
Nakashima R (2018) Google tracks your movements, like it or not. Associated Press August 14
Nassif H, Kuusisto F, Burnside ES, Page D, Shavlik J, Santos Costa V (2013) Score as you lift
(sayl): A statistical relational learning approach to uplift modeling. In: Machine Learning and
Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech
Republic, September 23–27, 2013, Proceedings, Part III 13, pp 595–611. Springer, New York
Nathan EB (1925) Analysed mortality: the english life no. 8a tables. Trans Faculty Actuaries
10:45–124
National Association of Insurance Commissioners (2011) A consumer’s guide to auto insurance.
NAIC Reports
National Association of Insurance Commissioners (2022) A consumer’s guide to auto insurance.
NAIC Reports
Natowicz MR, Alper JK, Alper JS (1992) Genetic discrimination and the law. Am J Human Genet
50(3):465
Neal RM (1992) Bayesian training of backpropagation networks by the hybrid monte carlo method.
Tech. rep., Citeseer
Neal RM (2012) Bayesian learning for neural networks, vol 118. Springer Science & Business
Media, New York
Neddermeyer JC (2009) Computationally efficient nonparametric importance sampling. J Am Stat
Assoc 104(486):788–802
Nelson A (2002) Unequal treatment: confronting racial and ethnic disparities in health care. J Natl
Med Assoc 94(8):666
Neyman J, Dabrowska DM, Speed T (1923) On the application of probability theory to agricultural
experiments. Essay on principles, section 9. Stat Sci, 465–472
Niculescu-Mizil A, Caruana R (2005a) Predicting good probabilities with supervised learning. In:
Proceedings of the 22nd International Conference on Machine Learning, pp 625–632
Niculescu-Mizil A, Caruana RA (2005b) Obtaining calibrated probabilities from boosting. In:
Proc. 21st Conference on Uncertainty in Artificial Intelligence (UAI’05), pp 413–420. AUAI
Press, Arlington, VA
Nielsen F (2013) Jeffreys centroids: A closed-form expression for positive histograms and a
guaranteed tight approximation for frequency histograms. IEEE Signal Process Lett 20(7):657–
660
Nielsen F, Boltz S (2011) The burbea-rao and bhattacharyya centroids. IEEE Trans Inf Theory
57(8):5455–5466
Nielsen F, Nock R (2009) Sided and symmetrized bregman centroids. IEEE Trans Inf Theory
55(6):2882–2904
Noguéro D (2010) Sélection des risques. discrimination, assurance et protection des personnes
vulnérables. Revue générale du droit des assurances 3:633–663
Nordholm LA (1980) Beautiful patients are good patients: evidence for the physical attractiveness
stereotype in first impressions of patients. Soc Sci Med Part A Med Psychol Med Sociol
14(1):81–83
Norman P (2003) Statistical discrimination and efficiency. Rev Econ Stud 70(3):615–627
Nuruzzaman M, Hussain OK (2020) Intellibot: A dialogue-based chatbot for the insurance
industry. Knowl Based Syst 196:105810
Oaxaca R (1973) Male-female wage differentials in urban labor markets. Int Econ Rev 14(3):693–
709
Oaxaca RL, Ransom MR (1994) On discrimination and the decomposition of wage differentials. J
Economet 61(1):5–21
References 467
Petersen A, Müller HG (2019) Fréchet regression for random objects with Euclidean predictors.
Ann Stat 47(2):691–719
Petersen F, Mukherjee D, Sun Y, Yurochkin M (2021) Post-processing for individual fairness. Adv
Neural Inf Process Syst 34:25944–25955
Petit P, Duguet E, L’Horty Y (2015) Discrimination résidentielle et origine ethnique: une étude
expérimentale sur les serveurs en île-de-france. Economie Prevision (1):55–69
Pfanzagl P (1979) Conditional distributions as derivatives. Ann Probab 7(6):1046–1050
Pfeffermann D (1993) The role of sampling weights when modeling survey data. International
Statistical Review/Revue Internationale de Statistique, pp 317–337
Phelps ES (1972) The statistical theory of racism and sexism. Am Econ Rev 62(4):659–661
Phelps JT (1895) Life insurance sayings. Riverside Press, Riverside Press
Picard P (2003) Les frontières de l’assurabilité. Risques 54:65–66
Pichard M (2006) Les droits à: étude de législation française. Economica
Pisu M, Azuero A, McNees P, Burkhardt J, Benz R, Meneses K (2010) The out of pocket cost of
breast cancer survivors: a review. J Cancer Survivorship 4(3):202–209
Plakans A, Wetherell C (2000) Patrilines, surnames, and family identity: A case study from the
russian baltic provinces in the nineteenth century. Hist Family 5(2):199–214
Platt J, et al. (1999) Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods. Adv Large Margin Classifiers 10(3):61–74
Plečko D, Bennett N, Meinshausen N (2021) fairadapt: Causal reasoning for fair data pre-
processing. arXiv 2110.10200
Pleiss G, Raghavan M, Wu F, Kleinberg J, Weinberger KQ (2017) On fairness and calibration.
arXiv 1709.02012
Pohle MO (2020) The murphy decomposition and the calibration-resolution principle: A new
perspective on forecast evaluation. arXiv 2005.01835
Pojman LP (1998) The case against affirmative action. Int J Appl Philos 12(1):97–115
Poku M (2016) Campbell’s law: implications for health care. J Health Serv Res Policy 21(2):137–
139
Pope DG, Sydnor JR (2011) Implementing anti-discrimination policies in statistical profiling
models. Am Econ J Econ Policy 3(3):206–31
Porrini D, Fusco G (2020) Less discrimination, more gender inequality:: The case of the italian
motor-vehicle insurance. J Risk Manag Insurance 24(1):1–11
Porter TM (2020) Trust in numbers. Princeton University Press, Princeton
Powell L (2020) Risk-based pricing of property and liability insurance. J Insurance Regulat 1:1–23
Pradier PC (2011) (petite) histoire de la discrimination (dans les assurances). Risques 87:51–57
Pradier PC (2012) Les bénéfices terrestres de la charité. les rentes viagères des hôpitaux parisiens,
1660–1690. Histoire Mesure 26(XXVI-2):31–76
Prince AE, Schwarcz D (2019) Proxy discrimination in the age of artificial intelligence and big
data. Iowa Law Rev 105:1257
Proschan MA, Presnell B (1998) Expect the unexpected from conditional expectation. Am Stat
52(3):248–252
Puddifoot K (2021) How stereotypes deceive us. Oxford University Press, Oxford
Puzzo DA (1964) Racism and the western tradition. J Hist Ideas 25(4):579–586
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Quinlan JR (1987) Simplifying decision trees. Int J Man Mach Stud 27(3):221–234
Quinlan JR (1993) C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco, CA
Quinzii M, Rochet JC (1985) Multidimensional signalling. J Math Econ 14(3):261–284
Racine J, Rilstone P (1995) The reverse regression problem: statistical paradox or artefact of
misspecification? Canad J Econ, 502–531
Radcliffe N (2007) Using control groups to target on predicted lift: Building and assessing uplift
model. Direct Market Anal J, 14–21
Radcliffe N, Surry P (1999) Differential response analysis: Modeling true responses by isolating
the effect of a single action. Credit Scoring and Credit Control IV
References 469
Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear regression
models. J Am Stat Assoc 92(437):179–191
Ransom RL, Sutch R (1987) Tontine insurance and the armstrong investigation: a case of stifled
innovation, 1868–1905. J Econ Hist 47(2):379–390
Rattani A, Reddy N, Derakhshani R (2017) Gender prediction from mobile ocular images: A
feasibility study. In: IEEE International Symposium on Technologies for Homeland Security
(HST), pp 1–6. IEEE
Rattani A, Reddy N, Derakhshani R (2018) Convolutional neural networks for gender prediction
from smartphone-based ocular images. IET Biometrics 7(5):423–430
Rawls J (1971) A theory of justice: Revised edition. Harvard University Press, Harvard
Rawls J (2001) Justice as fairness: A restatement. Harvard University Press, Harvard
Rebert L, Van Hoyweghen I (2015) The right to underwrite gender: The goods & services directive
and the politics of insurance pricing. Tijdschrift Voor Genderstudies 18(4):413–431
Reichenbach H (1956) The direction of time. University of Los Angeles Press, Berkeley
Reijns T, Weurding R, Schaffers J (2021) Ethical artificial intelligence – the dutch insurance
industry makes it a mandate. KPMG Insights 03/2021
Reimers CW (1983) Labor market discrimination against hispanic and black men. Rev Econ Stat,
570–579
Reinsel GC (2003) Elements of multivariate time series analysis. Springer, New York
Rényi A (1959) On measures of dependence. Acta mathematica hungarica 10(3–4):441–451
Rescher N (2013) How wide is the gap between facts and values? In: Studies in Value Theory, pp
25–52. De Gruyter
Resnick S (2019) A probability path. Springer, New York
Rhynhart R (2020) Mapping the legacy of structural racism in Philadelphia. Office of the
Controller, Philadelphia
Riach PA, Rich J (1991) Measuring discrimination by direct experimental methods: Seeking
gunsmoke. J Post Keynesian Econ 14(2):143–150
Ribeiro MT, Singh S, Guestrin C (2016) “why should i trust you?” explaining the predictions of any
classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp 1135–1144
Ridgeway CL (2011) Framed by gender: How gender inequality persists in the modern world.
Oxford University Press, Oxford
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
Riley JG (1975) Competitive signalling. J Econ Theory 10(2):174–186
Rink FT (1805) Ansichten aus Immanuel Kant’s Leben. Göbbels und Unzer
Rivera LA (2020) Employer decision making. Annu Rev Sociol 46:215–232
Robbins LA (2015) The pernicious problem of ageism. Generations 39(3):6–9
Robertson T, FT W, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
Robinson PM (1988) Root-n-consistent semiparametric regression. Econometrica, 931–954
Robinson WS (1950) Ecological correlations and the behavior of individuals. Am Sociol Rev
15(3):351–357
Robnik-Šikonja M, Kononenko I (1997) An adaptation of relief for attribute estimation in
regression. In: Machine Learning: Proceedings of the Fourteenth International Conference
(ICML’97), vol 5, pp 296–304
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff.
Mach Learn 53(1):23–69
Robnik-Šikonja M, Kononenko I (2008) Explaining classifications for individual instances. IEEE
Trans Knowl Data Eng 20(5):589–600
Rodríguez Cardona D, Janssen A, Guhr N, Breitner MH, Milde J (2021) A matter of trust?
examination of chatbot usage in insurance business. In: Proceedings of the 54th Hawaii
International Conference on System Sciences, p 556
Rodríguez-Cuenca B, Alonso MC (2014) Semi-automatic detection of swimming pools from aerial
high-resolution images and lidar data. Remote Sens 6(4):2628–2646
Roemer JE (1996) Theories of distributive justice. Harvard University Press, Harvard
470 References
Saks S (1937) Theory of the integral. Instytut Matematyczny Polskiej Akademi Nauk, Warszawa-
Lwów
Sakurada M, Yairi T (2014) Anomaly detection using autoencoders with nonlinear dimensionality
reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for
Sensory Data Analysis, pp 4–11
Salimi B, Howe B, Suciu D (2020) Database repair meets algorithmic fairness. ACM SIGMOD
Record 49(1):34–41
Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008)
Global sensitivity analysis: the primer. Wiley, New York
Samadi S, Tantipongpipat U, Morgenstern JH, Singh M, Vempala S (2018) The price of fair pca:
One extra dimension. Adv Neural Inf Process Syst 31:3992–4001
Sanche F, Roberge I (2023) La question de la semaine sur le casier judiciaire et les assurances.
Radio Canada January 31
Sandel MJ (2020) The tyranny of merit: What’s become of the common good? Penguin, UK
Santambrogio F (2015) Optimal transport for applied mathematicians. Birkäuser, NY 55(58–
63):94
Santosa F, Symes WW (1986) Linear inversion of band-limited reflection seismograms. SIAM J
Scientific Statist Comput 7(4):1307–1330
Sarmanov O (1963) Maximum correlation coefficient (nonsymmetric case). Sel Transl Math Stat
Probab 2:207–210
Schanze E (2013) Injustice by generalization: notes on the Test-Achats decision of the european
court of justice. German Law J 14(2):423–433
Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
Schapire RE (2013) Explaining adaboost. Empirical Inference: Festschrift in Honor of Vladimir N
Vapnik, pp 37–52
Schauer F (2006) Profiles, probabilities, and stereotypes. Harvard University Press, Harvard
Schauer F (2017) Statistical (and non-statistical) discrimination. In: Lippert-Rasmussen K (ed)
Handbook of the ethics of discrimination, pp 42–53. Routledge
Schilling E (2006) Accuracy and precision. Encyclopedia of Statistical Sciences, p 25
Schlesinger A, O’Hara KP, Taylor AS (2018) Let’s talk about race: Identity, chatbots, and ai. In:
Proceedings of the 2018 chi Conference on Human Factors in Computing Systems, pp 1–14
Schmeiser H, Störmer T, Wagner J (2014) Unisex insurance pricing: consumers’ perception and
market implications. Geneva Papers Risk Insurance Issues Pract 39(2):322–350
Schmidt KD (2006) Prediction. Encyclopedia of Actuarial Science
Schneier B (2015) Data and Goliath: The hidden battles to collect your data and control your world.
WW Norton & Company, New York
Schweik SM (2009) The ugly laws. New York University Press, New York
Scikit Learn (2017) Probability calibration. https://scikit-learnorg/stable/modules/calibrationhtml
Scism L (2019) New york insurers can evaluate your social media use–if they can prove why it’s
needed. The Wall Street Journal January 30
Scism L, Maremont M (2010a) Inside deloitte’s life-insurance assessment technology. Wall Street
Journal November 19
Scism L, Maremont M (2010b) Insurers test data profiles to identify risky clients. Wall Street
Journal November 19
Scutari M, Panero F, Proissl M (2022) Achieving fairness with a simple ridge penalty. Stat Comput
32(5):77
Seelye KQ (1994) Insurability for battered women. New York Times May 12
Segall S (2013) Equality and opportunity. Oxford University Press, Oxford
Seicshnaydre SE (2007) Is the road to disparate impact paved with good intentions: Stuck on state
of mind in antidiscrimination law. Wake Forest L Rev 42:1141
Selbst AD, Barocas S (2018) The intuitive appeal of explainable machines. Fordham Law Rev
87:1085
Seligman D (1983) Insurance and the price of sex. Fortune February 21st
472 References
Seresinhe CI, Preis T, Moat HS (2017) Using deep learning to quantify the beauty of outdoor
places. Roy Soc Open Sci 4(7):170170
Shadish WR, Luellen JK (2005) Quasi-experimental designs. Encyclopedia of Statistics in
Behavioral Science
Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois
Press, Urbana, IL
Shapley LS (1953) A value for n-person games. In: Kuhn HW, Tucker AW (eds) Contributions to
the theory of games II, pp 307–317. Princeton University Press, Princeton
Shapley LS, Shubik M (1969) Pure competition, coalitional power, and fair division. Int Econ Rev
10(3):337–362
Shikhare S (2021) Next generation ltc - life insurance underwriting using facial score model. In:
Insurance Data Science Conference
Siddiqi N (2012) Credit risk scorecards: developing and implementing intelligent credit scoring,
vol 3. Wiley, New York
Silver N (2012) The signal and the noise: Why so many predictions fail-but some don’t. Penguin,
Harmondsworth
Simon J (1987) The emergence of a risk society-insurance, law, and the state. Socialist Rev 95:60–
89
Simon J (1988) The ideological effects of actuarial practices. Law Soc Rev 22:771
Singer P (2011) Practical ethics. Cambridge University Press, Cambridge
Slovic P (1987) Perception of risk. Science 236(4799):280–285
Small ML, Pager D (2020) Sociological perspectives on racial discrimination. J Econ Perspect
34(2):49–67
Smith A (1759) The theory of moral sentiments. Penguin, Harmondsworth
Smith CS (2021) A.i. here, there, everywhere. New York Times (February 23)
Smith DJ (1977) Racial disadvantage in Britain: the PEP report. Penguin, Harmondsworth
Smith GC, Pell JP (2003) Parachute use to prevent death and major trauma related to gravitational
challenge: systematic review of randomised controlled trials. BMJ 327(7429):1459–1461
Smythe HH (1952) The eta: a marginal japanese caste. Am J Sociol 58(2):194–196
Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and roc: a family
of discriminant measures for performance evaluation. In: Australasian Joint Conference on
Artificial Intelligence, pp 1015–1021. Springer, New York
Sollich P, Krogh A (1995) Learning with ensembles: How overfitting can be useful. Adv Neural
Inf Process Syst 8:190–196
Solow RM (1957) Technical change and the aggregate production function. The Review of
Economics and Statistics, pp 312–320
Solution ICD (2020) How to increase credit score
Sorensen TA (1948) A method of establishing groups of equal amplitude in plant sociology based
on similarity of species content and its application to analyses of the vegetation on danish
commons. Biol Skar 5:1–34
Spedicato GA, Dutang C, Petrini L (2018) Machine learning methods to perform pricing
optimization. a comparison with standard glms. Variance 12(1):69–89
Speicher T, Ali M, Venkatadri G, Ribeiro FN, Arvanitakis G, Benevenuto F, Gummadi KP, Loiseau
P, Mislove A (2018) Potential for discrimination in online targeted advertising. In: Conference
on Fairness, Accountability and Transparency, Proceedings of Machine Learning Research, pp
5–19
Spence M (1974) Competitive and optimal responses to signals: An analysis of efficiency and
distribution. J Econ Theory 7(3):296–332
Spence M (1976) Informational aspects of market structure: An introduction. Q J Econ, 591–597
Spender A, Bullen C, Altmann-Richer L, Cripps J, Duffy R, Falkous C, Farrell M, Horn T, Wigzell
J, Yeap W (2019) Wearables and the internet of things: Considerations for the life and health
insurance industry. Brit Actuarial J 24:e22
Spiegelhalter DJ, Dawid AP, Lauritzen SL, Cowell RG (1993) Bayesian analysis in expert systems.
Stat Sci, 219–247
References 473
Spirtes P, Glymour C, Scheines R (1993) Discovery algorithms for causally sufficient structures.
In: Causation, prediction, and search, pp 103–162. Springer, New York
Squires G (2011) Redlining to reinvestment. Temple University Press, Philadelphia, PA
Squires GD (2003) Racial profiling, insurance style: Insurance redlining and the uneven develop-
ment of metropolitan areas. J Urban Affairs 25(4):391–410
Squires GD, Chadwick J (2006) Linguistic profiling: A continuing tradition of discrimination in
the home insurance industry? Urban Affairs Rev 41(3):400–415
Squires GD, DeWolfe R (1981) Insurance redlining in minority communities. Rev Black Polit
Econ 11(3):347–364
Stark L, Stanhaus A, Anthony DL (2020) “i don’t want someone to watch me while i’m working”:
Gendered views of facial recognition technology in workplace surveillance. J Assoc Inf Sci
Technol 71(9):1074–1088
Steensma C, Loukine L, Orpana H, Lo E, Choi B, Waters C, Martel S (2013) Comparing life
expectancy and health-adjusted life expectancy by body mass index category in adult canadians:
a descriptive study. Populat Health Metrics 11(1):1–12
Stein A (1994) Will health care reform protect victims of abuse-treating domestic violence as a
public health issue. Human Rights 21:16
Stenholm S, Head J, Aalto V, Kivimäki M, Kawachi I, Zins M, Goldberg M, Platts LG, Zaninotto
P, Hanson LM, et al. (2017) Body mass index as a predictor of healthy and disease-free life
expectancy between ages 50 and 75: a multicohort study. Int J Obesity 41(5):769–775
Stephan Y, Sutin AR, Terracciano A (2015) How old do you feel? the role of age discrimination
and biological aging in subjective age. PloS One 10(3):e0119293
Stevenson M (2018) Assessing risk assessment in action. Minnesota Law Rev 103:303
Steyerberg E, Eijkemans M, Habbema J (2001) Application of shrinkage techniques in logistic
regression analysis: a case study. Statistica Neerlandica 55(1):76–88
Stone DA (1993) The struggle for the soul of health insurance. J Health Polit Policy Law
18(2):287–317
Stone P (2007) Why lotteries are just. J Polit Philos 15(3):276–295
Štrumbelj E, Kononenko I (2010) An efficient explanation of individual classifications using game
theory. J Mach Learn Res 11:1–18
Štrumbelj E, Kononenko I (2014) Explaining prediction models and individual predictions with
feature contributions. Knowl Inf Syst 41:647–665
Struyck N (1740) Inleiding tot de algemeene geographie. Tirion 1740:231
Struyck N (1912) Les oeuvres de Nicolas Struyck (1687–1769): qui se rapportent au calcul des
chances, à la statistique général, à la statistique des décès et aux rentes viagèter. Société
générale néerlandaise d’assurances sur la vie et de rentes viagères
Stuart EA (2010) Matching methods for causal inference: A review and a look forward. Stat Sci
25(1):1
Suresh H, Guttag JV (2019) A framework for understanding sources of harm throughout the
machine learning life cycle. arXiv 1901.10002
Surowiecki J (2004) The wisdom of crowds: why the many are smarter than the few and how
collective wisdom shapes business, economies, societies and nations. Doubleday & Co, New
York
Sutton W (1874) On the method used by Dr. Price in the construction of the northampton mortality
table. J Inst Actuaries 18(2):107–122
Swauger S (2021) The next normal: Algorithms will take over college, from admissions to
advising. Washington Post (November 12)
Sweeney L (2013) Discrimination in online ad delivery: Google ads, black names and white names,
racial discrimination, and click advertising. Queue 11(3):10–29
Szalavitz M (2017) Why do we think poor people are poor because of their own bad choices. The
Guardian July 5
Szepannek G, Lübke K (2021) Facing the challenges of developing fair risk scoring models. Front
Artif Intell 4:681915
474 References
Tajfel H, Turner JC, Worchel S, Austin WG, et al. (1986) Psychology of intergroup relations, pp
7–24. Nelson-Hall, Chicago
Tajfel HE (1978) Differentiation between social groups: Studies in the social psychology of
intergroup relations. Academic Press, Cambridge, MA
Tanaka K (2012) Surnames and gender in japan: Women’s challenges in seeking own identity. J
Family Hist 37(2):232–240
Tang S, Zhang X, Cryan J, Metzger MJ, Zheng H, Zhao BY (2017) Gender bias in the job market:
A longitudinal analysis. Proc ACM Human Comput Interact 1(CSCW):1–19
Tasche D (2008) Validation of internal rating systems and pd estimates. In: The analytics of risk
model validation, pp 169–196. Elsevier
Taylor A, Sadowski J (2015) How companies turn your Facebook activity into a credit score. The
Nation May 27
Taylor S (2015) Price optimization ban. Government of the District of Columbia, Department of
Insurance August 25
Telles E (2014) Pigmentocracies: Ethnicity, race, and color in Latin America. UNC Press Books
Tharwat A (2021) Classification assessment methods. Appl Comput Inf 17(1):168–192
The Zebra (2022) Car insurance rating factors by state. https://www.thezebra.com/
Theobald CM (1974) Generalizations of mean square error applied to ridge regression. J Roy Stat
Soc B (Methodological) 36(1):103–106
Theodoridis S (2015) Machine learning: a Bayesian and optimization perspective. Academic Press,
Cambridge, MA
Thiery Y, Van Schoubroeck C (2006) Fairness and equality in insurance classification. Geneva
Papers Risk Insurance Issues Pract 31(2):190–211
Thomas G (2012) Non-risk price discrimination in insurance: market outcomes and public policy.
Geneva Papers Risk Insurance Issues Pract 37:27–46
Thomas G (2017) Loss coverage: Why insurance works better with some adverse selection.
Cambridge University Press, Cambridge
Thomas L, Crook J, Edelman D (2002) Credit scoring and its applications. SIAM
Thomas RG (2007) Some novel perspectives on risk classification. Geneva Papers Risk Insurance
Issues Pract 32(1):105–132
Thomson JJ (1976) Killing, letting die, and the trolley problem. Monist 59(2):204–217
Thomson W (1883) Electrical units of measurement. Popular Lect Addresses 1(73):73–136
Thornton SM, Pan S, Erlien SM, Gerdes JC (2016) Incorporating ethical considerations into
automated vehicle control. IEEE Trans Intell Transp Syst 18(6):1429–1439
Tian J, Pearl J (2002) A general identification condition for causal effects. In: Proceedings of the
Eighteenth National Conference on Artificial Intelligence, pp 567–573. MIT Press
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B
(Methodological) 58(1):267–288
Tilcsik A (2021) Statistical discrimination and the rationalization of stereotypes. Am Sociol Rev
86(1):93–122
Topkis DM (1998) Supermodularity and complementarity. Princeton University Press, Princeton
Torkamani A, Wineinger NE, Topol EJ (2018) The personal and clinical utility of polygenic risk
scores. Nature Rev Genet 19(9):581–590
Torous W, Gunsilius F, Rigollet P (2021) An optimal transport approach to causal inference. arXiv
2108.05858
Traag V, Waltman L (2022) Causal foundations of bias, disparity and fairness. arXiv 2207.13665
Tribalat M (2016) Statistiques ethniques, Une querelle bien française. L’Artilleur, Paris
Tsamados A, Aggarwal N, Cowls J, Morley J, Roberts H, Taddeo M, Floridi L (2021) The ethics
of algorithms: key problems and solutions, pp 1–16. AI & Society
Tsybakov AB (2009) Introduction to nonparametric estimation. Springer, New York
Tufekci Z (2018) Facebook’s surveillance machine. New York Times 19:1
Tukey JW (1961) Curves as parameters, and touch estimation. In: Neyman J (ed) Proceedings of
the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol 1, pp 681–694.
University of California Press, California
References 475
Tuppat J, Gerhards J (2021) Immigrants’ first names and perceived discrimination: A contribution
to understanding the integration paradox. Eur Sociol Rev 37(1):121–135
Turner R (2015) The way to stop discrimination on the basis of race. Stanford J Civil Rights Civil
Liberties 11:45
Tweedie MCK (1984) An index which distinguishes between some important exponential families.
Statistics: applications and new directions (Calcutta, 1981), pp 579–604
Tzioumis K (2018) Demographic aspects of first names. Scientific Data 5(1):1–9
Uotinen V, Rantanen T, Suutama T (2005) Perceived age as a predictor of old age mortality: a
13-year prospective study. Age Ageing 34(4):368–372
Upton G, Cook I (2014) A dictionary of statistics 3e. Oxford University Press, Oxford
US Census (2012) Frequently occurring surnames from census 2000, census report data file a: Top
1000 names. Genealogy Data
Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW (2019) Calibration: the
achilles heel of predictive analytics. BMC Med 17(1):1–7
Van Deemter K (2010) Not exactly: In praise of vagueness. Oxford University Press, Oxford
Van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, Cambridge
Van Gerven G (1993) Case c-109/91, Gerardus Cornelis Ten Oever v. Stichting bedrijfspensioen-
fonds voor het glazenwassers-en schoonmaakbedrijf. EUR-Lex 61991CC0109
Van Lancker W (2020) Automating the welfare state: Consequences and challenges for the
organisation of solidarity. In: Shifting solidarities, pp 153–173. Springer, New York
Van Parijs P (2002) Linguistic justice. Polit Philos Econ 1(1):59–74
Van Schaack D (1926) The part of the casualty insurance company in accident prevention. Ann
Am Acad Polit Soc Sci 123(1):36–40
Vandenhole W (2005) Non-discrimination and equality in the view of the UN human rights treaty
bodies. Intersentia nv
Varga TV, Kozodoi N (2021) fairness, algorithmic fairness r package. R Vignette
Vassy JL, Christensen KD, Schonman EF, Blout CL, Robinson JO, Krier JB, Diamond PM, Lebo
M, Machini K, Azzariti DR, et al. (2017) The impact of whole-genome sequencing on the
primary care and outcomes of healthy adult patients: a pilot randomized trial. Ann Int Med
167(3):159–169
Verboven K (2011) Introduction: Professional collegia: Guilds or social clubs? Ancient Society, pp
187–195
Verma S, Rubin J (2018) Fairness definitions explained. In: 2018 IEEE/ACM International
Workshop on Software Fairness (Fairware), pp 1–7. IEEE
Verrall R (1996) Claims reserving and generalised additive models. Insurance Math Econ
19(1):31–43
Viaene S, Dedene G, Derrig RA (2005) Auto claim fraud detection using Bayesian learning neural
networks. Expert Syst Appl 29(3):653–666
Vidoni P (2003) Prediction and calibration in generalized linear models. Ann Inst Stat Math
55:169–185
Villani C (2003) Topics in optimal transportation, vol 58. American Mathematical Society,
Providence, RI
Villani C (2009) Optimal transport: old and new, vol 338. Springer, New York
Villazor RC (2008) Blood quantum land laws and the race versus political identity dilemma.
Californial Law Rev 96:801
Viswanathan KS (2006) Demutualization. Encyclopedia of Actuarial Science
Vogel R, Bellet A, Clémen S, et al. (2021) Learning fair scoring functions: Bipartite ranking
under roc-based fairness constraints. In: International Conference on Artificial Intelligence and
Statistics, Proceedings of Machine Learning Research, pp 784–792
Voicu I (2018) Using first name information to improve race and ethnicity classification. Stat Public
Policy 5(1):1–13
Volkmer S (2015) Notice regarding unfair discrimination in rating: optimization. State of Califor-
nia, Department of Insurance February 18
von Mises R (1928) Wahrscheinlichkeit Statistik und Wahrheit. Springer, New York
476 References
von Mises R (1939) Probability, statistics and truth. Macmillan, New York
Von Neumann J (1955) CollectED works. Pergamon Press, Oxford, England
Von Neumann J, Morgenstern O (1953) Theory of games and economic behavior. Princeton
University Press, Princeton
Wachter S, Mittelstadt B (2019) A right to reasonable inferences: re-thinking data protection law
in the age of big data and ai. Columbia Bus Law Rev, 494
Wachter S, Mittelstadt B, Russell C (2017) Counterfactual explanations without opening the black
box: Automated decisions and the gdpr. Harvard J Law Technol 31:841
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using
random forests. J Am Stat Assoc 113(523):1228–1242
Waldron H (2013) Mortality differentials by lifetime earnings decile: Implications for evaluations
of proposed social security law changes. Social Secur Bull 73:1
Wallis KF (2014) Revisiting francis galton’s forecasting competition. Stat Sci, 420–424
Wang DB, Feng L, Zhang ML (2021) Rethinking calibration of deep neural networks: Do not be
afraid of overconfidence. Adv Neural Inf Process Syst 34:11809–11820
Wang Y, Kosinski M (2018) Deep neural networks are more accurate than humans at detecting
sexual orientation from facial images. J Personality Soc Psychol 114(2):246
Wang Y, Yao H, Zhao S (2016) Auto-encoder based dimensionality reduction. Neurocomputing
184:232–242
Wasserman L (2000) Bayesian model selection and model averaging. J Math Psychol 44(1):92–107
Wasserstein LN (1969) Markov processes over denumerable products of spaces, describing large
systems of automata. Problemy Peredachi Informatsii 5(3):64–72
Watkins-Hayes C, Kovalsky E (2016) The discourse of deservingness. The Oxford Handbook of
the Social Science of Poverty 1
Watson DS, Gultchin L, Taly A, Floridi L (2021) Local explanations via necessity and sufficiency:
Unifying theory and practice. Uncertainty in Artificial Intelligence, pp 1382–1392
Watson GS (1964) Smooth regression analysis. Sankhyā Indian J Stat A, 359–372
Weber M (1904) Die protestantische ethik und der geist des kapitalismus. Archiv für Sozialwis-
senschaft und Sozialpolitik 20:1–54
Weed DL (2005) Weight of evidence: a review of concept and methods. Risk Anal Int J 25(6):1545–
1557
Weisberg HI, Tomberlin TJ (1982) A statistical perspective on actuarial methods for estimating
pure premiums from cross-classified data. J Risk Insurance, 539–563
Welsh AH, Cunningham RB, Donnelly CF, Lindenmayer DB (1996) Modelling the abundance of
rare species: statistical models for counts with extra zeros. Ecol Modell 88(1–3):297–308
Westreich D (2012) Berkson’s bias, selection bias, and missing data. Epidemiology 23(1):159
Wheatley M (2013) The fairness challenge. FCA Financial Conduct Authority Speeches October
10
White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from web
search signals. NPJ Digit Med 1(1):8
Wiehl DG (1960) Build and blood pressure. Society of Actuaries
van Wieringen WN (2015) Lecture notes on ridge regression. arXiv 1509.09169
Wiggins B (2013) Managing Risk, Managing Race: Racialized Actuarial Science in the United
States, 1881–1948. University of Minnesota PhD thesis
Wikipedia (2023) Data. Wikipedia, The Free Encyclopedia
Wilcox C (1937) Merit rating in state unemployment compensation laws. Am Econ Rev, 253–259
Wilkie D (1997) Mutuality and solidarity: assessing risks and sharing losses. Philos Trans Roy Soc
Lond B Biol Sci 352(1357):1039–1044
Williams BA, Brooks CF, Shmargad Y (2018) How algorithms discriminate based on data they
lack: Challenges, solutions, and policy implications. J Inf Policy 8:78–115
Williams G (2017) Discrimination and obesity. In: Lippert-Rasmussen K (ed) Handbook of the
ethics of discrimination, pp 264–275. Routledge
Williams JE, Bennett SM (1975) The definition of sex stereotypes via the adjective check list. Sex
Roles 1(4):
References 477
Willson K (2009) Name law and gender in Iceland. UCLA: Center for the Study of Women
Wilson EB, Worcester J (1943) The determination of ld 50 and its sampling error in bio-assay. Proc
Natl Acad Sci 29(2):79–85
Wing-Heir L (2015) Price optimization in ratemaking. State of Alaska, Department of Commerce,
Community and Economic Development Bulletin B 15–12
Winter RA (2000) Optimal insurance under moral hazard. Handbook of insurance, pp 155–183
Winterfeldt D, Edwards W (1986) Decision analysis and behavioral research
Witten IH, Frank E, Hall MA, Pal CJ, DATA M (2016) Practical machine learning tools and
techniques. Morgan Kaufmann, Burlington, MA
Wod I (1985) Weight of evidence: A brief survey. Bayesian Stat 2:249–270
Wolff MJ (2006) The myth of the actuary: life insurance and Frederick L. Hoffman’s race traits
and tendencies of the American negro. Public Health Rep 121(1):84–91
Wolffhechel K, Fagertun J, Jacobsen UP, Majewski W, Hemmingsen AS, Larsen CL, Lorentzen
SK, Jarmer H (2014) Interpretation of appearance: The effect of facial features on first
impressions and personality. PloS One 9(9):e107721
Wolpert DH (1992) Stacked generalization. Neural Networks 5(2):241–259
Wolthuis H (2004) Heterogeneity. Encyclopedia of Actuarial Science, pp 819–821
Woodhams C, Williams M, Dacre J, Parnerkar I, Sharma M (2021) Retrospective observational
study of ethnicity-gender pay gaps among hospital and community health service doctors in
england. BMJ Open 11(12):e051043
Worham L (1985) Insurance classification: too important to be left to the actuaries. Univ Michigan
J Law 19:349
Works R (1977) Whatever’s fair-adequacy, equity, and the underwriting prerogative in property
insurance markets. Nebraska Law Rev 56:445
Wortham L (1986) The economics of insurance classification: The sound of one invisible hand
clapping. Ohio State Law J 47:835
Wright S (1921) Correlation and causation. J Agricultural Res 20:557–585
Wu Y, Zhang L, Wu X, Tong H (2019) Pc-fairness: A unified framework for measuring causality-
based fairness. Adv Neural Inf Process Syst 32:3404–3414
Wu Z, D’Oosterlinck K, Geiger A, Zur A, Potts C (2022) Causal proxy models for concept-based
model explanations. arXiv 2209.14279
Wüthrich MV, Merz M (2008) Stochastic claims reserving methods in insurance. Wiley, New York
Wüthrich MV, Merz M (2022) Statistical foundations of actuarial learning and its applications, vol
3822407. Springer Nature, New York
Yang TC, Chen VYJ, Shoff C, Matthews SA (2012) Using quantile regression to examine
the effects of inequality across the mortality distribution in the us counties. Soc Sci Med
74(12):1900–1910
Yao J (2016) Clustering in general insurance pricing. In: Frees E, Meyers G, Derrig R (eds)
Predictive modeling applications in actuarial science, pp 159–79. Cambridge University Press,
Cambridge
Yeung K (2018a) Algorithmic regulation: a critical interrogation. Regulation Governance
12(4):505–523
Yeung K (2018b) A study of the implications of advanced digital technologies (including AI
systems) for the concept of responsibility within a human rights framework. MSI-AUT (2018)
5
Yinger J (1998) Evidence on discrimination in consumer markets. J Econ Perspect 12(2):23–40
Yitzhaki S, Schechtman E (2013) The Gini methodology: a primer on a statistical methodology.
Springer, New York
Young IM (1990) Justice and the politics of difference. Princeton University Press, Princeton
Young RK, Kennedy AH, Newhouse A, Browne P, Thiessen D (1993) The effects of names on
perception of intelligence, popularity, and competence. J Appl Soc Psychol 23(21):1770–1788
Zack N (2014) Philosophy of science and race. Routledge
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and
naive Bayesian classifiers. In: ICML, Citeseer, vol 1, pp 609–616
478 References
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability
estimates. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pp 694–699
Zafar MB, Valera I, Rodriguez MG, Gummadi KP (2017) Fairness constraints: Mechanisms for
fair classification. arXiv 1507.05259:962–970
Zafar MB, Valera I, Gomez-Rodriguez M, Gummadi KP (2019) Fairness constraints: A flexible
approach for fair classification. J Mach Learn Res 20(1):2737–2778
Zafar SY, Abernethy AP (2013) Financial toxicity, part i: a new name for a growing problem.
Oncology 27(2):80
Zelizer VAR (2017) Morals and markets: The development of life insurance in the United States.
Columbia University Press, New York
Zelizer VAR (2018) Morals and markets. Columbia University Press, New York
Zenere A, Larsson EG, Altafini C (2022) Relating balance and conditional independence in
graphical models. Phys Rev E 106(4):044309
Zhang J, Bareinboim E (2018) Fairness in decision-making–the causal explanation formula. In:
Thirty-Second AAAI Conference on Artificial Intelligence
Zhang L, Wu Y, Wu X (2016) A causal framework for discovering and removing direct and indirect
discrimination. arXiv 1611.07509
Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC Press, Boca Raton
Žliobaite I (2015) On the relation between accuracy and fairness in binary classification. arXiv
1505.05723
Žliobaite I, Custers B (2016) Using sensitive personal data may be necessary for avoiding
discrimination in data-driven decision models. Artif Intell Law 24(2):183–201
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Roy Stat Soc B
(Stat Methodol) 67(2):301–320
Zuboff S (2019) The age of surveillance capitalism: The fight for a human future at the new frontier
of power. Public Affairs, New York
Index
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 479
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4
480 Index
Confusion matrix, 81 E
Consequentialism, 6 Econometrics, 34
Contingency table, 81 Efficiency, 26
Continuous, 68 Egalitarian, 8
Contribution Elastic, 99
Shapley, 136, 140 Elicitable, 79
Conundrum, 123 Empirical risk, 80
Cooperation, 26 Employment, 201
Counterfactual, 153, 369 Entropy, 107
Coverage, 28, 29 Equality, 15
Credit, 201 opportunity, 6
Credit scoring, 222, 264 treatment, 10
Criminal records, 191 Equalized odds, 326
Cross-financing, 42, 172 Equal opportunity, 6, 324
Equitable, 10
Ethics, 7
D Europe, 184
Data, 180 Expected utility, 52
big, 180, 244 Explainability, 123, 276
inferred, 188 Exponential family, 91
man made, 179
personal, 181
sensitive, 181 F
Decision, 240 F airness
Decision making, 240 .φ, 330
Decision tree, 105 Fable, 23, 47
Deductible, 29 Facial, 256
Demographic parity, 315, 322 Fair, 10
Deviance, 93 Fairness
Difference, 15 actuarial, 43, 276
Directed acyclical graph (DAG), 284 local individual, 359
Discrimination, v, 1 similarity, 358
direct, 5 Fat, 37
efficient, 10 First name, 248
fair, v Fraud, 182
indirect, 5 Friends, 273
legitimate, 34 Frisch–Waugh, 206
by proxy, 244, 245
rational, 10
reverse, 15, 386 G
statistical, 10 Gaussian, 87
unfair, 4 Gender, 37, 201, 224
Diseases, 256 fluid, 226
Disparate impact, 10 non binary, 226
Disparate treatment, 10 General Data Protection Regulation (GDPR),
Distance, 86 181, 211
Hellinger, 85 General insurance, 182
Mahalanobis, 91 Generalization, 240, 241
total variation, 86 Generalized additive models (GAM), 95
Wasserstein, 88 Generalized linear models (GLM), 91
Divergence, 86 lasso, 99
Jensen–Shannon, 88 ridge, 98
Kullback–Leibler, 86 Genetics, 43, 182
Durkheim, Emile, 9 Geodesic, 254
Index 481
M Orthogonalization, 389
Mahalanobis, 91, 358 Out-of-sample, 84
Majority, 106 Overfit, 63, 83, 106, 115
Majority rule, 111
Manhart, 241
P
Marital status, 202
Paradox
Mean squared error, 80, 107
Simpson, 209
Merit, 6, 205, 264
Partial dependence plot (PDP), 140
Mileage, 29
Penalty, 98
Minimum bias, 74
Personalization, 76
Misclassification, 46
Personalized premium, 44
Mitigation, 15, 386
Personalized pricing, 244
Mixed models, 97
Phrenology, 254
Model, 63
Pirates, 66
additive, 74
Platt scaling, 105
bagging, 110, 112
Poisson, 94
black box, 64
Pooling, 242
boosting, 113
Population Stability Index (PSI), 87
ensemble, 110
Predictive parity, 336, 337
multiplicative, 74
Prejudice, v, 2
neural networks, 101
Pricing, v
random forest, 110
Prima facie, 254
trees, 105
Principal component analysis, 120
Monge–Kantorovich, 367
Principal components, 101
Moral, 6, 7
Privacy, 189
Moral hazard, 32, 34
Probability, 61
Moral machine, 8
conditional, 61
Morgenstern, Oskar, 52
Propensity score, 301, 388
Murphy decomposition, 167
Propublica, 12
Mutatis mutandis, 124
Proxy, 237, 244
Mutual aid, 66
Pruning, 106
Mutual information, 87
Mutuality, 25
Mutualization, 242 Q
Quantile, 80, 89
Québec, 184
N
Names, 250
Network, 268, 270 R
Neural networks, 101 Race, 6
Neutral, v Raison d’être, 231
Non-reconstruction, 337 Random forest, 110, 113
Norm, 242 Rawls, John, 8
Normal, 56 Receiver operating characteristic (ROC), 118,
161
Reciprocity, 26
O Redistribution, 26
Obesity, 37 Redlining, 261
Ockham, 63 Regression, 78
Old, 232 Reichenbach, Hans, 278
Opaque, 64 Religion, 6
OpenStreetMap, 259 Resampling, 110
Optimal Reserving, 54
coupling, 367 Resolution, 167
Index 483